Spamming is easy, filtering is tough

0.00 avg. rating (0% score) - 0 votes

“Spamming is easy, filtering is tough”

We sent 250 million emails through Resdex last month, and that’s way cool ! Maintaining that traffic with good user experience is really a tough task. But Resdex gives it desired attention to make sure the emails sent are relevant.

What is Resdex ?

Resdex is one of the most commonly used product by recruiters in the country to search desired candidates. Resdex provides various options to ensure the hiring is more targeted and less time consuming.

Resdex is Naukri.com’s Resume Database Access. Naukri.com has a large database of over 48.1 million jobseekers across industries, functions, locations and experience levels.

With a Resdex subscription, recruiters not only get access to the database but also easy to use search tools that help in searching for the right candidate with utmost convenience.

  • Resdex is not only the largest but also the fastest growing database of jobseekers.
  • Profiles registered are spread across industries, functions and experience levels.
  • Candidates can be contacted real time either through E-mail or SMS.
  • The database has candidates from various locations including international locations as well.

Communication through email is one of the common and widely used way of communication. It is personal as well as professional channel to communicate.

Nerd out with me for a minute.

Peak Performance Stats

  • 400 emails per second
  • 24,000 emails per minute
  • 1,440,000 emails per hour
  • 10,000,000 emails per day
  • 60,000,000 emails per week
  • 2,50,000,000 emails per month!

 

Countless mails are exchanged between many end user which enable successful communication. But this is not the whole story of email communication. Many malicious or irrelevant mails are also the part of this system. So when it comes to professional communication, it’s very crucial to detect spam and block them. So it’s the responsibility of the medium/portal to differentiate between spam and ham.

Source : http://i.dailymail.co.uk/i/pix/2009/04/15/article-1170177-04765D6D000005DC-349_468x391.jpg

 

Lets define the above technical terms:

1) Spam : Spam is any irrelevant message usually sent to a large number of users with the intent of advertising , phishing or spreading malware.

2) Ham : A relevant and actual message whose sole purpose is communication with actual intent or purpose useful to mail sender and receiver.

Spam and ham constitute a mutually exclusive set. Spam mail degrades the intent as well as user experience. So, to fulfill the purpose of email one needs to eliminate spam.

Resdex, the recruiter portal of naukri has one of the largest pool of jobseekers of the country. Resdex as a main product of naukri, stands on two pillars – viewing jobseeker CVs and mailing the jobseekers (datapool of jobseekers and the communication between jobseeker and recruiter).

Mails to jobseekers from recruiters is very important from resdex point of view. Resdex on an average handles approximately 10 million mails per day that help connect the recruiters to various jobseekers through naukri.com.

It’s the responsibility of the portal to ensure that no spam is being sent to the jobseekers. Numerous filters are applied to ensure that right mails are delivered to the desired jobseekers. To ensure no unsolicited and spam being sent, Naive Bayes spam filter is one of these filters which detects spam with high success rate.

 

What is Naive Bayes spam filter ?

Bayesian filter is a statistical technique of email filtering. It basically calculates the probability of each word in an email to check its spam probability. The filter doesn’t know these probabilities in advance, it needs to be trained to store the spam probability. The spam probability of different word varies. To calculate the probability, Bayes theorem is applied. Then the score of an email is calculated by adding the probability of each word calculated. If the total exceeds the defined threshold , the email is marked as spam.

Let’s suppose the suspected message contains the word “hotel”. Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not “know” such facts; all it can do is compute probabilities.

The formula used by the software to determine that, is derived from Bayes’ theorem

Pr(S|W) = (Pr(W|S) * Pr(S)) / (Pr(W|S) * Pr(S) + Pr(W|H) * Pr(H))

where:

  • Pr(S|W)  is the probability that a message is a spam, knowing that the word “hotel” is in it;
  • Pr(S) is the overall probability that any given message is spam;
  • Pr(W|S) is the probability that the word “hotel” appears in spam messages;
  • Pr(H) is the overall probability that any given message is not spam (is “ham”);
  • Pr(W|H) is the probability that the word “hotel” appears in ham messages.

 

In layman terms , Naive Bayes spam filter algorithm finds the probability that message is spam given word “hotel” occurs in it.

So, Naive Bayes spam filter is applied in resdex to maintain the user experience and let the recruiters connect the job-seekers with relevant message only. This is one of most widely used mechanism used for filtering. With very high volume of mails sent to jobseekers, enough training is constantly done which improves its efficiency of the mechanism.

The self learning mechanism implemented in naukri is or we can say constantly improving in-house filter makes it more robust.

Stop Spam with Bogofilter

We use Bogofilter, which make use of Bayesian spam filtering techniques, to detect spam mails. There are a number of tools available for Linux to prevent spam. One of those tools, Bogofilter, is an incredibly well done system that seamlessly can be easily  integrated.

Using bogofilter is very easy to use

Step 1. Install

Open up a terminal window and issue the command which bogofilter. If the command returns /usr/bin/bogofilter, congratulations, Bogofilter is installed. If not, time to install.

Step 2. Bogofilter training

Just because the Bogofilter system is set up, don’t assume it will start working perfectly, out of the box. Most spam filters must first be trained before they will work. A collection of both ham (good email) and spam (junk email) must be marked as such to begin the training.

Step 3. Execute : Spam mail detection

Command :

/usr/local/bogofilter/bin/bogofilter -d BOGOFILTER_DATABASE < MAIL_CONTENT

switch($return)

{

case 0: return “REJECTED”;

case 1: return “ALLOWED”;

case 2: return “UNDECIDED”;

}

Do you feel you can help us in improving or you want to learn from us?

Its very simple – Join us. We are Hiring!