How Our Spam Filter Works
From Daily Data
Unless you have specifically requested no processing, all e-mail coming into our mail servers are processed against the ASSP Spam and Amavis Anti-Virus engines. This article describes in broad detail how the mail is processed.
Contents |
Summary
Email is not automatically accepted by the system. However, once it is accepted, it will be given a grade, or score, by multiple tests. Once the tests are completed, the sum of the scores determines if the mail is put in your mailbox or into the spam account.
Steps in identifying Spam
Identifying Spam is a multi-step process, mainly based on someone's experience with other messages that were spam that were similar to the one being checked. Following is a brief outline of the steps taken to determine if a message is spam.
In the following, the term "email" is used for Incoming e-mail, i.e. email that has been sent by someone to your email account.
- The IP address that the email is coming from is checked against our internal "reject" list of addresses. Mail that matches is rejected immediately.
- The IP address that the email is coming from is checked against the Real Time Black Hole List (RBL), a list of IP addresses that are known to be origins of spam. If the IP address matches, the email is either rejected out of hand or marked as "most likely spam".
- If an email is coming from an unknown IP address, and the From line is from an unknown correspondent, the email is put in "jail".
- If the message is already in jail, its initial score will be incremented. After a certain number of attempts to resend a message in jail, the message will be rejected.
- Messages which are in jail and never resent are remembered as possible spam in the future.
- If an email is coming from a known user (on the White List), or if it behaved itself well while in "jail", it is processed. Email coming from someone in the whitelist is automatically given a large negative grade in this step, a "very unlikely this is spam" grade.
- Any attachments are processed against our server's anti-virus. If a virus is detected, the email is thrown away. NOTE: you should still check for viruses as no anti-virus program will catch all emails.
- The email is checked against all other emails which have been marked as spam. This Bayesian Filter scan will look for similarities between the email and other emails which have been marked as spam. Email that matches any of these tests is given a positive grade, one for each test it matches.
- Looks for the IP address the email came from to see if spam has been reported from there before (i.e. someone has a virus on their computer)
- Looks for similar content. Note: this is not simply looking for words. Words you notice in spam may be used in valid emails, so we can not simply reject an email simply because it uses words associated with other spam. It has to look at phrases.
- Compares the email as a whole. When spam is reported, the result of a numeric calculation is stored, and this calculation is compared against the same one calculated against this email. This helps pick up images (i.e. pictures of text).
- Many other tests are performed
- The email is graded based upon the results above. If the score is greater than a certain value, it is marked as spam and placed into the spam account. If it is less than that score, it is sent to your mailbox
Terms and explanations
- IP Address - Each computer on the Internet has a unique address, sort of like a phone number, called an IP address.
- Jail - Many spammers will send a message two or three times automatically, then never try again. A well behaved e-mail server, on the other hand, will attempt to send a message and, when it receives the "I'm busy" flag, will wait a prescribed time to attempt to send it again. Also, spammers will generally try to send the same message to a lot of different destinations at one time. Thus, temporarily stopping a message, and seeing whether it acts correctly when the server tries to resend it is a very good test for if a message is spam. If it is resent too often, especially if it is resent to multiple unrelated users, it is most likely spam. The "jail" concept catches a lot of mail before it even begins being processed.
- Bayesian Filter - An attempt to mimic human recognition of spam. A human can tell when a message is many times, not just by the words, but by the way things are said, who they are from, etc... For example, if the family doctor uses certain phrases, it is acceptable. But, the exact same phrases used by a business associate would not be. The Bayesian Filter mimic's this recognition (but, not as well as a human).
- Real Time Black Hole List (RBL) - A database of IP Addresses that is a co-operative effort of many mail server administrators. When an IP Address is noted as a source of Spam, administrators world wide add that IP address to the RBL. If the reason for the spam is cleared up, the systems administrator of the affected server can then have the IP removed from the RBL.
A lot of the filtering is done based on what other people have found. Our spam engine co-operates with other e-mail service providers. When you identify spam, we not only mark it internally, but we also send a message to the co-operative database. This database is not automatically updated (you may incorrectly mark something as spam), but after a certain number of reports have been made on a single message, those results are used worldwide. So, you may be saved from reading the latest spam messages from someone thousands of miles away who reported a message as spam before you were even sent a copy.
How Spammers Work
Spammers, especially the ones who send millions of spam, work with certain constraints brought on by their business (and it is a business). For one thing, they don't worry too much if one or two messages don't get delivered; they are generally working off of lists of e-mail accounts that have high failure rates. Also, they tend to send the same e-mail message to a lot of people over a short period of time.
We have learned this and built engines that check for these two items. Spammers then learn how we are blocking their spam, and adjust their habits, and then we must adjust to these adjustments. Basically, a war of escalation.
At one time, spammers sent the exact same message to everyone on their list from a few, readily identifiable servers. During this time, it was a matter of seeing if the exact same message was coming in for multiple users from the same servers. Then, spammers started changing their messages slightly by having a computer put in random modifications between each message, so we had to start looking for messages which look similar instead of exactly alike.
Note: Computers are very poor at recognizing the same message, slightly modified. Simply moving a sentence from the first paragraph to the second was sufficient to get a message past a spam filter a few years ago.
Spammers have, in many cases, started creating "pictures" of their message. These pictures can not be easily deciphered by a computer, and so are even more difficult to recognize. As such, we have started comparing pictures to pictures, to see if they are the same. However, simply changing one dot on a picture is enough to make it a completely different message to a computer. These are minor changes that you and I would not even see.
Finally, instead of simply sending from a few, poorly behaved servers, spammers are now working with virus creators. In many cases, when a virus takes over your computer, it is turning it into a drone for spammers. Combined with other computers which have been taken over, the collective of, literally, thousands of compromised computers is called a "bot-net." A botnet is a collection of computers, also known as zombies or robots, that can all be controlled remotely by one person. (See also http://en.wikipedia.org/wiki/Botnet. The number of botnets currently in operation is unknown, but some estimates place it in the over a million. The "top ten" list from Symantic, as of June 2009 had one botnet of between 1.4 and 2.1 Million compromised computers being responsible for 51 Million pieces of spam being sent per minute. That is not a typo: 51 million spam messages sent from 1-2 million computer each minute, all under a very small group of people's control. See http://www.dslreports.com/forum/r22720948-The-thriving-and-criminal-world-of-Botnet-Attacks for more information.
The bottom line is, the source of spam is across millions of computers, sending out a lot of different spam messages. And the only way to catch them is to teach a computer to compare in generalities, which computers do not do well.

