August 2002(This article describes the spam-filtering techniquesused in the spamproof web-based mail reader webuilt to exercise Arc. Animproved algorithm is described in BetterBayesian Filtering.)I think it's possible to stop spam, and that content-based filters are the way to do it.The Achilles heel of the spammers is their message.They can circumvent any other barrier you set up. They have so far, atleast. But they have to deliver their message, whatever itis. If we can write software that recognizes their messages,there is no way they can get around that. To the recipient, spam is easily recognizable. If you hired someone to read your mail and discard the spam, they wouldhave little trouble doing it. How much do we haveto do, short of AI, to automate this process?I think we will be able to solve the problem with fairlysimple algorithms. In fact, I've found that you can filterpresent-day spam acceptably well using nothing more than aBayesian combination of the spam probabilities of individualwords. Using a slightly tweaked (as described below) Bayesianfilter, we now miss less than 5 per 1000 spams, with 0 false positives.The statistical approach is not usually the first one peopletry when they write spam filters. Most hackers' first instinct isto try to write software that recognizes individual properties ofspam. You look at spamsand you think, the gall of these guys to try sending me mail that begins "Dear Friend" or has a subject line that's alluppercase and ends in eight exclamation points. I can filterout that stuff with about one line of code.And so you do,and in the beginning it works. A few simple rules will takea big bite out of your incoming spam. Merely lookingfor the word "click" will catch 79.7% of theemails in my spam corpus, with only 1.2% false positives.I spent about six months writing software that looked forindividual spam features before I tried the statisticalapproach. What I found was that recognizing that last fewpercent of spams got very hard, and that as Imade the filters stricter I got more false positives.False positives are innocent emails that get mistakenlyidentified as spams.For most users,missing legitimate email isan order of magnitude worse than receiving spam, so afilter that yields false positives is like an acne curethat carries a risk of death to the patient.The more spam a user gets, the lesslikely he'll be to notice one innocent mail sitting in hisspam folder. And strangely enough, the better your spam filters get,the more dangerous false positives become, because when thefilters are really good, users will be more likely toignore everything they catch.I don't know why I avoided trying the statistical approachfor so long. I think it was because I got addicted totrying to identify spam features myself, as if I were playingsome kind of competitive game with the spammers. (Nonhackersdon't often realize this, but most hackers are very competitive.)When I did try statistical analysis, Ifound immediately that it was much cleverer than I had been.It discovered, of course, that terms like "virtumundo" and"teens" were good indicators of spam. But it alsodiscovered that "per" and "FL" and "ff0000" are good indicators of spam. In fact, "ff0000" (html for bright red)turns out to be as good an indicator of spam as any pornographic term. _ Here's a sketch of how I do statistical filtering. I startwith one corpus of spam and one of nonspam mail. At themoment each one has about 4000 messages in it. I scanthe entire text, including headers and embedded htmland javascript, of each message in each corpus.I currently consider alphanumeric characters,dashes, apostrophes, and dollar signs to be part of tokens,and everything else to be a token separator. (There isprobably room for improvement here.) I ignore tokens thatare all digits, and I also ignore html comments, not evenconsidering them as token separators.I count the numberof times each token (ignoring case, currently) occurs ineach corpus. At this stage I end up with two large hash tables, one for each corpus, mapping tokens to numberof occurrences.Next I create a third hash table, this time mappingeach token to the probability that an email containing it is a spam,which I calculate as follows [1]:(let ((g ( 2 (or (gethash word good) 0))) (b (or (gethash word bad) 0))) (unless (< (+ g b) 5) (max .01 (min .99 (float (/ (min 1 (/ b nbad)) (+ (min 1 (/ g ngood)) (min 1 (/ b nbad)))))))))where word is the token whose probability we'recalculating, good and bad are the hash tablesI created in the first step, and ngood and nbadare the number of nonspam and spam messages respectively.I explained this as code to show a couple of important details.I want to bias the probabilities slightly to avoid falsepositives, and by trial and error I've found that a goodway to do it is to double all the numbers in good.This helps to distinguish between words that occasionallydo occur in legitimate email and words that almost never do. I only consider words that occur more than five times intotal (actually, because of the doubling, occurring three times in nonspam mail would be enough). And then there isthe question of what probability to assign to words thatoccur in one corpus but not the other. Again by trial and error I chose .01 and .99. There may be room for tuninghere, but as the corpus grows such tuning will happenautomatically anyway.The especially observant will notice that while I considereach corpus to be a single long stream of text for purposesof counting occurrences, I use the number of emails ineach, rather than their combined length, as the divisor in calculating spam probabilities. This adds anotherslight bias to protect against false positives.When new mail arrives, it is scanned into tokens, andthe most interesting fifteen tokens, where interesting is measured by how far their spam probability is from aneutral .5, are used to calculate the probability thatthe mail is spam. If probsis a list of the fifteen individual probabilities, youcalculate the combined probability thus:(let ((prod (apply #' probs))) (/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))One question that arises inpractice is what probability to assign to a word you'venever seen, i.e. one that doesn't occur in the hash tableof word probabilities. I've found, again by trial anderror, that .4 is a good number to use. If you've neverseen a word before, it is probably fairly innocent; spamwords tend to be all too familiar.There are examples of this algorithm being applied toactual emails in an appendix at the end.I treat mail as spam if the algorithm above gives it aprobability of more than .9 of being spam. But in practiceit would not matter much where I put this threshold, becausefew probabilities end up in the middle of the range. _ One great advantage of the statistical approach is that youdon't have to read so many spams. Over the past six months,I've read literally thousands of spams, and it is reallykind of demoralizing. Norbert Wiener said if you competewith slaves you become a slave, and there is somethingsimilarly degrading about competing with spammers. Torecognize individual spam features you have to try to getinto the mind of the spammer, and frankly I want to spendas little time inside the minds of spammers as possible.But the real advantage of the Bayesian approach, of course,is that you know whatyou're measuring. Feature-recognizing filters likeSpamAssassin assign a spam "score" to email. The Bayesianapproach assigns an actual probability. The problem witha "score" is that no one knows what it means. The userdoesn't know what it means, but worse still, neither doesthe developer of the filter. How many points should anemail get for having the word "sex" in it? A probabilitycan of course be mistaken, but there is little ambiguityabout what it means, or how evidence should be combinedto calculate it. Based on my corpus, "sex" indicatesa .97 probability of the containing email being a spam,whereas "sexy" indicates .99 probability.And Bayes' Rule, equally unambiguous, says that an emailcontaining both words would, in the (unlikely)absence of any other evidence, have a 99.97% chance ofbeing a spam.Because it is measuring probabilities, the Bayesian approachconsiders all the evidence in the email, both good and bad.Words that occur disproportionately rarelyin spam (like "though" or "tonight" or "apparently")contribute as much to decreasing the probability asbad words like "unsubscribe" and "opt-in" do toincreasing it. So an otherwise innocent email that happensto include the word "sex" is not going to get tagged as spam.Ideally, of course, the probabilities should be calculatedindividually for each user. I get a lot of email containingthe word "Lisp", and (so far) no spam that does. So a wordlike that is effectively a kind of password for sendingmail to me. In my earlier spam-filtering software, the usercould set up a list of such words and mail containingthem would automatically get past the filters. On mylist I put words like "Lisp" and also my zipcode, sothat (otherwise rather spammy-sounding) receipts fromonline orders would get through. I thought I was beingvery clever, but I found that the Bayesian filter did thesame thing for me, and moreover discovered of a lot of words Ihadn't thought of.When I said at the start that our filters let through less than5 spams per 1000 with 0 false positives, I'm talking aboutfiltering my mail based on a corpus of my mail. But thesenumbers are not misleading, because that is the approach I'madvocating: filter each user's mail based on the spam andnonspam mail he receives. Essentially, each user shouldhave two delete buttons, ordinary delete and delete-as-spam.Anything deleted as spam goes into the spam corpus, and everything else goes into the nonspam corpus.You could startusers with a seed filter, but ultimately each user should havehis own per-word probabilities based on the actual mail hereceives. This (a) makes the filters more effective, (b) letseach user decide their own precise definition of spam,and (c) perhaps best of all makes it hard for spammersto tune mails to get through the filters. If a lot of the brain of the filter is in the individual databases, then merely tuning spams to get through the seed filterswon't guarantee anything about how well they'll get throughindividual users' varying and much more trained filters.Content-based spam filtering is often combined with a whitelist,a list of senders whose mail can be accepted with no filtering.One easy way to build such awhitelist is to keep a list of every address the user hasever sent mail to. If a mail reader has a delete-as-spambutton then you could also add the from addressof every email the user has deleted as ordinary trash.I'm an advocate of whitelists, but more as a way to save computation than as a way to improve filtering. I used to think thatwhitelists would make filtering easier, because you'donly have to filter email from people you'd never heardfrom, and someone sending you mail for the first time isconstrained by convention in what they can say to you.Someone you already know might send you an email talking about sex,but someone sending you mail for the first time would not be likely to. The problem is, people can have more than one email address, so a new from-address doesn't guarantee thatthe sender is writing to you for the first time.It is not unusualfor an old friend (especially if he is a hacker) to suddenlysend you an email with a new from-address, so you can'trisk false positives by filtering mail from unknown addresses especially stringently.In a sense, though, my filters do themselves embody a kindof whitelist (and blacklist) because they are based onentire messages, including the headers. So to thatextent they "know" the email addresses of trusted sendersand even the routes by which mail gets from them to me. And they know the same about spam, including the server names, mailer versions, and protocols. _ If I thought that I could keep up current rates of spamfiltering, I would consider this problem solved. But itdoesn't mean much to be able to filter out most present-dayspam, because spam evolves.Indeed, most antispam techniques so far have been like pesticides thatdo nothing more than create a new, resistant strain of bugs.I'm more hopeful about Bayesian filters, because they evolvewith the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automaticallynotice. Indeed, "c0ck" is far more damning evidence than"cock", and Bayesian filters know precisely how much more.Still, anyone who proposes a plan for spam filtering has tobe able to answer the question: if the spammers knewexactly what you were doing,how well could they get past you? For example, I think that ifchecksum-based spam filtering becomes a serious obstacle,the spammers will justswitch to mad-lib techniques for generating message bodies.To beat Bayesian filters, it would not be enough for spammersto make their emails unique or to stop using individualnaughty words. They'd have to make their mails indistinguishablefrom your ordinary mail. And this I think would severelyconstrain them. Spam is mostly salespitches, so unless your regular mail is all sales pitches,spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwisethe headers would look as bad to the Bayesian filters as ever,no matter what they did to the message body. I don't knowenough about the infrastructure that spammers use to knowhow hard it would be to make the headers look innocent, butmy guess is that it would be even harder than making the message look innocent.Assuming they could solve the problem of the headers,the spam of the future will probably look something likethis:Hey there. Thought you should check out the following:http://www.27meg.com/foobecause that is about as much sales pitch as content-basedfiltering will leave the spammer room to make. (Indeed, itwill be hard even to get this past filters, because if everythingelse in the email is neutral, the spam probability will hinge onthe url, and it will take some effort to make that look neutral.)Spammers range from businesses running so-calledopt-in lists who don't even try to conceal their identities,to guys who hijack mail servers to send out spams promotingporn sites. If we use filtering to whittle theiroptions down to mails like the one above, that shouldpretty much put the spammers on the "legitimate" end ofthe spectrum out of business; they feel obligedby various state laws to include boilerplate about whytheir spam is not spam, and how to cancel your"subscription," and that kind of text is easy to recognize.(I used to think it was naive to believe that stricter lawswould decrease spam. Now I think that while stricter laws may not decrease the amount of spam that spammers send,they can certainly help filters to decrease the amount of spam that recipients actually see.)All along the spectrum, if you restrict the sales pitches spammerscan make, you will inevitably tend to put them out ofbusiness. That word business is an important one toremember. The spammers are businessmen. They send spam becauseit works. It works because although the response rateis abominably low (at best 15 per million, vs 3000 permillion for a catalog mailing), the cost, to them, is practically nothing. The cost is enormous for the recipients, about 5 man-weeks for each million recipients who spend a second to delete the spam, but the spammerdoesn't have to pay that.Sending spam does cost the spammer something, though. [2]So the lower we can get theresponse rate-- whether by filtering, or by using filters to forcespammers to dilute their pitches-- the fewer businesses will find itworth their while to send spam.The reason the spammers use the kinds of salespitches that they do is to increase response rates.This is possibly even more disgustingthan getting inside the mind of a spammer,but let's take a quick look inside the mind of someonewho responds to a spam. This person is eitherastonishingly credulous or deeply in denial about their sexual interests. In either case, repulsive oridiotic as the spam seems to us, it is excitingto them. The spammers wouldn't say these things if theydidn't sound exciting. And "thought youshould check out the following" is just not going tohave nearly the pull with the spam recipient asthe kinds of things that spammers say now.Result: if it can't contain exciting sales pitches,spam becomes less effective as a marketing vehicle,and fewer businesses want to use it.That is the big win in the end. I started writing spamfiltering software because I didn't want have to look atthe stuff anymore.But if we get good enough at filteringout spam, it will stop working, and the spammerswill actually stop sending it. _ _Of all the approaches to fighting spam, from software to laws,I believe Bayesian filtering will be the single mosteffective. But I alsothink that the more different kinds of antispam effortswe undertake, the better, because any measure thatconstrains spammers will tend to make filtering easier.And even within the world of content-based filtering, I thinkit will be a good thing if there are many different kindsof software being used simultaneously. The more different filters there are, the harder it will be forspammers to tune spams to get through them.Appendix: Examples of FilteringHere is an example of a spam that arrived while I was writingthis article. The fifteen most interesting words in this spam are:qvp0045indiramx-05intimail$7500freeyankeedomcdobluefoxmediajpgunsecuredplatinum3d0qves7c57c266675The words are a mix of stuff from the headers and from themessage body, which is typical of spam. Also typical of spamis that every one of these words has a spam probability,in my database, of .99. In fact there are more than fifteen wordswith probabilities of .99, and these are just the firstfifteen seen.Unfortunately that makes this email a boring example ofthe use of Bayes' Rule. To see an interesting variety ofprobabilities we have to look at this actually quiteatypical spam.The fifteen most interesting words in this spam, with their probabilities,are:madam 0.99promotion 0.99republic 0.99shortest 0.047225013mandatory 0.047225013standardization 0.07347802sorry 0.08221981supported 0.09019077people's 0.09019077enter 0.9075001quality 0.8921298organization 0.12454646investment 0.8568143very 0.14758544valuable 0.82347786 This time the evidence is a mix of good and bad. A word like "shortest" is almost as much evidence for innocence as aword like "madam" or "promotion" is for guilt. But still thecase for guilt is stronger. If you combine these numbersaccording to Bayes' Rule, the resulting probability is .9027."Madam" is obviously from spams beginning"Dear Sir or Madam." They're not very common, but theword "madam" never occurs in my legitimate email, andit's all about the ratio."Republic" scores high becauseit often shows up in Nigerian scam emails, and also occurs onceor twice in spams referring to Korea and South Africa.You might say that it'san accident that it thus helps identify this spam. But I'vefound when examining spam probabilities that there area lot of these accidents, and they have an uncanny tendency topush things in the right direction rather than the wrong one.In this case, it is not entirely a coincidence that the word"Republic" occurs in Nigerian scam emails and this spam.There is a whole class of dubious business propositions involvingless developed countries, and these in turn are more likelyto have names that specify explicitly (because they aren't) that theyare republics.[3]On the other hand, "enter" is a genuine miss. It occursmostly in unsubscribe instructions, but here is used in acompletely innocent way. Fortunately the statistical approach isfairly robust, and can tolerate quite a lot of missesbefore the results start to be thrown off.For comparison, here is an example of that rare bird, a spam thatgets through the filters. Why? Because by sheer chance it happensto be loaded with words that occur in my actual email:perl 0.01python 0.01tcl 0.01scripting 0.01morris 0.01graham 0.01491078guarantee 0.9762507cgi 0.9734398paul 0.027040077quite 0.030676773pop3 0.042199217various 0.06080265prices 0.9359873managed 0.06451222difficult 0.071706355There are a couple pieces of good news here. First, this mailprobably wouldn't get through the filters of someone who didn'thappen to specialize in programming languages and have a goodfriend called Morris. For the average user, all the top five words here would be neutral and would not contribute to the spam probability.Second, I think filtering based on word pairs (see below) might wellcatch this one: "cost effective", "setup fee", "money back" -- prettyincriminating stuff. And of course if they continued to spam me(or a network I was part of), "Hostex" itself would berecognized as a spam term.Finally, here is an innocent email.Its fifteen most interesting words are as follows:continuation 0.01describe 0.01continuations 0.01example 0.033600237programming 0.05214485 i'm 0.055427782examples 0.07972858 color 0.9189189 localhost 0.09883721hi 0.116539136california 0.84421706same 0.15981844spot 0.1654587us-ascii 0.16804294what 0.19212411Most of the words here indicate the mail is an innocent one.There are two bad smelling words, "color"(spammers love colored fonts) and "California"(which occurs in testimonials and also in menus informs), but they are not enough to outweigh obviouslyinnocent words like "continuation" and "example".It's interesting that "describe" rates as so thoroughlyinnocent. It hasn't occurred in asingle one of my 4000 spams. The data turns out to befull of such surprises. One of the things you learnwhen you analyze spam texts is hownarrow a subset of the language spammers operate in. It'sthat fact, together with the equally characteristic vocabularyof any individual user's mail, that makes Bayesian filteringa good bet.Appendix: More IdeasOne idea that I haven't tried yet is to filter based onword pairs, or even triples, rather than individual words.This should yield a much sharper estimate of the probability.For example, in my current database, the word "offers"has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers"and "valuable offers" having probabilities of .99and, say, "approach offers" (as in "this approach offers")having a probability of .1 or less.The reason I haven't done this is that filtering based onindividual words already works so well. But it doesmean that there is room to tighten the filters if spamgets harder to detect.(Curiously, a filter based on word pairs would bein effect a Markov-chaining text generator runningin reverse.)Specific spam features (e.g. not seeing the recipient'saddress in the to: field) do of course have value in recognizing spam. They can be considered in thisalgorithm by treating them as virtual words. I'll probablydo this in future versions, at least for a handful of themost egregious spam indicators. Feature-recognizingspam filters are right in many details; what they lackis an overall discipline for combining evidence.Recognizing nonspam features may be more important thanrecognizing spam features. False positives are such aworry that they demand extraordinary measures. I willprobably in future versions add a second level of testingdesigned specifically to avoid false positives. If amail triggers this second level of filters it will be acceptedeven if its spam probability is above the threshold.I don't expect this second level of filtering to be Bayesian.It will inevitably be not only ad hoc, but based on guesses, because the number offalse positives will not tend to be large enough to notice patterns.(It is just as well, anyway, if a backup system doesn't rely on the sametechnology as the primary system.)Another thing I may try in the future is to focus extra attentionon specific parts of the email. For example, about 95% of currentspam includes the url of a site they wantyou to visit. (The remaining 5% want you to call a phone number,reply by email or to a US mail address, or in a fewcases to buy a certain stock.) The url is in such casespractically enough by itself to determine whether the emailis spam.Domain names differ from the rest of the text ina (non-German) email in that they often consist of severalwords stuck together. Though computationally expensive in the general case, it might be worth trying to decompose them. If a filter has never seen thetoken "xxxporn" before it will have an individual spamprobability of .4, whereas "xxx" and "porn" individuallyhave probabilities (in my corpus) of .9889 and .99respectively, and a combined probability of .9998.I expect decomposing domain names to become moreimportant as spammers are gradually forced to stop usingincriminating words in the text of their messages. (A urlwith an ip address is of course an extremely incriminating sign,except in the mail of a few sysadmins.)It might be a good idea to have a cooperatively maintainedlist of urls promoted by spammers. We'd need a trust metricof the type studied by Raph Levien to prevent maliciousor incompetent submissions, but if we had such a thing itwould provide a boost to any filtering software. It wouldalso be a convenient basis for boycotts.Another way to test dubious urls would be to send out acrawler to look at the site before the user looked at theemail mentioning it. You could use a Bayesian filter torate the site just as you would an email, and whateverwas found on the site could be included in calculatingthe probability of the email being a spam. A url that ledto a redirect would of course be especially suspicious.One cooperative project that I think really would be a goodidea would be to accumulate a giant corpus of spam. A large,clean corpus is the key to making Bayesian filtering workwell. Bayesian filters could actually use the corpus asinput. But such a corpus would be useful for other kindsof filters too, because it could be used to test them.Creating such a corpus poses some technical problems. We'dneed trust metrics to prevent malicious or incompetentsubmissions, of course. We'd also need ways of erasing personal information (not just to-addresses and ccs, butalso e.g. the arguments to unsubscribe urls, which oftenencode the to-address) from mails in the corpus. If anyonewants to take on this project, it would be a good thing forthe world.Appendix: Defining SpamI think there is a roughconsensus on what spam is, but it would be useful to havean explicit definition. We'll need to do this if we want to establisha central corpus of spam, or even to compare spam filteringrates meaningfully.To start with, spam is not unsolicited commercial email.If someone in my neighborhood heard that I was looking for an oldRaleigh three-speed in good condition, and sent me an emailoffering to sell me one, I'd be delighted, and yet thisemail would be both commercial and unsolicited. Thedefining feature of spam (in fact, its raison d'etre)is not that it is unsolicited, but that it is automated.It is merely incidental, too, that spam is usually commercial.If someone started sending mass email to support some politicalcause, for example, it would be just as much spam as emailpromoting a porn site.I propose we define spam as unsolicited automated email.This definition thus includes some emailthat many legal definitions of spam don't. Legal definitionsof spam, influenced presumably by lobbyists, tend to excludemail sent by companies that have an "existing relationship" withthe recipient. But buying something from a company, forexample, does not imply that you have solicitedongoing email from them.If I order something from an onlinestore, and they then send me a stream of spam, it's stillspam.Companies sending spam often give you a way to "unsubscribe,"or ask you to go to their site and change your "accountpreferences" if you want to stop getting spam. This isnot enough to stop the mail from being spam. Not opting outis not the same as opting in. Unless the recipient explicitly checked a clearly labelled box (whosedefault was no) asking to receive the email, then it is spam.In some business relationships, you do implicitly solicitcertain kinds of mail. When you order online, I think youimplicitly solicit a receipt, and notification when theorder ships.I don't mind when Verisign sends me mail warning thata domain name is about to expire (at least, if they are theactual registrar for it). But when Verisign sends meemail offering a FREE Guide to Building MyE-Commerce Web Site, that's spam.Notes:[1] The examples in this article are translatedinto Common Lisp for, believe it or not, greater accessibility.The application described here is one that we wrote in order totest a new Lisp dialect called Arc that is not yet released.[2] Currently the lowest rate seems to be about $200 to send a million spams.That's very cheap, 1/50th of a cent per spam.But filtering out 95%of spam, for example, would increase the spammers' cost to reacha given audience by a factor of 20. Few can havemargins big enough to absorb that.[3] As a rule of thumb, the more qualifiers there are before thename of a country, the more corrupt the rulers. Acountry called The Socialist People's Democratic Republicof X is probably the last place in the world you'd want to live.Thanks to Sarah Harlin for reading drafts of this; Daniel Giffin (who is also writing the production Arc interpreter) for several good ideas aboutfiltering and for creating our mail infrastructure; Robert Morris,Trevor Blackwell and Erann Gat for many discussions about spam; Raph Levien for advice about trust metrics; and Chip Coldwell and Sam Steingold for advice about statistics.More Info: