Better Bayesian Filtering

January 2003(This article was given as a talk at the 2003 Spam Conference.It describes the work I've done to improve the performance ofthe algorithm described in A Plan for Spam,and what I plan to do in the future.)The first discovery I'd like to present here is an algorithm forlazy evaluation of research papers. Justwrite whatever you want and don't cite any previous work, andindignant readers will send you references to all the papers youshould have cited. I discovered this algorithmafter A Plan for Spam'' [1] was on Slashdot.Spam filtering is a subset of text classification,which is a well established field, but the first papers aboutBayesianspam filtering per se seem to have been twogiven at the same conference in 1998,one by Pantel and Lin [2],and another by a group fromMicrosoft Research [3].When I heard about this work I was a bit surprised. Ifpeople had been onto Bayesian filtering four years ago,why wasn't everyone using it?When I read the papers I found out why. Pantel and Lin's filter was themore effective of the two, but itonly caught 92% of spam, with 1.16% false positives.When I tried writing a Bayesian spam filter,it caught 99.5% of spam with less than .03% falsepositives [4].It's always alarming when two peopletrying the same experiment get widely divergent results.It's especially alarming here because those two sets of numbersmight yield opposite conclusions.Different users have different requirements, but I think formany people a filtering rate of 92% with 1.16% false positives meansthat filtering is not an acceptable solution, whereas99.5% with less than .03% false positives means that it is.So why did we get such different numbers?I haven't tried to reproduce Pantel and Lin's results, butfrom reading the paper I see five things that probably accountfor the difference.One is simply that they trained their filter on very littledata: 160 spam and 466 nonspam mails.Filter performance should still be climbing with datasets that small. So their numbers may not even be an accuratemeasure of the performance of their algorithm, let alone ofBayesian spam filtering in general.But I think the most important difference is probablythat they ignored message headers. To anyone who has workedon spam filters, this will seem a perverse decision.And yet in the very first filters I tried writing, I ignored theheaders too. Why? Because I wanted to keep the problem neat.I didn't know much about mail headers then, and they seemed to mefull of random stuff. There is a lesson here for filterwriters: don't ignore data. You'd think this lesson wouldbe too obvious to mention, but I've had to learn it several times.Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. bothmailing'' and mailed'' to the rootmail''. They mayhave felt they were forced to do this by the small sizeof their corpus, but if so this is a kind of premature optimization.Fourth, they calculated probabilities differently.They used all the tokens, whereas I onlyuse the 15 most significant. If you use all the tokensyou'll tend to miss longer spams, the type where someone tells you their lifestory up to the point where they got rich from some multilevelmarketing scheme. And such an algorithmwould be easy for spammers to spoof: just add a bigchunk of random text to counterbalance the spam terms.Finally, they didn't bias against false positives.I thinkany spam filtering algorithm ought to have a convenientknob you can twist to decrease thefalse positive rate at the expense of the filtering rate.I do this by counting the occurrencesof tokens in the nonspam corpus double. I don't think it's a good idea to treat spam filtering asa straight text classification problem. You can usetext classification techniques, but solutions can and shouldreflect the fact that the text is email, and spamin particular. Email is not just text; it has structure.Spam filtering is not just classification, becausefalse positives are so much worse than false negativesthat you should treat them as a different kind of error.And the source of error is not just random variation, buta live human spammer working actively to defeat your filter.TokensAnother project I heard aboutafter the Slashdot article was Bill Yerazunis' CRM114 [5].This is the counterexample to the design principle Ijust mentioned. It's a straight text classifier,but such a stunningly effective one that it manages to filterspam almost perfectly without even knowing that'swhat it's doing.Once I understood how CRM114 worked, it seemedinevitable that I would eventually have to move from filtering basedon single words to an approach like this. But first, I thought,I'll see how far I can get with single words. And the answer is,surprisingly far.Mostly I've been working on smarter tokenization. Oncurrent spam, I've been able to achieve filtering rates thatapproach CRM114's. These techniques are mostly orthogonal to Bill's;an optimal solution might incorporate both.A Plan for Spam'' uses a very simpledefinition of a token. Letters, digits, dashes, apostrophes,and dollar signs are constituent characters, and everythingelse is a token separator. I also ignored case.Now I have a more complicated definition of a token: Case is preserved. Exclamation points are constituent characters. Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact. A price range like $20-25 yields two tokens, $20 and $25. Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g.foo'' in the Subject line becomes Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)Such measures increase the filter's vocabulary, whichmakes it more discriminating. For example, in the currentfilter,free'' in the Subject linehas a spam probability of 98%, whereas the same tokenin the body has a spam probability of only 65%.Here are some of the current probabilities [6]:SubjectFREE 0.9999free!! 0.9999Tofree 0.9998Subjectfree 0.9782free! 0.9199Free 0.9198Urlfree 0.9091FREE 0.8747Fromfree 0.7636free 0.6546In the Plan for Spam filter, all these tokens would have had thesame probability, .7602. That filter recognized about 23,000tokens. The current one recognizes about 187,000.The disadvantage of having a larger universe of tokensis that there is morechance of misses.Spreading your corpus out over more tokenshas the same effect as making it smaller.If you consider exclamation points asconstituents, for example, then you could end upnot having a spam probability for free with seven exclamationpoints, even though you know that free with just two exclamation points has a probability of 99.99%.One solution to this is what I call degeneration. If youcan't find an exact match for a token,treat it as if it were a less specificversion. I consider terminal exclamationpoints, uppercase letters, and occurring in one of thefive marked contexts as making a token more specific.For example, if I don't find a probability for``Subjectfree!'', I look for probabilities forSubject*free'',free!'', and free'', and take whichever oneis farthest from .5.Here are the alternatives [7]considered if the filter seesFREE!!!'' in theSubject line and doesn't have a probability for it.SubjectFree!!!Subjectfree!!!SubjectFREE!SubjectFree!Subjectfree!SubjectFREESubjectFreeSubjectfreeFREE!!!Free!!!free!!!FREE!Free!free!FREEFreefree If you do this, be sure to consider versions with initialcaps as well as all uppercase and all lowercase. Spamstend to have more sentences in imperative mood, and inthose the first word is a verb. So verbs with initial capshave higher spam probabilities than they would in all lowercase. In my filter, the spam probability of Act''is 98% and foract'' only 62%.If you increase your filter's vocabulary, you can end upcounting the same word multiple times, according to your olddefinition of same''.Logically, they're not thesame token anymore. But if this still bothers you, letme add from experience that the words you seem to becounting multiple times tend to be exactly the ones you'dwant to.Another effect of a larger vocabulary is that when youlook at an incoming mail you find more interesting tokens,meaning those with probabilities far from .5. I use the15 most interesting to decide if mail is spam.But you can run into a problem when you use a fixed numberlike this. If you find a lot of maximally interesting tokens,the result can end up being decided by whatever random factordetermines the ordering of equally interesting tokens.One way to deal with this is to treat someas more interesting than others.For example, thetokendalco'' occurs 3 times in my spam corpus and neverin my legitimate corpus. The token Url*optmails''(meaningoptmails'' within a url) occurs 1223 times.And yet, as I used to calculate probabilities for tokens,both would have the same spam probability, the threshold of .99.That doesn't feel right. There are theoreticalarguments for giving these two tokens substantially differentprobabilities (Pantel and Lin do), but I haven't tried that yet.It does seem at least that if we find more than 15 tokensthat only occur in one corpus or the other, we ought togive priority to the ones that occur a lot. So nowthere are two threshold values. For tokens that occur onlyin the spam corpus, the probability is .9999 if theyoccur more than 10 times and .9998 otherwise. Dittoat the other end of the scale for tokens foundonly in the legitimate corpus.I may later scale token probabilities substantially,but this tiny amount of scaling at least ensures that tokens get sorted the right way.Another possibility would be to consider notjust 15 tokens, but all the tokens over a certainthreshold of interestingness. Steven Hauser does thisin his statistical spam filter [8].If you use a threshold, make it very high, orspammers could spoof you by packing messages withmore innocent words.Finally, what should one doabout html? I've tried the whole spectrum of options, fromignoring it to parsing it all. Ignoring html is a bad idea,because it's full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approachseems to be the middle course, to notice some tokens but notothers. I look at a, img, and font tags, and ignore therest. Links and images you should certainly look at, becausethey contain urls.I could probably be smarter about dealing with html, but Idon't think it's worth putting a lot of time into this.Spams full of html are easy to filter. The smarterspammers already avoid it. Soperformance in the future should not depend much on howyou deal with html.PerformanceBetween December 10 2002 and January 10 2003 I got about1750 spams. Of these, 4 got through. That's a filteringrate of about 99.75%.Two of the four spams I missed got through because theyhappened to use words that occur often in my legitimateemail.The third was one of those that exploitan insecure cgi script to send mail to third parties.They're hard to filter based juston the content because the headers are innocent and they're careful about the words they use. Even so I canusually catch them. This one squeaked by with aprobability of .88, just under the threshold of .9.Of course, looking at multiple token sequenceswould catch it easily. Below is the result ofyour feedback form'' is an instant giveaway.The fourth spam was what I calla spam-of-the-future, because this is what I expect spam toevolve into: some completely neutraltext followed by a url. In this case it was was fromsomeone saying they had finally finished their homepageand would I go look at it. (The page was of course an ad for a porn site.)If the spammers are careful about the headers and use afresh url, there is nothing in spam-of-the-future for filtersto notice. We can of course counter by sending acrawler to look at the page. But that might not be necessary.The response rate for spam-of-the-future mustbe low, or everyone would be doing it.If it's low enough,it won't pay for spammers to send it, and we won't have to work too hard on filtering it.Now for the really shocking news: during that same one-monthperiod I got three false positives.In a way it'sa relief to get some false positives. When I wroteA Planfor Spam'' I hadn't had any, and I didn't know what they'dbe like. Now that I've had a few, I'm relieved to findthey're not as bad as I feared.False positives yielded by statisticalfilters turn out to be mails that sound a lot like spam, andthese tend to be the ones you would least mind missing [9].Two of the false positives were newslettersfrom companies I've bought things from. I neverasked to receive them, so arguably theywere spams, but I count them as false positives becauseI hadn't been deleting them as spams before. The reasonthe filters caught them was that both companies in January switched to commercial email sendersinstead of sending the mails from their own servers, and both the headers and the bodies became much spammier.The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This wasa direct result of making tokens case sensitive; the Planfor Spam filter wouldn't have caught it.It's hard to say what the overall false positive rate is,because we're up in the noise, statistically.Anyone who has worked on filters (at least, effective filters) willbe aware of this problem.With some emails it'shard to say whether they're spam or not, and these arethe ones you end up looking at when you get filters really tight. For example, so far the filter hascaught two emails that were sent to my address becauseof a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spamnor my nonspam mail.Another false positive was from a vice president at Virtumundo.I wrote to them pretending to be a customer,and since the reply came back through Virtumundo's mail servers it had the most incriminatingheaders imaginable. Arguably this isn't a real falsepositive either, but a sort of Heisenberg uncertaintyeffect: I only got it because I was writing about spam filtering.Not counting these, I've had a total of five false positivesso far, out of about 7740 legitimate emails, a rate of .06%.The other two were a notice that something I boughtwas back-ordered, and a party reminder from Evite.I don't think this number can be trusted, partlybecause the sample is so small, and partly becauseI think I can fix the filter not to catchsome of these.False positives seem to me a different kind of error fromfalse negatives.Filtering rate is a measure of performance. Falsepositives I consider more like bugs. I approach improving thefiltering rate as optimization, and decreasing falsepositives as debugging.So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase textmade it look to the filter like a Nigerian spam.This really is kind of a bug. As withhtml, the email being all uppercase is really conceptually onefeature, not one for each word. I need to handle case in amore sophisticated way.So what to make of this .06%? Not much, I think. You couldtreat it as an upper bound, bearing in mind the small sample size.But at this stage it is more a measure of the bugsin my implementation than some intrinsic false positive rateof Bayesian filtering.FutureWhat next? Filtering is an optimization problem,and the key to optimization is profiling. Don'ttry to guess where your code is slow, because you'llguess wrong. Look at where your code is slow,and fix that. In filtering, this translates to: look at the spams you miss, and figure out what youcould have done to catch them.For example, spammers are now working aggressively to evade filters, and one of the things they're doing isbreaking up and misspelling words to prevent filters fromrecognizing them. But working on this is not my firstpriority, because I still have no trouble catching thesespams [10].There are two kinds of spams I currently dohave trouble with.One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a datingsite. These get through because they're the one type ofsales pitch you can make without using sales talk. They usethe same vocabulary as ordinary email.The other kind of spams I have trouble filtering are thosefrom companies in e.g. Bulgaria offering contract programming services. These get through because I'm a programmer too, andthe spams are full of the same words as my real mail.I'll probably focus on the personal ad type first. I think ifI look closer I'll be able to find statistical differencesbetween these and my real mail. The style of writing iscertainly different, though it may take multiword filteringto catch that.Also, I notice they tend to repeat the url,and someone including a url in a legitimate mail wouldn't do that [11].The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn't find a smokingstatistical gun.Maybe the only answer is a central list ofdomains advertised in spams [12]. But there can't be thatmany of this type of mail. If the onlyspams left were unsolicited offers of contract programmingservices from Bulgaria, we could all probably move on toworking on something else.Will statistical filtering actually get us to that point?I don't know. Right now, for me personally, spam isnot a problem. But spammers haven't yet made a seriouseffort to spoof statistical filters. What will happen when they do?I'm not optimistic about filters that work at thenetwork level [13].When there is a static obstacle worth getting past, spammersare pretty efficient at getting past it. Thereis already a company called Assurance Systems that willrun your mail through Spamassassin and tell you whether it will get filtered out.Network-level filters won't be completely useless.They may be enough to kill all the "opt-in"spam, meaning spam from companies like Virtumundo andEqualamail who claim that they're really running opt-in lists.You can filter those based just on the headers, nomatter what they say in the body. But anyone willing tofalsify headers or use open relays, presumably includingmost porn spammers, should be able to get some message pastnetwork-level filters if they want to. (By no means themessage they'd like to send though, which is something.)The kind of filters I'm optimistic about are ones thatcalculate probabilities based on each individual user's mail.These can be much more effective, not only inavoiding false positives, but in filtering too: for example,finding the recipient's email address base-64 encoded anywhere ina message is a very good spam indicator.But the real advantage of individual filters is that they'll all bedifferent. If everyone's filters have different probabilities,it will make the spammers' optimization loop, what programmerswould call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy ofsome filter they have on their desktop, they'll have to do atest mailing for each tweak. It would be like programming ina language without an interactive toplevel, and I wouldn't wish thaton anyone.Notes[1]Paul Graham. A Plan for Spam.'' August 2002.http://paulgraham.com/spam.html.Probabilities in this algorithm arecalculated using a degenerate case of Bayes' Rule. There aretwo simplifying assumptions: that the probabilitiesof features (i.e. words) are independent, and that we knownothing about the prior probability of an email beingspam.The first assumption is widespread in text classification.Algorithms that use it are callednaive Bayesian.''The second assumption I made because the proportion of spam inmy incoming mail fluctuated so much from day to day (indeed,from hour to hour) that the overall prior ratio seemedworthless as a predictor. If you assume that P(spam) andP(nonspam) are both .5, they cancel out and you canremove them from the formula.If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or(especially) very low, you could probably improve filterperformance by incorporating prior probabilities. To dothis right you'd have to track ratios by time of day, becausespam and legitimate mail volume both have distinct dailypatterns.[2]Patrick Pantel and Dekang Lin. SpamCop-- A SpamClassification & Organization Program.'' Proceedings of AAAI-98Workshop on Learning for Text Categorization.[3]Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz.A Bayesian Approach to Filtering Junk E-Mail.'' Proceedings of AAAI-98Workshop on Learning for Text Categorization.[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email wasa false positive, this would give us .03%. These false positiverates are untrustworthy, as I explain later. I quotea number here only to emphasize that whatever the false positive rateis, it is less than 1.16%.[5] Bill Yerazunis. Sparse Binary Polynomial Hash MessageFiltering and The CRM114 Discriminator.'' Proceedings of 2003Spam Conference.[6] InA Plan for Spam'' I used thresholds of .99 and .01.It seems justifiable to use thresholds proportionate to thesize of the corpora. Since I now have on the order of 10,000 of eachtype of mail, I use .9999 and .0001.[7] There is a flaw here I should probably fix. Currently,when Subject*foo'' degenerates to justfoo'', what that means isyou're getting the stats for occurrences of foo'' inthe body or header lines other than those I mark.What I should do is keep track of statistics forfoo''overall as well as specific versions, and degenerate fromSubject*foo'' not tofoo'' but to Anywhere*foo''. Ditto forcase: I should degenerate from uppercase to any-case, notlowercase.It would probably be a win to do this with pricestoo, e.g. to degenerate from$129.99'' to $--9.99'',$--.99'',and $--''.You could also degenerate from words to their stems,but this would probably only improve filtering rates early on when you had small corpora.[8] Steven Hauser.Statistical Spam Filter Works for Me.''http://www.sofbot.com.[9] False positives are not all equal, and we should rememberthis when comparing techniques for stopping spam.Whereas many of the false positives caused by filterswill be near-spams that you wouldn't mind missing,false positives caused by blacklists, for example, will be justmail from people who chose the wrong ISP. In bothcases you catch mail that's near spam, but for blacklists nearnessis physical, and for filters it's textual.[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removingwhitespace, periods, commas, etc. and using a dictionary topick the words out of the resulting sequence.And of course finding words this way that weren't visible inthe original text would in itself be evidence of spam.Picking out the words won't be trivial. It will require more than just reconstructing word boundaries; spammersboth add (xHot nPorn cSite'') and omit (P#rn'') letters.Vision research may be useful here, since human vision isthe limit that such tricks will approach.[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don'tallow duplicates in the top 15 tokens, becauseyou could get a false positive if the sender happens to usesome bad word multiple times. (In my current filter, ``dick'' hasa spam probabilty of .9999, but it's also a name.)It seems we should at least notice duplication though,so I may try allowing up to two of each token, as Brian Burton does inSpamProbe.[12] This is what approaches like Brightmail's willdegenerate into once spammers are pushed into using mad-libtechniques to generate everything else in the message.[13]It's sometimes argued that we should be working on filteringat the network level, because it is more efficient. What peopleusually mean when they say this is: we currently filter at thenetwork level, and we don't want to start over from scratch.But you can't dictate the problem to fit your solution.Historically, scarce-resource arguments have been the losingside in debates about software design.People only tend to use them to justify choices(inaction in particular) made for other reasons.Thanks to Sarah Harlin, Trevor Blackwell, andDan Giffin for reading drafts of this paper, and to Dan againfor most of the infrastructure that this filter runs on.Related:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签