August 2003We may be able to improve the accuracy of Bayesian spam filtersby having them follow links to see what'swaiting at the other end. Richard Jowsey ofdeath2spam now doesthis in borderline cases, and reports that it works well.Why only do it in borderline cases? And why only do it once?As I mentioned in Will Filters Kill Spam?,following all the urls ina spam would have an amusing side-effect. If popular email clientsdid this in order to filter spam, the spammer's serverswould take a serious pounding. The more I think about this,the better an idea it seems. This isn't just amusing; itwould be hard to imagine a more perfectly targeted counterattackon spammers.So I'd like to suggest an additional feature to thoseworking on spam filters: a "punish" mode which,if turned on, would spider every urlin a suspected spam n times, where n could be set by the user. [1]As many people have noted, one of the problems with thecurrent email system is that it's too passive. It doeswhatever you tell it. So far all the suggestions for fixingthe problem seem to involve new protocols. This one wouldn't.If widely used, auto-retrieving spam filters would makethe email system rebound. The huge volume of thespam, which has so far worked in the spammer's favor,would now work against him, like a branch snapping back in his face. Auto-retrieving spam filters would drive thespammer's costs up, and his sales down: his bandwidth usagewould go through the roof, and his servers would grind to ahalt under the load, which would make them unavailableto the people who would have responded to the spam.Pump out a million emails an hour, get amillion hits an hour on your servers.We would want to ensure that this is only done tosuspected spams. As a rule, any url sent to millions ofpeople is likely to be a spam url, so submitting every httprequest in every email would work fine nearly all the time.But there are a few cases where this isn't true: the urlsat the bottom of mails sent from free email services likeYahoo Mail and Hotmail, for example.To protect such sites, and to prevent abuse, auto-retrievalshould be combined with blacklists of spamvertised sites.Only sites on a blacklist would get crawled, andsites would be blacklistedonly after being inspected by humans. The lifetime of a spammust be several hours at least, soit should be easy to update such a list in time tointerfere with a spam promoting a new site. [2]High-volume auto-retrieval would only be practical for userson high-bandwidthconnections, but there are enough of those to cause spammersserious trouble. Indeed, this solution neatlymirrors the problem. The problem with spam is that inorder to reach a few gullible people the spammer sends mail to everyone. The non-gullible recipientsare merely collateral damage. But the non-gullible majoritywon't stop getting spam until they can stop (or threaten tostop) the gulliblefrom responding to it. Auto-retrieving spam filters offerthem a way to do this.Would that kill spam? Not quite. The biggest spammerscould probably protect their servers against auto-retrieving filters. However, the easiest and cheapest way for themto do it would be to include working unsubscribe links in their mails. And this would be a necessity for smaller fry,and for "legitimate" sites that hired spammers to promotethem. So if auto-retrieving filters became widespread,they'd become auto-unsubscribing filters.In this scenario, spam would, like OS crashes, viruses, andpopups, become one of those plagues that only afflict peoplewho don't bother to use the right software.Notes[1] Auto-retrieving filters will have to follow redirects,and should in some cases (e.g. a page that just says"click here") follow more than one level of links.Make sure too thatthe http requests are indistinguishable from those ofpopular Web browsers, including the order and referrer.If the responsedoesn't come back within x amount of time, default tosome fairly high spam probability.Instead of making n constant, it might be a good idea tomake it a function of the number of spams that have beenseen mentioning the site. This would add a further level ofprotection against abuse and accidents.[2] The original version of this article used the term"whitelist" instead of "blacklist". Though they wereto work like blacklists, I preferred to call them whitelistsbecause it might make them less vulnerable to legal attack.This just seems to have confused readers, though.There should probably be multiple blacklists. A single pointof failure would be vulnerable both to attack and abuse.Thanks to Brian Burton, Bill Yerazunis, Dan Giffin,Eric Raymond, and Richard Jowsey for reading drafts of this.