There’s been a fairly quiet debate in the spam community for some time as to the effectiveness of “Bayesian poisoning”.
As you probably know, Bayesian filtering is a method proposed back in the late 90s to filter junk email, and developed by Paul Graham in his original work, “A plan for spam”. (If you’re rusty on your higher math skills, the term Bayesian refers to a number of methods of determining probability, first realized by mathematician Thomas Bayes).
Bayesian filtering relies on “training” an engine to recognize the probability of something being spam or not spam. It’s implemented in a variety of antispam products, and is a supplemental antispam method used in our own iHateSpam desktop product (but not in our server product).
The idea behind Bayesian poisoning is that by throwing in a bunch of good words, it confuses the Bayes probability engine. That’s why you see emails with things like the works of Charles Dickens in them — they are trying to confuse both Bayesian filters and the signature based engines.
But does Bayesian poisoning work? John Graham-Cumming at the POPFile project decided to actually find out (realize that POPFile uses Bayes filtering, so there is the potential of bias). His conclusion? Bayesian poisoning is real, but is not that big of a deal.
The evidence suggests that Bayesian poisoning is real, but either impractical or defeatable. At the same time the number of published attack methods indicates that Bayesian poisoning should not be dismissed and that further research is needed to ensure that successful attacks and countermeasures are discovered before spammers discover the same ways around statistical spam filtering.
Off the cuff, I think Bayesian poisoning is real. However, it’s a question of scale.
If a corporate email server is processing a 100,000 spam messages a day (probably about average for a company with 1,500 employees) and there’s a slight change in the probability to let, say, a tenth of a percent of spam through, that’s 100 pieces of spam that got into an organization. Now, a small number, but spammers deal with small numbers. A hundred million messages advertising herbal Viagra resulting in 50 sales (or a small spike in a stock price). When you’re using the bandwidth of other people’s machines (through botnets/spambots), it’s dirt cheap.
And there may also be the time factor involved. A massive attack of the work of Charles Dickens slightly alters the probabilities for possibly a bit longer. When you’re dealing with probabilities on a large scale, you will start to see a difference. This is the problem that
the drug pushers pharmaceutical business deals with all the time. They do a small clinical trial and they may not see a small effect (or ignore it). Then the drug gets used by a millions of people and we start to see people dying, committing suicide or growing a third leg. The number may only be a few tenths of a percentage, but there’s a large population that’s affected.
We’ve also found that our own Bayes engine in the iHateSpam gets “corrupted” after a while and has to be reset. We think it’s due to poisoning. I think that Bayesian filtering absolutely has a place in spam filtering, but it’s not the only solution.
I’m curious to know your thoughts.