A Plan for Spam
****
| Like to build things? Try Hacker
News. |
August 2002
*(This article describes the spam-filtering techniques
used in the spamproof web-based mail reader we
built to exercise Arc. An
improved algorithm is described in Better
Bayesian Filtering.)*
I think it's possible to stop spam, and that
content-based filters are the way to do it.
The Achilles heel of the spammers is their message.
They can circumvent any other barrier you set up. They have so far, at
least. But they have to deliver their message, whatever it
is. If we can write software that recognizes their messages,
there is no way they can get around that.
_ _ _
To the recipient, spam is easily recognizable. If you hired
someone to read your mail and discard the spam, they would
have little trouble doing it. How much do we have
to do, short of AI, to automate this process?
I think we will be able to solve the problem with fairly
simple algorithms. In fact, I've found that you can filter
present-day spam acceptably well using nothing more than a
Bayesian combination of the spam probabilities of individual
words. Using a slightly tweaked (as described below) Bayesian
filter, we now miss less than 5 per 1000 spams, with 0 false positives.
The statistical approach is not usually the first one people
try when they write spam filters. Most hackers' first instinct is
to try to write software that recognizes individual properties of
spam. You look at spams
and you think, the gall of these guys to try sending me mail
that begins "Dear Friend" or has a subject line that's all
uppercase and ends in eight exclamation points. I can filter
out that stuff with about one line of code.
And so you do,
and in the beginning it works. A few simple rules will take
a big bite out of your incoming spam. Merely looking
for the word "click" will catch 79.7% of the
emails in my spam corpus, with only 1.2% false positives.
I spent about six months writing software that looked for
individual spam features before I tried the statistical
approach. What I found was that recognizing that last few
percent of spams got very hard, and that as I
made the filters stricter I got more false positives.
False positives are innocent emails that get mistakenly
identified as spams.
For most users,
missing legitimate email is
an order of magnitude worse than receiving spam, so a
filter that yields false positives is like an acne cure
that carries a risk of death to the patient.
The more spam a user gets, the less
likely he'll be to notice one innocent mail sitting in his
spam folder. And strangely enough, the better your spam filters get,
the more dangerous false positives become, because when the
filters are really good, users will be more likely to
ignore everything they catch.
I don't know why I avoided trying the statistical approach
for so long. I think it was because I got addicted to
trying to identify spam features myself, as if I were playing
some kind of competitive game with the spammers. (Nonhackers
don't often realize this, but most hackers are very competitive.)
When I did try statistical analysis, I
found immediately that it was much cleverer than I had been.
It discovered, of course, that terms like "virtumundo" and
"teens" were good indicators of spam. But it also
discovered that "per" and "FL" and "ff0000" are good
indicators of spam. In fact, "ff0000" (html for bright red)
turns out to be as good an indicator of spam as any
pornographic term.
_ _ _
Here's a sketch of how I do statistical filtering. I start
with one corpus of spam and one of nonspam mail. At the
moment each one has about 4000 messages in it. I scan
the entire text, including headers and embedded html
and javascript, of each message in each corpus.
I currently consider alphanumeric characters,
dashes, apostrophes, and dollar signs to be part of tokens,
and everything else to be a token separator. (There is
probably room for improvement here.) I ignore tokens that
are all digits, and I also ignore html comments, not even
considering them as token separators.
I count the number
of times each token (ignoring case, currently) occurs in
each corpus. At this stage I end up with two large hash
tables, one for each corpus, mapping tokens to number
of occurrences.
Next I create a third hash table, this time mapping
each token to the probability that an email containing it is a spam,
which I calculate as follows [1]:
(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (
where word is the token whose probability we're
calculating, good and bad are the hash tables
I created in the first step, and ngood and nbad
are the number of nonspam and spam messages respectively.
I explained this as code to show a couple of important details.
I want to bias the probabilities slightly to avoid false
positives, and by trial and error I've found that a good
way to do it is to double all the numbers in good.
This helps to distinguish between words that occasionally
do occur in legitimate email and words that almost never do.
I only consider words that occur more than five times in
total (actually, because of the doubling, occurring three
times in nonspam mail would be enough). And then there is
the question of what probability to assign to words that
occur in one corpus but not the other. Again by trial and
error I chose .01 and .99. There may be room for tuning
here, but as the corpus grows such tuning will happen
automatically anyway.
The especially observant will notice that while I consider
each corpus to be a single long stream of text for purposes
of counting occurrences, I use the number of emails in
each, rather than their combined length, as the divisor
in calculating spam probabilities. This adds another
slight bias to protect against false positives.
When new mail arrives, it is scanned into tokens, and
the most interesting fifteen tokens, where interesting is
measured by how far their spam probability is from a
neutral .5, are used to calculate the probability that
the mail is spam. If probs
is a list of the fifteen individual probabilities, you
calculate the
combined probability thus:
(let ((prod (apply #'* probs)))
(/ prod (+ prod (apply #'* (mapcar #'(lambda (x)
(- 1 x))
probs)))))
One question that arises in
practice is what probability to assign to a word you've
never seen, i.e. one that doesn't occur in the hash table
of word probabilities. I've found, again by trial and
error, that .4 is a good number to use. If you've never
seen a word before, it is probably fairly innocent; spam
words tend to be all too familiar.
There are examples of this algorithm being applied to
actual emails in an appendix at the end.
I treat mail as spam if the algorithm above gives it a
probability of more than .9 of being spam. But in practice
it would not matter much where I put this threshold, because
few probabilities end up in the middle of the range.
_ _ _
One great advantage of the statistical approach is that you
don't have to read so many spams. Over the past six months,
I've read literally thousands of spams, and it is really
kind of demoralizing. Norbert Wiener said if you compete
with slaves you become a slave, and there is something
similarly degrading about competing with spammers. To
recognize individual spam features you have to try to get
into the mind of the spammer, and frankly I want to spend
as little time inside the minds of spammers as possible.
[...]