Instead of retrieving from the database only features which actually
occur in the message retrieve all of them above a certain
interestingness threshold (0.4, because that's the minimum I've observed
so far) and then match them in in-process.
This seems to be a little faster but not by much. May have to revisit if
my database grows.
When a feature is used, we use it to split the input string in which it
was found and use the fragments for subsequent feature searches. So
overlaps are impossible.
I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.
Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).
While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.