Instead of retrieving from the database only features which actually
occur in the message retrieve all of them above a certain
interestingness threshold (0.4, because that's the minimum I've observed
so far) and then match them in in-process.
This seems to be a little faster but not by much. May have to revisit if
my database grows.
When a feature is used, we use it to split the input string in which it
was found and use the fragments for subsequent feature searches. So
overlaps are impossible.
Instead of clipping the probability at [0.01, 0.99] we just add 1 to
each side. With my current corpus size this results in very similar
limits (they will creep closer to 0 and 1 with a larger corpus, but
never reach them) while avoiding having lots of tokens with exactly the
same probability. This makes the selection by judge_message less random
and more relevant (it prefers tokens which have been seen more
frequently).
I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.
Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).
While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.