Commit Graph

2 Commits

Author SHA1 Message Date
Peter J. Holzer 631f97abe5 Avoid overlapping tokens
For each used token, record the first, second and last third and exclude
all tokens which include those.
2019-08-17 11:12:34 +02:00
Peter J. Holzer f3817c4355 Implement basic idea
I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.

Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).

While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.
2019-08-17 09:29:11 +02:00