Commit Graph

6 Commits

Author SHA1 Message Date
Peter J. Holzer d96d1fc96e Improve overlap avoidance (#1)
When a feature is used, we use it to split the input string in which it
was found and use the fragments for subsequent feature searches. So
overlaps are impossible.
2019-09-14 11:01:24 +02:00
Peter J. Holzer e51294bca2 Add option --verbose 2019-09-01 15:19:23 +02:00
Peter J. Holzer c49d6847f3 Write used evidence to database 2019-08-27 22:38:00 +02:00
Peter J. Holzer e6a4ba72f1 Smooth limits of spam probability
Instead of clipping the probability at [0.01, 0.99] we just add 1 to
each side. With my current corpus size this results in very similar
limits (they will creep closer to 0 and 1 with a larger corpus, but
never reach them) while avoiding having lots of tokens with exactly the
same probability. This makes the selection by judge_message less random
and more relevant (it prefers tokens which have been seen more
frequently).
2019-08-17 11:32:59 +02:00
Peter J. Holzer 631f97abe5 Avoid overlapping tokens
For each used token, record the first, second and last third and exclude
all tokens which include those.
2019-08-17 11:12:34 +02:00
Peter J. Holzer f3817c4355 Implement basic idea
I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.

Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).

While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.
2019-08-17 09:29:11 +02:00