Commit Graph

8 Commits

Author SHA1 Message Date
Peter J. Holzer f4983e2472 Read all features for component in one query
Instead of retrieving from the database only features which actually
occur in the message retrieve all of them above a certain
interestingness threshold (0.4, because that's the minimum I've observed
so far) and then match them in in-process.

This seems to be a little faster but not by much. May have to revisit if
my database grows.
2019-09-14 15:13:50 +02:00
Peter J. Holzer e6dab8395f Add option --no-used-evidence 2019-09-14 12:09:36 +02:00
Peter J. Holzer d96d1fc96e Improve overlap avoidance (#1)
When a feature is used, we use it to split the input string in which it
was found and use the fragments for subsequent feature searches. So
overlaps are impossible.
2019-09-14 11:01:24 +02:00
Peter J. Holzer e51294bca2 Add option --verbose 2019-09-01 15:19:23 +02:00
Peter J. Holzer c49d6847f3 Write used evidence to database 2019-08-27 22:38:00 +02:00
Peter J. Holzer e6a4ba72f1 Smooth limits of spam probability
Instead of clipping the probability at [0.01, 0.99] we just add 1 to
each side. With my current corpus size this results in very similar
limits (they will creep closer to 0 and 1 with a larger corpus, but
never reach them) while avoiding having lots of tokens with exactly the
same probability. This makes the selection by judge_message less random
and more relevant (it prefers tokens which have been seen more
frequently).
2019-08-17 11:32:59 +02:00
Peter J. Holzer 631f97abe5 Avoid overlapping tokens
For each used token, record the first, second and last third and exclude
all tokens which include those.
2019-08-17 11:12:34 +02:00
Peter J. Holzer f3817c4355 Implement basic idea
I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.

Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).

While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.
2019-08-17 09:29:11 +02:00