bayes

Commit Graph

Author	SHA1	Message	Date
Peter J. Holzer	e6a4ba72f1	Smooth limits of spam probability Instead of clipping the probability at [0.01, 0.99] we just add 1 to each side. With my current corpus size this results in very similar limits (they will creep closer to 0 and 1 with a larger corpus, but never reach them) while avoiding having lots of tokens with exactly the same probability. This makes the selection by judge_message less random and more relevant (it prefers tokens which have been seen more frequently).	2019-08-17 11:32:59 +02:00
Peter J. Holzer	f3817c4355	Implement basic idea I start with tokens of length 1, and add longer tokens iff they extend a previously seen token by one character. Probability computation follow's Paul Graham's "A Plan for Spam", except that I haven't implemented some of his tweaks (most importantly, I don't account for frequencs within a message like he does). While selecting tokens for judging a message, I ignore substrings of tokens that have been seen previously. This still results in the majority of tokens to overlap, which is probably not good.	2019-08-17 09:29:11 +02:00

Author

SHA1

Message

Date

Peter J. Holzer

e6a4ba72f1

Smooth limits of spam probability

Instead of clipping the probability at [0.01, 0.99] we just add 1 to
each side. With my current corpus size this results in very similar
limits (they will creep closer to 0 and 1 with a larger corpus, but
never reach them) while avoiding having lots of tokens with exactly the
same probability. This makes the selection by judge_message less random
and more relevant (it prefers tokens which have been seen more
frequently).

2019-08-17 11:32:59 +02:00

Peter J. Holzer

f3817c4355

Implement basic idea

I start with tokens of length 1, and add longer tokens iff they extend a
previously seen token by one character.

Probability computation follow's Paul Graham's "A Plan for Spam", except
that I haven't implemented some of his tweaks (most importantly, I don't
account for frequencs within a message like he does).

While selecting tokens for judging a message, I ignore substrings of
tokens that have been seen previously. This still results in the
majority of tokens to overlap, which is probably not good.

2019-08-17 09:29:11 +02:00

2 Commits