bayes

Commit Graph

Author	SHA1	Message	Date
Peter J. Holzer	f4983e2472	Read all features for component in one query Instead of retrieving from the database only features which actually occur in the message retrieve all of them above a certain interestingness threshold (0.4, because that's the minimum I've observed so far) and then match them in in-process. This seems to be a little faster but not by much. May have to revisit if my database grows.	2019-09-14 15:13:50 +02:00
Peter J. Holzer	e6dab8395f	Add option --no-used-evidence	2019-09-14 12:09:36 +02:00
Peter J. Holzer	d96d1fc96e	Improve overlap avoidance (#1 ) When a feature is used, we use it to split the input string in which it was found and use the fragments for subsequent feature searches. So overlaps are impossible.	2019-09-14 11:01:24 +02:00
Peter J. Holzer	e51294bca2	Add option --verbose	2019-09-01 15:19:23 +02:00
Peter J. Holzer	c49d6847f3	Write used evidence to database	2019-08-27 22:38:00 +02:00
Peter J. Holzer	e6a4ba72f1	Smooth limits of spam probability Instead of clipping the probability at [0.01, 0.99] we just add 1 to each side. With my current corpus size this results in very similar limits (they will creep closer to 0 and 1 with a larger corpus, but never reach them) while avoiding having lots of tokens with exactly the same probability. This makes the selection by judge_message less random and more relevant (it prefers tokens which have been seen more frequently).	2019-08-17 11:32:59 +02:00
Peter J. Holzer	631f97abe5	Avoid overlapping tokens For each used token, record the first, second and last third and exclude all tokens which include those.	2019-08-17 11:12:34 +02:00
Peter J. Holzer	f3817c4355	Implement basic idea I start with tokens of length 1, and add longer tokens iff they extend a previously seen token by one character. Probability computation follow's Paul Graham's "A Plan for Spam", except that I haven't implemented some of his tweaks (most importantly, I don't account for frequencs within a message like he does). While selecting tokens for judging a message, I ignore substrings of tokens that have been seen previously. This still results in the majority of tokens to overlap, which is probably not good.	2019-08-17 09:29:11 +02:00

8 Commits All Branches Search

8 Commits

All Branches