bayes

Commit Graph

Author	SHA1	Message	Date
Peter J. Holzer	f4983e2472	Read all features for component in one query Instead of retrieving from the database only features which actually occur in the message retrieve all of them above a certain interestingness threshold (0.4, because that's the minimum I've observed so far) and then match them in in-process. This seems to be a little faster but not by much. May have to revisit if my database grows.	2019-09-14 15:13:50 +02:00
Peter J. Holzer	e6dab8395f	Add option --no-used-evidence	2019-09-14 12:09:36 +02:00
Peter J. Holzer	d96d1fc96e	Improve overlap avoidance (#1 ) When a feature is used, we use it to split the input string in which it was found and use the fragments for subsequent feature searches. So overlaps are impossible.	2019-09-14 11:01:24 +02:00
Peter J. Holzer	e51294bca2	Add option --verbose	2019-09-01 15:19:23 +02:00
Peter J. Holzer	c49d6847f3	Write used evidence to database	2019-08-27 22:38:00 +02:00
Peter J. Holzer	631f97abe5	Avoid overlapping tokens For each used token, record the first, second and last third and exclude all tokens which include those.	2019-08-17 11:12:34 +02:00
Peter J. Holzer	f3817c4355	Implement basic idea I start with tokens of length 1, and add longer tokens iff they extend a previously seen token by one character. Probability computation follow's Paul Graham's "A Plan for Spam", except that I haven't implemented some of his tweaks (most importantly, I don't account for frequencs within a message like he does). While selecting tokens for judging a message, I ignore substrings of tokens that have been seen previously. This still results in the majority of tokens to overlap, which is probably not good.	2019-08-17 09:29:11 +02:00

7 Commits