Smooth limits of spam probability

Instead of clipping the probability at [0.01, 0.99] we just add 1 to
each side. With my current corpus size this results in very similar
limits (they will creep closer to 0 and 1 with a larger corpus, but
never reach them) while avoiding having lots of tokens with exactly the
same probability. This makes the selection by judge_message less random
and more relevant (it prefers tokens which have been seen more
frequently).
This commit is contained in:
Peter J. Holzer 2019-08-17 11:32:59 +02:00
parent 631f97abe5
commit e6a4ba72f1
1 changed files with 4 additions and 17 deletions

View File

@ -41,37 +41,24 @@ csr.execute(
select
type, length, feature,
spam_count, ham_count,
spam_count::float8 / spam_message_count as spam_ratio,
ham_count::float8 / ham_message_count as ham_ratio
(spam_count + 1.0) / (spam_message_count + 1.0) as spam_ratio,
(ham_count + 1.0) / (ham_message_count + 1.0) as ham_ratio
from f, m
),
p as (
select
type, length, feature,
spam_count, ham_count,
case
when spam_count + ham_count > 4 then spam_ratio / (spam_ratio + ham_ratio)
end as spam_prob
spam_ratio / (spam_ratio + ham_ratio) as spam_prob
from f1
),
p1 as (
select
type, length, feature,
spam_count, ham_count,
case
when spam_prob < 0.01 then 0.01
when spam_prob > 0.99 then 0.99
else spam_prob
end as spam_prob
from p
),
p2 as (
select
type, length, feature,
spam_count, ham_count,
spam_prob,
abs(spam_prob - 0.5) as interesting
from p1
from p
)
select * from p2
order by interesting desc