Smooth limits of spam probability
Instead of clipping the probability at [0.01, 0.99] we just add 1 to each side. With my current corpus size this results in very similar limits (they will creep closer to 0 and 1 with a larger corpus, but never reach them) while avoiding having lots of tokens with exactly the same probability. This makes the selection by judge_message less random and more relevant (it prefers tokens which have been seen more frequently).
This commit is contained in:
parent
631f97abe5
commit
e6a4ba72f1
21
aggregate
21
aggregate
|
@ -41,37 +41,24 @@ csr.execute(
|
||||||
select
|
select
|
||||||
type, length, feature,
|
type, length, feature,
|
||||||
spam_count, ham_count,
|
spam_count, ham_count,
|
||||||
spam_count::float8 / spam_message_count as spam_ratio,
|
(spam_count + 1.0) / (spam_message_count + 1.0) as spam_ratio,
|
||||||
ham_count::float8 / ham_message_count as ham_ratio
|
(ham_count + 1.0) / (ham_message_count + 1.0) as ham_ratio
|
||||||
from f, m
|
from f, m
|
||||||
),
|
),
|
||||||
p as (
|
p as (
|
||||||
select
|
select
|
||||||
type, length, feature,
|
type, length, feature,
|
||||||
spam_count, ham_count,
|
spam_count, ham_count,
|
||||||
case
|
spam_ratio / (spam_ratio + ham_ratio) as spam_prob
|
||||||
when spam_count + ham_count > 4 then spam_ratio / (spam_ratio + ham_ratio)
|
|
||||||
end as spam_prob
|
|
||||||
from f1
|
from f1
|
||||||
),
|
),
|
||||||
p1 as (
|
|
||||||
select
|
|
||||||
type, length, feature,
|
|
||||||
spam_count, ham_count,
|
|
||||||
case
|
|
||||||
when spam_prob < 0.01 then 0.01
|
|
||||||
when spam_prob > 0.99 then 0.99
|
|
||||||
else spam_prob
|
|
||||||
end as spam_prob
|
|
||||||
from p
|
|
||||||
),
|
|
||||||
p2 as (
|
p2 as (
|
||||||
select
|
select
|
||||||
type, length, feature,
|
type, length, feature,
|
||||||
spam_count, ham_count,
|
spam_count, ham_count,
|
||||||
spam_prob,
|
spam_prob,
|
||||||
abs(spam_prob - 0.5) as interesting
|
abs(spam_prob - 0.5) as interesting
|
||||||
from p1
|
from p
|
||||||
)
|
)
|
||||||
select * from p2
|
select * from p2
|
||||||
order by interesting desc
|
order by interesting desc
|
||||||
|
|
Loading…
Reference in New Issue