Smooth limits of spam probability

Instead of clipping the probability at [0.01, 0.99] we just add 1 to each side. With my current corpus size this results in very similar limits (they will creep closer to 0 and 1 with a larger corpus, but never reach them) while avoiding having lots of tokens with exactly the same probability. This makes the selection by judge_message less random and more relevant (it prefers tokens which have been seen more frequently).
2019-08-17 11:32:59 +02:00 · 2019-08-17 11:32:59 +02:00 · e6a4ba72f1
parent 631f97abe5
commit e6a4ba72f1
1 changed files with 4 additions and 17 deletions
--- a/21
+++ b/21
@ -41,37 +41,24 @@ csr.execute(
                select
                    type, length, feature,
                    spam_count, ham_count,
-                    spam_count::float8 / spam_message_count as spam_ratio,
-                    ham_count::float8 / ham_message_count as ham_ratio
+                    (spam_count + 1.0) / (spam_message_count + 1.0) as spam_ratio,
+                    (ham_count + 1.0) / (ham_message_count + 1.0) as ham_ratio
                from f, m
            ),
            p as (
                select
                    type, length, feature,
                    spam_count, ham_count,
-                    case 
-                        when spam_count + ham_count > 4 then spam_ratio / (spam_ratio + ham_ratio)
-                    end as spam_prob
+                    spam_ratio / (spam_ratio + ham_ratio) as spam_prob
                from f1
            ),
-            p1 as (
-                select
-                    type, length, feature,
-                    spam_count, ham_count,
-                    case
-                        when spam_prob < 0.01 then 0.01
-                        when spam_prob > 0.99 then 0.99
-                        else spam_prob
-                    end as spam_prob
-                from p
-            ),
            p2 as (
                select
                    type, length, feature,
                    spam_count, ham_count,
                    spam_prob,
                    abs(spam_prob - 0.5) as interesting
-                from p1
+                from p
            )
        select * from p2
        order by interesting desc