Notes on Working in Log Space To avoid underflow I suggested you work with log probabilities. How exactly does this work? Original equations: P(spam|words) = a P(words|spam)P(spam) P(~spam|words) = a P(words|~spam)P(~spam) Use fact that P(spam|words) + P(~spam|words) = 1 to solve for a. Use Naive Bayes to calculate: P(w1 w2 w3|spam) = P(w1|spam)P(w2|spam)P(w3|spam) Log space: log P(spam|words) = log [a P(words|spam)P(spam)] = log a + log P(words|spam) + log P(spam) = log a + SUM(log P(w|spam)) + log P(spam) log P(~spam|words) = log a + SUM(log P(w|~spam)) + log P(~spam) But how do we solve for a? Answer: we do not need to. Subtract equations above: log P(spam|words) - log P(~spam|words) = SUM(log P(w|spam)) + log P(spam) - SUM(log P(w|~spam)) - log P(~spam) Rewrite: log( P(spam|words)/P(~spam|words) ) = = SUM(log P(w|spam)) + log P(spam) - SUM(log P(w|~spam)) - log P(~spam) Suppose threshhold is 50%. Then say msg is spam if log( P(spam|words)/P(~spam|words) ) > log(.5/.5) = 0 Suppose threshhold is 75%. Then say msg is spam if log( P(spam|words)/P(~spam|words) ) > log(.75/.25) = 1.0986