Notes on Working in Log Space

To avoid underflow I suggested you work with log probabilities.
How exactly does this work?

Original equations:

P(spam|words) = a P(words|spam)P(spam)
P(~spam|words) = a P(words|~spam)P(~spam)

Use fact that P(spam|words) + P(~spam|words) = 1 to solve for a.

Use Naive Bayes to calculate:
P(w1 w2 w3|spam) = P(w1|spam)P(w2|spam)P(w3|spam)

Log space:

log P(spam|words) = log [a P(words|spam)P(spam)]
		  = log a + log P(words|spam) + log P(spam)
		  = log a + SUM(log P(w|spam)) + log P(spam)

log P(~spam|words) = log a + SUM(log P(w|~spam)) + log P(~spam)

But how do we solve for a?  Answer: we do not need to.
Subtract equations above:

log P(spam|words) - log P(~spam|words) 
	= SUM(log P(w|spam)) + log P(spam) - SUM(log P(w|~spam)) - log P(~spam)

Rewrite:

log( P(spam|words)/P(~spam|words) ) = 
        = SUM(log P(w|spam)) + log P(spam) - SUM(log P(w|~spam)) - log P(~spam)

Suppose threshhold is 50%.  Then say msg is spam if
log( P(spam|words)/P(~spam|words) ) > log(.5/.5) = 0

Suppose threshhold is 75%.  Then say msg is spam if
log( P(spam|words)/P(~spam|words) ) > log(.75/.25) = 1.0986