573 -- 10/24
Introduction to Probability

1:30 (10min) 1. Discussion: what is a probability
1:40 (20min) 2. Definitions & disease problem
2:00 (15min) 3. Bayesian networks: factoring & causality
2:15 (15min) 4. Independence, d-separation, & Markov blanket
2:30 (20min) 5. Exercise: given large set of variables:
		- determine good causal factoring
		- what is relevent to a query

*************************************************************

1:30 (10min) 1. Discussion: what is a probability

FREQUENTIST: long-term frequency of a repeating event
SUBJECTIVIST / BAYESIAN: degree of belief

How could these very different ideas be connected?

*************************************************************

1:40 (10min) 2. Definitions & Disease Problem

P: function from sentences in propositional logic to [0,1]
0 <= P(a) <= 1
P(true)=1
P(false)=0
P(a v b) = P(a) + P(b) - P(a & b)
therefore:
P(~a) = 1 - P(a)

definition: conditional probability
P(a|b) def= P(a&b) / P(b)

Discuss:Why does this definition make sense? Venn diagram.

Bayes' rule can be written as the Product rule:
P(a&b) = P(b|a)P(a)

Any P is defined with respect to some background knowledge.
Suppose P0 is based on what we know at time 0.
Then we learn a fact "b".
What is P1(b)?
What is P1(a), defined in terms of P0?

This is called a Bayesian update.  

*************************************************************
Exercise: medical diagnosis

Suppose that a test for a particular disease has a very high success
rate:

    * if a tested patient has the disease, the test accurately reports
      this, a 'positive', 99% of the time (or, with probability 0.99),
      and
    * if a tested patient does not have the disease, the test
      accurately reports that, a 'negative', 95% of the time
      (i.e. with probability 0.95).

Suppose also, however, that only 0.1% of the population have that
disease (i.e. with probability 0.001). 

You got the test and it came back positive.  Should you panic?

Let a be the event that the patient has the disease, and 
b be the event that the test returns a positive result. 

P(a|b) = P(b|a)P(a) / P(b)

       = P(b|a)P(a) / ( P(b|a)P(a) + P(b|~a)P(~) )
    
       = (.99)(.0001) / ( P(.99)(.0001) + P(.05)P(.999) )

       = .019

*************************************************************
More definitions:

Vector notation:
where X is a multi-valued variable, e.g.
X=x1 v X=x2 v X=x3
then
P(X) def= <P(X=x1), P(X=x2), P(X=x3)>

Joint Distribution:
P(X,Y) = matrix{ P(X=xi & Y=yi) }

Conditional probabilities:
P(X|Y) = matrix{ P(X=xi | Y=yj) }
P(X|Y=y1) = vector{ P(X=xi | Y=y1) }

Marginalizing a joint distribution:
Given: P(X,Y)
Compute P(Y)

P(Y) = Sum_z P(Y,Z=z)
     = Sum_z P(Y|Z=z)P(Z=z)

Example:

    D = diseases
    S = symptoms
Suppose you know P(D,S) and want to compute P(D|S=s1).

By Bayes rule:
P(D|S=s1) = < P(d1,s1)/P(s1), P(d2,s1)/P(s1), ... , P(dn,s1)/P(s1) >

Note: every term in the vector is divided by P(s1).
I can look up P(di,s1), but I still have to calculate P(s1): or do I?
Instead, use the normalization shortcut:
P(D|S=s1) = (1/P(s1)) < P(d1,s1), P(d2,s1), ... , P(dn,s1) > 
          = alpha P(D,S=s1)
where alpha is chosen so that the vector sums to 1.

*************************************************************

2:00 (15min) 4. Bayesian networks: factoring & causality

Problem with representing joint distribution as a lookup table: size!
- Hard to store, harder to learn

Idea: Break the large joint probability distribution in a product of
smaller factors, leveraging independence.

Independence: 
a || b  def=  P(a&b) = P(a)P(b)
              P(a) = P(a&b) / P(b)
The portion of b's that are a's is the same as the portion of
everything that are a's.

Conditional independence:

a || b | c def= P(a,b|c) = P(a|c)P(b|c)

Lemma: if   a || b | c, then
  P(a | c) = P(a | b,c)

Proof:
  P(a | c) = P(a,b|c)/P(b|c)
	   = (P(a|b,c)P(b|c))/P(b/c)
	   = P(a|b,c)

What this means: you can simplify a conditional probability, by
eliminating conditioning terms that are independent.

Bayesian network:

Choose an ordering of the variables.  Then,  by the product (Bayes)
rule, ANY joint distribution can be factored:

P(D,C,B,A) = P(D|C,B,A)P(C|B,A)P(B|A)P(A)

Potential for savings if P has some conditional independencies. 
Examples:  
- Suppose C || B | A.  Then how can we simplify P(C|B,A)?
- Suppose D || A | B,C.  Then how can we simplify P(D,C,B,A)?

A Bayesian network is a graphical representation of a factored JPD
simplified by independencies.  Example:

Study       Bribe
     \     /
       Pass
     /     \
 Diploma    Car

What is the factorization?  How many parameters in the CONDITIONAL
PROBABILITY TABLES?

Study       Bribe
     \     /     \
       Pass      Arrested  
     /     \    /
 Diploma     Car

What is the factorization?

A good Bayesian network (factoring) is one that captures as many
indendencies as is possible.  But other independencies may still
exist, hidden in the tables!

Can use simple graph operations to determine the independencies that
MUST hold for any P that fits the graph.

Diploma || Study ?  (no)
Diploma || Study | Pass ? (yes)
Study || Bribe ?  (yes)
Study || Bribe | Pass ?  (no, explaining away)
Pass  || Arrested?   (no)
Pass  || Arrested | Bribe?  (yes)

But some are hard to figure out:

Diploma || Arrested | Bribe ????

Definition of d-separation: Two nodes X and Y are d-separated if, for
all paths between X and Y, there is an intermediate node A for which
either:

   1. the connection is serial or diverging and the state of A is known for certain; or

   2. the connection is diverging and neither A (nor any of its
      descendants) have received any evidence at all.

THEREFORE:

Diploma || Arrested | Bribe    (YES)

Diploma || Arrested | (Bribe & Car)   (NO)

Diploma || Arrested | (Pass & Car)   (YES)

Final concept: Markov Blanket - Evidence that d-separates a node from
EVERYTHING else:
	- parents
	- children
	- other parents of children