573 -- 10/24 Introduction to Probability 1:30 (10min) 1. Discussion: what is a probability 1:40 (20min) 2. Definitions & disease problem 2:00 (15min) 3. Bayesian networks: factoring & causality 2:15 (15min) 4. Independence, d-separation, & Markov blanket 2:30 (20min) 5. Exercise: given large set of variables: - determine good causal factoring - what is relevent to a query ************************************************************* 1:30 (10min) 1. Discussion: what is a probability FREQUENTIST: long-term frequency of a repeating event SUBJECTIVIST / BAYESIAN: degree of belief How could these very different ideas be connected? ************************************************************* 1:40 (10min) 2. Definitions & Disease Problem P: function from sentences in propositional logic to [0,1] 0 <= P(a) <= 1 P(true)=1 P(false)=0 P(a v b) = P(a) + P(b) - P(a & b) therefore: P(~a) = 1 - P(a) definition: conditional probability P(a|b) def= P(a&b) / P(b) Discuss:Why does this definition make sense? Venn diagram. Bayes' rule can be written as the Product rule: P(a&b) = P(b|a)P(a) Any P is defined with respect to some background knowledge. Suppose P0 is based on what we know at time 0. Then we learn a fact "b". What is P1(b)? What is P1(a), defined in terms of P0? This is called a Bayesian update. ************************************************************* Exercise: medical diagnosis Suppose that a test for a particular disease has a very high success rate: * if a tested patient has the disease, the test accurately reports this, a 'positive', 99% of the time (or, with probability 0.99), and * if a tested patient does not have the disease, the test accurately reports that, a 'negative', 95% of the time (i.e. with probability 0.95). Suppose also, however, that only 0.1% of the population have that disease (i.e. with probability 0.001). You got the test and it came back positive. Should you panic? Let a be the event that the patient has the disease, and b be the event that the test returns a positive result. P(a|b) = P(b|a)P(a) / P(b) = P(b|a)P(a) / ( P(b|a)P(a) + P(b|~a)P(~) ) = (.99)(.0001) / ( P(.99)(.0001) + P(.05)P(.999) ) = .019 ************************************************************* More definitions: Vector notation: where X is a multi-valued variable, e.g. X=x1 v X=x2 v X=x3 then P(X) def= Joint Distribution: P(X,Y) = matrix{ P(X=xi & Y=yi) } Conditional probabilities: P(X|Y) = matrix{ P(X=xi | Y=yj) } P(X|Y=y1) = vector{ P(X=xi | Y=y1) } Marginalizing a joint distribution: Given: P(X,Y) Compute P(Y) P(Y) = Sum_z P(Y,Z=z) = Sum_z P(Y|Z=z)P(Z=z) Example: D = diseases S = symptoms Suppose you know P(D,S) and want to compute P(D|S=s1). By Bayes rule: P(D|S=s1) = < P(d1,s1)/P(s1), P(d2,s1)/P(s1), ... , P(dn,s1)/P(s1) > Note: every term in the vector is divided by P(s1). I can look up P(di,s1), but I still have to calculate P(s1): or do I? Instead, use the normalization shortcut: P(D|S=s1) = (1/P(s1)) < P(d1,s1), P(d2,s1), ... , P(dn,s1) > = alpha P(D,S=s1) where alpha is chosen so that the vector sums to 1. ************************************************************* 2:00 (15min) 4. Bayesian networks: factoring & causality Problem with representing joint distribution as a lookup table: size! - Hard to store, harder to learn Idea: Break the large joint probability distribution in a product of smaller factors, leveraging independence. Independence: a || b def= P(a&b) = P(a)P(b) P(a) = P(a&b) / P(b) The portion of b's that are a's is the same as the portion of everything that are a's. Conditional independence: a || b | c def= P(a,b|c) = P(a|c)P(b|c) Lemma: if a || b | c, then P(a | c) = P(a | b,c) Proof: P(a | c) = P(a,b|c)/P(b|c) = (P(a|b,c)P(b|c))/P(b/c) = P(a|b,c) What this means: you can simplify a conditional probability, by eliminating conditioning terms that are independent. Bayesian network: Choose an ordering of the variables. Then, by the product (Bayes) rule, ANY joint distribution can be factored: P(D,C,B,A) = P(D|C,B,A)P(C|B,A)P(B|A)P(A) Potential for savings if P has some conditional independencies. Examples: - Suppose C || B | A. Then how can we simplify P(C|B,A)? - Suppose D || A | B,C. Then how can we simplify P(D,C,B,A)? A Bayesian network is a graphical representation of a factored JPD simplified by independencies. Example: Study Bribe \ / Pass / \ Diploma Car What is the factorization? How many parameters in the CONDITIONAL PROBABILITY TABLES? Study Bribe \ / \ Pass Arrested / \ / Diploma Car What is the factorization? A good Bayesian network (factoring) is one that captures as many indendencies as is possible. But other independencies may still exist, hidden in the tables! Can use simple graph operations to determine the independencies that MUST hold for any P that fits the graph. Diploma || Study ? (no) Diploma || Study | Pass ? (yes) Study || Bribe ? (yes) Study || Bribe | Pass ? (no, explaining away) Pass || Arrested? (no) Pass || Arrested | Bribe? (yes) But some are hard to figure out: Diploma || Arrested | Bribe ???? Definition of d-separation: Two nodes X and Y are d-separated if, for all paths between X and Y, there is an intermediate node A for which either: 1. the connection is serial or diverging and the state of A is known for certain; or 2. the connection is diverging and neither A (nor any of its descendants) have received any evidence at all. THEREFORE: Diploma || Arrested | Bribe (YES) Diploma || Arrested | (Bribe & Car) (NO) Diploma || Arrested | (Pass & Car) (YES) Final concept: Markov Blanket - Evidence that d-separates a node from EVERYTHING else: - parents - children - other parents of children