CSE 527 10/9/06 Scribe: David Langworthy (dlan@microsoft.com) Ê --Course Home Page Update Òchange your subscriptions optionsÓ if you do not use an @U address. Lecture notes and slide numbering will be out of sync. New resource links. Ê --Schedule See new readings. HW1 Ð trying to sort out wiki permissions. Ê Lecture Slide Numbers 4&5 Ê Alignment is widely used, but speed is a problem. Most used tool is BLAST Discuss scoring significance Ê --1 Some pieces of a protein matter more than others.Ê Some parts are active and some parts are scaffolding. Ê --2 Nothing in biology makes sense except in light of evolution. Ê Most changes make a protein worse, some wont matter, and a few make it better. Ê Nothing about the process of change is focused on making a protein (or organism) better. Ê Changes are less tolerated in many interactions. Ê --3 Basic Local Alignment Search Tool (BLAST) Ê Could run Smith Waterman, but itÕs slow and the database is growing faster than computers are getting faster, which is pretty fast. Ê Small good matches are more meaningful than long mediocre matches. Ê BLAST is a heuristic.Ê It may miss some long weak matches. Ê --4 Ê E-value is a measure of statistical confidence. Ê --5 Ê There are 8000 possible amino acid sequences of length 3. Build an un-gapped alignment based on seed matches. Ê --6 Ê Substrings of length 2 for convenience.Ê Ê :20 Ê ItÕs heuristic. ÊA lower first threshold might find a longer higher overall score. Ê --7 BLOSUM 62 Default score matrix for BLAST. Entries on diagonal are positive.Ê Most but not all off diagonal entries are negative. Diagonal entries are not all equal.Ê Vv is 4, ww is 11.Ê V is more common than W.Ê The matrix is symmetric [sigma commutative]. PAM is the other accepted score matrix. Ê 62 means the designers eliminated sequences that were greater than 62% similar. BLOSUM 50 only used sequences with less than 50% similarity. This is to prevent extrapolation from close relatives. Ê --8 BLAST Refinements Õ97 -- Databases growing larger.Ê Need more selectivity in filter.Ê Require two ÒcloseÓ non-overlapping hits. Allow for gaps.Ê Run bidirectional Smith Waterman until the score drops below some dynamic threshold. More flexible than just searching a fixed number of diagonals off the main i=j diagonal. Ê Blast does not do well with weaker matches.Ê Position Specific Iterated (PSI-) BLAST uses iterated search to boost results for distant matches by building a similarity matrix from initial hits and requerying using this as a weight matrix. Ê --9 Is 42 a good score? Ê -- Board Hypothesis testing Ê Coin either fair P(H) = .5 or unfair P(H) = .66666 Gather data say HHHHH Null Model (hypothesis) M0: coin is fair Alternate: M1: coin is biased Ê Prob(ObservedData | M0) == (.5)^5 == 1/32 P(Data | M1) == (2/3)^5 == 32/243 Ê These are small numbers and with more trials, the numbers will necessarily get smaller. Typically use ÒlikelihoodÓ ratio P(D|M1) / P(D|M0) = (32/243) / (1/32) = 1024/243 ~ 4 Math works nicely for 5 heads, but works for any sequence. Ê Neyman-Pearson Theorem You can come up with what ever means of testing you want, but you are not going to come up with anything better than likelihood ratio test. Ê :55 Ê Often convenient to look at log likelihood.Ê It makes the math easier.Ê Threshold tests work out the same. Ê P-value: probability, given that M0 is true, that you see data as or more extreme than observed.Ê This is the probability of one of the two possible errors -- rejecting the null when it's actually true. Ê Suppose 100 flips and 80 heads.Ê Ê In a simple hypothesis like the coin flip, we can calculate the P-value.Ê In other scenarios the P-value is not amenable to analytic solution.Ê It could be some complex process with feedback.Ê Can do simulations in this case. Ê --Back to slides Ê --10 Ê BLAST tells if two proteins are similar. Want to know if they are homologousÑsimilar because they evolved from a common ancestor.Ê Want to know if they are so similar they could not have evolved independently.Ê Ê --11 Ê Where numbers in BLOSUM matrix come from. Ê -- Back to BLOSUM 62 table 1:15 Ê --12 Ê Any matrix you come up with reflects some expectation of PXY.Ê Ê This is one of the reasons for the success of BLAST.Ê The authors worked out the probabilistic significance of a match.Ê Not just some ad hoc match score. Ê Ê