CSE 527 10/11/06 Alignment Statistics; PCR Notes by Zack Frazier Administrative: --------------- The second homework has been posted. Working in groups is allowed. Do not simply divide the work, everyone should work on every part. Turn in one solution. Clearly identify partners in each file. Lecture Slides: --------------- 13) Review of the Hypothesis testing examples from Friday. The Likelihood represents how much more likely the alternative hypothesis is then the null hypothesis. 14) Log Likelihood is often used, because it is numerically more convenient. The Neyman Pearson Lemma: For any scheme you can devise to identify the model for a given data observation, you can not do better then the Likelihood Ratio test. Here "better" means that you cannot find a test with more statistical power. 15) For a given data observation, we can calculate a p-value, which is the probability of seeing data as or more extreme given the null model. 16) Homologous does not (just) mean similarity. Homologous items are necessarily similar, but the converse is not true. It means similar by descent from a common ancestor. 17) The Blosum matrix is derived from sets of expert alignments. In this model the lambda values are scaling factors that bring everything close to integers. The evolutionary distances can cause problems, and leads to different matrices for different sets of sequences, depending on the distance to their common ancestor. Sequences should be aligned using a scoring matrix that reflects their evolutionary distance. 18) The movement from probabilities to scoring matrices can be reversed, and in fact all reasonable matrices have a corresponding implied probability distribution. 20) Extreme Value Distribution. Also known as the Gumbel distribution. parameters: N : the number of items in the sample. K : A free parameter. lambda : scaling parameter. similar to the lambda in the Blosum derivation. The value of N for a given ungapped alignment is the product of the sequence lengths. Since there are approximately this many possible ungapped alignments. 21) Although there is no statistical proof that the EVD provides reasonable p-values for gapped alignments, it often does. For gapped alignments, the lambda and K parameters can be fit using Maximum Likelihood Estimation(MLE). This is straightforward, and relatively cheap. 22) In the case where you do not trust the EVD, or desire an empirical estimation of p-value, randomization can be used. 23) pseudo code for permutations. 24) The definition of 'random' sequence could be improved. Maybe sample adjacent pairs? Or some other method that preserves more of the statistics of the given sequences. This is limited by the amount of original material given. Transmembrane proteins violate the independence assumption in dramatic fashion. The resolution of the method is dependant on the number of trials. The best p-value you can can assign after 100 randomizations is .01, so for low p-values, many iterations must be performed. EVD on the other hand, allows you to estimate p-values at these low levels. 25) 26) p-values are not the whole picture. If we have a p-value of 10^-3 based on the probability of matching a *single* sequence by chance, but are searching through an NCBI database with 150 million sequences, we do not have the statistical significance we might imagine. I.e., database size matters. One fix: get p-value for a match with a score as good or better than observed in a random data base *of the size you just searched*. Another common approach is to quote E-values, defined to be the the expected number of hits, taking into account the database size. E-values and p-values are often set to very stringent filtering levels. Since we have assumed independence, and made other assumptions about the distribution of data, which we know to be false, it is better to err on the side of caution, and require very strict e-values. 30) PCR Step 1) Add a sample of DNA that we would like to copy into a solution, which also contains the primers flanking the region of interest, some DNA polymerase, and some free nucleotides. Step 2) Melt the DNA by heating the mixture. Step 3) Cool down the mixture, this will cause the DNA to associate with the primers. The association will not be perfect, and there will be some mismatched primer/DNA segments. Step 4) Heat everything up to a lesser degree. This will melt all the primers that were sticking to the DNA with poor specificity. Step 5) DNA polymerase can now do what it does best and duplicate the region between the primers. Step 6: We have now doubled the number of DNA strands! GOTO #2 32) This method had been done for years. It required a lot of effort though. When the DNA melts at the temperature of 95C, the polymerase denatures, and does not reform when the temperature is brought back down. So traditionally, new polymerase had to be added each iteration. This was solved by using the DNA polymerase of Thermus aquaticus, a bacterium which lives in geysers with water often very close to 95C. Using this polymerase, the process could be automated. It now is a billion+ dollar industry, with many applications. 33) The FBI has a DNA database that was constructed shortly after PCR became widespread. They use 13 locations on the genome, which are highly variable in copy numbers of repeat sequences. After running PCR, they will have 13 chunks of DNA which are all of a certain size, which hopefully uniquely identifies the individual. Of course, you could have two different copy numbers, one on each chromosome. 34) How do we measure the copy number? DNA will move in a Gel at a rate proportional to its size, since larger molecules are not able to move through the obstacles of the gel as quickly. This can be used to identify differences in DNA length as small as one nucleotide base.