CSE 527
10/11/06
Alignment Statistics; PCR
Notes by Zack Frazier

Administrative:
---------------
    
    The second homework has been posted.

    Working in groups is allowed.  Do not simply divide the work,
    everyone should work on every part.  Turn in one solution.
    Clearly identify partners in each file.

Lecture Slides:
---------------

13) Review of the Hypothesis testing examples from Friday.

    The Likelihood represents how much more likely the alternative
    hypothesis is then the null hypothesis.

14) Log Likelihood is often used, because it is numerically more
    convenient.

    The Neyman Pearson Lemma:
       For any scheme you can devise to identify the model for a given
       data observation, you can not do better then the Likelihood
       Ratio test.  Here "better" means that you cannot find a test
       with more statistical power.

15) For a given data observation, we can calculate a p-value, which is
    the probability of seeing data as or more extreme given the null
    model.

16) Homologous does not (just) mean similarity.  Homologous items are
    necessarily similar, but the converse is not true.  It means
    similar by descent from a common ancestor.

17) The Blosum matrix is derived from sets of expert alignments.

    In this model the lambda values are scaling factors that bring
    everything close to integers.

    The evolutionary distances can cause problems, and leads to
    different matrices for different sets of sequences, depending on
    the distance to their common ancestor.  Sequences should be
    aligned using a scoring matrix that reflects their evolutionary
    distance.

18) The movement from probabilities to scoring matrices can be
    reversed, and in fact all reasonable matrices have a corresponding
    implied probability distribution.

20) Extreme Value Distribution.

    Also known as the Gumbel distribution.

    parameters:
        N : the number of items in the sample.
        K : A free parameter.  
        lambda : scaling parameter.  similar to the lambda in the
                 Blosum derivation.

    The value of N for a given ungapped alignment is the product of
    the sequence lengths.  Since there are approximately this many
    possible ungapped alignments.  

21) Although there is no statistical proof that the EVD provides
    reasonable p-values for gapped alignments, it often does.

    For gapped alignments, the lambda and K parameters can be fit using
    Maximum Likelihood Estimation(MLE).  This is straightforward, and
    relatively cheap.

22) In the case where you do not trust the EVD, or desire an empirical
    estimation of p-value, randomization can be used.

23) pseudo code for permutations.

24) The definition of 'random' sequence could be improved.  Maybe
    sample adjacent pairs? Or some other method that preserves more of
    the statistics of the given sequences.  This is limited by the
    amount of original material given.

    Transmembrane proteins violate the independence assumption in
    dramatic fashion.

    The resolution of the method is dependant on the number of trials.
    The best p-value you can can assign after 100 randomizations is
    .01, so for low p-values, many iterations must be performed.

    EVD on the other hand, allows you to estimate p-values at these
    low levels.

25)
    
26) p-values are not the whole picture.  If we have a p-value of 10^-3
    based on the probability of matching a *single* sequence by
    chance, but are searching through an NCBI database with 150
    million sequences, we do not have the statistical significance we
    might imagine.  I.e., database size matters.

    One fix: get p-value for a match with a score as good or better 
    than observed in a random data base *of the size you just searched*.

    Another common approach is to quote E-values, defined to be the 
    the expected number of hits, taking into account the database size.

    E-values and p-values are often set to very stringent filtering
    levels.  Since we have assumed independence, and made other
    assumptions about the distribution of data, which we know to be
    false, it is better to err on the side of caution, and require
    very strict e-values.

30) PCR
    Step 1)
        Add a sample of DNA that we would like to copy into a
        solution, which also contains the primers flanking the region
        of interest, some DNA polymerase, and some free nucleotides.
    Step 2)
        Melt the DNA by heating the mixture.
    Step 3)
        Cool down the mixture, this will cause the DNA to associate
        with the primers.  The association will not be perfect, and
        there will be some mismatched primer/DNA segments.
    Step 4) 
        Heat everything up to a lesser degree.  This will melt all the
        primers that were sticking to the DNA with poor specificity.
    Step 5)
        DNA polymerase can now do what it does best and duplicate the
        region between the primers.
    Step 6:
        We have now doubled the number of DNA strands!
        GOTO #2

32) This method had been done for years.  It required a lot of effort
    though.  When the DNA melts at the temperature of 95C, the
    polymerase denatures, and does not reform when the temperature is
    brought back down.  So traditionally, new polymerase had to be
    added each iteration.

    This was solved by using the DNA polymerase of Thermus aquaticus,
    a bacterium which lives in geysers with water often very close to
    95C.  Using this polymerase, the process could be automated.  It
    now is a billion+ dollar industry, with many applications.

33) The FBI has a DNA database that was constructed shortly after PCR
    became widespread.  They use 13 locations on the genome, which are
    highly variable in copy numbers of repeat sequences.

    After running PCR, they will have 13 chunks of DNA which are all
    of a certain size, which hopefully uniquely identifies the
    individual.  

    Of course, you could have two different copy numbers, one on each
    chromosome.

34) How do we measure the copy number?

    DNA will move in a Gel at a rate proportional to its size, since
    larger molecules are not able to move through the obstacles of the
    gel as quickly.

    This can be used to identify differences in DNA length as small as
    one nucleotide base.