CSE 527, Lecture 13, 11-8-2006
Steve Lewis
 

Talk after class today: Isochores and symmetry breaking

In the human genome there are some regions which are GC rich, some AT rich
more genes in the GC rich regions

HMMs

PFAM - specifically search for alignment of protein "domains" (regions) which constitute structural or functional "units".

basic transition architecture -
     basic line alphabet or inserted state or deleted state (no emission)

Scoring -
   use Baum-Welch to learn the algorithm

   Globins and training data score high

   scoring for pattern is probability / length since odds fall with every increase in length

   There is better scoring to take the ratio of the odds of data being generated by globin model vs. data generated by background model - compare to background

  Alternative - convert to Z score

     (score - mean) / standard deviation

     Z is a way to correct of length variance -

     mean is mean for proteins of similar length

    variance is variance for proteins of similar length

PFAM initial seeds are hand coded; train from hand alignments

Train with hand selected sample

Now automatically classify from Swiss-Prot

8000 families in rFam covering 75% of proteins

Families

    are hierarchical - globin might be the head of a large family

=====

Pseuocounts - this is the insertion of a small count to handle the case where a residue is simply not seen in some position - probability does not handle 0 well

Pseudocount - represents a Bayesian prior probability

More elaborate pseudocount: Dirchlet mixture Dirchlet  priors  - adding separate pseudocounts for different regions i.e. hydrophobic region, buried region ...

====

Computational gene prediction
-----------------------------

Motivation - lots of sequences - are they genes - are they expressed ???

state of the art  60% accuracy - based on first principles

   80% with similarity training

BUT - predictions are VERY  expensive to verify


Basics - there is a start Codon , stop codon and prefix and suffix RNA (UTR)

Transcription - DNA ->RNA

Translation - RNA->Protein

Codon table - reviewed

there is are a few variations in codon meaning

might be a dozen different code tables

   mitochondria use a slightly different table

   in tetrahymena 2 of the stop codons code for other amino acids sometimes so does the third

Gene finding

   First issue: define "reading frame" which base is the start of a codon -

       in RNA, 3 possible frames; in DNA, 6 frames (3 per strand)

       Open frame - long run without stop codon.  

      in random DNA a stop happens each 21 triplets, on average (64 codons/3 stops)

      So, very  low odds of long (several hundred codons) open reading frame.  long read frame is a good way to look for genes

Idea2 - compare codon frequency with known codon frequency - look at amino acid sequence

     also synonym usage is species dependent - certain species preferentially use certain synonyms  (Why? probably regulation - if different tRNAs are present in different levels - genes using codons matching more common tRNAs will be translated more eficiently.)

    Markov model - compare likelihoods - 5-6 order Markov - say spanning 2 codons

   using likelihood in a virus expressed genes spike in probability however there is a issue in a viral DNA genes might overlap and be read in two different frames

     in prokaryotes - most DNA is for coding so open frames look good but
       - there are a few short genes
       - some genes use abnormal sequences

    in eukaryotes -

        same signals but MANY interons - Phil Sharp matched the mRNA to DNA saw intron loops

        mRNA shorter than DNA with much editing

Biology of splicing

     splicing performed by Splicosome - at least 50 proteins and half dozen RNA molecules - RNA is the  most conserved portion

     splicing MUST preserve reading frame

     exon length not a multiple of 3 so any errors must be high fidelity

     interons are recognized by binding near the start and end of the interon

     why have introns

         after introns discovered

           in tetrahymena - code for ribosomal RNA has an interon but uses much simpler mechanism where RNA self excises intron region; intron region self catalyzes its own excision
 
           Some interons self excise in mRNAs

           might there be uses for discarded interons? -  Some introns have short RNA segments which can interfere with pieces of other genes (so-called "microRNAs")

Gene finding in eukaryotes

      need to find introns and exons.  some sequence signals - need to look at

      Some human data:
	- internal exons mean 122 median 145 bp
	- introns 1000 median mean 3300 -  REALLY BIG tail - some really long  introns
	- genes 14kb median 27kb average, but some big ones, e.g.  one gene 2.4mb 
	- 2.4 MB (dystrophin)  gene takes 16 hour to transcribe the gene