CSE527 Lecture 17: RNA Secondary Structure Prediction

Notes by Nan Li, (annli@cs, 12/03)

Modified by Manu Forero (manumanu at u) on 11/04

1. Background
RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA due to its sequence of nucleotides. But its ability to form 3D structures allows it to have enzymatic properties like those of proteins. We call the linear polynucleotide chain the primary structure of RNA; the intra-molecular base-pairing is the secondary structure.

Secondary structures are important for RNA functions.  For example, tRNA molecules all form similar structures (a characteristic "clover-leaf", one branch of which is the "anticodon loop", which pairs with a codon as part of the process of translating a messenger RNA into protein).  Overall, the structures are critical to their function.

Tertiary (3-D) structure is important in RNA, but the models that predict it are not very evolved yet. Fortunately the secondary structure has a lot of useful information.

Of all possible 16 base-pairings, 6 of them are considered stable.  The Watson-Crick base pairs, C-G and A-U, form stable base pairs with each other through the creation of hydrogen bonds between donor and acceptor sites on the bases.  In addition, the G-U wobble pair is weaker but still stable.  Other bases also pair sometimes, especially if chemically modified.

Transfer RNA helps ribosomes translate mRNA into proteins. On one side it has a codon that is the complementary to the mRNA codon that codes for the AA attached at the other extremity of the t-RNA. Ribosomes make proteins by binding one of these tRNAs and adding the AA to the polypetide chain. The AminoAcid-less t-RNA is then recycled by enzymes that replace the missing amino acid. Note that unlike other RNAs t-RNA's are not degraded quickly by RNAses, so they can be recycled.

Why? RNA's have important functions when they fold, just like proteins. Some argue that RNA based organisms (no DNA or proteins) were the first life forms in an 'RNA world' from which life evolved. Here are 3 RNAs that people have found after much grad student sweat--can we find some more using a computer?: 1-(Big one) Hammerhead ribozyme cleaves particular nucleic acid sequences . 2- (top) a ribozyme from the Hepatitis virus. 3- (bottom) a Riboswitch. Hangs out at the 5' untranslated region of mRNA controlling the production of a particular protein. When it recognizes a particular metabolite, showing ample presence of the protein, it catalyzes cleavage of its own mRNA, thus down-regulating production of that protein. 

Of course, this is very interesting to a bunch of people including biotechnologists trying to use RNAi to treat diseases, stop viruses, and all the ones trying to understand the many many functions of RNA.

And the trick is...The pair of RNA's on the slide both have (almost) the same function: recognize glycine and then turn on some gene. However, they are less than 30% similar in their sequence. The good news is, they seem to have a similar structure, so by predicting structure we should be able to find more interesting RNAs.

2. Algorithms
In the literature of RNA secondary structure prediction, a commonly used assumption is absence of pseudo-knots (defined below).  Currently, there has been some work on relaxing this assumption.  It will not be covered in this lecture.  Although lack of pseudoknots is obviously unrealistic, in practice they probably constitute only a few percent of pair bonds, and although sometimes critical for 3d structures, they are not too likely to cause major changes in secondary structure.

Definition of the prediction problem

Given a sequence s=r1rn, which are defined on alphabet {ACGU}. The output secondary structure is a set of pairs {i_j} such that
(1) i < (j-4)                  /* hairpin loops should be not smaller than 3*/
(2) if i_j and i'_j' are two pairs, with i ≤ i' then
   (a) i = i' && j = j', or
   (b) j < i', or               /*one pair precedes the other*/
   (c) i < i' < j' < j        /*nested pairs; excludes pseudo-knots*/

Violation of condition 2(c) is essentially the definition of a pseudoknot.  There are three popular approaches for RNA secondary structure prediction.  Both the first and second methods are applications of dynamic programming.  Its applicability totally depends on the "pseudo-knot free" assumption.

Approaches: Gold standard: Physical experiments. Problem: very painful ,as it takes a lot of work to get one of these structures figured out.
                    Silver standard: Comparative sequence Analysis. Key idea: uses compensatory mutations to figure out structure. Problem: requires several aligned, appropriately diverged, sequences.

2.1 Maximum Pairings, the Nussinov algorithm [1]: finding the structure with maximal pairings (Dynamic programming)
Assume B(i, j) is the number of paired bases in the optimal pairing of subsequence ri...rj.
B(i, i) = 0 for all i, j with i >= j-4; /*loop min size*/

otherwise max of the best inner loops, so check sub sequences starting/ending at  ri+1 or  rj-1 and everything else in between (the  rk's).
B(i, j) = max( B(i+1, j),  B(i, j-1),  B(i+1, j-1) + δ(xi, xj),  max {B(i, k) + B(k+1, j) | i < k < j} )

Here, δ(xi, xj) is 1 if  xi and xj can pair, zero otherwise.
The time complexity is Ο(|s|3), due to the max over i, j, k.

2.2 Minimum Energy [2]: find the structure with minimal free energy.

This is the same as max pairing, but instead of having δ(xi, xj) =1 if  xi and xj can pair, zero otherwise, we use e(xi, xj)=e*δ(xi, xj), where e is the energy reduction by that particular pairing (G-C, A-U, wobbly) from physical experiments. Since energy is reduced we minimize instead of maximizing.

Loop Based:
To calculate a secondary structure's free energy, the structure is decomposed into a combination of basic loops, and the free energy of each loop is summed.
There are 5 types of basic loops.
Hairpin: contains one strand/sequences uninterruped of unpaired bases.
Bulge: Interal loop with a single strand/sequence of unpaired bases.
Interior loop: contains two strands/sequences of unpaired bases.
Stacking: loop formed by pairing of ith nucleotide with jth and
(i+1)th nucleotide with (j-1)th.
Multi-loop: a loop with more than two unpaired strands/sequences.
The algorithm is similar as maximum pairing algorithm. Assume:

W(i, j) is the minimal free energy in optimal pairing of subsequence ri...rj.
V(i, j) is the minimal free energy in optimal pairing of subsequence ri...rj with i_j pairing.
W(i, j) = V(i, j) = ∞,  j-i < 4
W(i, j) = min( W(i+1, j), W(i, j-1), V(i, j), min {W(i, k)+W(k+1, j) | i < k < j } )
V(i, j) = min( eh(i, j), es(i, j)+V(i+1, j-1), VBI(i, j), VM(i, j) )
VBI(i, j) = min{ ebi(i, j, i', j') + V(i', j') | i < i' < j' < ji'-i+j-j' > 2}
VM(i, j) = min{ W(i+1, k) + W(k+1, j-1) | i < k < j-1}
where:
eh(i, j): free energy in a hairpin loop closed by i,j
es(i, j): free energy of a stacked pair i_j, (i+1)_(j-1)
ebi(i, j, i', j'): free energy of an interior/bulge loop with exterior pairs i,j and i',j'

The energies are usually determined experimentally. A notable exception is the multi-loop, whose energy is hard to determine. 

The time complexity is Ο(|s|4), due to the VBI term.  Some algorithms impose an arbitrary limit of 30-40 bases on size of bulge/internal loops to avoid this n4 cost.  Alternatively, if ebi(-) satisfies certain (realistic) assumptions, then the time can be reduced to  Ο(|s|3); e.g. see [3].

Packages such as mfold and the Vienna RNA package are based on this algorithm. 

Suboptimal energy: We care about sub-optimal folds because: 
-These states may be populated if close in energy to the min. (e.g. see next slide)
-The models and energy parameters are not perfect, so there's no reason to believe the lowest predicted energy state is the actual best.
-RNAs interact with proteins and other RNAs. These interactions can modify the energies of the folds. 

References
[1] R. Nussinov, G. Pieczenick, J. Griggs, and D. Kleitman, "Algorithms for loop matching," SIAM J. Appl. Math., vol. 35, pp. 68-82, 1978.
[2] R. Nussinov and A. B. Jacobson, "Fast algorithm for predicting the secondary structure of single stranded RNA," Proc. Natl. Acad. Sci. USA, vol. 77, pp. 6309-6313, 1980.
[3] Lyngso RB, Zuker M, Pedersen CN. Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999 Jun;15(6):440-5.