CSE 527 Lecture 21: RNA Secondary Structure

CSE527 Lecture 21: RNA Secondary Structure Prediction

Notes by Nan Li, (annli@cs)

1. Background
RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA due to its sequence of nucleotides. But its ability to form 3D structures allows it to have enzymatic properties like those of proteins. We call the linear polynucleotide chain the primary structure of RNA; the intra-molecular base-pairing is the secondary structure.

Secondary structures are important for RNA functions. For example, tRNA molecules all form similar structures (a characteristic "clover-leaf", one branch of which is the "anticodon loop", which pairs with a codon as part of the process of translating a messenger RNA into protein). Overall, the structures are critical to their function.

Of all possible 16 base-pairings, 6 of them are considered stable. The Watson-Crick base pairs, C-G and A-U, form stable base pairs with each other through the creation of hydrogen bonds between donor and acceptor sites on the bases. In addition, the G-U wobble pair is weaker but still stable. Other bases also pair sometimes, especially if chemically modified.

2. Algorithms
In the literature of RNA secondary structure prediction, a commonly used assumption is absence of pseudo-knots (defined below). Currently, there has been some work on relaxing this assumption. It will not be covered in this lecture. Although lack of pseudoknots is obviously unrealistic, in practice they probably constitute only a few percent of pair bonds, and although sometimes critical for 3d structures, they are not too likely to cause major changes in secondary structure.

Definition of the prediction problem

Given a sequence s=r₁…r_n, which are defined on alphabet {ACGU}. The output secondary structure is a set of pairs {i_j} such that
(1) i < (j-4)                  /* hairpin loops should be not smaller than 3*/
(2) if i_j and i'_j' are two pairs, with i ≤ i' then
   (a) i = i' && j = j', or
   (b) j < i', or     /*one pair precedes the other*/
   (c) i < i' < j' < j        /*nested pairs; excludes pseudo-knots*/

Violation of condition 2(c) is essentially the definition of a pseudoknot. There are three popular approachs for RNA secondary structure prediction. Both the first and second methods are applications of dynamic programming. Its applicability totally depends on the "pseudo-knot free" assumption.

2.1 Maximum Pairings, the Nussinov algorithm [1]: finding the structure with maximal pairings
Assume B(i, j) is the number of paired bases in the optimal pairing of subsequence r_i...r_j.
B(i, i) = 0;
B(i, i+1) = 0;
B(i, j) = max( B(i+1, j), B(i, j-1), B(i+1, j-1) + δ(x_i, x_j), max {B(i, k) + B(k+1, j) | i < k < j} )

Here, δ(x_i, x_j) is 1 if x_i and x_j can pair, zero otherwise.
The time complexity is Ο(|s|³), due to the max over i, j, k.

2.2 Minimum Energy [2]: find the structure with minimal free energy
To calculate a secondary structure's free energy, the structure is decomposed into a combination of basic loops, and the free energy of each loop is summed.
There are 5 types of basic loops.

Hairpin: contains one strand/sequences uninterruped of unpaired bases.

Bulge: Interal loop with a single strand/sequence of unpaired bases.

Interior loop: contains two strands/sequences of unpaired bases.

Stacking: loop formed by pairing of ith nucleotide with jth and
(i+1)th nucleotide with (j-1)th.

Multi-loop: a loop with more than two unpaired strands/sequences.

The algorithm is similar as maximum pairing algorithm. Assume:

W(i, j) is the minimal free energy in optimal pairing of subsequence r_i...r_j.
V(i, j) is the minimal free energy in optimal pairing of subsequence r_i...r_j with i_j pairing.
W(i, j) = V(i, j) = ∞, j-i < 4
W(i, j) = min( W(i+1, j), W(i, j-1), V(i, j), min {W(i, k)+W(k+1, j) | i < k < j } )
V(i, j) = min( eh(i, j), es(i, j)+V(i+1, j-1), VBI(i, j), VM(i, j) )
VBI(i, j) = min{ ebi(i, j, i', j') + V(i', j') | i < i' < j' < j, i'-i+j-j' > 2}
VM(i, j) = min{ W(i+1, k) + W(k+1, j-1) | i < k < j-1}
where:
eh(i, j): free energy in a hairpin loop closed by i,j
es(i, j): free energy of a stacked pair i_j, (i+1)_(j-1)
ebi(i, j, i', j'): free energy of an interior/bulge loop with exterior pairs i,j and i',j'

The time complexity is Ο(|s|⁴), due to the VBI term. Some algorithms impose an arbitrary limit of 30-40 bases on size of bulge/internal loops to avoid this n⁴cost. Alternatively, if ebi(-) satisfies certain (realistic) assumptions, then the time can be reduced to Ο(|s|³); e.g. see [3].

References
[1] R. Nussinov, G. Pieczenick, J. Griggs, and D. Kleitman, "Algorithms for loop matching," SIAM J. Appl. Math., vol. 35, pp. 68-82, 1978.
[2] R. Nussinov and A. B. Jacobson, "Fast algorithm for predicting the secondary structure of single stranded RNA," Proc. Natl. Acad. Sci. USA, vol. 77, pp. 6309-6313, 1980.
[3] Lyngso RB, Zuker M, Pedersen CN. Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 1999 Jun;15(6):440-5.

Hairpin: contains one strand/sequences uninterruped of unpaired bases.
Bulge: Interal loop with a single strand/sequence of unpaired bases.
Interior loop: contains two strands/sequences of unpaired bases.
Stacking: loop formed by pairing of ith nucleotide with jth and (i+1)th nucleotide with (j-1)th.
Multi-loop: a loop with more than two unpaired strands/sequences.