Notes by Nan Li, (annli@cs)
1. Background
RNA has some of the properties of both DNA and proteins. It has the
same information storage capability as DNA due to its sequence of
nucleotides. But its ability to form 3D structures allows it to have
enzymatic properties like those of proteins. We call the linear
polynucleotide chain the primary structure of RNA; the intra-molecular
base-pairing is the secondary structure.
Secondary structures are important for RNA functions. For example, tRNA molecules all form similar structures (a characteristic "clover-leaf", one branch of which is the "anticodon loop", which pairs with a codon as part of the process of translating a messenger RNA into protein). Overall, the structures are critical to their function.
Of all possible 16 base-pairings, 6 of them are considered stable. The Watson-Crick base pairs, C-G and A-U, form stable base pairs with each other through the creation of hydrogen bonds between donor and acceptor sites on the bases. In addition, the G-U wobble pair is weaker but still stable. Other bases also pair sometimes, especially if chemically modified.
2. Algorithms
In the literature of RNA secondary structure prediction, a commonly
used assumption is absence of pseudo-knots (defined below).
Currently, there has been some work on relaxing this assumption.
It will not be covered in this lecture. Although lack of
pseudoknots is obviously unrealistic, in practice they probably
constitute only a few percent of pair bonds, and although sometimes
critical for 3d structures, they are not too likely to cause major
changes in secondary structure.
Definition of the prediction problem
Given a sequence s=r1…rn, which
are defined on alphabet {ACGU}. The output secondary structure
is a set of pairs {i_j} such that
(1) i < (j-4)
/* hairpin loops should be not smaller than 3*/
(2) if i_j and i'_j' are two pairs, with i ≤ i'
then
(a) i = i' && j = j', or
(b) j < i', or
/*one pair precedes the other*/
(c) i < i' < j' < j
/*nested pairs; excludes pseudo-knots*/
Violation of condition 2(c) is essentially the definition of a pseudoknot. There are three popular approachs for RNA secondary structure prediction. Both the first and second methods are applications of dynamic programming. Its applicability totally depends on the "pseudo-knot free" assumption.
2.1 Maximum Pairings, the
Nussinov algorithm [1]: finding the structure with maximal pairings
Assume B(i, j) is the number of paired bases in the
optimal pairing of subsequence ri...rj.
B(i, i) = 0;
B(i, i+1) = 0;
B(i, j) = max(
B(i+1, j), B(i, j-1), B(i+1,
j-1) + δ(xi, xj),
max {B(i, k) + B(k+1, j) | i < k
< j} )
Here, δ(xi, xj) is 1 if xi
and xj can pair, zero otherwise.
The time complexity is Ο(|s|3), due to the max over i, j, k.
2.2 Minimum Energy [2]: find
the structure with minimal free energy
To calculate a secondary structure's free energy, the structure is
decomposed into a combination of basic loops, and the free energy of
each loop is summed.
There are 5 types of basic loops.
Hairpin: contains one strand/sequences uninterruped of unpaired bases. | |
Bulge: Interal loop with a single strand/sequence of unpaired bases. | |
Interior loop: contains two strands/sequences of unpaired bases. | |
Stacking: loop formed by pairing of ith nucleotide
with jth and (i+1)th nucleotide with (j-1)th. |
|
Multi-loop: a loop with more than two unpaired strands/sequences. |
W(i, j) is the minimal free energy in optimal pairing
of subsequence ri...rj.
V(i, j) is the minimal free energy in optimal pairing of
subsequence ri...rj with i_j
pairing.
W(i, j) = V(i, j)
= ∞, j-i <
4
W(i, j) = min(
W(i+1, j), W(i, j-1),
V(i, j),
min {W(i, k)+W(k+1, j) | i < k
< j } )
V(i, j) = min(
eh(i, j), es(i, j)+V(i+1, j-1),
VBI(i, j),
VM(i, j) )
VBI(i, j) = min{
ebi(i, j, i', j') + V(i', j')
| i < i' < j' < j, i'-i+j-j' > 2}
VM(i, j) = min{
W(i+1, k) + W(k+1, j-1) | i < k
< j-1}
where:
eh(i, j): free energy in a hairpin loop closed by i,j
es(i, j): free energy of a stacked pair i_j,
(i+1)_(j-1)
ebi(i, j, i', j'): free energy of an
interior/bulge loop with exterior pairs i,j and i',j'
The time complexity is Ο(|s|4), due to the VBI
term. Some algorithms impose an arbitrary limit of 30-40 bases on
size of bulge/internal loops to avoid this n4 cost.
Alternatively, if ebi(-) satisfies certain (realistic) assumptions,
then the time can be reduced to Ο(|s|3); e.g.
see [3].
References
[1] R. Nussinov, G. Pieczenick, J. Griggs, and D. Kleitman, "Algorithms
for loop matching," SIAM J. Appl. Math., vol. 35, pp. 68-82, 1978.
[2] R. Nussinov and A. B. Jacobson, "Fast algorithm for predicting the
secondary structure of single stranded RNA," Proc. Natl. Acad. Sci.
USA, vol. 77, pp. 6309-6313, 1980.
[3] Lyngso RB, Zuker M, Pedersen CN. Fast evaluation of internal loops
in RNA secondary structure prediction. Bioinformatics. 1999
Jun;15(6):440-5.