Assignment 5

Due: Friday, March 7

Catalyst Handin

Reading:

Section 6.5

Problem:

Summary: Implement the Nussinov RNA folding algorithm.

Background: To briefly summarize section 6.5 and my lecture on RNA structure prediction, aka RNA folding, RNA molecules often fold back on themselves, forming stable double-helix structures akin to the famous DNA double helix, with G-C and A-U pairs forming. (We'll ignore other pairs, which sometimes form, too.) A commonly used set of rules for what structures may form is: ri * rj allowed only if i < j - 4 (it can't bend too sharply) and if ri * rj rj and ri'* rj' are two pairs with i <= i' then either

  1. i = i' and j = j' (ri can't pair with two different bases)
  2. j < i' (one pair completely precedes the other), or
  3. j' < j (one is completely nested within the other).
Given these conditions, we can represent the structure for any given sequence with parentheses for paired bases, and dots for unpaired bases. Eg., here is an example RNA sequence, and one possible structure (not necessarily an optimal one):
GCCCACCUUCGAAAAGACUGGAUGACCAUGGGCCAUGAUU [1]
((((....((.....)).(((....))).))))....... [2]
9 Pairs.                                 [3]
For a given RNA sequence, there will usually be a huge number of structures satisfying the above restrictions. Nature usually favors the "most stable" structure. One approximation to "most stable" is the structure having the maximum possible number of pairs (subject to the above restrictions). The goal of this homework is an algorithm for this problem.

Input: a single line containing a string of letters x1x2 . . . xn from the 4 letter alphabet {A,G,C,U} (all uppercase, for simplicity), as in line [1] above.

Output: one line containing the input, a second line containing (one of) its optimal structure(s), as in [2] above, plus a third line giving the total number of pairs in that structure, as in [3]. I say "one of" since there may be different structures with equal numbers of pairs; often slight variants of each other. Giving any one of them is OK. The structure line will be a string of parens and dots, vertically aligned with the input string. A dot in the structure line means that the corresponding position in the RNA is unpaired; a left paren means it is paired with a position to its right, marked by a right paren. Furthermore, parens must be properly balanced/nested, so specific paired positions are marked by "matching" left/right parens.

Method: Use the Nussinov algorithm, described in class and section 6.5 of the text.

What You Need To Do:

  1. (30 points) Implement the above algorithm for calculating OPT[i, j]. For n <=25, print out OPT. (You'll probably find this useful for debugging.) Show your OPT matrix for the first test case below.
  2. (30 points) Devise and implement an algorithm to construct and print the structure (i.e., the string of parens and dots). This is a "traceback," similar to ones we've seen with other dynamic programming algorithms. I strongly recommend that you look for a recursive algorithm to do this, but it is not required. You may (or may not) find it convenient to create auxiliary data structures while you're building OPT to facilitate the traceback. Print the input, with the structure aligned vertically below it. Also print the number of pairs. Run your algorithm on all test cases given below.
  3. (10 points) Write a description of your traceback algorithm, explaining how it works/why it is correct.
  4. (10 points) Analyze (separately and collectively) the (big-O) run time of both parts 1 and 2.
  5. (20 points) Measure the actual run time of your algorithm (total time for both parts) on random RNA sequences of length 20-2000, say, plot them on a graph (e.g. Excel might be convenient, but is not required), and discuss how this compares to the theoretical performance predicted in step 4.

Test Cases:

  1. AGCUCAUAUGGC
  2. GCUCCAGUGGCCUAAUGGAUAUGGCUUUGGACUUCUAAUCCAAAGGUUGCGGGUUCGAGUCCCGUCUGGAGUA
  3. CAGUAGCACAGGUUAUUUAAAGCCAGAGAGAGUGAGACCCAUGAACACUUGAGACUGU CACUUGAACUGAUUCACAUCUCAUUUU

FYI, Sequence 2 is a naturally occurring example, specifically an arginine tRNA from Trypanosoma brucei, the African sleeping sickness parasite; cf. Mottram,J.C.; Eier,W.; Sloof,P.; Bell,S.; Nelson,R.G.; Barry,J.D.; tRNAs of Trypanosoma brucei J. Biol. Chem. 266:1 (1991). If you are interested, you can find examples of more naturally occuring RNA sequences at Rfam. You can grab the sequence for an RNA, run it through your program, and compare with the "consensus secondary structure" reported by Rfam. For example, here is the Rfam entry for Ribosomal protein L20 leader. Don't worry if your structure is different -- real RNA folding involves a few more details which we haven't discussed here.

Language: Java or C/C++; ask first if you wish to use another language.

What to turn in: Turn in your program electronically via Catalyst. Also turn in a printed version of your writeup in class.