Summary: Implement the Nussinov RNA folding algorithm.
Background: To briefly summarize section 6.5 and my lecture on RNA structure prediction, aka RNA folding, RNA molecules often fold back on themselves, forming stable double-helix structures akin to the famous DNA double helix, with G-C and A-U pairs forming. (We'll ignore other pairs, which sometimes form, too.) A commonly used set of rules for what structures may form is: ri * rj allowed only if i < j - 4 (it can't bend too sharply) and if ri * rj rj and ri'* rj' are two pairs with i <= i' then either
GCCCACCUUCGAAAAGACUGGAUGACCAUGGGCCAUGAUU [1] ((((....((.....)).(((....))).))))....... [2] 9 Pairs. [3]For a given RNA sequence, there will usually be a huge number of structures satisfying the above restrictions. Nature usually favors the "most stable" structure. One approximation to "most stable" is the structure having the maximum possible number of pairs (subject to the above restrictions). The goal of this homework is an algorithm for this problem.
Input: a single line containing a string of letters x1x2 . . . xn from the 4 letter alphabet {A,G,C,U} (all uppercase, for simplicity), as in line [1] above.
Output: one line containing the input, a second line containing (one of) its optimal structure(s), as in [2] above, plus a third line giving the total number of pairs in that structure, as in [3]. I say "one of" since there may be different structures with equal numbers of pairs; often slight variants of each other. Giving any one of them is OK. The structure line will be a string of parens and dots, vertically aligned with the input string. A dot in the structure line means that the corresponding position in the RNA is unpaired; a left paren means it is paired with a position to its right, marked by a right paren. Furthermore, parens must be properly balanced/nested, so specific paired positions are marked by "matching" left/right parens.
Method: Use the Nussinov algorithm, described in class and section 6.5 of the text.
What You Need To Do:
Test Cases:
FYI, Sequence 2 is a naturally occurring example, specifically an arginine tRNA from Trypanosoma brucei, the African sleeping sickness parasite; cf. Mottram,J.C.; Eier,W.; Sloof,P.; Bell,S.; Nelson,R.G.; Barry,J.D.; tRNAs of Trypanosoma brucei J. Biol. Chem. 266:1 (1991). If you are interested, you can find examples of more naturally occuring RNA sequences at Rfam. You can grab the sequence for an RNA, run it through your program, and compare with the "consensus secondary structure" reported by Rfam. For example, here is the Rfam entry for Ribosomal protein L20 leader. Don't worry if your structure is different -- real RNA folding involves a few more details which we haven't discussed here.
Language: Java or C/C++; ask first if you wish to use another language.
What to turn in: Turn in your program electronically via Catalyst. Also turn in a printed version of your writeup in class.