image University of Washington Computer Science & Engineering
  CSE 527Au '06:  Assignment #6
  CSE Home   About Us    Search    Contact Info 

Reading

Homework
  1. Download the Infernal software package (infernal.tar.gz; index page says version 0.55, but get version 0.7, or 0.71; they're much improved).

  2. Read 00README and sections 1-3 of Userguide.pdf (skip "local alignments" on page 12).

  3. Build and install it following the instructions in section 2 of the manual. (If you do the make install step, it copies the 4 executable files cmalign, cmbuild, cmscore, cmsearch into /usr/local/bin; you can easily delete them afterwards, if you don't want to keep them. Alternatively, add .../infernal-0.7/src to your path, so these programs can be found. Contrary to the manual's warning, I've found it to work properly on Max OS X; I have not tried it on Windows, but I think it works. Please let me know if you have troubles.)

    I had a slight problem with the build in 0.7, getting the error

      gcc -I. -g -O2  -c easel.c
      easel.c:13: error: static declaration of 'esl_error_handler' follows non-static declaration
      ./easel.h:140: error: previous declaration of 'esl_error_handler' was here
    
    Deleting "static" from the front of line 13 of easel.c:
      static esl_error_handler_f esl_error_handler = NULL;
      ^^^^^^
    
    then rerunning make seemed to cure this; your mileage may vary. Please let me know if you also see this problem; I'll report it to the developers.

  4. Follow the tutorial steps outlined in section 3 "Getting Started".

  5. The cmbuild example builds a model ("my.cm") for tRNA based on 5 yeast tRNAs. Given so few sequences and such closely related ones, it's a surprisingly good model. I've extracted a handful of tRNA sequences from the Genbank records for Pyrococcus furiosus (an anaerobic archaeon found in 100°C sediments near sea floor vents, presumably not a close relative of S. cerevisiae). Here are 3 versions of the sequences:

    Run cmsearch on pfur.fa using your "my.cm" model. [Note: cmsearch will also work on pfur.gb, but it seems to silently replace letters other than ACGT/U by random nucleotides, so the genbank comments (lines begining with semicolons) become "junk DNA". This is unlikely to match the CM, but does disrupt the coordinate system.]

  6. Deliverable #1: send me the output of cmsearch above, together with the scores of the lowest scoring true tRNA and highest scoring false tRNA (true/false according to the Genbank annotation). How do these compare to the "rough guide" for score significance given near the bottom of page 9 of the user guide?

    Note: cmsearch searches both strands, and coordinates on the "hit n:..." lines are always with respect to the input sequence, but the coordinates it reports in its alignments for hits on the reverse strand count positions from the front of the reversed sequence.

  7. This model did pretty well, but maybe that's all due to Eddy having very carefully selected his example tRNA sequences and very carefully aligning them manually.

    Use Zizhen Yao's CMfinder to automatically discover a tRNA motif in the P. furiosus sequences. I think you'll find it more convenient to download and install the software via the above link, but you may use the web server version if you prefer. (It may take 5-15 minutes to run on this example, depending on web server load.) Since 8 of the 10 tRNAs in this data happen to be on the reverse strand, I suggest you use pfurrc.fa rather than pfur.fa for this step; CMfinder only looks at one strand. I'd suggest you set CMfinder's parameter for expected number of stemloops to 3.

  8. Use this model to cmsearch pfur.fa. It will do pretty well -- no surprise, it can find the sequences it was build from. Also use it to search the tutorial.fa file from the Infernal distribution, which just contains the 5 yeast tRNAs from which you built my.cm. You should find that the CMfinder model built from the P. furiosus data doesn't do as well at recognizing yeast tRNAs as the hand-build yeast model did at finding P. furiosus tRNAs.

  9. Deliverable #2: Send me the results of scanning tutorial.fa, together with the lowest true positive and highest false positive scores.

  10. Improve the CMfinder model, so that does a better job of finding yeast tRNAs, without significantly reducing its success on P. furiosus tRNAs. Try to think about doing this is in a situation where you have a few "trusted" examples, e.g. the ones in P. furiosus, but none in yeast. There are several ways I can think of that might accomplish this. E.g.:

    You can probably think of other strategies.

  11. Deliverable #3: Try one or more of the above strategies, and/or one or more of your own, and tell me in a couple paragraphs what you did and how well it worked (e.g., send me scan results and true/false score thresholds as above). Also send me the refined .sto file you created. If you have the time and patience, scan more of the yeast genome to see how it does. To assess its false negative rate, you can feed it the tRNAs annotated in Genbank (plus maybe 100 nt of flanking sequence). Assessing false positive rate is harder; you need to feed it a lot of sequence, which is slow. If running the raw CM is too slow, Zasha Weinberg's RaveNnA filtering software might be useful. An early version is included with Infernal (but requires some non-default options during installation); alternatively, the latest version is here.

  12. "Extra Credit": If you found this interesting, I have dozens to hundreds of novel CMfinder motifs that are largely unexplored. Based on conservation near orthologous genes in various bacterial clades, they are plausibly cis-regulatory motifs of some kind. I'd love to see some of these worked out more fully. Let me know if you're interested.

Bundle the "deliverables" into one file with .zip or .tgz and UPLOAD IT HERE.

Please don't hesitate to contact me if you have questions, problems installing the software, etc.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to cse527-webmaster at cs.washington.edu]