Computational Biology Capstone: Project

University of Washington Computer Science & Engineering

CSE Home

About Us

Contact Info

Course Home

Software

Data

References

Correlating Genomic Annotations

Project goals

Software design goal: a pipeline that takes as input a genome annotation table A and, for each feasible Genome Browser table B, determines whether A and B are more correlated than expected by chance.
Analysis goal: apply the pipeline to the extremely conserved elements with the goal of trying to understand the function of such extreme conservation.

Phase Alpha restrictions
We start with a simplified version of the project, but design the software for ease of later extensions.

The input genome annotation table A consists of unmarked segments given by the extremely conserved elements.
Genome Browser table B is feasible if a set of unmarked segments can be extracted from it.
We will only look at tables from the hg19 human genome assembly.
Correlation of A and B is measured by the number of base-pairs by which they overlap.
Monte Carlo simulation to determine the p-value of this overlap will be done by preserving everything in table B, by preserving the segment and intersegment lengths of table A, but randomizing the positions of the segments and intersegments of table A, as described in statistical tests, "US-US Overlap" section, "Null Hypothesis 3" subsection.
There will be one such analysis per human chromosome.

Phase Beta restrictions

The input genome annotation table A consists of unmarked segments given by the extremely conserved elements.
Genome Browser table B is feasible if it can be treated as a partial function from genomic positions to real numbers. This includes tables of type wig, bigwig, and bedgraph.
We will only look at tables from the hg19 human genome assembly.
Correlation of A and B is measured by the average value of the table B function over positions in table A segments.
Monte Carlo simulation to determine the p-value of this overlap will be done by preserving everything in table B, by preserving the segment and intersegment lengths of table A, but randomizing the positions of the segments and intersegments of table A.
There will be one such analysis per human chromosome.

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to cse428-owner]