CSE 527 - Lecture 10, Monday October 30, 2006 Class notes, Hanna Filipsson Gibbs sampler: EM and Gibbs are similar methods, biggest difference is that EM uses hill climbing to find max, and Gibbs samples from the different instances Extensions basic approach is simple but doesn't always work tricks to help: Phase shift once you've found an alignment, look around to see if shifting over gives better results similar --> broaden search not similar --> narrow search Algorithmic adjustment of pattern width add/remove flanking positions (to maximize average relative entropy positions) Example (of algorithmic adjustment of pattern width): AATAAA AATAAG AATAAG The algorithm might settle on the last 5. Theory says that if you run the algorithm long enough, it will find the better alignment of the first 5. But it requires a series of unlikely choices. We dont want to wait that long. Instead, try move instance right and left. ------------ Methodology There are several tools to use (to find transcription factor binding sites). The paper mentioned on the slides (Assessing computational tools for the discovery of transcription factor binding sites) presents a research done to see if any tools (algorithms) are better than others: A group of experts were asked to find transcription factor binding sites. species used: human, mouse, fly, yeast datasets: 'real' - collect real data 'generic' - upstream regions, random 'markov' - built third order markov of upstream regions they created datatypes of all 3 kinds for all species Unfortunately nature isn't as simple as the sequences we model with this, and we might bias some algorithms. Using the real data might result in that an algorithm is penalized for making false predictions that are actually good (because we didnt know about it when making our model). The generic isn't perfect either since we don't know the exact stochastic process. There are arguments against all 3 methods, therefore they used all 3 to increase their odds of doing good. They let the experts of each algorithm perform the algorithm and took only the top prediction. (Table of how well the different methods did) The different methods did quite similarly good. Weeder did suprisingly good, but it is conservative (got few wrongs predictions but also few right) They all have their pros and cons. One might think a fancier algorithm is better, but it doesn't look like that in this table. Consensus - greedy algorithm, doing okay here (Gibbs was originally for protein alignment, but here adjusted for DNA alignment) many use a variant of Gibbs many use a variant of EM MEME, MEME3, variants of the same thing run by different groups, got similar results (Table where the results are broken into datatypes, i.e. mouse, human, yeast, fly) Yeast got high score we know a lot about yeast probably overfit the model to yeast (yeast is what we created the models from) remember that models are missing things for ex. a position can affect a place 100 000 positions away, this is not in any model ------------- Comparative genomics: cross-species comparison Phylogenies (aka Evolutionary trees): Complex question: given data (sequence, anatomy), find the phylogeny talked about for a long time can make big mistakes (ex. looks like a fish but isn't one) Simpler question: Given data and phylogeny, how well does the tree fit the data (this will be our focus here) Parsimony: The general idea is like Occam's Razor: "All things being equal, the simplest solution tends to be the best one." Given data where change is rare, prefer an explanation that requires few events (mutations). We compare sequences from different species and look at the places they differ. If the species are related, there has been a mutation somewhere. We want to construct a tree from this, to see how closely different species are. Construction the tree is however difficult (complex question above), so instead we look at how good an existing tree is (simpler question above). The tree with the smallest number of required mutations is probably the right one. (It is "more likely that animals with an eyeball came from the same heritage, than several inventing the eyeball seperately".) Given a tree, we count the number of required mutations. For example: (1) (2) A T A T A G C G C T (1) and (2) might look equally good, but with the given tree (1) requires 1 change and (2) requires 2 changes (see slides for the tree). This method works when change is rare. Counting events parsimoniously: use Sankoff & Rousseau algorithm it is dynamic programming, but instead of using a matrix (like last homework), we use a tree Maximum likelihood is usually considered to be a better way to evaluate a phylogeny, but parsimony is a natural approach and fast. Phylogenic footprinting: goal: identify regulating elements functional sequence evolve slower than non-functional ones non-functional has no selection effect (and may spread fast) if functional, then usually this is already the best and usually another is worse and will not spread consider a set of orthologues (evolved from common ancestor) sequences from different species identify unusually well conserved regions (hint that it is functional)