CSE 527 Lecture 2, Wednesday 10/01/03
Notes by Katarzyna Wilamowska <kasiaw@u>
Talks today
1:30 in HSB K-069
Dr. Phil Green
"Finishing the Gene-ome: Computationally Directed Gene Structure
Verification in C. elegans"
3:30 in Hitchcock Hall 132
Dr. Mark Chee
"Accessing Genetic Information: Technology for Large Scale SNP
Genotyping"
SNP - single nucleotide polymorphism
Good websites for infomation on computational biology:
Genome sizes
NAME
|
Genome length
|
Number of genes
|
mycoplasma genitalium |
580,073 |
483 |
smallest bacteria in genome size |
E. coli |
4,639,221 |
4,290 |
saccharomyces cerevisiae |
12,495,682 |
5,726 |
baker's yeast |
caenorhabditis elegans |
95.5x106 |
19,820 |
little see-through worm |
arabidopsis thaliana |
115,409,949 |
25,498 |
drosophila melanogaster |
122,653,977 |
13,472 |
fruit fly |
humans |
3.3x109 |
~25,000 |
Gene Expression
proteins do most of the work
they are dynamically created/destroyed
so are their mRNA blueprints
different mRNAs expressed at different times/places
knowing mRNA "expression levels" tells alot about the state of the
cell
Expression Microarrays
thousands to hundreds of thousands of spots per square inch
each holds millions of copies of a DNA sequence from one gene
take MRNA from cells, put it on array
see where it sticks - MRNA from gene x should stick to spot x
An Expression Array Experiment
o o o o
o o o
cells
|
| extract mRNA
|
\/
~ ~ ~
~ ~ ~
mRNA
|
|
|
\/
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
|
|
|
UV light
\/
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
UV light shows by color were mRNA sticks
An example application
72 leukemia patients
77 ALL
25 AML
1 chip per patient
7132 human genes per chip
Key issue: What's Different?
What genes are behaving differently between ALL & AML (or toher
disease/normal
states)?
Potential uses:
diagnosis
prognosis
insight into underlying
bilogy/biologies
treatement
A classification problem
Given an array from a new patient is it ALL or AML?
Many possible approaches (LDA, logistic regression, NN)
Problems - noise, dimensions
PolyA tail
on 3'tail of mRNA
likely to be recognized by transport machinery from nucleus to rest of
cells
useful for us to separate mRNA for our uses by making polyT tails
Practical Application of Microarrays
gene target discovery
pharmacology and toxicology
diagnostics
study gene function and regulation
refined categorization of diseases
e.g. "prostate cancer" is almost
certainly
not one disease. Are subtypes distinguishable at expression level?
Microarray platforms
oligonucleotide-based arrays
25mers spotted on a glass wafer,
Affymetrix
GeneChip arrays
custom spotted 50-80mers generated
from
know sequences
cDNA (complimentary DNA) - intially easier and cheaper to do
inserts from cDNA libraries
PCR products generated from gene
specific
or universal primers
DNA is more stable and easier to work with in lab. RNA degrades
quickly.
How unique is a 20mer?
VERY crude model: DNA is random - every position is equally likely to
be A, C,
G, or T, independent of eachother.
Then probability of a random 20mer is
(1/4)20=(1/2)40=((1/2)10)4=(1/1024)4
which is about (10-3)4 = 10 -12
So a random 20mer occours in random human-sized DNA with the
probability equal
to 0.003
How random is a Genome?
G/C content can vary from ~40-60% across and within organisims
Adjacent pairs are not independant
Adjacent triples are not independant
...
Many large-scale repeats e.g.
similar genes, domains within
genes
transpositions and other junk
(within
primates, ~5% of all DNA is composed of (noisey) copies of a 300bp ALU
sequence)