CSE 527 Lecture 2, Monday 10/04/04
Notes by Luca Cazzanti <lucagc@u>,
Katarzyna Wilamowska <kasiaw@u> (Autumn 2003);
Gene Structure
Historically, a gene has been defined as a unit of hereditable material,
a segment of chromosomes that codes for proteins. However, we now know that
in a gene there are also non-coding sequences (sequences that do not code for
a protein) and regulatory sequences (sequences that regulate whether other
sequences code for a protein, or are "expressed"). In a gene:
- transcription happens from the 5' to the 3' terminus;
- the promoter region and the transcription binding sites precede the 5' terminus
- the transcribed region includes 5' and 3' untranslated regions;
- most eukaryotes have introns, regions of DNA that are spliced out
before transport out of the nucleus, that is before translation.
Genome sizes
It is interesting to compare the size of the genome and the number of
genes for several organisms.
NAME
|
Genome length
|
Number of genes
|
Comment |
mycoplasma genitalium |
580,073 |
483 |
a bacterium, one of the smallest living organism |
E. coli |
4,639,221 |
4,290 |
saccharomyces cerevisiae |
12,495,682 |
5,726 |
baker's yeast |
caenorhabditis elegans |
95.5x106 |
19,820 |
little see-through worm |
arabidopsis thaliana |
115,409,949 |
25,498 |
drosophila melanogaster |
122,653,977 |
13,472 |
fruit fly |
humans |
3.3x109 |
~25,000 |
The table shows some interesting facts:
- Arabidopsis thaliana and humans have about the same number of genes, but
the human genome is several order of magnitude larger. This means that in
humans a great number of base pairs do not code for genes.
- It's somewhat surprising that humans, have such a low number of genes,
given the complexity of the human body. It's believed that in humans the
regulatory structure that accompanies the genes is of great importance, and
therefore the number of genes by itself is not as significant.
The importance of the regulatory structure is also evidenced by examining the different way in which humans and arabidopsis thaliana replicate genes that have similar functionality. In arabidopsis, similar functionality is achieved by genes that are almost identical. In humans, on the other hand, similar functions are often carried out by genes via alternative splicing.
Other "surprises" in the genome:
Humans have about 1/3 as many genes as it was expected (see note above too)
There are unexpectedly many non-coding RNAs
Many regions of non-coding RNA are conserved across mammals, yet more
evidence of the importance of regulatory networks, of which we know little.
Gene Expression
proteins do most of the work
they are dynamically created/destroyed
so are their mRNA blueprints
different mRNAs expressed at different times/places
knowing mRNA "expression levels" tells alot about the state of the
cell
Note however that gene regulation happens at all levels.
Expression Microarrays
thousands to hundreds of thousands of spots per square inch
each holds millions of copies of a DNA sequence from one gene
take MRNA from cells, put it on array
see where it sticks - MRNA from gene x should stick to spot x
Special processing and manufacturing techniques are used to make sure that the
DNA does not stick to the cell in the liquid solution instead of the genes and
to ensure a proper selection and layout of the spots on the mask.
An Expression Array Experiment
The diagram below shows the typical sequence of steps involved in a microarray
experiment. The mRNA is extracted from particular cells that must be tested and
put in a liquid solution which is washed over the microarray. Typically the mRNA's
are tagged with flourescent dyes so on exposure
to UV light, the bright spots show the genes that have hybridized.
o o o o
o o o
cells
|
| extract mRNA
|
\/
~ ~ ~
~ ~ ~
mRNA
|
|
|
\/
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
|
|
|
UV light
\/
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
O |
UV light shows by color were mRNA sticks
An example application of microarrays:
distinguishing two types of leukemia
We want to distinguish between two types of leukemia: ALL and AML. It's
important to distinguish because the therapy for one might actually make the
other type worse. Accurate diagnosis is therefore needed.
72 leukemia patients: 47 ALL and 25 AML
1 chip per patient
7132 human genes per chip
Key issue: What's Different?
What genes are behaving differently between ALL & AML (or other
disease/normal states)? The differences in the two microarrays can help
us determine which type of leukemia is present in a patient. There
are other potential uses:
diagnosis: determine the risk of
a type of cancer in a patient, based on genotype
prognosis, which may change depending
on the genotype
insight into underlying
biology
treatment
From a computational point of view, distinguishing between the two types of
leukemia is a classification problem. Many approaches have been and are
being investigated: linear discriminant analysis (LDA), logistic regression,
neural networks, support vector machines are some examples. When approaching
this problem, like many of the problems in biology, using these techniques,
the major difficulties are the high dimensionality of the data and noise.
Another application of microarrays: yeast sporulation
Sporulation is one process by which yeast reproduce. (They also divide asexually). We want to track
which genes are active at different time points during the sporulation process:
7 time points over 18 hours
1 microarray/time point
All 6200 yeast genes on each microarray
This process has yielded a 3 to 10 times increase in the number of genes
known to be involved in sporulation. Many of these genes have recognizable
analogs in human genes. This is an impressive achievement, considering that
the process only takes 18 hours. This example will be discussed in much more
detail in the next lecture.
Other Applications of Microarrays
Other applications of microarrays include the study of gene function and
regulation. For example, by studying the covarying nature of two or more genes,
it's possible to infer whether they are are co-regulated and/or they share
a common pathway. Another application is gene target discovery, where
comparisons of microarray experiments on diseased and normal cells can suggest
which genes play a role in the disease. These genes can then be targeted
specifically by a drug.
In pharmacology and toxicology microarrays may also be of help.
For example, microarrays can identify a drug's activity and toxicity with
respect to specific genes, and test subjects for drug trials may be selected
accordingly.
Diagnostics and gene function and regulation are other areas in which
microarrays can be helpful, again, by virtue of their ability to identify genes
that are expressed or suppressed quickly and with large throughput. Finally,
as indicated in the leukemia example, microarrays can be used to refine the
type of disease being diagnosed.
Microarray platforms
oligonucleotide-based arrays
- Affymetrix GeneChip: 25-mers spotted on a glass wafer,monochromatic, can achieve small feature size but are costly due to manufacturing technique (photolithography)
- Agilent: custom spotted 50-80mers generated from know sequences, 2-color,
easy to reprogram and less costly.
cDNA (complimentary DNA) - intially easier and cheaper to do;
inserts from cDNA libraries.
(DNA is more stable and easier to work with in lab, whereas RNA degrades
quickly. All the technologies work with DNA for this reason, and use cDNA copies of the
mRNAs for this reason.)
How unique is a 20mer?
VERY crude model: DNA is random - every position is equally likely to
be A, C, G, or T, independent of each other.
Then probability of a random 20mer is
(1/4)20=(1/2)40=((1/2)10)4=(1/1024)4
which is about (10-3)4 = 10 -12
So a specific 20mer occurs in random human-sized DNA with the
probability equal
to (3x109)x10-12 = 0.003, where recall from the table above that
3x109 is the size of the human genome.
This model is very rough. Some things that it does not take into account are:
G/C content can vary from ~40-60% across and within organisms
Adjacent pairs are not independent
Adjacent triplets are not independent
Many large-scale repeats e.g. similar genes, domains within
gene;transpositions and other junk (within
primates, ~5% of all DNA is composed of (noisy) copies of a 300bp ALU
sequence)
Sources of Noise in Microarrays
Lot-to-lot variation
Experiment-to-experiment variation: cell state, culture purity, sample
preparation, hybridization conditions all affect the experiments;
spot-to-spot variation: unequal dye incorporation;
non-linear dye/saturation
uneven spot size;
self-cross hybridization;
Image capture and processing techniques (spot finding, quantization etc.)
Nevertheless, ability to get this system-level view of cell activity is very valuable.