CSE 527 Lecture 2, Monday 10/04/04

Notes by Luca Cazzanti <lucagc@u>,
Katarzyna Wilamowska <kasiaw@u> (Autumn 2003);

Gene Structure

Historically, a gene has been defined as a unit of hereditable material, a segment of chromosomes that codes for proteins. However, we now know that in a gene there are also non-coding sequences (sequences that do not code for a protein) and regulatory sequences (sequences that regulate whether other sequences code for a protein, or are "expressed"). In a gene:

Genome sizes

It is interesting to compare the size of the genome and the number of genes for several organisms.
NAME
Genome length
Number of genes
Comment
mycoplasma genitalium 580,073 483 a bacterium, one of the smallest living organism
E. coli 4,639,221 4,290
saccharomyces cerevisiae 12,495,682 5,726 baker's yeast
caenorhabditis elegans 95.5x106 19,820 little see-through worm
arabidopsis thaliana 115,409,949 25,498
drosophila melanogaster 122,653,977 13,472 fruit fly
humans 3.3x109 ~25,000
The table shows some interesting facts: The importance of the regulatory structure is also evidenced by examining the different way in which humans and arabidopsis thaliana replicate genes that have similar functionality. In arabidopsis, similar functionality is achieved by genes that are almost identical. In humans, on the other hand, similar functions are often carried out by genes via alternative splicing. Other "surprises" in the genome:
  • Humans have about 1/3 as many genes as it was expected (see note above too)
  • There are unexpectedly many non-coding RNAs
  • Many regions of non-coding RNA are conserved across mammals, yet more evidence of the importance of regulatory networks, of which we know little.

    Gene Expression

  • proteins do most of the work
  • they are dynamically created/destroyed
  • so are their mRNA blueprints
  • different mRNAs expressed at different times/places
  • knowing mRNA "expression levels" tells alot about the state of the cell
  • Note however that gene regulation happens at all levels.

    Expression Microarrays

  • thousands to hundreds of thousands of spots per square inch
  • each holds millions of copies of a DNA sequence from one gene
  • take MRNA from cells, put it on array
  • see where it sticks - MRNA from gene x should stick to spot x
  • Special processing and manufacturing techniques are used to make sure that the DNA does not stick to the cell in the liquid solution instead of the genes and to ensure a proper selection and layout of the spots on the mask.

    An Expression Array Experiment

    The diagram below shows the typical sequence of steps involved in a microarray experiment. The mRNA is extracted from particular cells that must be tested and put in a liquid solution which is washed over the microarray. Typically the mRNA's are tagged with flourescent dyes so on exposure to UV light, the bright spots show the genes that have hybridized.
    o o o o
     o o o
      cells

      |
      | extract mRNA
      |
     \/
    ~ ~ ~
     ~ ~ ~
    mRNA

      |
      |
      |
     \/
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O
      |
      |
      |             UV light
     \/
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O
    O O O O O O

    UV light shows by color were mRNA sticks

    An example application of microarrays: distinguishing two types of leukemia

    We want to distinguish between two types of leukemia: ALL and AML. It's important to distinguish because the therapy for one might actually make the other type worse. Accurate diagnosis is therefore needed.
  • 72 leukemia patients: 47 ALL and 25 AML
  • 1 chip per patient
  • 7132 human genes per chip

  • Key issue: What's Different? What genes are behaving differently between ALL & AML (or other disease/normal states)? The differences in the two microarrays can help us determine which type of leukemia is present in a patient. There are other potential uses:
          diagnosis: determine the risk of a type of cancer in a patient, based on genotype
          prognosis, which may change depending on the genotype
          insight into underlying biology
          treatment
    From a computational point of view, distinguishing between the two types of leukemia is a classification problem. Many approaches have been and are being investigated: linear discriminant analysis (LDA), logistic regression, neural networks, support vector machines are some examples. When approaching this problem, like many of the problems in biology, using these techniques, the major difficulties are the high dimensionality of the data and noise.

    Another application of microarrays: yeast sporulation

    Sporulation is one process by which yeast reproduce. (They also divide asexually). We want to track which genes are active at different time points during the sporulation process:
  • 7 time points over 18 hours
  • 1 microarray/time point
  • All 6200 yeast genes on each microarray
  • This process has yielded a 3 to 10 times increase in the number of genes known to be involved in sporulation. Many of these genes have recognizable analogs in human genes. This is an impressive achievement, considering that the process only takes 18 hours. This example will be discussed in much more detail in the next lecture.

    Other Applications of Microarrays

    Other applications of microarrays include the study of gene function and regulation. For example, by studying the covarying nature of two or more genes, it's possible to infer whether they are are co-regulated and/or they share a common pathway. Another application is gene target discovery, where comparisons of microarray experiments on diseased and normal cells can suggest which genes play a role in the disease. These genes can then be targeted specifically by a drug.

    In pharmacology and toxicology microarrays may also be of help. For example, microarrays can identify a drug's activity and toxicity with respect to specific genes, and test subjects for drug trials may be selected accordingly.

    Diagnostics and gene function and regulation are other areas in which microarrays can be helpful, again, by virtue of their ability to identify genes that are expressed or suppressed quickly and with large throughput. Finally, as indicated in the leukemia example, microarrays can be used to refine the type of disease being diagnosed.

    Microarray platforms

  • oligonucleotide-based arrays
    1. Affymetrix GeneChip: 25-mers spotted on a glass wafer,monochromatic, can achieve small feature size but are costly due to manufacturing technique (photolithography)
    2. Agilent: custom spotted 50-80mers generated from know sequences, 2-color, easy to reprogram and less costly.
  • cDNA (complimentary DNA) - intially easier and cheaper to do; inserts from cDNA libraries. (DNA is more stable and easier to work with in lab, whereas RNA degrades quickly. All the technologies work with DNA for this reason, and use cDNA copies of the mRNAs for this reason.)

    How unique is a 20mer?

    VERY crude model: DNA is random - every position is equally likely to be A, C, G, or T, independent of each other.
    Then probability of a random 20mer is
    (1/4)20=(1/2)40=((1/2)10)4=(1/1024)4 which is about (10-3)4 = 10 -12
    So a specific 20mer occurs in random human-sized DNA with the probability equal to (3x109)x10-12 = 0.003, where recall from the table above that 3x109 is the size of the human genome.
    This model is very rough. Some things that it does not take into account are:

  • G/C content can vary from ~40-60% across and within organisms
  • Adjacent pairs are not independent
  • Adjacent triplets are not independent
  • Many large-scale repeats e.g. similar genes, domains within gene;transpositions and other junk (within primates, ~5% of all DNA is composed of (noisy) copies of a 300bp ALU sequence)
  • Sources of Noise in Microarrays

  • Lot-to-lot variation
  • Experiment-to-experiment variation: cell state, culture purity, sample preparation, hybridization conditions all affect the experiments;
  • spot-to-spot variation: unequal dye incorporation;
  • non-linear dye/saturation
  • uneven spot size;
  • self-cross hybridization;
  • Image capture and processing techniques (spot finding, quantization etc.)
  • Nevertheless, ability to get this system-level view of cell activity is very valuable.