RE: sequence fragments

From: Dimitris Achlioptas (optas@microsoft.com)
Date: Sat Apr 24 2004 - 16:52:31 PDT

  • Next message: Kelli McGee \(Kelly Services Inc\): "5/27/2004 Distinguishing Chambers of the Moment Polytope - Tara Holm; University of California, Berkeley"

    I believe that the exact answer is given by Ewens' distribution. See,
    for example, the first page of

    http://stat-www.berkeley.edu/~pitman/345.pdf

    Maya's approximation for a single length frequency should work well if:
    n = Theta(L) and L is very large. For collections of lengths you can
    take the corresponding products, i.e. assume independence, and that
    should be a reasonable estimate if the number of lengths considered is
    small (and the lengths themselves are "typical").

    Cheers
    dimitris

    -----Original Message-----
    From: theory-group-bounces@cs.washington.edu
    [mailto:theory-group-bounces@cs.washington.edu] On Behalf Of
    gupta@ee.washington.edu
    Sent: Friday, April 23, 2004 6:18 PM
    To: William Noble
    Cc: theory-group@cs.washington.edu; Martin Tompa
    Subject: Re: sequence fragments

    I'm an EE, so I know the continuous world a little better. In the
    continuous
    case, let's say you threw down N points uniformly randomly on a
    continuous line
    between 0 and a. Then the distances between each pair of adjacent N
    points
    (and the endpoints) would be your 'sequence lengths' and they would be
    distributed exponentially, so that the probability of short ones is much

    greater than the probabilty of long ones.

    Since this is a discrete case (your sequence lengths are discrete not
    continuous) it changes things slightly. I'd have to go back over the
    math, but
    the Poisson is the binomial distribution in the limit, and the
    exponential
    distribution should be a limiting case of 'how many trials until the
    first
    success', so a geometric random variable (also the only memoryless
    discrete
    r.v.).

    So I think your sequence lengths would just be geometric r.v.'s.

    Someone have this proved already? Otherwise I suppose I could spend
    another few
    minutes verifying my assertion :)

    -Maya Gupta

    -- 
    Maya Gupta
    gupta@ee.washington.edu
    (206)616-2303
    Assistant Professor
    Dept. of Electrical Engineering
    University of Washington
    Quoting William Noble <noble@gs.washington.edu>:
    > 
    > Hello, theory group.
    > 
    > I posed a question to Martin Tompa this morning that is relevant to my
    > analysis of the human genome, and he suggested that I try running it
    > by y'all.  So here goes ...
    > 
    >   You are given a sequence of letters of length L.  You select
    >   uniformly at random n distinct positions within the sequence, and
    >   break the sequence into n+1 fragments.  You then count the number X
    >   of occurrences of all fragments of a specified length m.  I'd like
    >   to know the distribution of that resulting count X.  In particular,
    >   I need to compute the p-value of observing at least p length-m
    >   fragments, so I need a cumulative density function.
    > 
    >   Actually, to make it a bit more complex, the specified value m will
    >   be a collection of values (e.g., not just fragments of length 13,
    >   but fragments of length 11, 12, 13, 17 or 22).  But I assume that if
    >   I can solve the problem above, it will be straightforward to
    >   generalize to this case.
    > 
    > I also assume that somewhere, in some form, someone has solved this or
    > a closely related problem before.  If anyone knows the answer or can
    > point me toward a solution, I'd be grateful.
    > 
    > Thanks.
    > Bill Noble
    > 
    > -----
    > William Stafford Noble
    > Assistant Professor
    > Department of Genome Sciences
    > University of Washington
    > Health Sciences Center, Box 357730
    > 1705 NE Pacific Street
    > Seattle, WA 98195
    > Tel: (206) 543-8930
    > Fax: (206) 685-7301
    > Office: J-205
    > http://www.gs.washington.edu/~noble
    > _______________________________________________
    > Theory-group mailing list
    > Theory-group@cs.washington.edu
    > http://mailman.cs.washington.edu/mailman/listinfo/theory-group
    > 
    _______________________________________________
    Theory-group mailing list
    Theory-group@cs.washington.edu
    http://mailman.cs.washington.edu/mailman/listinfo/theory-group
    _______________________________________________
    Theory-group mailing list
    Theory-group@cs.washington.edu
    http://mailman.cs.washington.edu/mailman/listinfo/theory-group
    

  • Next message: Kelli McGee \(Kelly Services Inc\): "5/27/2004 Distinguishing Chambers of the Moment Polytope - Tara Holm; University of California, Berkeley"

    This archive was generated by hypermail 2.1.6 : Sat Apr 24 2004 - 16:53:07 PDT