From: Dimitris Achlioptas (optas@microsoft.com)
Date: Sat Apr 24 2004 - 16:52:31 PDT
I believe that the exact answer is given by Ewens' distribution. See,
for example, the first page of
http://stat-www.berkeley.edu/~pitman/345.pdf
Maya's approximation for a single length frequency should work well if:
n = Theta(L) and L is very large. For collections of lengths you can
take the corresponding products, i.e. assume independence, and that
should be a reasonable estimate if the number of lengths considered is
small (and the lengths themselves are "typical").
Cheers
dimitris
-----Original Message-----
From: theory-group-bounces@cs.washington.edu
[mailto:theory-group-bounces@cs.washington.edu] On Behalf Of
gupta@ee.washington.edu
Sent: Friday, April 23, 2004 6:18 PM
To: William Noble
Cc: theory-group@cs.washington.edu; Martin Tompa
Subject: Re: sequence fragments
I'm an EE, so I know the continuous world a little better. In the
continuous
case, let's say you threw down N points uniformly randomly on a
continuous line
between 0 and a. Then the distances between each pair of adjacent N
points
(and the endpoints) would be your 'sequence lengths' and they would be
distributed exponentially, so that the probability of short ones is much
greater than the probabilty of long ones.
Since this is a discrete case (your sequence lengths are discrete not
continuous) it changes things slightly. I'd have to go back over the
math, but
the Poisson is the binomial distribution in the limit, and the
exponential
distribution should be a limiting case of 'how many trials until the
first
success', so a geometric random variable (also the only memoryless
discrete
r.v.).
So I think your sequence lengths would just be geometric r.v.'s.
Someone have this proved already? Otherwise I suppose I could spend
another few
minutes verifying my assertion :)
-Maya Gupta
-- Maya Gupta gupta@ee.washington.edu (206)616-2303 Assistant Professor Dept. of Electrical Engineering University of Washington Quoting William Noble <noble@gs.washington.edu>: > > Hello, theory group. > > I posed a question to Martin Tompa this morning that is relevant to my > analysis of the human genome, and he suggested that I try running it > by y'all. So here goes ... > > You are given a sequence of letters of length L. You select > uniformly at random n distinct positions within the sequence, and > break the sequence into n+1 fragments. You then count the number X > of occurrences of all fragments of a specified length m. I'd like > to know the distribution of that resulting count X. In particular, > I need to compute the p-value of observing at least p length-m > fragments, so I need a cumulative density function. > > Actually, to make it a bit more complex, the specified value m will > be a collection of values (e.g., not just fragments of length 13, > but fragments of length 11, 12, 13, 17 or 22). But I assume that if > I can solve the problem above, it will be straightforward to > generalize to this case. > > I also assume that somewhere, in some form, someone has solved this or > a closely related problem before. If anyone knows the answer or can > point me toward a solution, I'd be grateful. > > Thanks. > Bill Noble > > ----- > William Stafford Noble > Assistant Professor > Department of Genome Sciences > University of Washington > Health Sciences Center, Box 357730 > 1705 NE Pacific Street > Seattle, WA 98195 > Tel: (206) 543-8930 > Fax: (206) 685-7301 > Office: J-205 > http://www.gs.washington.edu/~noble > _______________________________________________ > Theory-group mailing list > Theory-group@cs.washington.edu > http://mailman.cs.washington.edu/mailman/listinfo/theory-group > _______________________________________________ Theory-group mailing list Theory-group@cs.washington.edu http://mailman.cs.washington.edu/mailman/listinfo/theory-group _______________________________________________ Theory-group mailing list Theory-group@cs.washington.edu http://mailman.cs.washington.edu/mailman/listinfo/theory-group
This archive was generated by hypermail 2.1.6 : Sat Apr 24 2004 - 16:53:07 PDT