TIME: 1:30-2:20 pm, January 23, 2007 PLACE: EEB 037 SPEAKER: Amit Chakrabarti Dartmouth College TITLE: Estimating Entropy (and its Friends) on Data Streams ABSTRACT: Statistical analysis of large streams of data is an increasingly important problem in the modern heavily networked world. Over the last couple of years, a number of researchers have worked on estimating information-theoretic measures on data streams, motivated by applications in computer networks and databases. One particularly important information-theoretic measure is the empirical entropy of a data stream. A number of recent research efforts had yielded sub-linear space algorithms to approximate this quantity. In this talk, we describe a new algorithm to estimate the empirical entropy of a stream of $m$ items, read in a single pass, within a $1\pm\epsilon$ approximation factor, using $O(\eps^{-2} \log m)$ words of storage space. This improves upon all earlier algorithms for the problem. It is also considerably simpler than the only previously known algorithm with $polylog(m)$ space complexity. We also prove the first nontrivial lower bound, namely $\Omega(\eps^{-2} / \log(\eps^{-1}))$, for this problem, proving that our algorithm is near-optimal in terms of its dependency on $\eps$. In addition, we prove upper and lower bounds for approximating related stream statistics such as the entropy norm and $k$th order entropy. One key algorithmic technique in our work is a novel extension to a widely-used method originating with Alon, Matias, and Szegedy (1996). This technique might be of independent interest. This is joint work with Graham Cormode and Andrew McGregor.