TIME: 1:30-2:20 pm,  January 23, 2007

PLACE: EEB 037

SPEAKER: Amit Chakrabarti
         Dartmouth College

TITLE: Estimating Entropy (and its Friends) on Data Streams

ABSTRACT:
Statistical analysis of large streams of data is an increasingly
important problem in the modern heavily networked world. Over the last
couple of years, a number of researchers have worked on estimating
information-theoretic measures on data streams, motivated by
applications in computer networks and databases.  One particularly
important information-theoretic measure is the empirical entropy of a
data stream. A number of recent research efforts had yielded sub-linear
space algorithms to approximate this quantity.

In this talk, we describe a new algorithm to estimate the empirical
entropy of a stream of $m$ items, read in a single pass, within a
$1\pm\epsilon$ approximation factor, using $O(\eps^{-2} \log m)$ words
of storage space. This improves upon all earlier algorithms for the
problem. It is also considerably simpler than the only previously known
algorithm with $polylog(m)$ space complexity.  We also prove the first
nontrivial lower bound, namely $\Omega(\eps^{-2} / \log(\eps^{-1}))$,
for this problem, proving that our algorithm is near-optimal in terms of
its dependency on $\eps$.  In addition, we prove upper and lower bounds
for approximating related stream statistics such as the entropy norm and
$k$th order entropy.

One key algorithmic technique in our work is a novel extension to a
widely-used method originating with Alon, Matias, and Szegedy (1996).
This technique might be of independent interest.

This is joint work with Graham Cormode and Andrew McGregor.