TIME: 3:30-4:30 pm,  Friday, May 23, 2008

PLACE: CSE 503  

SPEAKER: David Kempe
         USC

TITLE: Subset Selection for Sampling


ABSTRACT:

One of the central problems in many data driven sciences is the right
selection of attributes to observe in order to accurately predict a
variable of interest. Applications include:

1. Choosing medical attributes to measure in order to predict the risk
of medical conditions such as heart disease.

2. Choosing socio-economic factors to add to surveys to predict
behaviors or achievements.

3. Choosing sensors to measure to calculate an overall quantity such
as the average or maximum temperature.

These examples share several key features:
1. In all cases, a quantity of interest (risk of disease, future
success, global average) cannot be measured directly, but is assumed
or known to correlate with many measurable variables.
2. Restrictions on time, cost, or energy, as well as the fear of
overfitting, make it undesirable or impossible to sample all
variables. 

Thus, a key task is the selection of the "best" set of variables to
measure, yielding as accurate a prediction as possible. In this talk,
we will explore the question from two viewpoints:
1. A stochastic view, in which all variables are assumed to be random,
and their correlations known. The goal is then to find the subset
yielding smallest expected mean square error or best R^2 fit. We give
exact and approximation algorithms for several interesting special
cases of this problem.
2. An adversarial view, in which all variables are assumed to be
controlled by an adversary. The adversary is limited in that variables
reside in a metric space, and their values cannot differ by more than
their distance. For this problem, we prove equivalence with standard
clustering objectives such as k median and k center.

We conclude with a large array of exciting open questions.

[This talk is based on joint work with Abhimanyu Das, appearing in
STOC 2008 and IPSN 2008.]