TIME: 3:30-4:30 pm, Friday, May 23, 2008 PLACE: CSE 503 SPEAKER: David Kempe USC TITLE: Subset Selection for Sampling ABSTRACT: One of the central problems in many data driven sciences is the right selection of attributes to observe in order to accurately predict a variable of interest. Applications include: 1. Choosing medical attributes to measure in order to predict the risk of medical conditions such as heart disease. 2. Choosing socio-economic factors to add to surveys to predict behaviors or achievements. 3. Choosing sensors to measure to calculate an overall quantity such as the average or maximum temperature. These examples share several key features: 1. In all cases, a quantity of interest (risk of disease, future success, global average) cannot be measured directly, but is assumed or known to correlate with many measurable variables. 2. Restrictions on time, cost, or energy, as well as the fear of overfitting, make it undesirable or impossible to sample all variables. Thus, a key task is the selection of the "best" set of variables to measure, yielding as accurate a prediction as possible. In this talk, we will explore the question from two viewpoints: 1. A stochastic view, in which all variables are assumed to be random, and their correlations known. The goal is then to find the subset yielding smallest expected mean square error or best R^2 fit. We give exact and approximation algorithms for several interesting special cases of this problem. 2. An adversarial view, in which all variables are assumed to be controlled by an adversary. The adversary is limited in that variables reside in a metric space, and their values cannot differ by more than their distance. For this problem, we prove equivalence with standard clustering objectives such as k median and k center. We conclude with a large array of exciting open questions. [This talk is based on joint work with Abhimanyu Das, appearing in STOC 2008 and IPSN 2008.]