Efficient Probabilistic SQL Queries over Uncertain Data

by
Brian Harris

Abstract:

Relational databases and the SQL query language provide a robust and practical means for storing and exploring many types of data. However, an increasingly prevalent type of data which is not supported by this model is data containing uncertainty; this sort of data shows up in a variety of applications including record matching, data cleaning, and RFID and sensor data. We developed a system called MystiQ which defines new semantics for SQL in the presence of such datasets using probability to quantify the degree of uncertainty. To allow for efficient probability computation at a large scale, we concentrate the system's resources at providing correct ranking for the k most probable answers to SQL queries, and attempt to minimize the amount of computation needed to rule out the non-top-k answers. Additionally, we can sometimes push some, or even all, of the probability computations into the query engine to avoid expensive simulations.

Advised by Dan Suciu

CSE 403
Wednesday
April 11, 2007
4:30 - 5:20 pm