Project Ideas

SQL Query Evaluation in the Probabilistic Relational Model.

Query evaluation techniques for "simple" probabilistic databases have been developed in [1]. Probabilistic Relational Models (PRMs) [2] are more advanced specification formalisms for probabilistic databases, which generalize Bayesian Networks by allowing the attributes of a record to depend not only on other attributes of the same record but also on concise representations of complex probabilistic spaces. In this project you will study SQL processing techniques on PRMs, and demonstrate them in an implementation. You will restrict the investigation to simple select-distinct-from-where queries.

Issues to address:
a) Can the SQL evaluation problem on PRMs be reduced to evaluating the SQL query over a simple tuple-independent or tuple disjoin-and-independent probabilistic database (as in [1]), or is it significantly harder ?
b) Extend the 'safe plans' from [1] to work over PRMs. One should expect this to be possible only for some of the queries (the 'safe' queries), and fail for #P-hard ones.
c) Search the literature (e.g. starting from [2]) for papers describing experiments of inference in PRMs. Compare them with safe query processing over probabilistic data and discuss scalability.
d) (Optional) Propose optimization methods for query processing on PRMs.

Difficulty: (a) - (c) seem relatively simple; (d) is open-ended and may be difficult.

Publishable: (d) is probably publishable.

Utility: medium
Conditional Query Evaluation P(Q | V) on Probabilistic Databases.

The techniques in [1] only discuss the evaluation of P(Q). Often, however, certain events V are known to hold, and users are interested in evaluating the conditional probability P(Q | V) rather than P(Q). For example [3] considers the events V= "there is a person at the door" and Q = "Bob returns from work": with the techniques in [1] we know how to compute P(Q) and separately P(V), but users may want to know P(Q | V) instead.

Issues to address:
a) Define a query paradigm in which the user can specify conditions V. You may want to consider the case when V is a SQL query for which the user "confirms" the answers, or a constraint that is known to hold, or something else. Define an application scenario and give examples of conditional queries.
b) Are there "safe plans" for conditional evaluation of P(Q | V)? If so the provide an algorithm for finding the safe plans, as in [1].
c) If not, then check out the Monte Carlo simulation method for unsafe queries in [4]. Extend the algorithm to evaluate conditional probabilities.

Difficulty: medium (depends a lot on what query paradigm is selected for (a))

Publishable: yes, provided (b) and (c) are done thoroughly.

Utility: low/medium (may become high in the future)
Explaining Answers in a Probabilistic Database.

Reference [5] describes "lineage", which is a full explanation why an answer exists: however, unsophisticated users do not want the full explanation (which is often too bulky) but an insightful piece of information about why a tuple is in the answer. Specifically, a query's answer is a ranked list of answers t1, t2, t3, ..., in decreasing order of their probabilities. When users ask "why is ti there ?" they may want to know the few highest pieces of evidences supporting ti, or they may want to know why ti ranked so low (i.e. why do the tj's with j < i have a higher probability), or why it ranked so high. This project grew out of Mike Cafarella's "querying the Web" project, and will require lots of exploration (and meetings with Mike). Interest/background in AI is a plus.

Issues to address (choose a subset):
a) Define a taxonomy of explanations for query answers.
b) Find "safe plans" (a la [1]) for finding the explanations to a specific query answer.
c) Design a good user interface for explaining the query's answers.

Additional references: Jennifer Widom's work on data lineage (talk to Dan).

Difficulty: open-ended and difficult

Publishable: yes

Utility: high
Define a data model where correlations between tuples can be explicitly stated. For example R(a,b) is correlated 0.8 with S(a,d). For motivation see [3].

Questions to address:
a) Define a data model that allows users to say "tuple so-and-so from R is correlated 0.8 with tuple such-and-such from S". Define a semantics for this model.
b) Find an application domain and illustrate with several examples. Are two-way correlations enough? Or do we need multiple correlations (e.g. between three tuples)? What exactly do the latter mean?
c) Study query processing.

Difficulty: simple to medium

Publishable: unlikely

Utility: high

References for Projects 1 - 4

[1] Christopher Re, Nilesh Dalvi, Dan Suciu. Query Evaluation on Probabilistic Databases. IEEE Data Engineering Bulletin, March 2006.
[2] Lise Getoor. An Introduction to Probabilistic Graphical Models for Relational Data. IEEE Data Engineering Bulletin, March 2006.
[3] Garofalakis et al. Probabilistic Data Management for Pervasive Computing: The Data Furnace Project. IEEE Data Engineering Bulletin, March 2006.
[4] Nilesh Dalvi, Dan Suciu. Efficient Query Evaluation on Probabilistic Databases. VLDB 2004.
Querying the Foundational Model of Anatomy (FMA).

The FMA is a very large ontology, and this project requires you to develop and implement a query paradigm by which to query FMA on a relational database.

Issues to address: TBD

References:
[1] Peter Mork's paper (ask Dan).
[2] http://sig.biostr.washington.edu/projects/fm/

Publishable: requires more work

Utility: medium to high
Extend a stream-processing engine with the capability of querying historical data.

This project will be advised by Magda Balazinska and Dennis Lee (magda@cs, lee@cs). Please contact them directly, and let us know.

Monitoring applications enable users to continuously observe the current state of a system, and receive alerts when interesting combinations of events occur. For example, a system administrator may monitor a network of machines and receive an alert every time a machine experiences a failure (runs out of disk space, becomes overloaded, etc.). In many scenarios, however, historical information is needed to explain what caused the alert. For example, to understand why a machine failed, the administrator may need to know if the same error occurred in the past, and what was happening in the system then. The problem is that existing data management systems enable administrators to either observe new data in real-time or query historical data offline, but not do both at the same time. The goal of this project is to combine a stream processing engine (that enables real-time continuous queries) with a traditional RDBMS. The combined system will enable queries over both current and historical data. In our example, this means that the administrator will be able to receive complementary historical information at the same time as the original alert.

Project Description:
The goal of this project is to design, build, and evaluate the system described above. To do so, we propose that you follow the steps below:
1. First build a simple prototype that uses the Borealis stream processing engine (which we will provide) to run real-time queries and slow historical queries over logs of data produced by machines in CSE (we already have samples of these logs).
2. In order to handle a large log and high event rates, try extending Borealis in one or more of the following ways
  a) Design and implement an algorithm to automatically index logs based on the historical queries that the user registers in the system. Such indexes will speed-up historical queries.
  b) Design and implement an algorithm for partitioning historical queries into smaller chunks, and prioritizing the different query pieces in order to maximize the timeliness and relevance of the information returned to the user.
  c) Try to extend the system to perform speculative execution. If an alert is about to occur, the system may start to process historical information before the alert is even produced. By doing so, the system may be able to provide historical information almost at the same time as the alert.
  d) Because there may sometimes be a lot of historical information relevant to an alert, implement an operator that ranks and filters the historical information returned to the user.
Deliverables:
- A running system that processes both real-time data and historical data, albeit not necessarily fast.
- At least one extension to handle large logs and high event rates.
- Experimental results showing the performance of the system without and with your enhancements.
References:
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. PODS 2002.
[2] Abadi et al. Aurora: A New Model and Architecture for Data Stream Management. VLDB Journal 2003. Primarily sections 2 and 5.

Publishable: yes, but will require additional work after the quarter is over.

Utility: yes. Such a tool would be useful for system administrators.

Team: 2 people

Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to bhushan]