SQL Query Evaluation in the
Probabilistic Relational Model.
Query evaluation techniques for "simple" probabilistic databases have
been developed in [1]. Probabilistic Relational Models (PRMs) [2]
are more advanced specification formalisms for probabilistic databases,
which generalize Bayesian Networks by allowing the attributes of a
record to depend not only on other attributes of the same record but
also on concise representations of complex probabilistic spaces. In
this project you will study SQL processing techniques on PRMs, and
demonstrate them in an implementation. You will restrict the
investigation to simple select-distinct-from-where queries.
Issues to address:
a) Can the SQL evaluation problem on PRMs be reduced to
evaluating the SQL
query over a simple tuple-independent or tuple disjoin-and-independent
probabilistic database (as in [1]), or is it significantly harder ? b) Extend the 'safe
plans' from [1] to work over PRMs. One should expect
this to be possible only for some of the queries (the 'safe' queries),
and fail for #P-hard ones. c) Search the
literature (e.g. starting from [2]) for papers describing
experiments of inference in PRMs. Compare them with safe query
processing over probabilistic data and discuss scalability. d) (Optional) Propose
optimization methods for query processing on PRMs.
Difficulty: (a) -
(c) seem relatively simple; (d) is open-ended and may be difficult.
Publishable: (d) is
probably publishable.
Utility: medium
Conditional Query Evaluation P(Q
| V) on Probabilistic Databases.
The techniques in [1] only discuss the evaluation of P(Q). Often,
however, certain events V are known to hold, and users are interested
in evaluating the conditional probability P(Q | V) rather than
P(Q). For example [3] considers the events V= "there is a person
at the door" and Q = "Bob returns from work": with the techniques in
[1] we know how to compute P(Q) and separately P(V), but users may want
to know P(Q | V) instead.
Issues to address:
a) Define a query paradigm in which the user can specify
conditions V. You may want to consider the case when V is a SQL
query for which the user "confirms" the answers, or a constraint that
is known to hold, or something else. Define an application
scenario and give examples of conditional queries. b) Are there "safe plans"
for conditional evaluation of P(Q | V)? If so the provide an algorithm
for finding the safe plans, as in [1]. c) If not, then
check out the Monte Carlo simulation method for unsafe queries in
[4]. Extend the algorithm to evaluate conditional probabilities.
Difficulty: medium
(depends a lot on what query paradigm is selected for (a))
Publishable: yes, provided
(b) and (c) are done thoroughly.
Utility: low/medium (may become high in the future)
Explaining Answers in a
Probabilistic Database.
Reference [5] describes "lineage", which is a full
explanation why an answer exists: however, unsophisticated users do not
want the full explanation (which is often too bulky) but an insightful
piece of information about why a tuple is in the answer.
Specifically, a query's answer is a ranked list of answers t1, t2, t3,
..., in decreasing order of their probabilities. When users ask
"why is ti there ?" they may want to know the few highest pieces of
evidences supporting ti, or they may want to know why ti ranked so low
(i.e. why do the tj's with j < i have a higher probability), or why
it ranked so high. This project grew out of Mike Cafarella's "querying
the Web" project, and will require lots of exploration (and meetings
with Mike). Interest/background in AI is a plus.
Issues to address (choose a
subset):
a) Define a taxonomy of explanations for query answers. b) Find "safe plans"
(a la [1]) for finding the explanations to a specific query answer. c) Design a good
user interface for explaining the query's answers.
Additional references: Jennifer
Widom's work on data lineage (talk to Dan).
Difficulty:
open-ended and difficult
Publishable: yes
Utility: high
Define a data model where correlations between tuples can be
explicitly stated. For example R(a,b) is correlated 0.8 with
S(a,d). For motivation see [3].
Questions to address:
a) Define a data model that allows users to say "tuple
so-and-so from R is correlated 0.8 with tuple such-and-such from
S". Define a semantics for this model. b) Find an
application domain and illustrate with several examples. Are two-way
correlations enough? Or do we need multiple correlations (e.g.
between three tuples)? What exactly do the latter mean?
c) Study query processing.
The FMA is a very large ontology, and this project
requires you to develop and implement a query paradigm by which to
query FMA on a relational database.
Extend a stream-processing
engine with the capability of querying historical data.
This project will be advised by Magda Balazinska and Dennis Lee
(magda@cs, lee@cs). Please contact them directly, and let us know.
Monitoring applications enable users to continuously observe the
current state of a system, and receive alerts when interesting
combinations of events occur. For example, a system administrator may
monitor a network of machines and receive an alert every time a machine
experiences a failure (runs out of disk space, becomes overloaded,
etc.). In many scenarios, however, historical information is
needed to explain what caused the alert. For example, to understand why
a machine failed, the administrator may need to know if the same error
occurred in the past, and what was happening in the system then. The
problem is that existing data management systems enable administrators
to either observe new data in real-time or query historical data
offline, but not do both at the same time. The goal of this
project is to combine a stream processing engine (that enables
real-time continuous queries) with a traditional RDBMS. The combined
system will enable queries over both current and historical data. In
our example, this means that the administrator will be able to receive
complementary historical information at the same time as the original
alert.
Project Description: The goal of this project is to design, build, and evaluate
the system described above. To do so, we propose that you follow the
steps below:
First build a simple prototype that uses the Borealis stream
processing engine (which we will provide) to run real-time queries and
slow historical queries over logs of data produced by machines in CSE
(we already have samples of these logs).
In order to handle a large log and high event rates, try
extending Borealis in one or more of the following ways
a)Design and implement an
algorithm to automatically index logs based on the historical queries
that the user registers in the system. Such indexes will speed-up
historical queries.
b) Design and implement an algorithm for partitioning historical
queries into smaller chunks, and prioritizing the different query
pieces in order to maximize the timeliness and relevance of the
information returned to the user.
c) Try to extend the system to perform speculative execution. If an
alert is about to occur, the system may start to process historical
information before the alert is even produced. By doing so, the system
may be able to provide historical information almost at the same time
as the alert.
d) Because there may sometimes be a lot of historical information
relevant to an alert, implement an operator that ranks and filters the
historical information returned to the user.
Deliverables:
A running
system
that processes both real-time data and historical data, albeit not
necessarily fast.
At least one extension to handle large logs and high event
rates.
Experimental results showing the performance of the system
without and with your enhancements.
Publishable: yes, but will
require additional work after the quarter is over.
Utility: yes. Such a tool
would be useful for system administrators.
Team: 2 people
Department of Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to bhushan]