CSE599T: Topics in Probabilistic and Statistical Databases
Friday, 9:30-12:20 CSE 503
Instructor: Dan Suciu
Overview:
We will discuss concepts, algorithms, and systems used for
process probabilistic data, and for applying statistical
techniques to data management. Applications include management
of uncertain data, data anonymization, approximate query
processing, and query size estimation. We will discuss the
probabilistic data model, several approaches to query
evaluation, data lineage/provenance, the random graph data
model, sketches from data, and sampling techniques. The course
will consists mostly of lectures, very few reading assignments,
and discussions in class.
Grading:
Will be based on participation to the discussions in class
Mailing List:
Please subscribe cse599t mailing list.
Lectures
1. Introduction
- Lecture notes
- Recommended readings
- Dalvi, Suciu, "Management of Probabilistic Data: Foundations and
Challenges", PODS'2007
- Gupta, Sarawagi, "Creating Probabilistic Databases from
Information Extraction Models", VLDB'2006
- Rastogi, Suciu, Hong, "The Boundary Between Privacy and Utility in
Data Publishing", VLDB'2007
- Markl, Megiddo, Kutsch, Tran, Haas, Srivastava, "Consistently
Estimating the Selectivity of Conjuncts of Predicates", VLDB'2005.
- Jaynes, "Probability Theory", Cambridge Press, 2003.
2. Representation of Probabilistic Databases
- Lecture notes
- Recommended Readings
- Background on Incomplete Databases
- Incomplete Information in Relational Databases
Tomasz Imielinski and Witold Lipski. Jr.
Comments: this is the base paper that started everything, but it is
very techincal and not easy to read. I recommend that you read sect 1:
beyond that you are on your own...
- Models for Incomplete and Probabilistic Information
Todd J. Green, Val Tannen (in Data Engineering Bulletin, 2006)
A succinct and up to date summary of Imielinski and Lipsik's paper.
I recommend you read it in lieu of Imielisnki.
- Background on Or Sets
- Semantic representations and query languages for Or-sets
Leonid Libkin and Limsoon Wong, JCSS 1996
Read sections 2, 3
- Background on graphical models
- Probabilistic Networks and Expert Systems
R. Cowell and P. Dawid and S. Lauritzen and D. Spiegelhalter
Chapters 5.1 and 5.2.
Probabilistic reasoning in intelligent systems, J. Pearl
A standard reference for graphical models, and a pleasure to read, but
we won't discuss it in class.
- Representation of Prob DBs
- Databases with uncertainty and lineage
Benjelloum, Das Sarma, Halevy, Theobald, Widom, VLDBJ 2008.
Read up to Sec. 4.2 (inclusive)
- Fast and Simple Relational Processing of Uncertain Data
Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu
Read sections 1-6. The purpose here is to understand WS-relations
(which have been described in several prior papers) and the
U-relations (which are described here for the first time)
- Queries and Materialized Views on Probabilistic Databases
Nilesh Dalvi Christopher Re Dan Suciu(Manuscript)
Sections 2 and 4.
- Approximate Lineage for Probabilistic Databases
Christopher Re, Dan Suciu, VLDB'2008
Sec. 1 and 2
3. Representation and Query Evaluation
- We will finish the discussion on representing probabilistic
databases, then start query evaluation
- Lecture notes
- Recommended readings: items 9, 12, 13, 14, and 1 (listed above)
4. Dichotomy Theorem
- Lecture notes
- Recommended readings: items 9, 12, 13, 14, and 1 (listed above)
5. Query Evaluation
- Lecture notes
- Recommended readings: items 9, 12, 13, 14, and 1 (listed above)
6. Aspects of Query Processing
- Lecture notes
- Recommended readings
- Graham Cormode and Minos Garofalakis. Histograms and wavelets on probabilistic data. In ICDE, 2009
- Xi Zhang and Jan Chomicki. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. In DBRank, 2008
- Mohamed Soliman, Ihab Ilyas, and Kevin C. Chang. Top-k Query Processing in Uncertain Databases. In ICDE, 2007
- Ke Yi, Feifei Li, Divesh Srivastava, and George Kollios. Efficient Processing of Top-k Queries in Uncertain Databases. In ICDE, 2008
- Ming Hua, Jian Pei, Wenjie Zhang, and Xuemin Lin. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD, 2008
7. Probabilistic logic, Conditional logic
8. Implicit Probabilistic Data
9. Histograms and Sampling
- Lecture notes
- Recommended readings
- Guha, Koudas, Srivastava. Fast Algorithms For Hierarchical Range Histogram Construction. In PODS, 2002
- Getoor, Taskar, Koller. Selectivity Estimation using Probabilistic Models. In SIGMOD, 2001
- Chaudhuri, Motwani, Narasayya. On random sampling over joins. In SIGMOD, 1999
- Babcock, Chaudhuri. Towards a robust query optimizer. In SIGMOD, 2005
10. Sampling and Review