Please subscribe to the Course Calendar
Unlike textbook-level SQL, modern big data management systems support a nested data model and/or iteration. In this seminar we will read some fundamental papers on language design and expressive power focusing on nested relations and their interaction with iteration.
Please see the notes below. Additional recommended reading: some basic background on datalog, e.g. the first paper here http://courses.cs.washington.edu/courses/cse590q/12au/
Week | Topic | Paper(s) | Link | Notes | Presenter | |
---|---|---|---|---|---|---|
Note unusual day: Firday, 9/27 |
Overview and Organization | introduction | ||||
1 | Monday, 10/7 | Comprehension | Main: Buneman et al: Comprehension Syntax. SIGMOD Record, 1994 |
paper (restr) link paper (restr) |
Jennifer Ortiz, | |
2 | Monday, 10/14 | Principles | Main: Tannen et al: Naturally Embedded Query Languages. ICDT 1992 (Sec 1-4 only) | paper (restr) | read the technical report |
Paris Koutris |
3 | Monday, 10/21 | Case Studies | Main: Melnik et al., Dremel: interactive analysis of web-scale datasets. CACM 2011 |
paper (restr) manual paper (restr) link |
focus on query language |
Laurel Orr, Gabriel Bender, Dominik Mortiz |
4 | Monday, 10/28 | Implementation | Main: Melnik et al, Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 2011 Optional: Abadi et al, Column-stores vs. row-stores: howdifferent are they really? SIGMOD 2008 Optional: Liefke, Suciu: XMILL: An Efficient Compressor for XML Data. SIGMOD 2000 |
paper (restr) paper (restr) paper (restr) |
focus on storage layout |
Jeremy Hyrkas, Emad Soroush, Daniel Li |
5 | Monday, 11/4 | Conservativity (1) | Main: Wong: Normal Forms and Conservative Properties for Query Languages over Collection Types. PODS 1993 | paper (restr) | Jingjing, Ryan | |
Monday, 11/11 | -- external speakers | |||||
6 | Monday, 11/18 | Conservativity (2) | Main: Suciu: Bounded Fixpoints for Complex Objects. Theor. Comput. Sci. 1997 (Sec. 1-4) | slides paper (restr) |
Shumo, Prasang | |
7 | Monday, 11/25 | Parallelism | Main: Suciu, Tannen: A Query Language for NC. J. Comput. Syst. Sci. 1997 | slides | read slides paper is optional |
Eric |
8 | Monday, 12/2 | While-Language | Main: Datalog and Recursive Query Processing |
paper | Shengliang |
Modern nested data models include Protobuf and Json. We should do a very brief overview of the syntax for one of them only.
Take-aways from the comprehension paper:
-- comprehension seems natural to express queries over deeper collections
-- aggregates become natural in conjunction with nested collections
-- need explicit set/bag/list distinction
-- functions that define queries (what we call "macros") can be defined in the same language/style as the rest of the query language
-- variant types; how are these captured in modern data models?
Comprehending Monads, Phil Wadler: Read Sec. 2, 3 only. Take-aways: the definition of a monad, important examples of monads (lists, identity, strictness), basic terminology (object, morphisms, functor).
Naturally embedded query languages; read only sec. 1-4 (and just the first 3 paragraphs of 4).
This paper is the key to understanding the landscape of query languages for nested collections. While the paper should be easy to read (make sure you read the Tech Report, not the journal version!), there is a lot of stuff to think about.
Sec. 1.4: skip the details of sri/sru (we'll discuss them later); all you need from this section is to understand what map and ext do, namely:
map(f)({x1,...,xn}) = {f(x1), ..., f(xn)}
ext(f)({x1,...,xn}) = f(x1) U ... U f(xn)
Sec. 2. Assume that an equality operator on the base type b is added to MC and to MA:
= : b x b --> {()} (recall {()} is the same as Bool)
(IMHO it was a mistake not to add this operator to MC/MA, just assume it's there.)
Take-away from Sec 2,3:
(a) MC = MA = "conjunctive queries for nested relations" (assuming we have the = operator!)
(b) MC({}, union) = MA({}, union) = "unions of conjunctive queries for nested relations"
(c) to get negation, it suffices to add any single non-monotone operator:
intersection, or
= on non-atomic types (not the same as = on atomic types!), or
set difference, or
etc
Then: MC({}, union, difference) = "THE Nested Relational Algebra" = NRA
Sec. 3, "Discussion" on the "nest/unnest" operators. Here is what I know and is not stated in the paper:
"Standard Relational Algebra" + nest + unnest = NRA (almost)
To discuss the significance of this equivalence. To discuss what nest/unnest cannot express.
Sec 4. If you have "powerset", how do you express transitive closure?
Skip the discussion on cardinality and the rest of the paper.
Pig-Latin, Dremmel (aka google's Big Query), and AQL are query languages for "big data" that support nested collections. We'll discuss their design. How can they express sample queries on nested collections? Suggestions, what about:
unnest: {A x {B}} --> {A x B}
nest: {A x B} --> {A x {B}}
join-plus-nested-join: {A x {B x C}} x {A x {B x D}} --> {A x {B x C x D}}
nested-unnest: {A x {B x {C}}} --> {A x {B x C}}
etc
All implementations of nested collections into flat collections use some form of column-store, hence we should discuss the C-store paper too.
First proven by Paredaens and Van Gucht, the conservativity theorem says that adding nested relations to NRA does not help one express more queries; it only helps at manipulating nested collections. We will discuss Wong's proof, which is elegant in spirit because it consists of simple rules that would normally be applied by modern query optimizer. The take-away of the paper is that, any good query optimizer should be able to remove unnecessary nested relations from intermediate results. We will see a second proof technique for the conservativity theorem next.
The old slides are very concise and almost sufficient for this topic. Sections 1-4 of the paper are a bit long-ish, but quite readable, and offer a survey of our discussion from previous seminars.
There are two main ideas in this paper:
(a) an alternative proof technique for the conservativity theorem, which uses a (natural?) encoding of nested relations into flat relations, then translates NRA into RA over that encoding. This is useful for implementations (which we discussed in another seminar).
(b) recursion plus nested relations can express powerset. This was first shown by Abiteboul and Beeri (that paper is harder to read), and is shown here in one line. This is important: in other words, it says that datalog plus nested relations is a no-no! The "natural" way to combine iteration plus recursion is to use a "bounded" fixpoint. The main result in this paper is that NRA extended with bounded fixpoint is a conservative extension of RA with fixpoint. Paraphrasing: in "bounded" datalog^neg we can always remove unnecessary nesting in intermediate results. (And this does NOT hold for datalog^neg over nested relations!) An alternative way to combine recursion and nested relations while keeping the language in PTIME was described by Gyssens and Van Gucht, and is to simply disallow the construction of nested relations inside the recursion: later S. and Van Gucht showed these two languages are equivalent, hence they seem to be the "right" query language with recursion plus nested relations. So far, no practical query language has adopted such a form of recursion/iteration over nested relations
"Divide-and-conquer" on sets captures precisely the complexity class NC, in the same way in which fixpoint (or datalog^neg) captures precisely PTIME. In contrast to "divide-and-conquer", "step-wise-iteration" on sets captures PTIME. Taken with a grain of salt, this suggests that "divide-and-conquer" is the natural language construct for expressing parallel queries, while "step-wise" is the natural construct for sequential queries. One take-away is the easy way to rewrite any divide-and-conquer function into to step-wise-iteration function. Clearly, no automatic rewriting in the other direction is possible, unless NC=PTIME!.
Modern database systems allow users to define new aggregate operators. We will discuss UDA's in postgres. The UDA is essentially defined as step-wise-iteration, hence it is "sequential". To discuss: how to we convert the "sequential" UDA into a parallel (divide-and-conquer) UDA?
Fixpoint extensions of first-order logic and datalog-like languages
Abiteboul and Vianu
Graphlab, Pregel consists of iteration + local-updates. What is the expressive power of such languages? This paper is a dense (but luckily short) and clarifies the interaction between updates and iteration.
Take away: updates during iteration lead to nondeterminism (the updates order is non-deterministic). The paper captures this in Logic using an operator called W, defined as follows:
Wy.R(x,y) = all possible maximal subsets R' of R s.t. x is a key in R'.
Note that both Wx.Wy.R(x,y) and Wy.Wx.R(x,y) denote partial matchings in R (i.e. both x and y are keys), but they are obtained in different order thus may be different. Of course, Wxy.R(x,y) means simply "choose some tuple (x,y) from R" and is completely different from the other two. More recently, W has been called the "repair by key" operator in (Lyublena Antova, Christoph Koch, Dan Olteanu: From complete to incomplete information and back. SIGMOD Conference 2007): repair-by-key[x](R(x,y)) means "choose a unique y for each x", and is the same as Wy.R(x,y)
A discussion and comparison to Graphlab and/or Pregel would be nice.