CSE 544
University of Washington
Spring 2000
Homework 2
The following questions are intended to make you think about XML and storage. They are both open ended. Your solutions should be reflections of this thought process. As a consequence, be sure to be clear about the assumptions you are making. This is particularly true when you are considering a particular workload.
Some things to keep in mind: XML data is a graph; there may be need for sets and/or lists; queries follow a path through the graph.
Assume a large quantity of data (petabytes). Here are some sample DTDs and XML data to act as examples. Feel free to start from one example and generalize.
Some citations which might be useful are:
In particular, your proposal should include: your storage strategy; a DTD; and some example queries (based on the DTD) for which your proposal is not efficient.
Some citations which might be useful, in addition to the above:
From "The Cow Book" Exercise 12.4
SELECT R.a, S.b FROM R, S WHERE R.c = S.c
SELECT S.a, T.b FROM S, T WHERE S.c = T.c
SELECT R.a, T.a FROM R, T WHERE R.c = T.c
Some citations for double pipelined hash join
"Nesting" takes tuples from one table and 'nests' them under another tuple in an XML document. The "parent" table is assumed to have only a few tuples, while the "child" table can have many tuples that belong as subelements under each parent tuple. Notice that for a given parent tuple, there could be no child tuple. You can assume that each parent tuple has an (unique) ID, and that child tuples have a foreign key referencing their parent's ID. Here is an example:
|
nested under |
|
|||||||||||||||||||||
results in | |||||||||||||||||||||||
|
<items> <item> <quantity> 4 </quantity> <price> $1.50 </price> <order number> 0001 </order number> </item> <item> <quantity> 3 </quantity> <price> $1.75 </price> <order number> 0001 </order number> </item> <item> <quantity> 9 </quantity> <price> $1.29 </price> <order number> 0001 </order number> </item> </items> |
nested under |
<orders> <order> <order number> 0001 </order number> <customer> Bob Jones </customer> <address> Bob Jones University </address> </order> <order> <order number> 0002 </order number> <customer> Ralph Nader </customer> <address> Ralph Nader Institute </address> </order> </orders> |
results in |
<order history> <customer> Bob Jones </customer> <address> Bob Jones University </address> <order> 0001 <quantity> 4 </quantity> <price> $1.50 </price> <quantity> 3 </quantity> <price> $1.75 </price> <quantity> 9 </quantity> <price> $1.29 </price> </order> </order history> |
For each of the following page replacement policies, suggest (and justify) a relational operator which will cause poor performance.
Optimizer plans are usually considered to be trees. Give a query for which the "best" plan is a graph rather than a tree. The query can be over a schema of your choice. You do not need to prove that the plan is the absolute best, but should (at the very least) give a strong intuitive argument that a graph is better than a tree for this query.