CSE 544

University of Washington

Spring 2000

Homework 2

Storing XML

The following questions are intended to make you think about XML and storage. They are both open ended. Your solutions should be reflections of this thought process. As a consequence, be sure to be clear about the assumptions you are making. This is particularly true when you are considering a particular workload.

Some things to keep in mind: XML data is a graph; there may be need for sets and/or lists; queries follow a path through the graph.

Assume a large quantity of data (petabytes). Here are some sample DTDs and XML data to act as examples. Feel free to start from one example and generalize.

Propose a relational schema for efficiently storing XML data, where efficiently has several potential meanings. Propose for each of the following desired behavior scenarios.
1. Efficient loading
2. Efficient querying only, including considerations of predicates (where do they occur on the path expression), potential execution strategies (can the execution be done bottom up or top down), what characteristics of a query are most important (are you optimizing for joins or redundancy), etc.
3. Efficient updating, including insertions and deletions of subgraphs, and updates to nodes and edges (e.g. data values and schema).
4. Some balance between the above needs
Some citations which might be useful are:
- Daniela Florescu, Donald Kossmann: Storing and Querying XML Data using an RDMBS. IEEE Data Engineering Bulletin 22(3): 27-34 (1999).
- Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314.
- Alfons Kemper, Guido Moerkotte: Access Support in Object Bases. SIGMOD Conference 1990: 364-374.
- Roy Goldman, Jennifer Widom: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB 1997: 436-445.
Propose a storage layout to store XML natively. This could be inspired by object oriented design, relational schemas, plain text with indices, or completely of your own creation. It should allow for efficient processing of some interesting set of queries. To approach this problem, it might be useful to choose an example data set, and develop a few queries over that data. You should explore both queries which work well over your storage layout and queries which work poorly.
In particular, your proposal should include: your storage strategy; a DTD; and some example queries (based on the DTD) for which your proposal is not efficient.

Some citations which might be useful, in addition to the above:
- C. C. Kanne, Guido Moerkotte: Efficient Storage of XML Data. ICDE 2000: 198
- Feng Tian, David J. DeWitt, Jianjun Chen, Chun Zhang: The Design and Performance Evaluation of Various XML Storage Strategies. Unpublished.
- Alin Deutsch, Mary Fernandez, Dan Suciu: Storing semistructured data with STORED. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 1999.

B+ trees

Exercise 9.2 from "The Cow Book", parts 1 through 5.
Give an example B+ tree and a delete operation on that tree such that the tree height is increased by the delete operation.

Join Algorithms

Given the following data stored on disk as specified. For each query, analyze the join algorithms we have seen (Nested Loops, Sort-Merge, Hash, Double Pipelined Hash, Semijoin), arguing for a best and a worst algorithm.
From "The Cow Book" Exercise 12.4
- The cost metric is the number of page I/Os and the cost of writing the result should be ignored. Disk and buffer pages are assumed to be the same size.
- Relations R and T contain 10,000 tuples and have 10 tuples per page.
- Relation S contains 2,000 tuples and also has 10 tuples per page.
- Both relations are stored as simple heap files.
- 52 buffer pages are available.
- There is only one index, on attribute c of relation R. It is a clustered index using a B+ tree.
- Any attribute of any relation can be hashed such that 100 hash entries fit on a page.
1. ```
         SELECT R.a, S.b
         FROM   R, S
         WHERE  R.c = S.c
```
2. ```
         SELECT S.a, T.b
         FROM   S, T
         WHERE  S.c = T.c
```
3. Assume for this query that interactivity is very important.
```
         SELECT R.a, T.a
         FROM   R, T
         WHERE  R.c = T.c
```
Some citations for double pipelined hash join
- Zachary G. Ives, Marc Friedman, Daniela Florescu, Alon Levy, Daniel S. Weld. An Adaptive Query Execution System for Data Integration, SIGMOD 1999, Philadelphia, PA.
- Tolga Urhan, Michael Franklin: XJoin: Getting Fast Answers From Slow and Bursty Networks. Technical Report. CS-TR-3994, UMIACS-TR-99-13. February, 1999
- Peter J. Haas, Joseph M. Hellerstein: Ripple Joins for Online Aggregation. SIGMOD Conference 1999: 287-298.
- Annita N. Wilschut, Peter M. G. Apers: Dataflow Query Execution in a Parallel Main-Memory Environment. PDIS 1991: 68-77 (no online version!?)

Propose three join algorithms (perhaps extensions of the above algorithms) to perform "Nesting" as described below. Evaluate your algorithms, demonstrating situations in which one works significantly better (or worse) than the others. At least one of you algorithms should be efficient in a general purpose environment.

"Nesting" takes tuples from one table and 'nests' them under another tuple in an XML document. The "parent" table is assumed to have only a few tuples, while the "child" table can have many tuples that belong as subelements under each parent tuple. Notice that for a given parent tuple, there could be no child tuple. You can assume that each parent tuple has an (unique) ID, and that child tuples have a foreign key referencing their parent's ID. Here is an example:

Relational Example

quantity	price	order (parent)
4	$1.50	0001
3	$1.75	0001
9	$1.29	0001

nested under

order	name	address
0001	Bob Jones	Bob Jones University
0002	Ralph Nader	Ralph Nader Institute

results in

name	address	order	quantity	price
Bob Jones	Bob Jones University	0001	4	$1.50
Bob Jones	Bob Jones University	0001	3	$1.75
Bob Jones	Bob Jones University	0001	9	$1.29

XML Example

   <items>
    <item>
     <quantity>
	4
     </quantity>
     <price>
	$1.50
     </price>
     <order number>
	0001
     </order number>
    </item>

    <item>
     <quantity>
	3
     </quantity>
     <price>
	$1.75
     </price>
     <order number>
	0001
     </order number>
    </item>

    <item>
     <quantity>
	9
     </quantity>
     <price>
	$1.29
     </price>
     <order number>
	0001
     </order number>
    </item>    
   </items>

nested under

   <orders>
    <order>
     <order number>
	0001
     </order number>
     <customer>
	Bob Jones
     </customer>
     <address>
	Bob Jones University
     </address>
    </order>
    <order>
     <order number>
	0002
     </order number>
     <customer>
	Ralph Nader
     </customer>
     <address>
	Ralph Nader Institute
     </address>
    </order>
   </orders>

results in

   <order history>
    <customer>
	Bob Jones
    </customer>
    <address>
	Bob Jones University
    </address>
    <order>
	0001
     <quantity>
	4
     </quantity>
     <price>
	$1.50
     </price>
     <quantity>
	3
     </quantity>
     <price>
	$1.75
     </price>
     <quantity>
	9
     </quantity>
     <price>
	$1.29
     </price>
    </order>
   </order history>

Buffer management

For each of the following page replacement policies, suggest (and justify) a relational operator which will cause poor performance.

LRU
MRU
Random
Random, except for k most recently used pages (where k is perhaps 10% of the whole).

Optimizer Plans

Optimizer plans are usually considered to be trees. Give a query for which the "best" plan is a graph rather than a tree. The query can be over a schema of your choice. You do not need to prove that the plan is the absolute best, but should (at the very least) give a strong intuitive argument that a graph is better than a tree for this query.