CSE 544

University of Washington

Spring 2000

Homework 2

Storing XML

The following questions are intended to make you think about XML and storage. They are both open ended. Your solutions should be reflections of this thought process. As a consequence, be sure to be clear about the assumptions you are making. This is particularly true when you are considering a particular workload.

Some things to keep in mind: XML data is a graph; there may be need for sets and/or lists; queries follow a path through the graph.

Assume a large quantity of data (petabytes). Here are some sample DTDs and XML data to act as examples. Feel free to start from one example and generalize.

  1. Propose a relational schema for efficiently storing XML data, where efficiently has several potential meanings. Propose for each of the following desired behavior scenarios.
    1. Efficient loading
    2. Efficient querying only, including considerations of predicates (where do they occur on the path expression), potential execution strategies (can the execution be done bottom up or top down), what characteristics of a query are most important (are you optimizing for joins or redundancy), etc.
    3. Efficient updating, including insertions and deletions of subgraphs, and updates to nodes and edges (e.g. data values and schema).
    4. Some balance between the above needs

    Some citations which might be useful are:

  2. Propose a storage layout to store XML natively. This could be inspired by object oriented design, relational schemas, plain text with indices, or completely of your own creation. It should allow for efficient processing of some interesting set of queries. To approach this problem, it might be useful to choose an example data set, and develop a few queries over that data. You should explore both queries which work well over your storage layout and queries which work poorly.

    In particular, your proposal should include: your storage strategy; a DTD; and some example queries (based on the DTD) for which your proposal is not efficient.

    Some citations which might be useful, in addition to the above:

B+ trees

  1. Exercise 9.2 from "The Cow Book", parts 1 through 5.
  2. Give an example B+ tree and a delete operation on that tree such that the tree height is increased by the delete operation.

Join Algorithms

  1. Given the following data stored on disk as specified. For each query, analyze the join algorithms we have seen (Nested Loops, Sort-Merge, Hash, Double Pipelined Hash, Semijoin), arguing for a best and a worst algorithm.

    From "The Cow Book" Exercise 12.4

    • The cost metric is the number of page I/Os and the cost of writing the result should be ignored. Disk and buffer pages are assumed to be the same size.
    • Relations R and T contain 10,000 tuples and have 10 tuples per page.
    • Relation S contains 2,000 tuples and also has 10 tuples per page.
    • Both relations are stored as simple heap files.
    • 52 buffer pages are available.
    • There is only one index, on attribute c of relation R. It is a clustered index using a B+ tree.
    • Any attribute of any relation can be hashed such that 100 hash entries fit on a page.

    1.          SELECT R.a, S.b
               FROM   R, S
               WHERE  R.c = S.c
      
    2.          SELECT S.a, T.b
               FROM   S, T
               WHERE  S.c = T.c
      
    3. Assume for this query that interactivity is very important.
               SELECT R.a, T.a
               FROM   R, T
               WHERE  R.c = T.c
      

    Some citations for double pipelined hash join

  2. Propose three join algorithms (perhaps extensions of the above algorithms) to perform "Nesting" as described below. Evaluate your algorithms, demonstrating situations in which one works significantly better (or worse) than the others. At least one of you algorithms should be efficient in a general purpose environment.

    "Nesting" takes tuples from one table and 'nests' them under another tuple in an XML document. The "parent" table is assumed to have only a few tuples, while the "child" table can have many tuples that belong as subelements under each parent tuple. Notice that for a given parent tuple, there could be no child tuple. You can assume that each parent tuple has an (unique) ID, and that child tuples have a foreign key referencing their parent's ID. Here is an example:

    Relational Example

    quantity price order (parent)
    4 $1.50 0001
    3 $1.75 0001
    9 $1.29 0001
    nested under
    order name address
    0001 Bob Jones Bob Jones University
    0002 Ralph Nader Ralph Nader Institute
    results in
    name address order quantity price
    Bob Jones Bob Jones University 0001 4 $1.50
    Bob Jones Bob Jones University 0001 3 $1.75
    Bob Jones Bob Jones University 0001 9 $1.29

    XML Example

       <items>
        <item>
         <quantity>
    	4
         </quantity>
         <price>
    	$1.50
         </price>
         <order number>
    	0001
         </order number>
        </item>
    
        <item>
         <quantity>
    	3
         </quantity>
         <price>
    	$1.75
         </price>
         <order number>
    	0001
         </order number>
        </item>
    
        <item>
         <quantity>
    	9
         </quantity>
         <price>
    	$1.29
         </price>
         <order number>
    	0001
         </order number>
        </item>    
       </items>
       
    nested under
       <orders>
        <order>
         <order number>
    	0001
         </order number>
         <customer>
    	Bob Jones
         </customer>
         <address>
    	Bob Jones University
         </address>
        </order>
        <order>
         <order number>
    	0002
         </order number>
         <customer>
    	Ralph Nader
         </customer>
         <address>
    	Ralph Nader Institute
         </address>
        </order>
       </orders>
       
    results in
       <order history>
        <customer>
    	Bob Jones
        </customer>
        <address>
    	Bob Jones University
        </address>
        <order>
    	0001
         <quantity>
    	4
         </quantity>
         <price>
    	$1.50
         </price>
         <quantity>
    	3
         </quantity>
         <price>
    	$1.75
         </price>
         <quantity>
    	9
         </quantity>
         <price>
    	$1.29
         </price>
        </order>
       </order history>
       

Buffer management

For each of the following page replacement policies, suggest (and justify) a relational operator which will cause poor performance.

  1. LRU
  2. MRU
  3. Random
  4. Random, except for k most recently used pages (where k is perhaps 10% of the whole).

Optimizer Plans

Optimizer plans are usually considered to be trees. Give a query for which the "best" plan is a graph rather than a tree. The query can be over a schema of your choice. You do not need to prove that the plan is the absolute best, but should (at the very least) give a strong intuitive argument that a graph is better than a tree for this query.