CSE 544

University of Washington

Spring 2000

Homework 2 - Solutions

Storing XML

1.

Store the XML data as one big blob in a table. The table has arity 1. Whenever an XML file needs to be loaded, perform one insert that takes the file as one big binary blob and adds a tuple in the table. Loading is minimal since the operation involves only data copying.

2.

As mentioned above, the answer here depends on the queries that are most frequently asked. For each reasonable way of storing XML data in RDBMS one can come up with an ideal scenario and an worst-case one. Some of the options are:

(a): Use the Attribute+Inline approach as per Florescu and Kossman paper. They present fifteen benchmark queries as their general query load and Table 8 in the paper gives the execution times for each of the queries. Overall Attribute+Inline gives good performance.
(b): If the query load involves queries with simple predicates on attribute names, the Edge approach (as per Florescu and Kossman) might be more suitable. Without an index, a simple traversal of the table will produce the output. With an index significant speed-ups can be achieved.
(c): For queries involving path expressions another scenario might be more suitable. Store the XML data using either Edge or Universal approaches and build an index using all the possible paths in the XML tree (the XML data can be represented as a tree if we ignore the IDREFs). The path expressions can be more easily answered making use of the index.

Another issue here is the omnipresent normalization versus performance tradeoff. The question is whether to spread the data into tables reducing redundancy (normalization) or to store it into few tables with increased redundancy but potentially better performance. In the first approach, the queries might have to perform a substantial number of joins and we know joins are expensive. The latter approach favors redundancy and might be better for queries with long predicates (involving many attributes). For a thorough discussion of this issue (which arises in the relational model too) see Section 16.1 in the Cow Book.

3.

The answer here is tied to the answer above. Often, performing updates, inserts and deletes require executing a query in order to get to the right tuple in the query to be modified. Therefore, a substantial portion of the total cost is the query execution cost which we discussed above. In addition to the query cost there is the new object creation cost and index update cost. Another issue is adding a new attribute to the document. The Universal schemes presented in Florescu and Kossman would perform very poorly here since the table must be altered - a very expensive operation in all RDBMSs. On the other hand the Edge approach and Attribute+Inline approach would perform better, since in the worst case only a new table must be created. Table 9 gives some performance times for a few update cases, which seems to indicate that the Attribute+Inline method is robust (given their query load).

4.

All the answers above seem to indicate that the Attribute+Inline approach is the winner in most situations. Again, this is all relative to the workload that needs to be performed. One scenario can be - XML data stored in RDBMSs just as a means of backup of a very busy website. Crashes are very rare so the bottleneck is to store the XML data fast, without worrying about querying it. A simple binary blob with a timestamp would certainly do the job.

Problem 2) Again, there is no universal good answer to this question.

One initial naive view is that most of the XML data out there is just relational data that has been "XML-ized". Therefore a native storage for it should be compatible to the relational native storage. In this case we should be able to leverage off all the relational techniques we have for storing relational data and efficiently index it.

In general however, XML data conceptually corresponds to a graph-layout.

Now let's assume that we have the following DTD.

<!ELEMENT book(booktitle, author*, collection*)>
<!ELEMENT author(name, biography)>
<!ATTLIST author id ID #REQUIRED>
<!ELEMENT name(firstname?, lastname)>
<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT biography ANY>
<!ELEMENT collection(series, book*)>
<!ELEMENT series (#PCDATA)>

Moreover, let's assume that there are three types of queries that we ask:

1.: Given the booktitle give me the authors.
2.: Given an author give me his/her biography.
3.: Given a booktitle give me all the booktitles for which there is a sequence of collections that links the books (ie given booktitle A, return all the booktitles B, such that
A->collection->booktitle->collection....->booktitle->collection->B - recursively).

One obvious way of storing data conforming to this DTD would be by treating each tag as being one graph node and storing it in a page along with pointers to other tags (its children). In this scheme one record corresponds to a tag. There are different proposed schemes of encoding pointers - using either disk page ids and offsets or using uniquely generated ids. For faster accessing one might want to store pointers to parents (reverse direction).
The big plus of this scheme is its simplicity. Indexes can be built in the usual way using hash tables or B+ trees. Insertions and deletions are more complicated since pointers need to be refreshed in which case you might need to touch an arbitrary number of pages.
In order to answer the given queries, indexes would be helpful both on booktitles and authors. If the order of the XML data is unimportant, clustered indexes can be created for both booktitles and authors (and any other different tags if needed). The last query is doable, using a recursive algorithm and will involve multiple seeks on the booktitles.
A different approach would be to view each book along with its booktitle, authors, collections as a subtree. Each page then can store a number of subtrees. There are pointers from one subtree to another given by the collection attribute. In this case one stored record corresponds exactly to one subtree.
Using this approach only one clustered index on the booktitle only can be built. Other indexes can be formed too, but unclustered only. Given the query load, an unclustered index on authors should be built.
The advantage of this layout is that we can answer the first and the second queries very efficiently, since the information is clustered together. The last query as before involves lots of seeks on booktitles, but overall there are fewer disk accesses than before.

There are further tricks that can be done for minimizing the space occupancy. One idea is to map the tags into small integers and store the map separately than the tags themselves. Another idea would be to identify types for the given attributes in the DTD. One example in our case is to label the series as being an ISBN-like field which is a series of 10 digits. More clever optimizations can be found in the 'XMill' paper of Helmuth Liefke and Dan Suciu which deals with compressing XML.

B+ trees

Exercise 9.2 from "The Cow Book":
1. I1, I2, and everything in the range [L2, ..., L8]. Notice that I3 is not visited because leaf nodes are connected via front and back pointers.
4. There are many search keys X such that inserting X would increase the height of the tree. Any search key in the range [65 ... 79] would suffice. A key in this range would go in L5 if there were room for it, but since L5 is full already and since it cannot redistribute any data entries over to L4 (L4 is full also), it must split; this in turn causes I2 to split, which causes I1 to split, and assuming I1 is the root node, a new root is created and the tree becomes taller.
5. We can infer several things about subtrees A, B and C. First of all, they each must have height one, since their ``sibling'' trees (those rooted at I2 and I3) have height one. Also, we know the ranges of these trees (assuming duplicates fit on the same leaf) subtree A holds search keys less than 10, B contains keys between 10 and 19 and C between 20 and 29. In addition, each intermediate node has at least 2 key values and 3 pointers.
BONUS
If the keys are allowed to be variable length: When a node becomes under full, it attempts to borrow from a neighbor but is unable to do so (or to do in such a way that both nodes have the minimum) without moving an element that is equal to the key. As a result, the intermediate node (parent) must be reorganized. This can percolate to the root eventually, causing a new level to be added.

Join Algorithms

For each query, analyze the join algorithms we have seen.
This analysis is not meant to be exhaustive. Its rather 'back of the envelope math.

Basic Nested Loops join: (pages in outer) + [(tuples/page in outer) * (pages in outer) * (pages in inner)]

Index Nested Loops join: (pages in outer) + [(tuples/page in outer) * (pages in outer) * (height of tree (2-4) + leaf exploration (1))]

Block Nested Loops join: (pages in outer) + [(blocks in outer) * (pages in inner)]

Sort Merge join: (sorting of inner) + (sorting of outer) + (pages in outer) + (pages in inner)

Hash join: 3 * [(pages in outer) + (pages in inner)], if B > sqrt(smaller relation)

Semijoin: This technique is not likely to apply in this context. Moreover, it can be implemented in number of ways, and we do not have enough information here to give a cost estimate.
1. ```
         SELECT R.a, S.b
         FROM   R, S
         WHERE  R.c = S.c
```
  1. Basic Nested Loops is clearly going to loose, choosing R as the outer (to make it that much worse): 1,000 + (10 * 1,000 * 200) = 2,001,000
  2. Index Nested Loops, choosing S as the outer (to use the index on R): 200 + (10 * 200 * (3 + 1)) = 8,200
  3. Block Nested Loops, choosing S as the outer: 200 + (200/(52 - 2) * 1000) = 4,200
  4. Sort-Merge, using S as the outer and using a simple sort cost estimate (R is already sorted): 200 log 200 + 0 + 200 + 1,000 = 2,729
  5. Hash join, B = 52 > sqrt(200): 3 * (200 + 1000) = 3,600
  In this case, Sort Merge join is going to win, since we have already sorted R (by having a clustered index on it).
2. ```
         SELECT S.a, T.b
         FROM   S, T
         WHERE  S.c = T.c
```
  This is very similar to the first, except index nested loops is not valid, and the T would need to be sorted for Sort-Merge. Sorting T drives the cost of Sort Merge well above Hash, so Hash wins.
  - Sort-Merge, using S as the outer and using a simple sort cost estimate: 200 log 200 + 1,000 log 1,000 + 200 + 1,000 = 12,694
Assume for this query that interactivity is very important.
```
         SELECT R.a, T.a
         FROM   R, T
         WHERE  R.c = T.c
```
1. Basic Nested Loops is clearly going to loose (again): 1,000 + (10 * 1,000 * 1,000) = 10,001,000
2. Index Nested Loops, choosing T as the outer (to use the index on R): 1,000 + (10 * 1,000 * (3 + 1)) = 41,000
3. Block Nested Loops, with either as the outer: 1000 + (1000/(52 - 2) * 1000) = 21,000
4. Sort-Merge, with either as the outer using a simple sort cost estimate (R is already sorted): 1,000 log 1,000 + 0 + 1,000 + 1,000 = 11,966
5. Hash join, B = 52 > sqrt(1,000): 3 * (1000 + 1000) = 6,000
Clearly Hash join is the superior algorithm for performance, however it blocks, therefore disrupting interactivity. The stress on interactivity would suggest one of the nested loops joins, or if available, Double Pipelined Hash.

Propose three join algorithms to perform "Nesting".

Nesting is a special form of joining. The important things which a nesting algorithm must provide is locality, i.e. that all children must be listed with their parents. Somethings which may be useful for optimization are that every child has a parent, parents are unique, and that there are many fewer parents than children.

Sort Merge join can be used to do nesting as long as the parent is the outer relation.
Nested Loops join cannot be used without modifications. The parent needs to be the outer. The child relation could be sorted, with a binary search used to find each child cluster, or a clustered index could be used.
Hash join could be used, hashing the child on the join key. The join key is the parent ID, which could be used as a perfect hash to partition the children by parent. Spinning through the parents, each parent would get exactly one bucket. This is not really hash join, though it uses some of the same mechanisms. Additionally, the numnber of buckets is equal to the number of parents, which could get ugly. Normal hash join (where several parent's children would be in the same bucket) could be used. Notice that this use of hash join is backwards from the normal, as we are hashing the larger relation.

Buffer management

LRU
Any operation which requires repeated sequential scans of the same data will perform poorly under LRU when the buffer size is (slightly) less than the data size. An example would be Nested Loops Join, where the size of the inner is (slightly) larger than the available buffers.
MRU
Any operation which uses a few pages repeatedly while accessing a large number of pages only once will perform poorly under MRU. The frequently accessed pages will continuously be thrown out of the buffer pool to make room for the incoming pages. The infrequently used pages will stick in the buffers even though they are used only once. An example is Hash Join, where the hashed relation fits into a few pages, while the other relation is very large, relative to the buffer pool size.
Random
Any operation with predicatable behavior will perform worse under a random strategy than under a strategy tailored to its needs. Both of the above examples (sequential scans, and few pages repeatedly) should perform better under random than under LRU and MRU (respectively). However, clearly there are approaches to both these problems that are better than random. For instance, sequential scans of data which is one page larger than the buffer could be done more efficiently by toggling only one page in and out, rather than randomly throwing out all the pages.
Random, except for k most recently used pages (where k is perhaps 10% of the whole).
As with random, this will work poorly for particular operations, but for a general workload may be a bit better than LRU or MRU. In particular, it is poor for anything that random is poor for, but has an additional bias against things for which LRU is poor for, such as sequential scan. In a sequential scan, this strategy will keep the most recently seen pages until just (speaking loosely) before they are needed again.

Optimizer Plans

There are several situations in which a graph is more useful or efficient than a tree. One is recursive queries, since some kind of cycle is required. Another is to avoid redundent scans of the same relation. This typically occurs when the same subquery appears twice in the same query, or during a self join such as below.

   SELECT A.c
   FROM T A, T B
   WHERE A.a != B.a AND
         A.z > 20 AND
         A.y < 30