University of Washington
Homework 2 - Solutions
Problem 2) Again, there is no universal good answer to this question.
One initial naive view is that most of the XML data out there is just relational data that has been "XML-ized". Therefore a native storage for it should be compatible to the relational native storage. In this case we should be able to leverage off all the relational techniques we have for storing relational data and efficiently index it.
In general however, XML data conceptually corresponds to a graph-layout.
Now let's assume that we have the following DTD.
<!ELEMENT book(booktitle, author*, collection*)> <!ELEMENT author(name, biography)> <!ATTLIST author id ID #REQUIRED> <!ELEMENT name(firstname?, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT biography ANY> <!ELEMENT collection(series, book*)> <!ELEMENT series (#PCDATA)>
Moreover, let's assume that there are three types of queries that we ask:
The big plus of this scheme is its simplicity. Indexes can be built in the usual way using hash tables or B+ trees. Insertions and deletions are more complicated since pointers need to be refreshed in which case you might need to touch an arbitrary number of pages.
In order to answer the given queries, indexes would be helpful both on booktitles and authors. If the order of the XML data is unimportant, clustered indexes can be created for both booktitles and authors (and any other different tags if needed). The last query is doable, using a recursive algorithm and will involve multiple seeks on the booktitles.
Using this approach only one clustered index on the booktitle only can be built. Other indexes can be formed too, but unclustered only. Given the query load, an unclustered index on authors should be built.
The advantage of this layout is that we can answer the first and the second queries very efficiently, since the information is clustered together. The last query as before involves lots of seeks on booktitles, but overall there are fewer disk accesses than before.
There are further tricks that can be done for minimizing the space occupancy. One idea is to map the tags into small integers and store the map separately than the tags themselves. Another idea would be to identify types for the given attributes in the DTD. One example in our case is to label the series as being an ISBN-like field which is a series of 10 digits. More clever optimizations can be found in the 'XMill' paper of Helmuth Liefke and Dan Suciu which deals with compressing XML.
If the keys are allowed to be variable length: When a node becomes under full, it attempts to borrow from a neighbor but is unable to do so (or to do in such a way that both nodes have the minimum) without moving an element that is equal to the key. As a result, the intermediate node (parent) must be reorganized. This can percolate to the root eventually, causing a new level to be added.
This analysis is not meant to be exhaustive. Its rather 'back of the envelope math.
Basic Nested Loops join: (pages in outer) + [(tuples/page in outer) * (pages in outer) * (pages in inner)]
Index Nested Loops join: (pages in outer) + [(tuples/page in outer) * (pages in outer) * (height of tree (2-4) + leaf exploration (1))]
Block Nested Loops join: (pages in outer) + [(blocks in outer) * (pages in inner)]
Sort Merge join: (sorting of inner) + (sorting of outer) + (pages in outer) + (pages in inner)
Hash join: 3 * [(pages in outer) + (pages in inner)], if B > sqrt(smaller relation)
Semijoin: This technique is not likely to apply in this context. Moreover, it can be implemented in number of ways, and we do not have enough information here to give a cost estimate.
SELECT R.a, S.b FROM R, S WHERE R.c = S.c
SELECT S.a, T.b FROM S, T WHERE S.c = T.c
This is very similar to the first, except index nested loops is not valid, and the T would need to be sorted for Sort-Merge. Sorting T drives the cost of Sort Merge well above Hash, so Hash wins.
SELECT R.a, T.a FROM R, T WHERE R.c = T.c
Clearly Hash join is the superior algorithm for performance, however it blocks, therefore disrupting interactivity. The stress on interactivity would suggest one of the nested loops joins, or if available, Double Pipelined Hash.
Nesting is a special form of joining. The important things which a nesting algorithm must provide is locality, i.e. that all children must be listed with their parents. Somethings which may be useful for optimization are that every child has a parent, parents are unique, and that there are many fewer parents than children.
Any operation which requires repeated sequential scans of the same data will perform poorly under LRU when the buffer size is (slightly) less than the data size. An example would be Nested Loops Join, where the size of the inner is (slightly) larger than the available buffers.
Any operation which uses a few pages repeatedly while accessing a large number of pages only once will perform poorly under MRU. The frequently accessed pages will continuously be thrown out of the buffer pool to make room for the incoming pages. The infrequently used pages will stick in the buffers even though they are used only once. An example is Hash Join, where the hashed relation fits into a few pages, while the other relation is very large, relative to the buffer pool size.
Any operation with predicatable behavior will perform worse under a random strategy than under a strategy tailored to its needs. Both of the above examples (sequential scans, and few pages repeatedly) should perform better under random than under LRU and MRU (respectively). However, clearly there are approaches to both these problems that are better than random. For instance, sequential scans of data which is one page larger than the buffer could be done more efficiently by toggling only one page in and out, rather than randomly throwing out all the pages.
As with random, this will work poorly for particular operations, but for a general workload may be a bit better than LRU or MRU. In particular, it is poor for anything that random is poor for, but has an additional bias against things for which LRU is poor for, such as sequential scan. In a sequential scan, this strategy will keep the most recently seen pages until just (speaking loosely) before they are needed again.
There are several situations in which a graph is more useful or efficient than a tree. One is recursive queries, since some kind of cycle is required. Another is to avoid redundent scans of the same relation. This typically occurs when the same subquery appears twice in the same query, or during a self join such as below.
SELECT A.c FROM T A, T B WHERE A.a != B.a AND A.z > 20 AND A.y < 30