Project Requirements

The course project is due Tuesday, March 6. The project will be completed in groups, and a written proposal will be due by each group on Thursday of the third week. Some possible project ideas are described below. Original ideas are welcome, but should be approved by the instructor before the proposal deadline.

Indexing patterns
Consider a large semistructured database assumed to be a single XML file, but much larger than main memory. For a limited class of queries (to be determined) design indexing techniques to minimize disk i/o. Analyze and compare the efficiency of each proposed index, taking into account the cost of maintainence over updates to the database.
This problem remains unsolved today. Readings on related topics are:
- Indexes for semistructured data: DataGuides (http://www.informatik.uni-trier.de/~ley/db/conf/vldb/GoldmanW97.html) and T-indexes (http://www.informatik.uni-trier.de/~ley/db/conf/icdt/MiloS99.html; download the paper from my website, http://www.cs.washington.edu/homes/suciu). DataGuides and T-indexes are the same for tree-like data, which is the case with XML. T-indexes are better for cyclic data.
- The approach to indexing in Xyleme (http://www-rocq.inria.fr/verso/LEVEL1/Xyleme.html).
- A system from Berkeley, XSet (http://www.cs.berkeley.edu/~ravenben/xset/), that implements some index for XML, much in the spirit of dataguides/t-indexes.
Xpath containment
Choose a manageable, well-defined class of Xpath expressions and investigate algorithms for deciding the containment of such expressions. The main technique to be used here is that of containment of conjunctive queries.
Related reading: Peter T. Wood, in Dood 2000,On the Equivalence of XML Patterns.
Xpath query pruning
Given an Xpath expression over XML data conforming to a given schema, investigate techniques for transforming the original expression into a simpler expression that returns the same results over all data conforming to the schema. Define a reasonable subset of Xpath and of XML-Schema for which such pruning works. There is some relevant literature on this topic.
Some readings on query pruning:
- This paper http://www.informatik.uni-trier.de/~ley/db/conf/icde/FernandezS98.html describes pruning of regular path queries (read only up to the section on query pruning: the "rewriting using state extents" section is irrelevant). It does not handle order or branching.
- Extensions to order and/or branching are in: Horizontal Query Optimization on Ordered Semistructured Data by Hartmut Liefke (in WebDB'99, http://www-rocq.inria.fr/~cluet/WEBDB/procwebdb99.html ) and in a technical report (http://cs.kaist.ac.kr/library/tr/, see the paper "Preprocessing Multiple Regular Path Expressions for the Rewriting-based Optimization in Semistrucured Databases" by Chang-Won Park and Chin-Wan Chung). These are more technical, you may want to postpone reading them.
- Klemens Bohm, Kathrin Gayer, Karl Aberer, M. Tamer Özsu Query Optimization for Structured Documents Based on Knowledge on the Document Type Definition
- Kuper, Simeon. Subsumption for XML Types
Storing XML as relational data
Compare and analyze proposed methods of storing XML in a relational DBMS. There is a some literature on this topic. This project would require digesting existing literature, and either proposing original improvements, or providing a comparison of known techniques.
There are three main publications on this topic:
- Florescu and Kossman (http://www.informatik.uni-trier.de/~ley/db/journals/debu/FlorescuK99.html)
- Shanmugasundaram et al. (http://www.informatik.uni-trier.de/~ley/db/conf/vldb/ShanmugasundaramGTZDN99.html)
- STORED (http://www.informatik.uni-trier.de/~ley/db/conf/sigmod/DeutschFS99.html)
Publishing relational data as XML
Assess the current state of methods for publishing XML views of relational data, both research prototypes and database vendor capabilities. All three major database vendors offer XML extensions to their database systems: Oracle, IBM, Microsoft. Provide a comparison study of their approaches.

There are two reading on this topic:
- SilkRoute ( http://www9.org/w9cdrom/)
- Xperanto (http://www.research.att.com/conf/webdb2000/program.html)
Bulk processing of recursive transformations

XML is processed recursively with XSLT, essentially a language for defining recursive functions. But recursive tree traversals are inefficient to implement and impossible to optimize. Bulk processing of such transformations recognizes some restricted forms for recursive processing that can be implemented with standard relational operators. Define a fragment of XSLT that can be processed in bulk, and implement a translator from that fragment to SQL. There is some relevant literature that needs to be digested for this project.

The only reading here is a bulky journal paper (http://link.springer-ny.com/link/service/journals/00778/bibs/0009001/00090076.htm). Come to my office, and I can explain you in a few minutes what is going on. It is not trivial: but it is exciting and highly relevant.
Processing ordered collections in SQL

SQL processes only unordered collections (bags). XML query languages on the other hand must consider ordered collections, since XML is an ordered data model. Can operations on ordered collections be implemented efficiently in SQL ? In this project you are required to design a suite of techniques for processing ordered collections in SQL, and implement a translator from a query language on ordered collections into SQL, implementing those techniques. As starting point, represent an ordered collection by adding an extra integer field that stores the "index" of each tuple in the collection.

Project Requirements

Indexing patterns

Xpath containment

Xpath query pruning

Storing XML as relational data

Publishing relational data as XML

Bulk processing of recursive transformations