544 Projects

Project Ideas

milestones

Enhance SQL with WordNet.

Motivation: SQL Queries require exact knowledge of the data in the database. For example the query:


SELECT Product.name, Product.price

FROM Product

WHERE Product.name = 'Cellphone'

requires us to know that the Product table contains some products whose name is 'Cellphone'. If we are wrong, and the products we are looking for are called 'Cell phone' or 'Mobile phone' or 'Cellular', then we get an empty answer, with no clue how to fix our query. WordNet [1] could provide the much needed help. WordNet is a large ontology of the entire English language. Using this ontology one can compute the 'distance' between two words (concepts), and this can be used to enhance SQL queries.

Project Description: Implement a system that integrates a SQL query engine with the WordNet ontology. Your system needs to support only single-table queries (i.e. no need for joins) and without aggregates, much like the query above. Your system should return answers where the equality predicates in the original SQL query are interpreted as 'short distance' in the WordNet ontology. For example, the answer to the query above may look like this:

Answer:


Product.name  |  Product.price   |  Score (Wordnet distance)

================================================================

Cell phone    |  99.99           |  1

Cell phone    |  49.99           |  1

Mobile phone  |  59.99           |  2

Mobile phone  |  89.99           |  2

Cordless phone|  59.99           |  3

. . . .

Challenge. The main challenge here is integrating WordNet with the database server. Can it reside outside the engine, and if so how is it integrated with the engine ? Or is it possible to push WordNet inside the engine ?

References:

[1] WordNet: A Lexical Database for the English Language.

[2] Agrawal, Chaudhari, Das, Gionis. Automated Ranking of Database Query results. CIDR, 2003.

[3] Theobald, Weikum. The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking. EDT, 2002.

Publishable: yes (the WordNet/RDB engineering)

Pratical Utility: not yet (requires lots of extra work)

Team: Max 2 people

Implement a stream-based XML transformation tool as part of the XML Toolkit project.

Motivation The XML Toolkit project consists of a collection of highly efficient, highly scalable tools for processing large XML files. They support only limited operations, but perform them very efficiently, much like Unix command lines for text processing. The new transformation tool will allow users to express simple transformations on an XML file, which can be done in stream fashion, yet will be quite powerful.

Project Description. Design a simple XML transformation language. The starting point can be XSAG, in reference [1]: add or delete features as needed. Then implement it in the XML Toolkit framework. Use the Toolkit's parser and XPath processor, to benefit from their highly efficient implementations. You will need to buffer some XML elements, but hopefully only a few, depending on the particular transformation. Your tool should support efficiently the following simple transformations (some refer to the examples in [2]):

Replace every element with a element
Inside replace with ; inside
replace with .
Suppose all elements consist of a and . Replace the two elements with a single string, obtained by concatenating the first and last name.
To each element add a subelement which whose value is listprice/text() * disount/text()

Deliverables: The tool; a dozen or so simple transformations on a large document; running times for these transformations; running times using the SIX.

References:

[1] Christoph Koch, Stefanie Scherzinger. Attribute Grammars for Scalable Query Processing on XML Streams. DBPL 2003: 233-256.

[2] Iliana Avila-Campillo, Todd J. Green, Ashish Gupta, Makoto Onizuka, Demian Raven, Dan Suciu. XMLTK: An XML Toolkit for Scalable XML Stream Processing. In Proceedings of PLANX, October, 2002.

[3] Todd Green et al.Processing XML Streams with Deterministic Automata.

[4] The XML toolkit.

Publishable: not in a major conference

Pratical Utility: very, very useful. It is badly needed in the XML Toolkit project

Team: 1 person; maybe 2

Implement a SQL interface to a Unix file system.

Motivation. The Unix file system offers a very limited user interface. Commands like 'ls', 'find', 'grep' allow us to list and find files, but only support a limited query language. It would be nice to be able to express much richer queries, as supported in SQL. Examples are:

find all directories that contain an *.c file and an *.obj file with the same name but where the obj file is older than the c file
list all directories in decreasing order of the number of files; list all directories in decreasing order of the total amount of space used in that directory
find duplicate files across all directories, i.e. files with the same name, same content.
search in all directories that contain some .pdf file for all files that contain the words 'process' and 'algebra'

Project Description: Design a relational schema for file metadata. Your schema should include at least the following:

file name
directory structure (parent, ancestors, etc)
file attributes (date, size, ...)
file content: the set of words (for text files only)
file signature (using a hash function), to be used in comparisons.

Write a program to populate your database from a large directory tree. Your program should be quite efficient: my own directory tree has about 2GB of data, and I'd expect this program to run in only a few minutes. Then, show some interesting SQL queries over your database. Extra bonus: integrate your query interface with a file browser (such as the Windows browser of the Mac finder) such that when the file names are returned they are displayed in the bowser.

Challenge. The only challenge here is dealing with the nitty-gritty systems issues: read the files, determine their types robustly (figure out which ones are text files), and if they are text files then scan their content to index all words. Also: choose a good signature function (to detect duplicates efficiently)

References
none here

Publishable: not in a database conference

Pratical Utility: maybe

Team: 1 person

Allow loose schema specification in SQL queries.

Motivation. SQL queries require detailed knowledge of the database schema. A wrong table name or a wrong attribute name results in a syntax error. It would be nice to have a system that tolerates queries with deviations from the database schema: misspelled table/attribute names, different ways to split the data into tables. Such a system will translate an incorrect query into a set of approximations. For example a query like:


SELECT Author.title

FROM Author

WHERE Author.name = 'Ullman'

would be translated by the system into two queries:


SELECT Publication.title

FROM Author, Wrote, Publication

WHERE Author.aid = Wrote.aid and Wrote.pid = Publication.pid

   and Author.name = 'Ullman'



and



SELECT Employee.title

FROM Author, Employee

WHERE Author.ssn = Employee.ssn and Author.name = 'Ullman'

(or something like that...)

Project Description Given a relational schema construct a graph that represents all key - foreign-key relationships in the database. Given a SQL query construct a similar graph using only the relations and attributes mentioned the query. Then implement some graph matching algorithm that finds the best matchings from the query graph to the database graph, and for each such mapping generate the corresponding correct SQL query.

Deliverables: You must demonstrate your system on a real database. You can use, for example, the movie database in HW1, or TPC/H, etc.

References.

[1] Hristidis, Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. VLDB, 2002.

[2] AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy. Learning to Map between Ontologies on the Semantic Web. VLDB, 2001.

Publishable: yes (after lots of innovation and lots of hard work)

Pratical Utility: not yet

Team: 2 people; 3 are possible

Probabilistic XQuery

Motivation. Probabilities occur in databases in various contexts. For example in [1] probabilities are introduced when uncertain predicates are evaluated on database values. We need to be able to represent and query probabilistic XML data. Representation is easy: just add to each element some @prob attribute, indicating the probability of that element. Query evaluation is hard: we need to be able to return a query answer as a 'probabilistic' answer.

Project Description. Implement a probabilistic query interpretor for XQuery. Your interpretor will actually rewrite XQuery expressions into expressions that manipulate the probabilities explicitly, then use a standard XQuery interpretor (e.g. Galax) to run them. You choose a subset of XQuery to implement: at a minimum it will have only FOR-LET-WHERE-RETURN clauses whithout nested queries and without element constructors in the RETURN clause (i.e. only a variable is returned). More challenging is to define the meaning of queries that return constructed XML values.

ChallengeS. As shown in [1], the general query evaluation problem on probabilistic databases is #P complete. You need to define a fragment of XQuery where all queries can be evaluated efficiently. Doing this only for simple XPath expressions is probably easy, given the techniques in [1]. The challenge is to do this for richer XQuery fragments.

References:

[1] Nilesh Dalvi and Dan Suciu. Efficient Query Evaluation on Probabilitistic Databases.

[2] E. Hung, L. Getoor, V. S. Subrahmanian. PXML: A Probabilistic Semistructured Data Model and Algebra. International Conference on Data Engineering, Bangalore, India, March 2003.

Publishable: no (unless you discover something I don't anticipate now)

Pratical Utility: yes. (can be implemented in Galax with approximate predicates)

Team: 1-2 persons

Build a privacy monitoring tool based on "Privacy Views", and "Purpose Statements".

Motivation. Privacy protection is a major concern in data exchange: for example a medical research institution cannot share its data with partners if that reveals private information about its human subjects, unless the parters use the data for a purpose for which the human subject(s) have explicitly grant permission. To enforce that we need to be able to track private information. What exactly constitutes private data in a large database ? We also need to track purpose statements: for what kind of usage have owners of private data authorized its use ? Finally, we need to be able decide, for a given query, if it can be answered or not, based on the query's purpose and on its beneficiary.

Project Specifications: Implement a system that keeps track of the following privacy components: privacy views, purpose statements, the requester's query and the requester's purpose. The system should decide whether permission should be granted or not. At a finer level the system could return a number between 0 and 1, representing the amount of privacy disclosure for that query, where 0 means no disclosure at all (hence permission granted) and 1 means total disclosure (hence permission denied). You also need to define several meaningful purposes supported by your system.

ChallengeS. You need to use a database system here, to define views, evaluate views, etc. This is the simple part. The harder part consists of deciding whether the answer returned by a query is private or not. This may require implementing a simple version of query containment.

Deliverables: You should illustrate your system on the TPC/H benchmark. For example define the data in ORDER and LINEITEM as private, owned by the customer, and specify for what purpose that piece of data can be accepted. Then run the benchmark queries on behalf of a variety of customers, and with different purposes. Does your system make the right decisions ?

References:

[1] Dan Suciu et al. Privacy Preserving Data Integration and Sharing.

Publishable: yes

Pratical Utility: not now

Team: 2-3 people

Design and implement a database editor.

Motivation. We have editors for text files. We don't have editors for relational databases.

Project Description. Implement a simple editor than can work with any relational database system, by using only an ODBC connection. For the purpose of the project the editor can be restricted to a command-line oriented editor (i.e. doesn't have to be WYSIWYG). It should support at a minimum the following commands. Extensions are desirable, and some are suggested in parenthesis.

T  < table name> < attr name> ...

       - open a table for scrolling; the tuples are sorted by attr
         names


L
        - list the current record. [May want to list several records
          'around' the current point]

N, P
        - move to the next, previous record [May want to move one
          screen at a time; may want to move to the next tuple
          satisfying a given SQL query]


D
         - delete the current record [Delete several records; delete
           all records defined by a SQL query]

I
         - insert a new record [Or records, specified by a SQL query]

C
         - 'copy' the current record  [Copy several records]


P
         - 'paste' the record(s) previously copied [Need to be smart
           about schema: what if the previously copied record has a
           different schema ?]

F 
         - follow a foreign key: close the current table and open the
           other table, placing the cursor at the appropriate record.

ChallengeS. User your creativity. This can be a very cool tool.

Deliverables: Show this working on at least two database systems, e.g. Postgres and on SQL Server. (Reason for two: despite the fact that ODBC is 'standard', it is actually not that standard...)

References
none here

Publishable: yes/maybe

Pratical Utility: yes/maybe

Team: 2 persons

Implement a "Leakage Mesasuring Tool". Given a view V and a query S, and a database instance D, the tool will measure how much information the view V leaks about the query S.

Motivation. Traditional security mechanisms protect data at the physical level, such as firewalls and other perimeter mechanisms. In data exchange, however, such mechanisms are limited since they can only protect the data up to the first authorized recipient. When data is exchanged with multiple partners, information may be unintentionally disclosed, even when all physical protection mechanisms work as intended. Whenever we publish a view V about our database D, we need to understand how much it leaks about a query S that we want to keep secret.

Project Description. The first task consists of coming up with a reasonable definition of leakage for given V, S, and D. The following is a suggestion: the members of this project are encouraged to consider alternatives: leak(V,S,D) = (Prob(S | V) - Prob(S))/Prob(S) (see reference [1]; notice that in [1] all probabilities are over all databases D; here you consider a fixed database D instead). Your tool will take as input the queries V, S (expresses as conjunctive queries), the database instance D, and some description of the domains of the attributes, and will compute leak(V,S,D).

Challenge. No matter how you define leak(V,S,D), it will be a function of probabilities of S and V. These probabilities can be computed by brute force, if one iterates over all possible database instances constructed from a given domain. For example P(S)=k/n where where k is the total number of database instances D s.t. S(D)=s (for some given output s, see [1]) and n is the total number of database instances. The major challenge here is to compute P(S) given that n and k are huge numbers. At a minimum a brute force approach should be tried on tiny domains (of, say, 10 elements or so). However you are encouraged to think hard and find more clever algorithm, e.g. one that works on large domains (say all integers, i.e. 2^32) but only on tiny databases (say 5-10 tuples).

References [1] Miklau/Suciu SIGMOD 2003 [2] Adam, Wortmann. Security-control methods for statistical databases: a comparative study. [from the 590db fall/2003 website]

Publishable: yes

Pratical Utility: no

Team: 2 persons


	Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to nilesh]