Homework 2 - Query Execution

Due Wednesday, April 28, 1999

Goals

For this assignment, you will gain a feel for how query execution works in a real system, specifically focusing on the operation of simple join algorithms.  You will also hopefully see how different query execution trees have different performance results, which will provide some motivation for query optimization.  (Note that the differences in your simple execution system will probably be minor, because you will be using a "toy" data set and an execution system that ignores many of the complex aspects of a real system.)

Specifications

This assignment will be a two-person group assignment, so find a partner.  Each group member will choose and implement a different join type, from one of the various methods:  nested loops, sort merge, hash, and double pipelined.

Code

A skeleton query execution engine written in Java is provided here in zip and tar.gz format.  This code includes a simple query execution plan parser, the basic iterator model, and the scan operator for reading data files.  You will need to modify PlanParser.java to create the appropriate objects which implement your join operators; you will also need to create classes for those operators.

For your join implementations, you do not need to concern yourself with hash table overflow resolution, external sorting, etc. for the sort merge, hash, and double pipelined joins.; assume that the data will all fit in memory.  You may also assume the nested loops join can also be done in memory, but you may not assume any ordering on the data (i.e. for each outer tuple, your nested loops join must examine all elements in the inner relation sequentially).  Assume you are working with equality joins (equijoins).

Begin by familiarizing yourselves with the basic architecture of the code and the object model.  This should be quite simple, as you are only provided with 4 classes (QueryExec, PlanParser, Tuple, and Scan) and one interface (Operator).  You and your partner will want to add a generic Join class and two specific subclasses for your operator implementations.

Note that one important aspect of joining two tuples is that the resulting "combined" tuple may actually have "name clashes" when the original attributes each contain an attribute (other than the join attribute) with the same name.   This can cause a problem if you feed the output to another operator which needs to refer to a specific attribute.  Thus your Join operator should return metadata (in this case, just attribute names) where attributes from the second occurrence of the attribute are renamed to eliminate the clash; do this by appending hashes (#'s) to the end of the offending attribute name until it no longer matches a previous attribute.

Note also that your join operator should only return one copy of the join attribute, rather than one from each source tuple.

Query Plan Language

The query plan language that this execution engine expects is very primitive, but a simple parser is already provided for you.

The BNF grammar is:

query ::- operator

operator ::- join(operator.attribute operator.attribute) Join results of 2 operators using the specified attributes
::- tablename Scan table named tablename.tbl

This is basically a simple prefix-based encoding of joins and scans.  A sample query would be:

join(table1.attr1 join(table2.attr3 table3.attr4).attr2)

Which would construct a tree with a join at the top, fed by a scan of table1 and a join of table2 and table3.

The results of your query will go into a file called output.dat, which is crude but human-readable.

Experiments

There are four tables for you to experiment with, which are a subset of a real data set from a study of high school students.  (See http://augustus.csscr.washington.edu/icpsr/comp_227.html for the original data.)

Popularity (SCHOOLID, STUDENTID, TEACHERID, TEACHERKNOWS, KNOWSPARENT, INCLASS, COLLEGE, TOPOTENTIAL, POPULAR, TALKATIVE, DISLIKESSCHOOL, DISCIPLINE, HANDICAP) - 10001 tuples, 1.5MB

Language (SCHOOLID, STUDENTID, FIRSTLANG, OTHERLANG, USUALLANG, NONENGLISH, SPEAKS, READS, WRITES) - 10001 tuples, 1.1MB

School (SCHOOLTYPE, SCHOOLID, CENSUSREGION, LOWESTGRADE, HIGHESTGRADE, STUDENTS, SPECIALIZATION, GRADS, YEARLENGTH, PERIODLENGTH, DAILYPERIODS, ATTENDANCE, NAMSTUDENTS, NAMFACULTY, ASIANSTUDENTS, ASIANFACULTY, HISPSTUDENTS, HISPFACULTY, BLACKSTUDENTS, BLACKFACULTY, WHITESTUDENTS, WHITEFACULTY, INCOLLEGE, NONCOLLEGE, MILITARY) - 1976 tuples, 140KB

SchoolType (STUDENTS, SCHOOLTYPE, SCHOOLID, STUDENTID, CENSUSREGION, TWINDATA, LANGDATA, SPANLANG, TESTDATA, GRADE) - 5001 tuples, 620KB

For Experiment 1, run the following joins of two relations:

Run each join using the two possible plans, under both your join algorithm and your partner's.  (You should end up with four timing results for each join.)

For Experiment 2, suppose we would like to know the students, their teachers, their first languages, and the appropriate school types.  (We do not have projection, so simply run a join of the appropriate tables.)  Be aware that student ID's are unique, so STUDENTID supersedes SCHOOLID as a key where appropriate.

You and your partner should create three different execution plans for this query and execute them with your different join algorithms. Now do a combined write-up of the results your group obtained.

In your experimental write-up, include the timings, the plans used, and a discussion of the differences (if any) between timings on different algorithms and different plans.   Also discuss why query execution plan orderings can make a more significant difference in a real-world system with larger data sets.  Note that the experimental section will comprise approximately half of your grade, so you should put some thought into the discussion.

Submitting Your Homework Solution and Discussion

Take your (and your partner's) various new and modified source files and your experimental write-up and zip or tar/gzip them into a single file.  Send via attachment to zives@cs and rap@cs.