CSE 326 Data Structures Project 4

CSE 326 Data Structures
Project 4

Objective

In this assignment you will build a processor for database queries (well, simple forms of database queries). Rather that being instructed to use specific data structures, in this assignment you will design the necessary data structures based on the knowledge you've acquired during the course. Hence, you get to see how different data structures come together to construct a complete application.

A crash course in databases

(Note: this does not replace CSE-444!)

We begin by introducing the concept of a database table, and then explain the kinds of queries that we pose on them.

Data in a relational database is stored in tables. A Table has a name, and a set of attributes. The attributes are the names of the columns. Each attribute has a type (e.g., integer, string, etc). The table is filled with a set of rows (called tuples), where each row has a value for each one of the attributes. In the example below, the table Employee has four attributes: SSN is an integer, Name is a string, DepartmentID and Salary are integers.

Employee
SSN	Name	DepartmentID	Salary
999999999	John	1	30,000
777777777	Tony	1	32,000
888888888	Alice	2	45,000

A database can include several tables. The schema of a table includes the name of the table, the name of its attributes and the data type for every attribute. The schema of the database is the collection of schemas of the individual tables.

Given a set of tables, we define several operations on them for the purpose of querying. Every operation takes as input one or two tables and returns a table as a result. We consider the following query operators:

Selection

The selection operator selects a subset of the rows in a table, given some selection condition. For example, the selection find all rows in Employee with Salary more than $31,000 returns the following table.

Employee
SSN	Name	DepartmentID	Salary
777777777	Tony	1	32,000
888888888	Alice	2	45,000

The conditions in the selection operator can be of the following forms: attribute = value, attribute < value, attribute > value, attribute1 = attribute2, or attribute1 < attribute2.

Projection

The projection operator simply selects a subset of the attributes in the table and disposes of the rest. For example, if we project the attributes SSN and Name from our table, we obtain the following table:

Employee
SSN	Name
999999999	John
777777777	Tony
888888888	Alice

Order-by

The order-by operator takes one of the attributes as input, and produces the same table, but in ascending order, sorted by the given attribute.

Join

This is where things get interesting. The join operator takes as input two tables, T1 and T2, and two attributes: A1 (of table T1) and A2 (of table T2). The join produces all the tuples in the Cartesian product of T1 and T2, where T1.A1=T2.A2. The following example we take the join of the attribute SSN of table Employee and the attribute EmployeeSSN of table Dependents. Note that in the resulting table we have only one column of the joined attribute (i.e., we have SSN but we don't have EmployeeSSN). Also note that Alice is not represented in the result because it didn't match any tuple in the Dependents table.

Employee
Name	SSN
John	999999999
Tony	777777777
Alice	888888888

Dependents
EmployeeSSN	Dname
777777777	Emily
999999999	Joe
777777777	Pam

*Employee_Dependents*
Name	SSN	Dname
Tony	777777777	Emily
John	999999999	Joe
Tony	777777777	Pam

Join conditions in queries are specified by equalities of the form A1=A2, where A1 and A2 refer to attributes in two distinct tables. In this assignment you can assume that attribute names are distinct across tables (i.e, no pair of tables share an attribute name). Hence, A1=A2 uniquely specifies a join condition. Exactly one of A1 or A2 is in the result of the join, and you can choose which arbitrarily.

What you need to do

Your program will read in a database, and then process queries on it.

Step 1: reading in the input. The first step you need to do is read the database schema and the tuples in the database. The schema is given in a file with the following form:

Table name; number of attributes; filename
Attribute₁; type
Attribute₂; type
...
Attribute_n; type
Table name; number of attributes; filename
...

You can assume that there are three data types for columns: integer, string (of size at most 30), and float. Feel free to support more types if you want.

You will first read in the database schema, then create the data structures you need to save the tables (note: the entire database will be in main memory; no need to deal with data on disk).

Next, you read in the data. Each table will be stored in a different file (whose name is specified with the schema). The internal format of the file is a tuple per line, where the different values are separated by spaces. We will assume that strings do not contain spaces in our database to simplify the task of reading in the data.

Step 2: read the query. Having read the database, your program should now go into a loop where it reads a query and answers it, until the user exits the program. The queries will be typed in by the user. You can also have the ability to read a query from a file.

A query has the following form: (this is a simplified version of the SQL database query language. For the full version, come to CSE-444).

Select A₁,..., A_n
From Table₁,..., Table_m
Where selectionCondition₁,..., selectionCondition_k,
JoinCondition₁,..., joinCondition_m-1
OrderBy A

What this means: The From clause tells you what tables participate in the query. The Where clause gives two types of conditions: the selection predicates (as explained above), and the join predicates. If the From clause has m tables, then there will be m-1 conditions. Every table in the From clause participates in at least one join condition (but some tables may participate in more than one join condition).

The Select clause tells you which attributes to produce in the result of the query (this is the projection operator described above). Finally, the OrderBy clause tells you how to order the output. This clause is optional; if it's not specified, then any ordering suffices.

Step 3: processing the query. Processing the query will proceed as follows:

You first perform the joins between the tables (note; if you have m tables in the From clause, you'll have to perform m-1 joins). The order of the joins does not matter! To perform the joins, you start with two tables, join them and produce an intermediate table. You then take the intermediate table and join it with a third table, etc., until you're done. Let's call the table resulting from performing all the joins T1.
Next, you apply the selection conditions to T1. This results in a table T2 that includes a subset of the tuples in T1.
Next, you apply the OrderBy clause. This results in a table T3 with the same tuples as T2, but in a different order.
Finally, you apply the projection conditions to T3. This results in a table T4 that has a subset of the columns as T3.
You print out: (1) the tuples in T4 (in whatever format you like), and (2) a count of the number of tuples in T4.

Extra credit: introducing optimizations. The method described above is, in general, quite inefficient for processing queries over large amounts of data. One very common optimization is to process some selections before the joins. The reason for this is that the cost of a join may, in the worst case, be proportional to the product of the sizes of the two tables. For example, consider the query

Select name, Dname
From Employee, Dependent
Where SSN=EmployeeSSN and SSN=77777777

This query asks for the name and the name of the dependents of the employee whose social security number is 77777777. While it is possible to first join the Employee table with the Dependent table, and then perform the selection on SSN=77777777, it would be much more efficient to first perform a selection SSN=777777777 on the Employee table, then a selection EmployeeSSN=777777777 on the Dependent table, and finally do a join on the two tables resulting from the selections. The join would then be done on two relatively small tables.

Here, your task is to process the queries, but do the selections as early as possible in the query evaluation. Note: you need to be very careful here about where you place selections to make sure your answer is correct.

Mechanics

The assignment will be done in groups of 3-4 students. You should send email to Maya, Nic and Alon by Monday, the 21st of May telling us the members of your group.
The join operator should be implemented using an algorithm called hash-join. It will be explained in section.
You should hand in a 2-page description of your project on June 1st. The description should specify the data structures and algorithms you used in your project, and any innovations you've incorporated into your database system. You should also specify what was the challenging part about the assignment, and what could be improved.
We will schedule times for demonstrations of your project. All group members need to be at the demo, and it will take about 10-15 minutes. You'll show us some queries running, and answer some general questions about the project. Demonstration times will be on June 1st on June 4th.
Note: you have a lot of freedom in this project. You'll have to make a lot of choices on your own. Whenever you do, make sure there is some justification for it.