Project 3: Database Administration

ESTIMATED TIME: 4 hours.

DUE DATE: Wednesday, 11/5, 11:59pm

TURN IN INSTRUCTIONS: Turn in the file tune.sql to the Catalyst drop box.

WHAT YOU WILL LEARN

Some basic database tuning: given a database and a workload, tune the database to make the workload faster.

SETUP

POSTGRES, JDBC DRIVERS, CLASSPATH: You need the same setup as in project 2. See instructions there.

STARTER CODE: DatabaseGenerator.java, importDatabase.sql, TestQueries.java, Dowload them to your local directory. You should not modify any of them. You will examine importDatabse.sql and TestQueries.java.

FORCING POSTGRES TO OPTIMIZE DECENTLY: Find the file postgresql.conf (it is is normally in the directory where your postgres data file resides). Set constraint_exclusion = on (you also need to uncomment this line). If your server is already running, you need to reload the config file, by running pg_ctl reload -D [your-data-file].

CREATE YOUR DATABASE

createdb project3

RUN THE STARTER CODE Edit TestQueries.java to update the username and password, then:

javac DatabaseGenerator.java
javac TestQueries.java
java DatabaseGenerator 100
psql -f importDatabase.sql project3
java TestQueries project3 1
The numbers 100 and 1 represent the scale factors; normally you wouldn't change them, but if you want you may increase them test your solution more extensively.

If you get any error up to here, contact Michael or myself.

TESTING AND RUNNING YOUR SOLUTION All you have to do in this project is to create a file tune.sql; you will turn in only this file. After you create the file or update it, to test your solution do this:

dropdb project3
createdb project3
psql -f importDatabase.sql project3
psql -f tune.sql project3
java TestQueries project3 1
You need to drop the database then re-create it because TestQueries performs some insertions/deletions, so the database is no longer the same.

PROJECT DESCRIPTION

You are now a database administrator. The database is generated by the Java program DatabaseGenerator.java, and imported by the script importDatabase.sql. The database has four tables, and in each the attribute K is a key:
R(K,A,B,C,D,E)
S(K,A,B,C), T(K,D, E)   -- S.B is a foreign key to T.K
U(K,A,B)
You don't care what the database means or contains, all you care about this database is that there are some users that access it extensively by running TestQueries.java. You are in charge of administrating the database and optimizing it for performance. As you saw, TestQueries.java takes several minutes to run: you need to improve that.

Create a file tune.sql where you put some star dust to sprinkle over the database that makes TestQueries run faster. The customers don't care what you do to the database, but after your star dust has settled, TestQueries will run much faster and your customers will love you.

Here are the rules.

  1. You are restricted to doing the following:
    1. Create an index on one ore more attributes. You may create several indexes per table, and may decide which one to make a primary index.
    2. Split a table horizontally, by some attribute value; see the lecture notes.
    3. Split the table vertically; see the lecture notes.
    4. Create a materialized view.
  2. You may modify the database schema, by creating and populating new tables, but your database must still contain four tables or views called R, S, T, U with the same schema as before.
  3. You may not drop data. The content of R, S, T, U must remain unchanged, even though some of these may become views in your solution.
  4. You may not modify TestQueries.java. The application must continue to run, even without recompiling it.
  5. You may focus on optimizing ONLY on the workload TestQueries.java. Some other queries may become slower as a result of your optimization, but that's OK as long as TestQueries runs faster.
Hint: you may want to check what plan postgres picks for a given query, and for that you may use explain, e.g.
explain select count(*) from R where A = 88;

COMMENTS

Keep in mind that this is designed to be a simple project: there are no gotchas here. A reasonable solution should make TestQueries run about 5-6 times faster. Based on the material covered in class, there is one rather canonical solution, and that's all you need to find; several variations on this canonical solution are also acceptable. What this means is that once you find one solution that makes sense to you, stop experimenting with alternatives just to see if they run faster.

An unpleasant surprise is postgres does not always cooperate with your instructor's lectures. Those lectures assume that SQL does certain optimizations, but postgres does not do them. I managed to find out how optimize horizontal partitions, and we will discuss this in class. (If you want to read yourself, you'll find it here http://www.postgresql.org/docs/8.3/interactive/ddl-partitioning.html, but don't worry about it, we'll discuss this in class). I did not figure out how to do vertical partitions, and the lack of optimization does affect performance. Here we will cheat a bit, details in class.