Project 3: Database Administration

Due: Wednesday, May 13 2010, Midnight

ESTIMATED TIME: 12 hours.

TURN IN INSTRUCTIONS: Turn in tune.sql, question2.txt, trigger.sql, access-control.sql using the regular assignment drop box.

WHAT YOU WILL LEARN

Some basic DBMS administration

  1. Database tuning: given a database and a workload, tune the database to make the workload faster.
  2. Making the database active: add a trigger to a database that automatically reacts to actions performed on the database.
  3. Security: protect the data inside a database using views and access control lists.

Note that this assignment covers topics complementary to what we discussed in class and in sections. Therefore, a large fraction of time for this assignment should be spent reading documentation (we include all necessary links below) and figuring out how to perform the various administrative tasks. We took this time into account when preparing the assignment. Learning how to learn more about DBMSs is an important skill to acquire. We also want to make sure that you know there is a lot more to DBMSs than we have time to cover in CSE 444.

SETUP

POSTGRES, JDBC DRIVERS, CLASSPATH: You need the same setup as in project 2. See instructions there.

STARTER CODE: Please download project3.tar.gz. This archive contains:

CREATE YOUR DATABASE

createdb project3

RUN THE STARTER CODE

  1. Edit dbconn.config to update the username (you don't need to update the password unless you added a password to your username when you set up postgres).
  2. Run the following commands:
    javac DatabaseGenerator.java
    javac TestQueries.java
    java DatabaseGenerator 100
    psql -f importDatabase.sql project3     
    java TestQueries 1
    
    Note 1: Be aware that the last two lines will take 2-3 minutes each on the lab machines. 
    
    Note 2: You can also run the script from within psql with the following command: \i importDatabase.sql 
    You may need to specify the absolute path to the script and also modify importDatabase.sql to
    add absolute pathnames to all filenames.
    
The numbers 100 and 1 represent the scale factors; normally you wouldn't change them, but if you want you may increase them to test your database tuning solution more extensively.

If you get any error up to here, contact one of the TAs.

PROJECT DESCRIPTION

You are the database administrator for the UW Medical Center (UWMC). In this project, the UWMC database is generated by the Java program DatabaseGenerator.java, and imported by the script importDatabase.sql.

The database has the following schema:

Patient(pid INTEGER, fname VARCHAR(20), lname VARCHAR(20), age INT, street VARCHAR(20), city VARCHAR(10), zipcode VARCHAR(5))
Disease(pid INTEGER, disease VARCHAR(20))
Doctor(did INTEGER, fname VARCHAR(20), lname VARCHAR(20), specialty VARCHAR(20)) 
Sees(pid INTEGER,did INTEGER) 
Product(eid INTEGER, description VARCHAR(20)) 
Stock(eid INTEGER, quantity INTEGER) 
Supplier(sid INTEGER, name VARCHAR(20), street VARCHAR(20), city VARCHAR(10), zipcode VARCHAR(5)) 
Supplies(eid INTEGER,sid INTEGER) 
The Patient, and Doctor tables contain information about the patients and doctors, respectively. The Disease table stores which patient has which disease(s). The Sees table says which patient is seeing which doctor(s). n.b. For the Patient table, it is usually better to use date-of-birth instead of age since the d.o.b. does not change every year! However, for the purposes of this problem, it is easier to use age.

The Product table contains information about the medical supplies and equipment that the UWMC has. The Stock table maintains the remaining quantity of each product. The Supplier table contains information about the supplier companies, whereas the Supplies table says which products are supplied by which supplier companies.

For each relation, the attibutes that form the primary key are underlined.

Additionally,

1. Database Tuning

The staff at UWMC access the database extensively by running TestQueries.java. In TestQueries.java, you'll see that there are six kinds of queries that they run very often (e.g. finding the list of doctors a specific patient has seen, finding the number of patients with a specific disease). You are in charge of administrating the database and optimizing it for performance. As you saw, TestQueries.java takes a couple minutes to run: you need to improve that.

Create a file tune.sql where you put some star dust to sprinkle over the database that makes TestQueries run faster. After your star dust has settled, TestQueries will run much faster and your customers will love you. The UWMC is paying you a lot of money, so they would like the TestQueries to run at least 20 times faster. Although they currently do not have any INSERT statements in their workload, the database should be able to support them without a significant cost. Therefore, you are allowed to use no more than 7-8 indexes (in addition to the default ones created by postgres). Also, for each index, explain in 1-2 sentences (as a comment in tune.sql) why you use that particular index (as opposed to another one - if applicable). If you want a challenge: on the lab machines, there is a solution that improves things by 30x using only 6 indexes. Can you beat it?

In this assignment, you will only tune the performance by adding indexes. You should create indexes on one or more attributes. You may create several indexes per table, and may decide which one to make a primary index. Be aware that, by default, Postgres builds a primary index on the primary key of each table. You should focus ONLY on optimizing the workload TestQueries.java. Some other queries may become slower as a result of your optimization, but that's OK as long as TestQueries runs faster.

Note that in practice, physical database tuning is more involved and includes table partitioning, materialized views, and other.

To create an index, use the following command:

CREATE INDEX index_name ON table_name(column_name1, column_name2, ..., column_namek); 

To make an index be the primary index, use the following command:

CLUSTER index_name ON table_name;
Hint: you may want to check what plan postgres picks for a given query, and for that you may use explain, e.g.
explain select count(*) from Supplier where city='kirkland';

TESTING AND RUNNING YOUR SOLUTION For this question, you need to create a file tune.sql that you will turn-in. After you create the file or update it, to test your solution do the following:

createdb project3
psql -f importDatabase.sql project3
psql -f tune.sql project3
java TestQueries 1

If you need to start from scratch (i.e. remove your indexes and get back postgres's default indexes), the dropdb command may also be helpful. You can use it as:

dropdb project3

2. Making the database active

Note: This question (and the next one too) is difficult because you will learn something that we did not discuss in class nor in sections. The instructions below point you to all the documentation that you need. If you have further questions, please do not hesitate to post general questions on the message board or send us email. This should be a worthwhile learning experience.

Motivation: It is often useful to have the DBMS perform some actions automatically in response to operations on the database. The UWMC database already does that to some extent. For example, if you delete a patient from the database, the corresponding entry in the Sees relation will also be deleted automatically

2.1 Maintaining referential integrity in face of updates

Examine the CREATE TABLE statements in importDatabase.sql. What actions will the DBMS perform automatically to maintain referential integrity in face of insert/delete/update statements performed on the various tables? Please write your answer in a file called question2.txt and please remember to turn in that file (you may want to review Lecture 4).

2.2 Triggers

Sometimes, we would like the DBMS to perform more complex actions in response to more complex events. For example, if a product is no longer being used, a nurse will remove it from the Supplies table. We would like, however, to keep this product in the database until its stock goes to zero. Once the product is out of stock, it should be removed from the Product and Stock tables. We can make the DBMS perform the latter two deletes automatically by defining a trigger on the database. You can read more about triggers in Section 7.5 of the book [Old edition: Section 7.4].

Add a trigger to the database that performs the following function: if the quantity of a product in stock goes down to zero and there is no supplier for the product, remove the product from the Product and Stock tables.

You do not need to worry about the cases when:

Before you start:

  1. Read the PostgreSQL documentation on triggers (Sections 35.1 and 35.2 only; ignore the C example).
  2. Read the overview of PL/pgSQL .
  3. Read the example of trigger using PL/pgSQL.
  4. Take a look at the sample trigger we provide you in trigger-sample.sql.
The sample trigger is very similar to the one that you need to write. This sample trigger prints an error message when someone updates the stock for an item that no longer appears in the Supplies table. As a comment inside trigger-sample.sql, you will find the detailed instructions for loading and running this sample trigger. To write your trigger, you should only need to add about three lines of code to this example and you can remove the error message if you'd like.

Here are the steps you need to take to complete this task:

  1. Tell PostgreSQL that you want to use PL/pgSQL with your database:
     CREATE LANGUAGE plpgsql 
  2. Put your trigger definition in a file called trigger.sql (to start, you can simply copy & paste the contents of trigger-sample.sql).
  3. Load your trigger definition with (note that you may need to specify the absolute path):
     \i trigger.sql 
  4. Add your trigger with the CREATE TRIGGER statement as shown in the example and also in the file "trigger-sample.sql".

This question is also a good opportunity to learn about PL/pgSQL and the concept of stored procedures (optional).

TESTING AND RUNNING YOUR SOLUTION: Try to perform actions on the database and verify that the trigger fires at the appropriate times and performs the appropriate actions.

3. Security

Hint: For several of the links we provide for this problem, you can scroll down to the bottom of the linked page to see a few examples - which can be helpful to see the syntax of the statement.

There is nothing about access control and security in our book, but you can find more information here:

3.1 Basic database access control

The hospital enforces a strict access control policy to its database. It distinguishes four kinds of users:

  1. doctors
  2. nurses
  3. administrators
  4. members of the public

The hospital also classifies the data into three overlapping categories:

  1. MEDICAL DATA: data about patients, their doctors, and their diseases
  2. SUPPLY DATA: data about products, their suppliers and their stocks
  3. PUBLIC DATA: doctors' names, their speciality, the diseases treated in this hospital, and the names of the suppliers for the hospital. Patient data is never public.

The access control policy is the following. The public is the least privileged: doctors, nurses, and admin are allowed to access everything that the public can access.

Your task is to write the SQL access policy statements that enforce this policy. Please procede as follows.

Please add all your commands to a file called access-control.sql and hand-in this file.

3.1.1 Creating roles and users

As a first step, create one database role for each of the three kinds of users: doctors, nurses, and administrators. Call these roles "doctor", "nurse", and "uwmcadmin" respectively (make sure not to use capital letters, they seem to cause problems). A PUBLIC role already exists. To create roles, use the CREATE ROLE statement. You should not need to use any of the options available with this statement. Database roles are like groups: you will grant privileges to roles and will associate users with roles.

Now that you have the roles, create one test user for each role (doctor1, nurse1, admin1, and guest1) using the CREATE USER statement. Make sure that:

Note: guest1 need not be associated with any special roles.

In addition to the PostgreSQL documentation website, you can also use the postgres help command to get the correct syntax for these statements, e.g. \h CREATE ROLE.

TESTING AND RUNNING YOUR SOLUTION: Try to connect to the project3 database as each user. Verify that you can connect but that you cannot view any data nor perform any inserts/updates at this time.

3.1.2 Creating public views and granting access to these views

Create one or more views on the database. These views should expose only information classified as PUBLIC DATA. They should hide everything else.

Grant access this these views to the appropriate role(s). For this, use the GRANT STATEMENT. Consider the following question: do you need to grant access explicitly to all roles? Or is it enough to grant access to PUBLIC?

TESTING AND RUNNING YOUR SOLUTION: Verify that all your users can see the public data.

3.1.3 Creating other views and granting other privileges

Create other views and grant other privileges as appropriate to implement the UWMC access control policies. Note that you do not always have to create views. When appropriate, you can grant access rights directly to existing tables.

TESTING AND RUNNING YOUR SOLUTION: To test your solution, try to run psql as each of the four users and check what operations you are allowed to perform.

3.2 The challenge of data anonymization

Because UWMC is affiliated with a university, it wants to support medical research. This is a difficult requirement. On one hand, UWMC wants to keep all patient data confidential. On the other hand, it wants to show the most detailed data possible to researchers.

To address this requirement, UWMC decides to add another role, called "researcher". A researcher should have read-only access to a view with the following schema:

DiseaseResearch(zipcode VARCHAR(5), age INTEGER, disease VARCHAR(20))

This table indicates the diseases for patients with different ages and at different locations. It does not include any personally identifiable information.

Please add the "researcher" role with the appropriate view and access control rights. Create a user researcher1 to test your solution. Remember to add all commands to the access-control.sql file.

As a smart administrator, however, you uncover the following problem: most patients at UWMC are also registered voters in the state of Washington and the voter's registry is publicly available. The voter's registry has the following schema

Voter(vid INTEGER, fname VARCHAR(20), lname VARCHAR(20), age INT, zipcode VARCHAR(5))

A voter's registry was generated along with the tables in the UWMC database. Please run the command:

psql -f importVoter.sql project3

to load it into your DBMS. (If you deleted the text files for the various database tables, run java DatabaseGenerator 100 to get a new set of tables. Note that you will have to reimport both the UWMC database and Voter table after this.)

Grant public read access to this new table (this is a public database).

Show how a researcher can now uncover with high likelihood the disease(s) that certain people have. Include the SQL Statement(s) that you write into the access-control.sql file.

Please discuss your findings (is this problem significant? how could it be solved? etc.). Add your answer as a comment into the file access-control.sql