ESTIMATED TIME: 15 hours.
DUE DATE: Tuesday, May 26th, 6:30pm
TURN IN INSTRUCTIONS: Turn in tune.sql, question2.txt, trigger.sql, and access-control.sql to the Catalyst drop box.
Some basic DBMS administration
Note that this assignment covers topics which we only briefly
discussed in class. Therefore, a lot of time for this assignment will
be spent reading documentation (we include all necessary links below)
and figuring out how to perform the various administrative tasks. We
took this time into account when preparing the assignment. Learning how
to learn more about DBMSs is an important skill to acquire. We also
want to make sure that you know there is a lot more to DBMSs than we
have time to cover in CSE P544.
POSTGRES, JDBC DRIVERS, CLASSPATH: You need the same setup as in Homework 3.
STARTER CODE: Please download hw5.zip. This archive contains:
CREATE YOUR DATABASE
createdb -U postgres hw5
RUN THE STARTER CODE
javac DatabaseGenerator.java
javac TestQueries.java
java DatabaseGenerator 100
psql -U postgres -f importDatabase.sql hw5
java TestQueries 1
You are the database administrator for the UW Medical Center (UWMC). In this project, the UWMC database is generated by the Java program DatabaseGenerator.java, and imported by the script importDatabase.sql.
The database has the following schema:
Patient(pid INTEGER, fname VARCHAR(20), lname VARCHAR(20), age INT, street VARCHAR(20),The Patient, and Doctor tables contain information about the patients and doctors, respectively. The Disease table stores which patient has which disease(s). The Sees table says which patient is seeing which doctor(s). n.b. For the Patient table, it is usually better to use date-of-birth instead of age since the d.o.b. does not change every year! However, for the purposes of this problem, it is easier to use age.
city VARCHAR(10), zipcode VARCHAR(5))
Disease(pid INTEGER, disease VARCHAR(20))
Doctor(did INTEGER, fname VARCHAR(20), lname VARCHAR(20), specialty VARCHAR(20))
Sees(pid INTEGER,did INTEGER)
Product(eid INTEGER, description VARCHAR(20))
Stock(eid INTEGER, quantity INTEGER)
Supplier(sid INTEGER, name VARCHAR(20), street VARCHAR(20), city VARCHAR(10), zipcode VARCHAR(5))
Supplies(eid INTEGER,sid INTEGER)
The Product table contains information about the medical supplies and equipment that the UWMC has. The Stock table maintains the remaining quantity of each product. The Supplier table contains information about the supplier companies, whereas the Supplies table says which products are supplied by which supplier companies.
For each relation, the attributes that form the primary key are underlined.
Additionally,
The staff at UWMC access the database extensively by running TestQueries.java. In TestQueries.java, you'll see that there are six kinds of queries that they run very often (e.g. finding the list of doctors a specific patient has seen, finding the number of patients with a specific disease). You are in charge of administrating the database and optimizing it for performance. As you saw, TestQueries.java takes a couple minutes to run: you need to improve that.
Create a file tune.sql where you put some star dust to sprinkle over the database that makes TestQueries run faster. After your star dust has settled, TestQueries will run much faster and your customers will love you. The UWMC is paying you a lot of money, so they would like the TestQueries to run atleast 3 times faster. Although they currently do not have any INSERT statements in their workload, the database should be able to support them without a significant cost. Therefore, you are allowed to use no more than 7-8 indexes (in addition to the default ones created by postgres). If you want a challenge: Our solution improves things by over 10 times using 7 indexes. Can you beat it?
In this assignment, you will only tune the performance by adding indexes. You should create indexes on one or more attributes. You may create several indexes per table, and may decide which one to make a primary index. Be aware that, by default, Postgres builds a primary index on the primary key of each table. You should focus ONLY on optimizing the workload TestQueries.java. Some other queries may become slower as a result of your optimization, but that's OK as long as TestQueries runs faster.
Note that in practice, physical database tuning is more involved and
includes table partitioning, materialized views, and other things. Open
a psql window using the following command:
psql -U postgres hw5
To create an index, use the following command:
CREATE INDEX index_name ON table_name(column_name1, column_name2, ..., column_namek);
To make an index be the primary index, use the following command:
CLUSTER index_name ON table_name;Hint: you may want to check what plan postgres picks for a given query, and for that you may use
explain select count(*) from Supplier where city='kirkland';
TESTING AND RUNNING YOUR SOLUTION For this question, you need to create a file tune.sql that you will turn-in. After you create the file or update it, to test your solution do the following:
psql -U postgres -f tune.sql hw5
java TestQueries 1
If you need to start from scratch (i.e. remove your indexes and get back postgres's default indexes), the dropdb command may also be helpful. You can use it as:
dropdb -U postgres hw5
Motivation: It is often useful to have the DBMS perform some actions automatically in response to operations on the database. The UWMC database already does that to some extent. For example, if you delete a patient from the database, the corresponding entry in the Sees relation will also be deleted automatically
Examine the CREATE TABLE statements in importDatabase.sql. What actions will the DBMS perform automatically to maintain referential integrity in face of insert/delete/update statements performed on the various tables? Please write your answer in a file called question2.txt and please remember to turn in that file (you may want to review Lecture 3).
Sometimes, we would like the DBMS to perform more complex actions in response to more complex events. For example, if a product is no longer being used, a nurse will remove it from the Supplies table. We would like, however, to keep this product in the database until its stock goes to zero. Once the product is out of stock, it should be removed from the Product and Stock tables. We can make the DBMS perform the latter two deletes automatically by defining a trigger on the database. You can read more about triggers in Section 5.8 of the book.
Add a trigger to the database that performs the following function: if the quantity of a product in stock goes down to zero and there is no supplier for the product, remove the product from the Product and Stock tables.
You do not need to worry about the cases when:
Before you start:
Here are the steps you need to take to complete this task:
CREATE LANGUAGE plpgsql
\i trigger.sql
This question is also a good opportunity to learn about PL/pgSQL and the concept of stored procedures (optional).
TESTING AND RUNNING YOUR SOLUTION: Try to perform actions on the database and verify that the trigger fires at the appropriate times and performs the appropriate actions.
For this problem, you may want to first review the last section of the slides of Lecture 6. For several of the links provided below, you can scroll down to the bottom of the linked page to see a few examples - which can be helpful to see the syntax of the statement.
The hospital enforces a strict access control policy to its database. It distinguishes four kinds of users:
The hospital also classifies the data into three overlapping categories:
The access control policy is the following. The public is the least privileged: doctors, nurses, and admin are allowed to access everything that the public can access.
Your task is to write the SQL access policy statements that enforce
this policy.
Please add all your commands to a file called access-control.sql and turn-in this file.
3.1.1 Creating roles and users
As a first step, create one database role for each of the three kinds of users: doctors, nurses, and administrators. Call these roles "doctor", "nurse", and "uwmcadmin" respectively (make sure not to use capital letters, they seem to cause problems). A PUBLIC role already exists. To create roles, use the CREATE ROLE statement. You should not need to use any of the options available with this statement. Database roles are like groups: you will grant privileges to roles and will associate users with roles.
Now that you have the roles, create one test user for each role (doctor1, nurse1, admin1, and guest1) using the CREATE USER statement. Make sure that:
Note: guest1 need not be associated with any special roles.
In addition to the PostgreSQL documentation website, you can also use the postgres help command to get the correct syntax for these statements, e.g. \h CREATE ROLE.
TESTING AND RUNNING YOUR SOLUTION: Try to connect to the hw5 database as each user. Verify that you can connect but that you cannot view any data nor perform any inserts/updates at this time.
3.1.2 Creating public views and granting access to these views
Create one or more views on the database. These views should expose only information classified as PUBLIC DATA. They should hide everything else.
Grant access to these views to the appropriate role(s). For this, use the GRANT STATEMENT. Consider the following question: do you need to grant access explicitly to all roles? Or is it enough to grant access to PUBLIC?
TESTING AND RUNNING YOUR SOLUTION: Verify that all your users can see the public data.
3.1.3 Creating other views and granting other privileges
Create other views and grant other privileges as appropriate to implement the UWMC access control policies. Note that you may not always need to create views. When appropriate, just grant access rights directly to the tables.
TESTING AND RUNNING YOUR SOLUTION: To test your solution, try to run psql as each of the four users and check what operations you are allowed to perform.
Because UWMC is affiliated with a university, it wants to support medical research. This is a difficult requirement. On one hand, UWMC wants to keep all patient data confidential. On the other hand, it wants to show the most detailed data possible to researchers.
To address this requirement, UWMC decides to add another role, called "researcher". A researcher should have read-only access to a view with the following schema:
DiseaseResearch(zipcode VARCHAR(5), age INTEGER, disease VARCHAR(20))
This table indicates the diseases for patients with different ages and at different locations. It does not include any personally identifiable information.
Please add the "researcher" role with the appropriate view and access control rights. Create a user researcher1 to test your solution. Remember to add all commands to the access-control.sql file.
As a smart administrator, however, you uncover the following problem: most patients at UWMC are also registered voters in the state of Washington and the voter's registry is publicly available. The voter's registry has the following schema
Voter(vid INTEGER, fname VARCHAR(20), lname VARCHAR(20), age INT, zipcode VARCHAR(5))
A voter's registry was generated along with the tables in the UWMC database. Please run the command:
psql -U postgres -f importVoter.sql hw5
to load it into your DBMS. (If you deleted the text files for the
various database tables, run java DatabaseGenerator 100
to get a new set of tables. Note that you will have to reimport both
the UWMC database and Voter table after this.)
Grant public read access to this new table (this is a public database).
Show how a researcher can now uncover with high likelihood the disease(s) that certain people have. Include the SQL Statement(s) that you write into the access-control.sql file. Briefly discuss your findings (is this problem significant? how could it be solved? etc.). Add these as comments into access-control.sql.