CSE 544, Winter 2009, Assignment 1

Using a database management system

Due: Friday, 23 January 2009, 11:00pm

Assignment 1 Clarifications

This assignment can also be done in a group of two students.
Please let us know if you encounter any problems with this assignment. We will post clarifications and fixes here as necessary.

Assignment goal

The goal of this assignment is to ensure that you are comfortable using a relational database management system (RDBMS), that you are comfortable with SQL, and that you know how to write a simple application that uses a DBMS. SQL, the Structured Query Language, is the most widely used relational database language today. It allows you to specify the data you are interested in by using logical expressions.

For the purpose of the class, we will be using the PostgreSQL open-source DBMS. PostgreSQL provides a fairly standard implementation of SQL. Please keep in mind, however, that when you change to a different DBMS, you may need to use different variants of SQL. The differences normally affect only advanced features.

In real life (i.e., outside of class), when you will decide to use a DBMS, nothing will be setup and ready for you. For this reason, in this assignment, you will setup and use a database from scratch. Well, not quite. At least, the DBMS server and some client applications have already been installed for you on cubist.cs.washington.edu. You should log on to cubist to complete the assignment. The assignment will walk you through some of the most important features of relational databases.

Documentation

You may find the following documentation helpful when completing this assignment:

Chapters 1 through 5 of the book by Ramakrishnan and Gehrke (R&G), especially chapter 5. Please read the book to refresh your memory of SQL.
PostgreSQL tutorial. The PostgreSQL tutorial is very nice, except the part that discusses SQL queries (in particular joins). That part will make your life harder than it should. Please read the book instead.
PostgreSQL documentation
Additional online introduction to SQL

Overview

We are going to build and use an inventory database for a toy store. This is a classical type of application that DBMSs have been designed to support and are really good at supporting.

Database schema

Our toy-store database has the following schema:

Product(pid: integer, name: varchar(20), min_age: integer)
Manufacturer(mid: integer, name: varchar(20), address: varchar(50))
Supplier(sid: integer, name: varchar(20), address: varchar(50))
Inventory(pid:integer, stock: integer)
Manufactures(mid:integer, pid: integer)
Supplies(sid: integer, pid: integer)

The Product relation contains information about all the toys sold by the store. The Product.pid column is the primary key for the relation. Product.name is the name of each toy, and Product.min_age indicates the minimum age recommended to play with the toy.

Manufacturer and Supplier list the names and addresses of all toy manufacturers and suppliers respectively. Manufacturer.mid and Supplier.sid are the primary keys for these relations.

Inventory indicates the number of toys in stock. Inventory.pid is a foreign key referring to Product.pid. Inventory.stock indicates the number of toys of the given type available in stock.

Manufactures and Suppliers associate products with their manufacturers and suppliers respectively. Note that there can be only one manufacturer for a product, but there can be many suppliers.

Database content

The content of the database is currently stored in six files corresponding to the six relations above. These files are available as one tar ball that you can download from here: hw1.tar.gz. If you want to get the files directly onto cubist, the easiest might be to log on to cubist and execute the following command:

> wget http://www.cs.washington.edu/education/courses/544/09wi/assignments/hw1.tar.gz

To extract the files, use the following command:

> tar zxvf hw1.tar.gz

You should now see a directory called hw1 that contains the files with all the data. If this procedure did not work for you, please contact the TA or talk to another student in the class.

1. Creating your database (1pt)

Before you start, take a quick look at the "Architecture Fundamentals" section of the PostgreSQL documentation. All of today's DBMSs support this simple client-server architecture. We will discuss the architecture of a DBMS in greater detail in class. You will also need to setup your postgres account and password by going to http://cubist.cs.washington.edu/~evan/reg.cgi

In order to store data in a database, you first need to create a database. The name of your database should be cse544-09wi-yourlogin, where yourlogin is your login on cubist. This naming convention will ensure that each database has a unique name. To create a database, follow these simple instructions.

Write the command that you used in a file called problem1.txt.

Now that you created your database, you can try to access it and play with it as described here.

2. Creating tables (6pt)

Using the following instructions, create the six tables in the inventory database. At this point, do not worry about specifying what attributes are the primary keys or the foreign keys. You will do this in problem 6. Create the tables without any constraints on their content other than the types of the different attributes.

Write all the statements that you used in a file called problem2.txt.

3. Loading data (6 pt)

Now that you created the relations, you can load the data into the database. To insert data into a table, you can use the SQL INSERT statement as demonstrated here.

3.1) Write a SQL statement to insert a new product with pid=-1, name='my product', and min_age=3 into the Product table.

There are, however, 10,000 products in our database. Entering these products one at the time would be somewhat tedious.

In general, when you will use a database, you will be in the same situation: you will start with a large amount of data. For example, you will have data from experiments that you will be running as part of your research projects.

Fortunately, in all DBMSs, it is possible to load a large amount of data from a text file in a single operation. In PostgreSQL, you can perform such a load using the COPY statement. COPY, however, requires administrator privileges. As a user, you need to use the psql meta-command \copy instead. You can read more about psql and its meta-commands here.

3.2) Populate the six tables form the six text files using \copy.

3.3) Write a SQL statement to remove the product that you entered in 3.1 (only that product).

Write all the commands that you used in a file called problem3.txt.

4. SQL Queries (17 pts)

Now that you have a database with data in it, you can use SQL to ask various queries about the data. In this problem, you will practice writing SQL queries.

You can assume that every product in the table Product appears in the table Inventory.

Write SQL queries for the following:

4.1) List the ids and names of all products whose inventory is below 5.

4.2) List the ids and names of all suppliers for products manufactured by "manufacturer_2". The id and name of each supplier should appear only once.

4.3) List the ids, names, and number in stock of all products in inventory. Order the list by decreasing number in stock and decreasing product ids.

4.4) List the ids and names of all products for whom there is only one supplier.

4.5) Find the ids and names of the products with the lowest inventory. Do NOT assume these are always products with an inventory of zero.

4.6) List the id and name of each supplier along with the total number of products it supplies.

4.7) Find the id and name of the manufacturer who produces toys on average for the youngest children.

Store your queries in a file called problem4.txt.

5. The system catalog (0 pts)

A relational DBMS maintains all metadata information about relations (e.g., their schemas, their integrity constraints, their indexes) in a set of special tables called the catalog tables. Storing metadata in the same way as the data itself is a neat design features of relational DBMSs. Catalog tables can be queried with SELECT statements just like any other table. For example, to see the list of relations in the current database you can use the following statement:

# SELECT * FROM pg_tables WHERE tableowner='yourlogin';

To make your life easier, psql provides several meta-commands to look-up information about relations. For example, the command:

# \d

Shows the set of relations in the current database.

# \d Product;

Shows the definition of relation Product.

You can read more about the catalog tables in Section 12.1 of R&G.

You can read more about PostgreSQL's catalog here.

You can read more about psql here.

We encourage you to look-up the definitions of your different tables again after you create integrity constraints and indexes below.

6. Integrity constraints (18 pts)

It is possible to restrict the data that can be stored in a database by defining integrity constraints: i.e., conditions on the database schema. Only database instances that satisfy the integrity constraints are allowed to exist. The DBMS will ensure that this property holds.

In our schema, we have designated some columns to be the primary key of a relation and other columns to be a foreign key referencing another relation. These constraints are called key constraints. They are the most fundamental type of integrity constraints. To enforce these constraints, the DBMS must know about them. Key constraints are normally specified when first defining a table using the CREATE TABLE statement. If a table already exists, its definition can be modified using the ALTER TABLE statement.

6.1) Modify the schemas of relations Product, Manufacturer, and Supplier to specify that pid, mid, and sid are the respective primary keys for these relations. Use the ALTER TABLE statement. If you already specified the primary key constraint when you created the relation, you do not need to do anything. Write the exact statement that you used (either the ALTER TABLE statement or a copy of your CREATE TABLE statement) in a file called problem6.txt.

6.2) Similarly, modify the schemas of relations Inventory, Manufactures, and Supplies to specify the foreign key constraints. Once again, use the ALTER TABLE syntax or do nothing if you already specified the constraints when you created the relations. Add the exact statement that you used (either the ALTER TABLE statement or a copy of your CREATE TABLE statement) to the file called problem6.txt.

6.3) Try to insert a product into the Product table with an identifier that already exists in the table. Try to insert an entry in the Inventory table for a product that does not exist in the Product table. What happens when you try to execute these operations? What would happen if the tables did not have the key constraints specified on them? Add the operations that you performed and your answer to the question to the file called problem6.txt.

6.4) What happens if you try to delete a product from the Product table without first deleting it from the Inventory, Supplies, and Manufactures table? Is there a way to ask the DBMS to remove these related entries automatically? (Refer to section 3.3 or Ramakrishnan and Gehrke). Add your answer to the file problem6.txt.

DBMSs support more general and more sophisticated types of integrity constraints. You can read more about complex integrity constraints in Section 5.7 of R&G.

6.5) (Hard question) It is often useful to have the DBMS perform some actions automatically in response to operations on the database. For example, if a product is discontinued, a user will remove it from the Supplies table. We would like, however, to keep this product in the database until its inventory goes to zero. Once the product is out of stock, it should be removed from the Product and Inventory tables. We can make the DBMS perform the latter two deletes automatically by defining a trigger on the database. You can read more about triggers in Section 5.8 of R&G. Add a trigger to the database that performs the following function: if the quantity of a product in stock goes down to zero and there is no supplier for the product, remove the product from the Product and Inventory tables. You will need to read the PostgreSQL documentation on triggers. We encourage you to use PL/pgSQL to write your trigger function (example of trigger using PL/pgSQL). This question is also a good opportunity to learn about PL/pgSQL.

You do not need to worry about the cases when:

a supplier is deleted or
a new product is added with no supplier and 0 stock.

7. Indexes (14pts)

Databases typically contain large amounts of data. Because this data must persist even if the DBMS goes down, it is stored on disk. To answer a query, the DBMS must thus read data from disk, which can be slow if a lot of data must be read. Indexes are data structures that can speed-up this process by enabling the DBMS to read only a subset of all data from disk when answering a query. You can read more about indexes in Chapters 8 through 11 of R&G. We will also discuss them in class.

When you altered the schemas of relations Product to specify the primary key constraint, you might have noticed the following message:

NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "product_pkey" for table "product"
ALTER TABLE

This message indicates that PostgreSQL automatically creates an index on the attributes that serve as the primary key of a relation. This index speeds-up queries that search for tuples by their primary key. In the case of PostgreSQL, this index is also the mechanism that enforces the primary key constraint.

In this question, we will practice using indexes on a single large table. Create a second table, called ProductBig, with the following schema. At first, do not specify that pid is a primary key. Load data into this table from the file called /cse/courses/cse544/09wi/public/productbig.dat (beware that the load will take several minutes).

ProductBig(pid: integer, name: varchar(20), description: varchar(100), rating: integer, min_age: integer)

Consider the following query:

SELECT * FROM ProductBig WHERE pid=9876543;

7.1) Try to execute this query and see how long it takes. You can use the command \timing to have PostgreSQL time your queries. How long does the query take? How many tuples from relation ProductBig must be read from disk in order to answer this query? Write your answers in a file called problem7.txt.

7.2) Now modify the schema of ProductBig to indicate that pid is a primary key. Execute the above query again. How long does it take this time? Why is that? Add your answer to the file called problem7.txt.

Note: you can examine the query execution plan that the DBMS uses to answer the query. To see the plan, use the EXPLAIN command. You can find more information about EXPLAIN in the PostgreSQL documentation about EXPLAIN. We will talk about query plans starting from lecture 7. In the meantime, you can at least see if the DBMS uses the index or not when answering your query. You do not have to use EXPLAIN. Use it only if it helps you understand what is going on.

7.3) An index on the primary key speeds-up only a specific type of queries, which may not necessarily correspond to the query workload of the system. Consider the following query:

SELECT count(*) FROM ProductBig WHERE rating > 98;

Try to execute this query. How long does it take? How many tuples from relation ProductBig must be read from disk in order to answer this query? Write your answer in a file called problem7.txt.

7.4) Create an index on ProductBig that will speed-up the query above. Add the SQL statement that you use to create the index and an explanation for your choice of index to the file problem7.txt. How long does the query take now?

7.5) Delete the index on rating and add the statement that you used to problem7.txt.

7.6) Since indexes are so great, why not create an index on every single attribute of ever single table? Add your explanation to the file problem7.txt.

When you are done working on this question, please drop the ProductBig table to avoid using all that disk space.

8. Views (8pts)

A view is a table whose content is not physically stored on disk. Instead, it is computed from a query that corresponds to the view definition.

8.1) Create a view that shows only toys in the Product table for children over age 6. Write the SQL statement that you use in a file called problem8.txt.

8.2) Write a query that uses the newly created view to find all toys for children over age 6 that contain the substring '666' in their name. Add the SQL statement that you used in a file called problem8.txt.

8.3) Create a view that shows the id and names of toys along with the ids and names of their manufacturers. Add the SQL statement that you used to the file called problem8.txt.

8.4) What do you think are some of the benefits of using views? Add your answer to problem8.txt.

8.5) What happens if you try to insert a tuple into one of those views? Should users be allowed to perform inserts or updates on views? Add your answer to problem8.txt.

9. Appilcations that use a DBMS (20pts)

The goal of this final question is to ensure that you are comfortable writing an application that use a DBMS as a back-end. This question requires that you write a small amount of Java code.

You may find the following documentation helpful when completing this question:

Chapters 6 ("Database Application Development") and 7("Internet Applications") of the book by Ramakrishnan and Gehrke (R&G), Third Edition. Chapter 6 includes an introduction to JDBC with examples.
JDBC Documentation
The PostgreSQL JDBC Interface documentation.
An example to get you started.

To use JDBC, you need to have the jar file with the Postgres JDBC driver in you classpath. Set your classpath as follows if you're using tcsh:

> setenv CLASSPATH /usr/share/java/postgresql-jdbc4.jar:.

or as follows for bash:

> export CLASSPATH=/usr/share/java/postgresql-jdbc4.jar:.

Create a class called ToyDB. Executing:

> java ToyDB

on cubist should produce the following menu:

   *** Please enter one of the following commands ***
  > list
  > list <age>
  > describe <pid>
  > inventory <pid>
  > suppliers <pid>
  > manufacturer <pid>
  > quit
  >

These commands should perform the following actions:

list: lists all the products in the Product relation.

list <age>: shows all the toys where the min_age is less than or equal to <age>.

describe <pid>: shows the detailed description of a product given its <pid>.

inventory <pid>: shows the number of items left in stock for product <pid>.

suppliers <pid>: shows the id and name of all suppliers for product <pid>.

manufacturer <pid>: shows the id and name of the manufacturer for product <pid>.

quit: quits the application.

Your application should perform some rudimentary error checking, but we will not test it on erroneous inputs. Please do not spend time making things bullet-proof.

To ensure you work directly on the interesting parts of this assignment, we provide you a starter file ToyDBMenu.java that includes the code to print the menu and read user commands. The ToyDBMenu class also shows samples of how the results should be printed. You only need to rename the file and add the code to connect to the database, execute the appropriate SQL statements, and display the results.

Please write your program such that it reads the database name, login, and password necessary to connect to the database from a file. The format of the file should be:

    jdbc:postgresql:cse544-09wi-yourlogin
    cse544-09wi-yourlogin
    password

This will help us when we test your assignment.

Testing your application

Inside hw1.tar.gz, we provide you a few test files called toydbX.test. Executing

> java ToyDB < toydbX.test

Should produce the same output as the one shown in toydbX.out.

We will test your application on inputs similar to toydbX.test.

Turn-in instructions

As for all the assignments, you can work on this assignment in a group of two. Make sure to turn-in the following files. Each team should submit one set of files only. Please include a text file with the name of both partners:

partner-names.txt
problem1.txt
problem2.txt
problem3.txt
problem4.txt
nothing to turn-in for problem 5
problem6.txt
problem7.txt
problem8.txt
All the files that comprise the application that you built in problem 9.

To hand-in your assignment, issue the following command under the directory where you saved the above files:

tar zcvf cse544-09wi-youraccountname.tar.gz problem?.txt partner-names.txt problem-9-files

After executing this command, you should see a file named cse544-09wi-youraccountname.tar.gz. Please send this file as an email attachment to your TA.