CSE 544, Fall 2007, Assignment 1

Using a database management system

Due: Friday, 19 October 2007, 11:00pm

Assignment 1 Clarifications

For Q4: you can assume that every product appears in the inventory table.
There was a typo in Q7.3 and Q7.4 - we changed all occurrences of Product to ProductBig.
For Q6.5: You do not need to worry about the cases when
- a supplier is deleted OR
- a new product is added with no supplier and 0 stock.

Assignment goal

The goal of this assignment is to ensure that you are comfortable using a relational database management system (RDBMS) and that you are comfortable with SQL. SQL, the Structured Query Language, is the most widely used relational database language today. It allows you to specify the data you are interested in by using logical expressions.

For the purpose of the class, we will be using the PostgreSQL open-source DBMS. PostgreSQL provides a fairly standard implementation of SQL. Please keep in mind, however, that when you change to a different DBMS, you may need to use different variants of SQL. The differences normally affect only advanced features.

In real life (i.e., outside of class), when you will decide to use a DBMS, nothing will be setup and ready for you. For this reason, in this assignment, you will setup and use a database from scratch. Well, not quite. At least, the DBMS server and some client applications have already been installed for you on cubist.cs.washington.edu. You should log on to cubist to complete the assignment. The assignment will walk you through some of the most important features of relational databases.

Documentation

You may find the following documentation helpful when completing this assignment:

Chapters 1 through 5 of the book by Ramakrishnan and Gehrke (R&G), especially chapter 5. Please read the book to refresh your memory of SQL.
PostgreSQL tutorial. The PostgreSQL tutorial is very nice, except the part that discusses SQL queries (in particular joins). That part will make your life harder than it should. Please read the book instead.
PostgreSQL documentation
Additional online introduction to SQL

Overview

We are going to build and use an inventory database for a toy store. This is a classical type of application that DBMSs have been designed to support and are really good at supporting.

Database schema

Our toy-store database has the following schema:

Product(pid: integer, name: varchar(20), min_age: integer)
Manufacturer(mid: integer, name: varchar(20), address: varchar(50))
Supplier(sid: integer, name: varchar(20), address: varchar(50))
Inventory(pid:integer, stock: integer)
Manufactures(mid:integer, pid: integer)
Supplies(sid: integer, pid: integer)

The Product relation contains information about all the toys sold by the store. The Product.pid column is the primary key for the relation. Product.name is the name of each toy, and Product.min_age indicates the minimum age recommended to play with the toy.

Manufacturer and Supplier list the names and addresses of all toy manufacturers and suppliers respectively. Manufacturer.mid and Supplier.sid are the primary keys for these relations.

Inventory indicates the number of toys in stock. Inventory.pid is a foreign key referring to Product.pid. Inventory.stock indicates the number of toys of the given type available in stock.

Manufactures and Suppliers associate products with their manufacturers and suppliers respectively. Note that there can be only one manufacturer for a product, but there can be many suppliers.

Database content

The content of the database is currently stored in six files corresponding to the six relations above. These files are available as one tar ball that you can download from here: hw1.tar.gz. If you want to get the files directly onto cubist, the easiest might be to log on to cubist and execute the following command:

> wget http://www.cs.washington.edu/education/courses/544/07au/assignments/hw1.tar.gz

To extract the files, use the following command:

> tar zxvf hw1.tar.gz

You should now see a directory called hw1 that contains the files with all the data. If this procedure did not work for you, please contact the TA or talk to another student in the class.

1. Creating your database (1pt)

Before you start, take a quick look at the "Architecture Fundamentals" section of the PostgreSQL documentation. All of today's DBMSs support this simple client-server architecture. We will discuss the architecture of a DBMS in greater detail in class.

In order to store data in a database, you first need to create a database. The name of your database should be cse544-07au-yourlogin, where yourlogin is your login on cubist. This naming convention will ensure that each database has a unique name. To create a database, follow these simple instructions.

Write the command that you used in a file called problem1.txt.

Now that you created your database, you can try to access it and play with it as described here.

2. Creating tables (6pt)

Using the following instructions, create the six tables in the inventory database. At this point, do not worry about specifying what attributes are the primary keys or the foreign keys. You will do this in problem 6. Create the tables without any constraints on their content other than the types of the different attributes.

Write all the statements that you used in a file called problem2.txt.

3. Loading data (6 pt)

Now that you created the relations, you can load the data into the database. To insert data into a table, you can use the SQL INSERT statement as demonstrated here.

3.1) Write a SQL statement to insert a new product with pid=-1, name='my product', and min_age=3 into the Product table.

There are, however, 10,000 products in our database. Entering these products one at the time would be somewhat tedious.

In general, when you will use a database, you will be in the same situation: you will start with a large amount of data. For example, you will have data from experiments that you will be running as part of your research projects.

Fortunately, in all DBMSs, it is possible to load a large amount of data from a text file in a single operation. In PostgreSQL, you can perform such a load using the COPY statement. COPY, however, requires administrator privileges. As a user, you need to use the psql meta-command \copy instead. You can read more about psql and its meta-commands here.

3.2) Populate the six tables form the six text files using \copy.

3.3) Write a SQL statement to remove the product that you entered in 3.1 (only that product).

Write all the commands that you used in a file called problem3.txt.

4. SQL Queries (17 pts)

Now that you have a database with data in it, you can use SQL to ask various queries about the data. In this problem, you will practice writing SQL queries.

You can assume that every product in the table Product appears in the table Inventory.

Write SQL queries for the following:

4.1) List the ids and names of all products whose inventory is below 5.

4.2) List the ids and names of all suppliers for products manufactured by "manufacturer_2". The id and name of each supplier should appear only once.

4.3) List the ids, names, and number in stock of all products in inventory. Order the list by decreasing number in stock and decreasing product ids.

4.4) List the ids and names of all products for whom there is only one supplier.

4.5) Find the ids and names of the products with the lowest inventory. Do NOT assume these are always products with an inventory of zero.

4.6) List the id and name of each supplier along with the total number of products it supplies.

4.7) Find the id and name of the manufacturer who produces toys on average for the youngest children.

Store your queries in a file called problem4.txt.

5. The system catalog (0 pts)

A relational DBMS maintains all metadata information about relations (e.g., their schemas, their integrity constraints, their indexes) in a set of special tables called the catalog tables. Storing metadata in the same way as the data itself is a neat design features of relational DBMSs. Catalog tables can be queried with SELECT statements just like any other table. For example, to see the list of relations in the current database you can use the following statement:

# SELECT * FROM pg_tables WHERE tableowner='yourlogin';

To make your life easier, psql provides several meta-commands to look-up information about relations. For example, the command:

# \d

Shows the set of relations in the current database.

# \d Product;

Shows the definition of relation Product.

You can read more about the catalog tables in Section 12.1 of R&G.

You can read more about PostgreSQL's catalog here.

You can read more about psql here.

We encourage you to look-up the definitions of your different tables again after you create integrity constraints and indexes below.

6. Integrity constraints (18 pts)

It is possible to restrict the data that can be stored in a database by defining integrity constraints: i.e., conditions on the database schema. Only database instances that satisfy the integrity constraints are allowed to exist. The DBMS will ensure that this property holds.

In our schema, we have designated some columns to be the primary key of a relation and other columns to be a foreign key referencing another relation. These constraints are called key constraints. They are the most fundamental type of integrity constraints. To enforce these constraints, the DBMS must know about them. Key constraints are normally specified when first defining a table using the CREATE TABLE statement. If a table already exists, its definition can be modified using the ALTER TABLE statement.

6.1) Modify the schemas of relations Product, Manufacturer, and Supplier to specify that pid, mid, and sid are the respective primary keys for these relations. Use the ALTER TABLE statement. If you already specified the primary key constraint when you created the relation, you do not need to do anything. Write the exact statement that you used (either the ALTER TABLE statement or a copy of your CREATE TABLE statement) in a file called problem6.txt.

6.2) Similarly, modify the schemas of relations Inventory, Manufactures, and Supplies to specify the foreign key constraints. Once again, use the ALTER TABLE syntax or do nothing if you already specified the constraints when you created the relations. Add the exact statement that you used (either the ALTER TABLE statement or a copy of your CREATE TABLE statement) to the file called problem6.txt.

6.3) Try to insert a product into the Product table with an identifier that already exists in the table. Try to insert an entry in the Inventory table for a product that does not exist in the Product table. What happens when you try to execute these operations? What would happen if the tables did not have the key constraints specified on them? Add the operations that you performed and your answer to the question to the file called problem6.txt.

6.4) What happens if you try to delete a product from the Product table without first deleting it from the Inventory, Supplies, and Manufactures table? Is there a way to ask the DBMS to remove these related entries automatically? (Refer to section 3.3 or Ramakrishnan and Gehrke). Add your answer to the file problem6.txt.

DBMSs support more general and more sophisticated types of integrity constraints. You can read more about complex integrity constraints in Section 5.7 of R&G.

6.5) (Hard question) It is often useful to have the DBMS perform some actions automatically in response to operations on the database. For example, if a product is discontinued, a user will remove it from the Supplies table. We would like, however, to keep this product in the database until its inventory goes to zero. Once the product is out of stock, it should be removed from the Product and Inventory tables. We can make the DBMS perform the latter two deletes automatically by defining a trigger on the database. You can read more about triggers in Section 5.8 of R&G. Add a trigger to the database that performs the following function: if the quantity of a product in stock goes down to zero and there is no supplier for the product, remove the product from the Product and Inventory tables. You will need to read the PostgreSQL documentation on triggers. We encourage you to use PL/pgSQL to write your trigger function (example of trigger using PL/pgSQL). This question is also a good opportunity to learn about PL/pgSQL.

You do not need to worry about the cases when:

a supplier is deleted or
a new product is added with no supplier and 0 stock.

7. Indexes (14pts)

Databases typically contain large amounts of data. Because this data must persist even if the DBMS goes down, it is stored on disk. To answer a query, the DBMS must thus read data from disk, which can be slow if a lot of data must be read. Indexes are data structures that can speed-up this process by enabling the DBMS to read only a subset of all data from disk when answering a query. You can read more about indexes in Chapters 8 through 11 of R&G. We will also discuss them in class.

When you altered the schemas of relations Product to specify the primary key constraint, you might have noticed the following message:

NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "product_pkey" for table "product"
ALTER TABLE

This message indicates that PostgreSQL automatically creates an index on the attributes that serve as the primary key of a relation. This index speeds-up queries that search for tuples by their primary key. In the case of PostgreSQL, this index is also the mechanism that enforces the primary key constraint.

In this question, we will practice using indexes on a single large table. Create a second table, called ProductBig, with the following schema. At first, do not specify that pid is a primary key. Load data into this table from the file called /cse/courses/cse544/07au/public/productbig.dat (beware that the load will take several minutes).

ProductBig(pid: integer, name: varchar(20), description: varchar(100), rating: integer, min_age: integer)

Consider the following query:

SELECT * FROM ProductBig WHERE pid=9876543;

7.1) Try to execute this query and see how long it takes. You can use the command \timing to have PostgreSQL time your queries. How long does the query take? How many tuples from relation ProductBig must be read from disk in order to answer this query? Write your answers in a file called problem7.txt.

7.2) Now modify the schema of ProductBig to indicate that pid is a primary key. Execute the above query again. How long does it take this time? Why is that? Add your answer to the file called problem7.txt.

Note: you can examine the query execution plan that the DBMS uses to answer the query. To see the plan, use the EXPLAIN command. You can find more information about EXPLAIN in the PostgreSQL documentation about EXPLAIN. We will talk about query plans starting from lecture 7. In the meantime, you can at least see if the DBMS uses the index or not when answering your query. You do not have to use EXPLAIN. Use it only if it helps you understand what is going on.

7.3) An index on the primary key speeds-up only a specific type of queries, which may not necessarily correspond to the query workload of the system. Consider the following query:

SELECT count(*) FROM ProductBig WHERE rating > 98;

Try to execute this query. How long does it take? How many tuples from relation ProductBig must be read from disk in order to answer this query? Write your answer in a file called problem7.txt.

7.4) Create an index on ProductBig that will speed-up the query above. Add the SQL statement that you use to create the index and an explanation for your choice of index to the file problem7.txt. How long does the query take now?

7.5) Delete the index on rating and add the statement that you used to problem7.txt.

7.6) Since indexes are so great, why not create an index on every single attribute of ever single table? Add your explanation to the file problem7.txt.

When you are done working on this question, please drop the ProductBig table to avoid using all that disk space.

8. Views (8pts)

A view is a table whose content is not physically stored on disk. Instead, it is computed from a query that corresponds to the view definition.

8.1) Create a view that shows only toys in the Product table for children over age 6. Write the SQL statement that you use in a file called problem8.txt.

8.2) Write a query that uses the newly created view to find all toys for children over age 6 that contain the substring '666' in their name. Add the SQL statement that you used in a file called problem8.txt.

8.3) Create a view that shows the id and names of toys along with the ids and names of their manufacturers. Add the SQL statement that you used to the file called problem8.txt.

8.4) What do you think are some of the benefits of using views? Add your answer to problem8.txt.

8.5) What happens if you try to insert a tuple into one of those views? Should users be allowed to perform inserts or updates on views? Add your answer to problem8.txt.

Turn-in instructions

You should work on this assignment by yourself. Make sure to turn-in the following files:

problem1.txt
problem2.txt
problem3.txt
problem4.txt
nothing to turn-in for problem 5
problem6.txt
problem7.txt
problem8.txt

To hand-in your assignment, issue the following command under the directory where you saved the above files:

tar zcvf cse544-07au-youraccountname.tar.gz problem?.txt

After executing this command, you should see a file named cse544-07au-youraccountname.tar.gz. Please send this file as an email attachment to your TA.