CSEP544 Homework 1

Objectives:: To create and import databases, to practice simple, then advanced SQL queries, to create and use database indexes, and to get familiar with Postgres
Reading assignments:: Creating tables in SQL (section 3.1), SQL queries (sections 5.1-5.6), writing complex queries (Three Query Language Formalisms).
Assignment tools:: PostgreSQL, IMDB dataset
Due date:: Monday, October 10th, 2011, 11:59pm, in the Dropbox
What to turn in:: hw1-create-tables.sql, hw1-create-indexes.sql, and hw1-queries.sql (see below)
Maximum number of points:: 150

In this homework, you will write several SQL queries on a relational movie database. The data in this database is from the IMDB website. The database consists of six tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
DIRECTORS (id, fname, lname)
CASTS (pid, mid, role)
MOVIE_DIRECTORS (did, mid)
GENRE (mid, genre)

All id fields and year are integers, all other fields are character strings. You can use either text or varchar(xx) for the character strings (people's names have maximum 30 characters, roles and genres have maximum 50, and movie titles have maximum 150). You should define the gender field to be one single character long.

In addition, impose the following constraints:
ACTOR.id, MOVIE.id, DIRECTOR.id = a key
CASTS.pid = foreign key to ACTOR.id
CASTS.mid, MOVIE_DIRECTORS.mid = foreign key to MOVIE.id
MOVIE_DIRECTORS.did = foreign key to DIRECTORS.id

We provide the movie database as a set of plain-text data files in the linked .tar.gz archive. Each file in this archive contains all the rows for the named table, one row per line.

In this homework, you need to do three things: A import the movie dataset into postgres, B. create indexes on the data to make the queries run fast, and C.run SQL queries to answer a set of questions about the data.

POSTGRES

In this homework we will use a client-server database system, PostgreSQL, a.k.a. postgres. Client-server means that there are two programs involved in order to run queries on the database: the server, which runs in the background, and the client, where you type the SQL queries. This is in contrast with sqlite, a client-side database system that is today available on most PC's (try running sqlite3 from the command line). For that reason postgres requires more work to install than, say, sqlite, but it should not be too much work. We are already providing postgres on the computers in the Allen Center basement labs, but if you use your personal computer we strongly encourage you to install it yourself on your computer. Brief instructions on how to start postgres are bleow; detailed instructions are here.

[Download and install postgres from http://www.postgresql.org/download/]
[Depending on your installation you many need to initialize the database; do this at most once]
[Set PATH as necessary]
initdb C:\pgsql_data             [creates a data folder, do this only once. Copy it to Z: before logout!]

[Depending on your installation, you may need to start the postgres server; do this at most once after each system reboot]
[Open a postgres shell]
pg_ctl start -D C:\pgsql_data    [starts the postgres server - do this each time the computer boots]

[Now you must create a database; you do this only once fo the database: the database is persistent!]
[Open a postgres shell]
createdb imdb                    [creates the imdb database, do this only once]

[Start the postgres client: every time you want to access the database]
[Open a postgres shell]
psql imdb                        [opens a postgres SQL command line]

you can also run:
dropdb imdb                    [deletes the imdb database; do this if you want a clean start]

Once you start psql, remember these three important commands:\h = help for SQL commands \? = help for metacommands \q = quit psql

SQL SERVER

You are required to use postgres for this assignment. This creates a challenge for you, because queries 6-11 below will be difficult to run on postgres: you need to find a clever way to formulate queries 6-11 to help postgres run them; depending on your postgres installation, you may want to get your fingers dirty and increase its buffer space, for example. We will not give you any suggesions on how to solve queries 6-11: this is your challenge.

However, we do offer this help: you can test and debug these queries on SQL SERVER first. This is a commercial database engine and is much more efficient than postgres. However, it is not free: you need to connect to the sever iprojsrv, where you have an account; see instructions in lecture 1. We strongly encourage you to play with SQL Server, in addition to using postgres.

Brief instructions on connecting to iprojsrv are below; more detailed instructions are here.

  - use a Windows machine, from within the CSE building
    or, remote desktop to aria.cs.washington.edu
  - start SQL Server Management Studio; if you don't have it, you can get it for free here XXX
  - in the CONNECT window, type the server name iprojsrv, select SQL Server Authentication
  - user name = your UW user id; password = given in lecture 1 in class
  - in the left pane click Security - Logins - your login; then change your passowrd
  - in the left pane click Databases - imdb
  - click the New Query button (top left): type "select count(*) from movie", click "Execute".

A. IMPORTING THE IMDB DATABASE (10 points):

To import the movie database into postgres, first get the database from here: IMDB Database: it comes as six large text files. Create the database (createdb imdb) then start a postgres client to connect to the database (psql imdb). Then use the CREATE TABLE SQL statement to create the tables, choosing appropriate types for each column:

	CREATE TABLE table_name (  . . . );

Then, you can use the postgres \copy command

\copy tablename from 'filename' with delimiter '|' csv quote E'\n'

Here tablename is one of ACTOR,CASTS, etc, and 'filename' is the full path to actor-ascii.txt, casts-ascii.txt etc. You need to run this command six times, once for each table/filename pair.

Note: the \copy command is not executed by the psql process, but by the postgres server ! You must ensure that the server has access to the filename: give the full path name, and make sure it has read permissions to the file (chmod 644 filename). See \h copy.

You may get a conversion error during the \copy command, depending on your postgres installation, because the six text files are in UTF8, while postgres accepts only ASCII files (depending on its setup). One way to fix that, is to convert from UTF8 to ASCII by using the following commands (should work on Linux, MAC, and Windows/cygwin):

iconv -c -t ASCII actor.txt > actor-ascii.txt iconv -c -t ASCII movie.txt > movie-ascii.txt iconv -c -t ASCII casts.txt > casts-ascii.txt iconv -c -t ASCII genre.txt > genre-ascii.txt iconv -c -t ASCII directors.txt > directors-ascii.txt iconv -c -t ASCII movie_directors.txt > movie_directors-ascii.txt

After doing this conversion, if you are using Windows, you may get postgres errors when you try to copy the files, e.g. ERROR: unquoted carriage return found in data. Check your file to see if there's an extra carriage return after certain lines on the specified line numbers. You may need to search and replace those.

Put all the create table an \copy statements in a hw1-create-tables.sql, which is the first file you need to turn in.

B. CHOOSING INDEXES (30 points):

NOTE: you may want to start working on C first, then return to B.

Once you have imported the movie database, your main task is to write SQL queries (task C below). However, you will find that even simple queries will take a long time; this is because the movie database you created lacks indexes on frequently accessed columns of its tables.

Hence, you will need to choose indexes for the movie database and create them using postres' CREATE INDEX SQL statement:

CREATE [UNIQUE] INDEX index_name ON table_name(col_1, col_2, ...);

Run \help create index at the postgres command line to get more detailed information. If given, the optional keyword UNIQUE says that no row can exist in the table with duplicate values of all the index columns. If you did not mark those constraints already in the CREATE TABLE statements, telling SQLite about them here may let it choose a faster way to create and maintain the index, or to read through the indexed table when running a query.

Your goal is to choose up to about 10 indexes, such that once the indexes are created, queries 1-5 below should not take more than 1 minute on a modern computer, while queries 6-11 should run in 5-10 minutes each (could be slower if your machine is too slow). You have some flexibility about which indexes to choose, there is no absolutely perfect solution; our solution takes only about XXX seconds to run on a desktop machine from 2009, using 11 indexes. Be sure to create only indexes that matter.

Before each CREATE INDEX statement write a one-line comment with the reason for choosing each index (e.g. "in order to speed up the selection in query 7" or "in order to speed up the join in query 11").

Put all your code for part B (CREATE INDEX code) in a file called hw1-create-indexes.sql.

C. SQL QUERIES (110 points; 10 points per question):

For each question below, write a single SQL query to answer that question. Put your queries in a file called hw1-queries.sql. Add a comment to each query indicating with the question it answers and the number of rows in the query result.

List the first and last names of all the actors who played in the movie 'Officer 444'. [~13 rows expected]
List all the directors who directed a 'Film-Noir' movie in a leap year. (You need to check that the genre is 'Film-Noir' and simply assume that every year divisible by 4 is a leap year.) Your query should return director name, the movie name, and the year. [~113 rows]
List all the actors who acted in a film before 1900 and also in a film after 2000. (That is: < 1900 and > 2000.) [~48 rows]
How can this be? Actors can't live more than 100 years, right? Please find the explanation. For that you need to investigate a bit, perhaps run 1-2 additional queries (include them in your hw1-queries.sql file). Once you identify one logical explanation why some actors appear in movies more than 100 years apart, write it in your SQL code, as a comment to the SQL query; keep your answer below 1-2 sentences.
List all directors who directed 500 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed. [~47 rows]
We want to find actors that played five or more roles in the same movie during the year 2010. Notice that CASTS may have occasional duplicates, but we are not interested in these: we want actors that had five or more distinct roles in the same movie in the year 2010. Write a query that returns the actors' names, the movie name, and the number of distinct roles that they played in that movie (which will be ≥ 5). [~24 rows]
Consider again all actors that had five or more roles in a movie in 2010 from the previous question. Return the names of these actors and their roles. Your answer should have one tuple for each combination of (actor, movie, role) - so if an actor has 10 roles in a given movie, there should be 10 tuples for that actor and movie. Approx. 140 rows, 1-2 minutes.
For each year, count the number of movies in that year that had only female actors. Recall the meaning of the universal quantifier: a movie without any actors is also a movie with only female actors (since there are no male actors in such a movie!). Approx. 130 rows, 1-2 minutes.
Now make a small change: for each year, report the percentage of movies with only female actors made that year, and also the total number of movies made that year. For example, one answer will be:
```
1990 31.81 13522
```
meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer. Approx. 130 rows, 1-2 minutes.
Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if the actor is simply listed more than once in CASTS, we still count her/him only once. You may not assume that only one film has the largest cast. 1 row, 10 minutes.
A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films. 1 row, 5 minutes.
The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 (but not with Bacon himself) have Bacon number 2, etc. Count how many actors have Bacon number is 2. 1 row, 5 minutes (Hint: correlated subqueries are SLOW; uncorrelated subqueries are faster).

Put all your code for part C (SELECT-FROM-WHERE code) in a file called hw1-queries.sql. This is the third file you will turn in.

NOTES:

Unless otherwise specified, in each question that asks you to return all actors, or all directors, or all movies, you should return their names (first and last for people).
Each question lists a suggested number of rows returned for a correct query, and an approximate running time. If your query produces a slightly different number of rows, that does not necessarily mean that you made a mistake, because there are inconsistencies in the data and variations in the SQL query may lead to slightly different answers.
Save all files you create during this homework: you may need them in later homeworks.