CSE 344 Homework 3

Objectives:: To practice advanced SQL, and get familiar with Postgres for future homeworks.
Reading assignments:: SQL queries (sections 6.1-6.5), selecting and creating indexes (sections 8.4-8.5)
Assignment tools:: PostgreSQL, IMDB dataset ascii version
Due date:: April 18, 2011, at 11:59 pm -->in the dropbox. (But start early!)
What to turn in:: create-tables.sql, create-indexes.sql, and hw3-queries.sql (see below)

This homework is a continuation of homework 2, letting you practice more advanced SQL on the IMDB database. Here it is again for your reference:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
DIRECTORS (id, fname, lname)
CASTS (pid, mid, role)
MOVIE_DIRECTORS (did, mid)
GENRE (mid, genre)

All id fields and MOIVE.id (updated) are integers, all other fields are character strings.You must use varchar(xx) for the character strings (people's names have maximum 30 characters, roles and genres have maximum 50, and movie titles have maximum 150). You should define the gender field to be one single character long.
Note: postgres does not have text, hence you must use char(xx) or varchar(xx)

In addition, impose the following constraints:
ACTOR.id, MOVIE.id, DIRECTOR.id = a key
CASTS.pid = foreign key to ACTOR.id
CASTS.mid, MOVIE_DIRECTORS.mid ~~and GENRE.mid~~= foreign keys to MOVIE.id (Updated: GENRE.mid is no longer a foreign key)
MOVIE_DIRECTORS.did = foreign key to DIRECTORS.id

In this homework, you will do three things, similar to what you did in homework 2: A. import the database into Postgres, B. run SQL queries to answer a set of questions about the data, and C. create indexes on the data to make the queries run fast.

POSTGRES

In this homework we will use a client-server database system, PostgreSQL, a.k.a. postgres. Client-server means that there are two programs involved in order to run queries on the database: the server, which runs in the background, and the client, where you type the SQL queries. For that reason postgres requires more work to install than sqlite, but it should not be too much work. We are already providing postgres on the computers in the Allen Center basement labs, but if you use your personal computer we strongly encourage you to install it yourself on your computer. Here are detailed instructions on setting up Postgres. Read them now ! In brief, you will need to run the following (but again, read the detailed instructions first, so you can see what applies to you):

[Open a postgres shell using hw3shell.cmd]
initdb C:\pgsql_data             [creates a data folder, do this only once. Copy it to Z: before logout!]
[in another postgres shell]
pg_ctl start -D C:\pgsql_data    [starts the postgres server - do this each time the computer boots]
[back in the first shell]
createdb imdb                    [creates the imdb database, do this only once]
psql imdb                        [opens a postgres SQL command line]

you can also run:
dropdb imdb                    [deletes the imdb database; do this if you want a clean start]

Once you start psql, remember these three important commands:\h = help for SQL commands \? = help for metacommands \q = quit psql

A. DATA IMPORT (8 points):

Start from the create-tables.sql script you used with sqlite in homework 2, and modify it appropriately. To speed up import, do not create any indexes before you import the data.

Importing data in postgres is different from sqlite, here is what you need to do:

Create pure-ASCII version of the IMDB database. (Why ? Because the files that came with hw1 are in UTF8, and are not accepted by postgres). For that, do only one of the following:

Dowload the IMDB dataset ASCII version from the link above, OR
Use the IMDB txt files that you downloaded for the first homework, convert them from UTF8 to ASCII by using the following commands (should work on Linux, MAC, and Windows/cygwin):
iconv -c -t ASCII actor.txt > actor-ascii.txt iconv -c -t ASCII movie.txt > movie-ascii.txt iconv -c -t ASCII casts.txt > casts-ascii.txt iconv -c -t ASCII genre.txt > genre-ascii.txt iconv -c -t ASCII directors.txt > directors-ascii.txt iconv -c -t ASCII movie_directors.txt > movie_directors-ascii.txt

Import the ascii files in the postgres database. After you create the tables (CREATE TABLE ...), for each table execute the \copy statement:

\copy tablename from 'filename' with delimiter '|' csv quote E'\n'

Here tablename is one of ACTOR,CASTS, etc, and 'filename' is the full path to actor-ascii.txt, casts-ascii.txt etc.
Note: the \copy command is not executed by the psql process, but by the postgres server ! You must ensure that the server has access to the filename: give the full path name, and make sure it has read permissions to the file (chmod 644 filename). See \h copy.

Put all the create table an \copy statements in a create-tables.sq. Mark where you made changes from homework 1 to homework 2.

B. SQL QUERIES (72 points; 12 points per question):

For each question below, write a single SQL query to answer that question. Add a comment to each query indicating the question number and the number of rows your query returns.

Note: you may want to create the indexes (part C) before, or while you run the queries (part B). Even with indexes, these queries will be much slower than those in HW1. Poorly written queries may run forever: you must write efficient queries. Times below are approximate, and on a powerful computer.

Consider all actors that had five or more roles in a movie in 2010. In homework 2, we asked you to list each such actor's name, the movie name, and the number of roles he/she played. Do the same thing, but instead of giving the number of roles, give the name of each role. Your answer should have one tuple for each combination of (actor, movie, role) - so if an actor has 10 roles in a given movie, there should be 10 tuples for that actor and movie. Approx. 140 rows, 1-2 minutes.
For each year, count the number of movies in that year that had only female actors. Recall the meaning of the universal quantifier: a movie without any actors is also a movie with only female actors (since there are no male actors in such a movie!). Approx. 130 rows, 1-2 minutes.
Now make a small change: for each year, report the percentage of movies with only female actors made that year, and also the total number of movies made that year. For example, one answer will be:
```
1990 31.81 13522
```
meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer. Approx. 130 rows, 1-2 minutes.
Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if the actor is simply listed more than once in CASTS, we still count her/him only once. You may not assume that only one film has the largest cast. 1 row, 10 minutes.
A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films. 1 row, 5 minutes.
The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 (but not with Bacon himself) have Bacon number 2, etc. Count how many actors have Bacon number is 2. 1 row, 5 minutes (Hint: correlated subqueries are SLOW; uncorrelated subqueries are faster).

Put all your code for part B (SELECT-FROM-WHERE code) in a file called hw3-queries.sql.

C. CHOOSING INDEXES (20 points):

Create all necessary indexes to run the workload in reasonable time. The syntax in Postgres is the same as SQLite's:

CREATE [UNIQUE] INDEX index_name ON table_name(col_1, col_2, ...);

You should add your CREATE INDEX code for part C to the existing create-indexes.sql file. As in homework 2, before each CREATE INDEX statement write a one-line comment with the reason for choosing each index (e.g. "in order to speed up the selection in query 7" or "in order to speed up the join in query 11").

Put all your code for part B (CREATE INDEX code) in a file called create-indexes.sql.

NOTES:

Unless otherwise specified, in each question that asks you to return all actors, or all directors, or all movies, you should return their names (first and last for people).
Each question lists a suggested number of rows returned for a correct query, and an approximate running time. If your query produces a slightly different number of rows, that does not necessarily mean that you made a mistake, because there are inconsistencies in the data and variations in the SQL query may lead to slightly different answers.