CSEP544 Homework 1

Objectives:: To create a simple database schema, import data, practice simple, then advanced SQL queries, create and use database indexes, and to get familiar with a database management system, such as Postgres or SQL Server
Reading assignments:: Creating tables in SQL (section 3.1), SQL queries (sections 5.1-5.6), writing complex queries (Three Query Language Formalisms).
Assignment tools:: Install your own PostgreSQL or install your own SQL Server; SQL Azure (your login = your UW email, your password = given in classs)
IMDB dataset
Due date:: Monday, January 20th, 2014, 11:00pm, in the Dropbox
What to turn in:: hw1-create-tables.sql, hw1-create-indexes.sql, and hw1-queries.sql
Maximum number of points:: 150
Extra credit: Instead of using our copy of IMDB (which was downloaded in 2010), download and import a recent version of IMDB from ftp://ftp.fu-berlin.de/pub/misc/movies/database; Points:; Up to 5 extra points for creating up-to-date versions of the six current tables (ACTOR, MOVIE, DIRECTOR, CASTS, MOVE-DIRECTORS, GENRE), or; Up to 10 extra points for creating a more complete (but well designed!) schema, based on the data available now; Note: the extra credit question requires some python programming, some conceptual design, and some wrestling with data formats (non-ASCII characters, or who knows what). You will receive credit only for a good job, meaning that most (say 99.99%) of the data ends up uploaded in the databases.

Homework Description

In this homework, you will create a database, import data, then run several SQL queries on this database. In addition to your own database, you will also run some SQL queries on an identical database that is already uploaded on SQL Azure. The data is from imdb.com, and was downloaded around 2010. You should create a database with the following six tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
DIRECTORS (id, fname, lname)
CASTS (pid, mid, role)
MOVIE_DIRECTORS (did, mid)
GENRE (mid, genre)

All id fields and year are integers, all other fields are character strings. You can use either text or varchar(xx) for the character strings (people's names have maximum 30 characters, roles and genres have maximum 50, and movie titles have maximum 150). You should define the gender field to be one single character long.

In addition, impose the following constraints:

ACTOR.id, MOVIE.id, DIRECTOR.id = a key
CASTS.pid = foreign key to ACTOR.id
CASTS.mid, MOVIE_DIRECTORS.mid = foreign key to MOVIE.id
MOVIE_DIRECTORS.did = foreign key to DIRECTORS.id

We provide the IMDB dataset as a set of plain-text data files in the linked .tar.gz archive. Each file in this archive contains all the rows for the named table, one row per line.

In this homework, you need to do three things: (A) Import the movie dataset into your own installation of postgres or SQL Server, (B) Create indexes on the data to make the queries run fast, and (C) Run SQL queries to answer a set of questions about the data. Your queries should execute correctly on both your local copy of database and in SQL Azure.

POSTGRES

In this homework we will use a client-server database system, PostgreSQL, a.k.a. postgres. Client-server means that there are two programs involved in order to run queries on the database: the server, which runs in the background, and the client, where you type the SQL queries. This is in contrast with sqlite, a client-side database system that is today available on most PC's (try running sqlite3 from the command line). For that reason postgres requires more work to install than, say, sqlite, but it should not be too much work. In CSEP544 you are expected to install your own copy of postgres on your machine. Alternatively, you can install SQL Server, but we do not provide support for your local installation of SQL Server. Brief instructions on how to start postgres:

[Download and install postgres from http://www.postgresql.org/download/]
[Depending on your installation you many need to initialize the database; do this at most once]
[Set PATH as necessary]
initdb C:\pgsql_data             [creates a data folder, do this only once. Copy it to Z: before logout!]

[Depending on your installation, you may need to start the postgres server; do this at most once after each system reboot]
[Open a postgres shell]
pg_ctl start -D C:\pgsql_data    [starts the postgres server - do this each time the computer boots]

[Now you must create a database; you do this only once fo the database: the database is persistent!]
[Open a postgres shell]
createdb imdb                    [creates the imdb database, do this only once]

[Start the postgres client: every time you want to access the database]
[Open a postgres shell]
psql imdb                        [opens a postgres SQL command line]

you can also run:
dropdb imdb                    [deletes the imdb database; do this if you want a clean start]

Once you start psql, remember these three important commands:\h = help for SQL commands \? = help for metacommands \q = quit psql

SQL SERVER

You can choose to install and use SQL Server instead of postgres on your machine, but we do not provide support for installation. With a UW Student ID, you can download SQL Server from free here.

Connecting to SQL Server on Windows Azure:

In this homework you will be required to run all your SQL queries both on your local database (postgres or SQL Server) and also on Microsoft Azure Cloud, which runs SQL Server in the cloud. Azure should run faster than your local postgres, but probably slower than your local SQL Server. You can connect in two ways: (a) use your Web browser, https://m01rrgdwg2.database.windows.net (normally you will connect to the IMDB database, but the first time you should connect to the master database in order to change your password) or (b) run SQL Server Management Studio 2008 R2; you must use R2:, which you can download with your UW student ID from here.

In both cases, your login is your UW login (without '@washington.edu') and the initial password is given in class.

When connecting using SQL Server Management Studio, do the following

Server type: Database Engine
Server name: m01rrgdwg2.database.windows.net
Authentication: SQL Server Authentication
Login: Your UW netid
Password: Initially, use the one we gave in class

Once you are connected, change your password by running the following command in the master database:

ALTER LOGIN yourlogin WITH PASSWORD='some_new_password' OLD_PASSWORD = 'old_password'

More precisely:

If you connect using a Web browser, first connect to the master database, then run the ALTER LOGIN command above; then connect to the IMDB database.
If you connect from Management Studio, then: in the Object Explorer on the left, select Databases -> System Databases -> master
- Click on New Query (at the top)
- Execute the ALTER LOGIN command above. Make sure to use some capital letters and some numbers.
- To run queries against IMDB, first select the database called IMDB in the Object Explorer and then click on "New Query"

Note that you will get an error message if your new password is not sufficiently complex. If you have any problems connecting to Windows Azure, please let the instructor or the TAs know.

A. IMPORTING THE IMDB DATABASE (10 points):

Please note that all instructions refer to postgres; minor changes are needed for SQL Server

To import the movie database into postgres, first get the database from here: IMDB Database: it comes as six large text files. Create the database (createdb imdb) then start a postgres client to connect to the database (psql imdb). Then use the CREATE TABLE SQL statement to create the tables, choosing appropriate types for each column:

	CREATE TABLE table_name (  . . . );

Then, you can use the postgres \copy command

\copy tablename from 'filename' with delimiter '|' csv quote E'\n'

Here tablename is one of ACTOR,CASTS, etc, and 'filename' is the full path to actor-ascii.txt, casts-ascii.txt etc. You need to run this command six times, once for each table/filename pair.

Note: the \copy command is not executed by the psql process, but by the postgres server ! You must ensure that the server has access to the filename: give the full path name, and make sure it has read permissions to the file (chmod 644 filename). See \h copy.

You may get a conversion error during the \copy command, depending on your postgres installation, because the six text files are in UTF8, while postgres accepts only ASCII files (depending on its setup). One way to fix that, is to convert from UTF8 to ASCII by using the following commands (should work on Linux, MAC, and Windows/cygwin):

iconv -c -t ASCII actor.txt > actor-ascii.txt iconv -c -t ASCII movie.txt > movie-ascii.txt iconv -c -t ASCII casts.txt > casts-ascii.txt iconv -c -t ASCII genre.txt > genre-ascii.txt iconv -c -t ASCII directors.txt > directors-ascii.txt iconv -c -t ASCII movie_directors.txt > movie_directors-ascii.txt

After doing this conversion, if you are using Windows, you may get postgres errors when you try to copy the files, e.g. ERROR: unquoted carriage return found in data. Check your file to see if there's an extra carriage return after certain lines on the specified line numbers. You may need to search and replace those.

Put all the create table an \copy statements in a hw1-create-tables.sql, which is the first file you need to turn in.

B. CHOOSING INDEXES (30 points):

NOTE: you may want to start working on C first, then return to B.

Once you have imported the movie database, your main task is to write SQL queries (task C below). However, you will find that even simple queries will take a long time; this is because the movie database you created lacks indexes on frequently accessed columns of its tables.

Hence, you will need to choose indexes for the movie database and create them using postres' CREATE INDEX SQL statement:

CREATE [UNIQUE] INDEX index_name ON table_name(col_1, col_2, ...);

Run \help create index at the postgres command line to get more detailed information. If given, the optional keyword UNIQUE says that no row can exist in the table with duplicate values of all the index columns. If you did not mark those constraints already in the CREATE TABLE statements, telling SQLite about them here may let it choose a faster way to create and maintain the index, or to read through the indexed table when running a query.

Your goal is to choose up to about 10 indexes, such that once the indexes are created, queries 1-5 below should not take more than 1 minute on a modern computer, while queries 6-11 should run in 5-10 minutes each (could be slower if your machine is too slow). You have some flexibility about which indexes to choose, there is no absolutely perfect solution; our solution takes only about XXX seconds to run on a desktop machine from 2009, using 11 indexes. Be sure to create only indexes that matter.

Before each CREATE INDEX statement write a one-line comment with the reason for choosing each index (e.g. "in order to speed up the selection in query 7" or "in order to speed up the join in query 11").

Put all your code for part B (CREATE INDEX code) in a file called hw1-create-indexes.sql.

C. SQL QUERIES (110 points; 10 points per question):

For each question below, write a single SQL query to answer that question. Put your queries in a file called hw1-queries.sql. Add a comment to each query indicating with the question it answers and the number of rows in the query result.

List the first and last names of all the actors who played in the movie 'Officer 444'. [~13 rows expected]
List all the directors who directed a 'Film-Noir' movie in a leap year. (You need to check that the genre is 'Film-Noir' and simply assume that every year divisible by 4 is a leap year.) Your query should return director name, the movie name, and the year. [~113 rows]
List all the actors who acted in a film before 1900 and also in a film after 2000. (That is: < 1900 and > 2000.) [~53 rows]
How can this be? Actors can't live more than 100 years, right? Please find the explanation. For that you need to investigate a bit, perhaps run 1-2 additional queries (include them in your hw1-queries.sql file). Once you identify one logical explanation why some actors appear in movies more than 100 years apart, write it in your SQL code, as a comment to the SQL query; keep your answer below 1-2 sentences.
List all directors who directed 500 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed. [~47 rows]
We want to find actors that played five or more roles in the same movie during the year 2010. Notice that CASTS may have occasional duplicates, but we are not interested in these: we want actors that had five or more distinct roles in the same movie in the year 2010. Write a query that returns the actors' names, the movie name, and the number of distinct roles that they played in that movie (which will be ≥ 5). [~24 rows]
Consider again all actors that had five or more roles in a movie in 2010 from the previous question. Return the names of these actors and their roles. Your answer should have one tuple for each combination of (actor, movie, role) - so if an actor has 10 roles in a given movie, there should be 10 tuples for that actor and movie. Approx. 140 rows, 1-2 minutes.
For each year, count the number of movies in that year that had only female actors. Recall the meaning of the universal quantifier: a movie without any actors is also a movie with only female actors (since there are no male actors in such a movie!). Approx. 130 rows, 1-2 minutes.
Now make a small change: for each year, report the percentage of movies with only female actors made that year, and also the total number of movies made that year. For example, one answer will be:
```
1990 31.81 13522
```
meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer. Approx. 130 rows, 1-2 minutes.
Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if the actor is simply listed more than once in CASTS, we still count her/him only once. You may not assume that only one film has the largest cast. 1 row, 10 minutes. The cast size is around 1300.
A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films. 1 row, 5 minutes. If you were to count those movies in the decade then you would get around 457,500.
The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 (but not with Bacon himself) have Bacon number 2, etc. Count how many actors have Bacon number is 2. 1 row, 5 minutes The number of actors is around 521,900. (Hint: correlated subqueries are SLOW; uncorrelated subqueries are faster).

Put all your code for part C (SELECT-FROM-WHERE code) in a file called hw1-queries.sql. This is the third file you will turn in.

NOTES:

Unless otherwise specified, in each question that asks you to return all actors, or all directors, or all movies, you should return their names (first and last for people).
Each question lists a suggested number of rows returned for a correct query, and an approximate running time. If your query produces a slightly different number of rows, that does not necessarily mean that you made a mistake, because there are inconsistencies in the data and variations in the SQL query may lead to slightly different answers.
Save all files you create during this homework: you may need them in later homeworks.