CSE 344 Homework 3

Objectives:
To practice advanced SQL. To get familiar with commercial database management systems (SQL Server), both in client-server and in cloud configurations.
Assignment tools:
SQL Server on IISQLSRV, Windows Azure.
Due date:
Wednesday, Feb. 1, 2012, at 11:59 pm Dropbox
What to turn in:
hw3-queries.sql

This homework is a continuation of homework 2, with two changes. The queries are more challenging, and you will have to use a commercial database system . SQLite simply cannot execute these queries in any reasonable amount of time; hence, we will use SQL Server, which has one of the most advanced query optimizers.. In fact, you will be asked to use two database servers, both running SQL Server.

Here is again the schema of the IMDB database, for your reference:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
DIRECTORS (id, fname, lname)
CASTS (pid, mid, role)
MOVIE_DIRECTORS (did, mid)
GENRE (mid, genre)

All id fields are integers. MOVIE.year is an integer. All other fields are character strings.

id column in ACTOR, MOVIE & DIRECTOR tables is a key for the respective table.

CASTS.pid refers to ACTOR.id
CASTS.mid refers to MOVIE.id
MOVIE_DIRECTORS.did refers to DIRECTORS.id
MOVIE_DIRECTORS.mid refers to MOVIE.id
GENRE.mid refers to MOVIE.id

In this homework, you will do three things. First, you will connect to two database systems that we use in this homework: IISQLSRV and Windows Azure. Second, you will write and test the six SQL queries below (you can use either database server for that); keep in mind that the queries are quite challenging, both for you and for the database engine. Third, you will do a final run of your queries on both database systmes, and report their running times, then write a brief opinion on their similarities or differences.

A. Connecting to SQL Server on IISQLSRV and to Windows Azure:

You will connect to two servers: IISQL, and Windows Azure Cloud. The good news is that the IMDB database is already uploaded on both servers, and all indices are created; all you need is to do is to connect successfully, then run your queries.

1. We will be usingĀ Microsoft SQL Server 2008 running on IISQLSRV.cs.washington.edu. Follow the instructions here to connect to the database server.

2. We also use an instance of SQL Server running in the Microsoft Azure Cloud. You can connect in two ways: (a) use your Web browser, https://xbn4slk6hn.database.windows.net (normally you will connect to the IMDB database, but the first time you should connect to the master database in order to change your password) or (b) run SQL Server Management Studio 2008 R2 (it has to be R2 -- this is already installed on the PC labs), just as you did to connect to iisqlsrv.cs.washington.edu, but use instead xbn4slk6hn.database.windows.net. For either type of connection, your login is your UW login (without '@washington.edu') and the initial password is given in class (change it immediately).

If you have any problems connecting to IISQLSRV or to Windows Azure, please let the instructor or the TAs know. Once you connect, for fun, try and run some of the queries from the previous homework and see how fast they run compared to your local SQLite application.

 

B. SQL QUERIES (90 points; 15 points per question):

For each question below, write a single SQL query to answer that question. Add a comment to each query indicating the question number and the number of rows your query returns. At this stage you can run the query either on IISQLSRV or on Azure.

  1. Consider all actors that had five or more roles in a movie in 2010. In homework 2, we asked you to list each such actor's name, the movie name, and the number of roles he/she played. Do the same thing, but instead of giving the number of roles, give the name of each role. Your answer should have one tuple for each combination of (actor, movie, role) - so if an actor has 10 roles in a given movie, there should be 10 tuples for that actor and movie. Approx. 140 rows.

  2. For each year, count the number of movies in that year that had only female actors. Recall the meaning of the universal quantifier: a movie without any actors is also a movie with only female actors (since there are no male actors in such a movie!). Approx. 130 rows.

  3. Now make a small change: for each year, report the percentage of movies with only female actors made that year, and also the total number of movies made that year. For example, one answer will be:

    1990 31.81 13522

    meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer. Approx. 130 rows.

  4. Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if the actor is simply listed more than once in CASTS, we still count her/him only once. You may not assume that only one film has the largest cast. 1 row.

  5. A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films. 1 row.

  6. The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 (but not with Bacon himself) have Bacon number 2, etc. Count how many actors have Bacon number is 2. 1 row.

C. Running Times (10 points)

Now run all six queries on IISQLSRV, then on SQL Azure, and measure their running times. Report all 12 running times, then write a brief comment (2-3 sentences) explaining why you think the running times are similar, or different.

Put all your code for part B (SELECT-FROM-WHERE code) in a file called hw3-queries.sql and add the answers to part C as SQL comments.