CSE 444 Project 1: SQL Queries

Due date: January 21, 2011

In this project you are asked to write several SQL queries on a relational movie database. The data in this database is from the IMDB website. The database exists already on IISQLSRV (click here for instructions on how to connect). The database consists of six tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year, rank)
DIRECTORS (id, fname, lname)
CASTS (pid, mid, role)
MOVIE_DIRECTORS (did, mid)
GENRE (genre, mid)

id column in ACTOR, MOVIE & DIRECTOR tables is a key for the respective table.
CASTS.pid refers to ACTOR.id,
CASTS.mid refers to MOVIE.id
MOVIE_DIRECTORS.did refers to DIRECTORS.id
MOVIE_DIRECTORS.mid refers to MOVIE.id
GENRE.mid refers to MOVIE.id

Notes:

Please write queries to answer the following questions:

  1. List all the directors who directed a 'Film-Noir' movie in a leap year. (You need to check that the genre is 'Film-Noir' and simply assume that a leap-year is divisible by 4.) Your query should return director name, the movie name, and the year. [113 rows]

  2. List the first and last names of all the actors who played in the movie 'Officer 444'. [13 rows]

  3. List all the actors who acted in a film before 1900 and also in a film after 2000. (That is: < 1900 and > 2000.) [48 rows]

    How can this be? Actors can't live more than 100 years, right? Please find the explanation. For that you need to investigate a bit, perhaps run 1-2 additional queries. Once you identify one logical explanation why some actors appear in movies more than 100 years apart, write it in your SQL code, as a comment to the SQL query; keep your answer below 1-2 sentences.

  4. List all directors who directed 150 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed. [621 rows]

  5. We want to find actors that played two or more roles in the same movie during the year 2010. Notice that CASTS may have occasional duplicates, but we are not interested in these: we want actors that had two or more distinct roles in the same movie in the year 2010.

    1. Write a query that returns the actors' names, the movie name, and the number of distinct roles that they played in that movie (which will be ≥ 2). [1,574 rows]
    2. Write a query that returns the actors' names, the movie name, and all the distinct roles (two ore more) that the actor played in that movie. [3,420 rows]

    1. For each year, count the number of movies in that year that had only female actors. [122 or 129 rows]
    2. Now make a small change: for each year, report the percentage of movies with only female actors made that year, and also the total number of movies made that year. For example, one answer will be:
      1990 31.81 13522
      meaning that in 1990 there were 13,522 movies, and 31.81% had only female actors. You do not need to round your answer. [121 or 128 rows]
  6. Find the film(s) with the largest cast. Return the movie title and the size of the cast. By "cast size" we mean the number of distinct actors that played in that movie: if an actor played multiple roles, or if the actor is simply listed more than once in CASTS, we still count her/him only once. [1 row]

  7. A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films. [1 row]

    Hint: this query combines several subtleties. One thing you may want to check is that the number of movies in that decade doesn't exceed the total number of movies in the database.

  8. The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 have Bacon number 2, etc. Return all actors whose Bacon number is 2. [521,876 rows]

    To ponder: how can you compute the Bacon number of each actor in the database?

These 9 problems are of increasing degree of complexity. Some of them are quite advanced SQL queries. If you write them correctly, none should take more than a few minutes on IISQLSRV. There are lots of hints for this assignment in the lecture notes, and more may be given in class.

WHAT TO TURN IN: a .sql file with all SQL queries and, for each query, the number of answers that you received. Submit it to the Catalyst dropbox linked from the class homepage.

WHAT YOU WILL LEARN: advanced SQL.

ESTIMATED TIME: 8 hours, but with a high variance. (The entire answer consists of as little as 120 lines of SQL code.)