Assignment 1: SQL Queries, Web interface and XQuery

This assignment has three parts. The first two parts deal with SQL queries on a relational database while the third part deals with XQuery. Each part has a separate deadline. For the relational database, we use the film database that can be downloaded from here. This is a stripped down version of a dataset from the UCI KDD archive. The dataset consists of three tables: filmid is a key for FILMS and stagename is a key for ACTORS. refers to FILMS.filmid and refers to FILMS.stagename. Each of the tables is in a separate text file where every tuple is on a new line and the attributes in a tuple are separated by vertical bars.

  1. [30 points] Load the database into PostgreSQL server. Documentation for PostgreSQL can be found here. You need to log on to `cubist' to access PostgreSQL server (using the command `psql'). Accounts on cubists will be created for you. To load the database, create a table for each of the relations and use the 'COPY' function of PostgreSQL to import from a file. After that, write SQL expressions for each of the following queries and execute them:
    1. List all actors in descending order of the number of films they acted in.
    2. Find the film(s) with the largest cast. Find the film(s) with the smallest cast. In both cases, also return the size of the cast.
    3. The Bacon number of an actor is the length of the shortest path between the actor and Kevin Bacon in the "co-acting" graph. That is, Kevin Bacon has Bacon number 0; all actors who acted in the same film as KB have Bacon number 1; all actors who acted in the same film as some actor with Bacon number 1 have Bacon number 2, etc. Return all actors whose Bacon number is 2. Bonus: Suppose you write a single SELECT-FROM-WHERE SQL query that returns all actors whose Bacon number is infinity. How big is the query?
    4. Find the actors who acted in films of all categories in the database.
    5. Find all actors who acted only in films before 1960.
    6. Find the films with more women actors than men.
    7. A decade is a sequence of 10 consecutive years. For example 1965, 1966, ..., 1974 is a decade, and so is 1967, 1968, ..., 1976. Find the decade with the largest number of films.
    8. Create a new table with the following schema: ACTORS_DEBUT(stagename, lname, fname, debutfilmname, year), listing for each actor their debut film. You need to use the INSERT SQL statement.

    Revised Deadline:April 8(Thursday), 9.00 p.m. You need to turn in SQL queries for each of the above problems. Send your solution by email to Nilesh. If you are turning in a hard-copy, put it in Nilesh's mailbox.


  2. [35 points] In this assignment you will create a small website to provide an interface to the film database. The website should allow the users to search for films, and to browse the list of actors/films.

    You have to construct a website with the following structure:

    Website Overview

    To accomplish this task you will need to implement Java Server Pages, which will provide a web form for user input, interface with the database instance, and return HTML content. Instructions on setting up JSPs on cubist can be found here. You can also refer to Sun's JSP tutorial.

    Due Date:April 14.

  3. [35 points] Consider the XML data instance Mondial, avaialable here (at the bottom of the page). Write XQueries to answer the following questions. In formulating your questions, you need to understand how various elements are nested: e.g. what is under a country, under which element is a city etc. For that it helps if you inspect the DTD (ignore the warning that the data is not valid), or inspect the data directly. For each question below turn in the XQuery, and the result of running the query on the XML data (which should be an XML document).

    1. Retrieve all the names of all cities located in Peru, sorted alphabetically.
    2. Find all countries with more than 20 provinces.
    3. Find all ethnic groups that live in more than more than 10 countries.
    4. Find the countries adjacent to the 'Pacific Ocean' sea.
    5. Find the names of all countries that have at least 5 mountains over 2000m high, and list the names and heights of all mountains in these countries (regardless of their height). Note: the height attribute is in meters, so you don't have to do any conversions.
    6. One user is interested in long rivers. Produce the following view of the data, containing only rivers longer than 2000 (all units are in km), in the format described below:
      • The root element is user and contains several river elements
      • Each rivercontains a name element with the river's name, and several country elements, one for each country through which it flows. (Note: some rivers may not have any country, due to noise in the data. It is OK to include these rivers, even if they look as they flow through no country at all.)
      • Each country element contains only the name of the country (a string).

    For this question you need to run galax. For that you need to download it from here. Here is a brief example on how to run Galax.

    Due Date:April 21.