CSE 344 section 03 -- A tour of Postgres, practice with relational algebra

================================================================
A QUICK TOUR OF POSTGRESQL FOR HOMEWORK 3

Why postgres?
  You'll need it for the upcoming homework 5.
  It has more features and a better optimizer than sqlite.

The minimal set of commands to start (per hw instructions):
  [Open a postgres shell using hw3shell.cmd]
  initdb Z:\pgsql_data
  [in another postgres shell]
  pg_ctl start -D Z:\pgsql_data
  [back in the first shell]
  createdb imdb
  psql imdb

If you're running on your own machine:
  - download postgres here: http://www.postgresql.org/download/
  - set it up and remember the "database superuser's password"
  - add postgres commands to PATH if needed (especially on Windows or Mac);
    see hw3shell.cmd (which does NOT work at home) for a guide to what
    you might need to do
  - don't do initdb and pg_ctl start; there's already a data directory
    and postgres auto-starts on boot.
  - put `-U postgres' before the database name in createdb and psql commands,
    example:
      psql -U postgres imdb
    and enter the superuser password

psql:
  metacommands: sqlite3 has . commands, psql has \ commands - neither are SQL

  \h = help for SQL commands 
  \? = help for metacommands 
  \q = quit psql

  \ commands you need:
     \copy to import imdb data - see hw for exact syntax
     \d (describe) - like .schema, .indices, etc.

Postgres EXPLAIN command:
  "EXPLAIN query" tells you postgres's plan for executing your query,
  without actually running it.  Here is example output from EXPLAIN: 

  imdb=> explain 
    select count(*) as role_count 
    from casts
    group by mid
    ;
                                         QUERY PLAN
  ----------------------------------------------------------------------------------------
   GroupAggregate  (cost=0.00..402257.46 rows=240821 width=4)
     ->  Index Scan using castsmid on casts  (cost=0.00..342017.96 rows=11445847 width=4)
  (2 rows)

  Each line/row in the plan corresponds to one "physical" query operator,
  which usually but not always maps to a node in the relational algebra
  query tree.  (Optimization can cause the tree to be rearranged or pruned.)
  Usually it's obvious what kind of relational algebra operator is 
  implemented, for example "GroupAggregate" is the group-and-aggregate
  \gamma operator. 

  Higher levels of indentation indicate greater depth in the tree,
  so to do the high-level \gamma, we need to do a "index scan" on Casts,
  using the castsmid index to (presumably) find all the tuples with
  each possible mid value in succession.

  Each operator is labeled with 3 values:
  - The "cost," or estimated elapsed time (in no particular units), 
    for the operation, displayed as [time to first row]..[time to finish].

  - How many tuples are expected in the operation's output.

  - How many attributes per tuple. (Will be more than you think,
    postgres stores extra bookkeeping info in tuples).

ANALYZE:
  If the plan from EXPLAIN is surprisingly bad - wildly off
  estimates for row counts, or there's a "sequential scan" (read in file order)
  of a table based on a column you know is indexed, it may help to 
  tell postgres to refresh the internal statistics used by the 
  optimizer:
    ANALYZE VERBOSE;
  or 
    VACUUM ANALYZE VERBOSE;
  (which also frees up "garbage" blocks on disk).

Quick postgres tips:
  - Don't create the table indexes before you import the data - 
    updating the indexes will make the import take a LONG time.
    (This was true with sqlite as well, but it's more likely here
     because you're using the old sql scripts.)

  - Always run "EXPLAIN query" before actually running the query.

  - "EXPLAIN ANALYZE query" will tell you the actual runtime of the query
    (but sadly not the query results as well).

Postgres documentation:
  http://www.postgresql.org/docs/8.4/interactive/


================================================================
RELATIONAL ALGEBRA

For this section, we use the same IMDB database you are already using
for homeworks 2 and 3:

  Actor (id, fname, lname, gender)
  Movie (id, name, year)
  Directors (id, fname, lname)
  Casts (pid, mid, role)
  Movie_Directors (did, mid)
  Genre (mid, genre)

We will translate the following SQL queries into relational algebra:

1. For each movie with at least 5 actors, compute the avergae number
   of roles per actor.

   SELECT m.name, COUNT(*) / COUNT(DISTINCT c.aid) AS average
   FROM Movie m, Casts c
   WHERE m.id = c.mid
   GROUP BY m.id, m.name
   HAVING COUNT (DISTINCT c.aid) >= 5
   ;

2. For all actors who appear in more than 50 movies made after 1950,
   get the number of such movies they were in.

   SELECT a.fname, a.lname, COUNT(DISTINCT m.id) AS movie_count
   FROM Actor a, Casts c, Movie m
   WHERE a.id = c.mid
     AND c.mid = m.id
     AND m.year > 1950
   GROUP BY a.id, a.fname, a.lname
   HAVING COUNT(DISTINCT m.id) > 50
   ;

3. Compute the average number of roles per movie.

   SELECT AVG(role_count)
   FROM (
     SELECT COUNT(*) AS role_count
     FROM Casts
     GROUP BY mid
   ) counts
   ;

4. Give all the actors who only appear in movies before 2000.

   SELECT *
   FROM Actor a
   WHERE 2000 > ALL (
     SELECT m.year
     FROM Casts c, Movie m
     WHERE a.id = c.pid
       AND c.mid = m.id
   )
   ;

   [do alternative SQL queries exist?]

5. Give all the directors who did not make a movie in 2004.

   SELECT *
   FROM Directors d
   WHERE NOT EXISTS (
     SELECT *
     FROM Movie_Directors md, Movie m
     WHERE md.did = d.id AND m.year = 2004
   )
   ;