CSE 344 section 03 -- A tour of Postgres, practice with relational algebra ================================================================ A QUICK TOUR OF POSTGRESQL FOR HOMEWORK 3 Why postgres? You'll need it for the upcoming homework 5. It has more features and a better optimizer than sqlite. The minimal set of commands to start (per hw instructions): [Open a postgres shell using hw3shell.cmd] initdb Z:\pgsql_data [in another postgres shell] pg_ctl start -D Z:\pgsql_data [back in the first shell] createdb imdb psql imdb If you're running on your own machine: - download postgres here: http://www.postgresql.org/download/ - set it up and remember the "database superuser's password" - add postgres commands to PATH if needed (especially on Windows or Mac); see hw3shell.cmd (which does NOT work at home) for a guide to what you might need to do - don't do initdb and pg_ctl start; there's already a data directory and postgres auto-starts on boot. - put `-U postgres' before the database name in createdb and psql commands, example: psql -U postgres imdb and enter the superuser password psql: metacommands: sqlite3 has . commands, psql has \ commands - neither are SQL \h = help for SQL commands \? = help for metacommands \q = quit psql \ commands you need: \copy to import imdb data - see hw for exact syntax \d (describe) - like .schema, .indices, etc. Postgres EXPLAIN command: "EXPLAIN query" tells you postgres's plan for executing your query, without actually running it. Here is example output from EXPLAIN: imdb=> explain select count(*) as role_count from casts group by mid ; QUERY PLAN ---------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..402257.46 rows=240821 width=4) -> Index Scan using castsmid on casts (cost=0.00..342017.96 rows=11445847 width=4) (2 rows) Each line/row in the plan corresponds to one "physical" query operator, which usually but not always maps to a node in the relational algebra query tree. (Optimization can cause the tree to be rearranged or pruned.) Usually it's obvious what kind of relational algebra operator is implemented, for example "GroupAggregate" is the group-and-aggregate \gamma operator. Higher levels of indentation indicate greater depth in the tree, so to do the high-level \gamma, we need to do a "index scan" on Casts, using the castsmid index to (presumably) find all the tuples with each possible mid value in succession. Each operator is labeled with 3 values: - The "cost," or estimated elapsed time (in no particular units), for the operation, displayed as [time to first row]..[time to finish]. - How many tuples are expected in the operation's output. - How many attributes per tuple. (Will be more than you think, postgres stores extra bookkeeping info in tuples). ANALYZE: If the plan from EXPLAIN is surprisingly bad - wildly off estimates for row counts, or there's a "sequential scan" (read in file order) of a table based on a column you know is indexed, it may help to tell postgres to refresh the internal statistics used by the optimizer: ANALYZE VERBOSE; or VACUUM ANALYZE VERBOSE; (which also frees up "garbage" blocks on disk). Quick postgres tips: - Don't create the table indexes before you import the data - updating the indexes will make the import take a LONG time. (This was true with sqlite as well, but it's more likely here because you're using the old sql scripts.) - Always run "EXPLAIN query" before actually running the query. - "EXPLAIN ANALYZE query" will tell you the actual runtime of the query (but sadly not the query results as well). Postgres documentation: http://www.postgresql.org/docs/8.4/interactive/ ================================================================ RELATIONAL ALGEBRA For this section, we use the same IMDB database you are already using for homeworks 2 and 3: Actor (id, fname, lname, gender) Movie (id, name, year) Directors (id, fname, lname) Casts (pid, mid, role) Movie_Directors (did, mid) Genre (mid, genre) We will translate the following SQL queries into relational algebra: 1. For each movie with at least 5 actors, compute the avergae number of roles per actor. SELECT m.name, COUNT(*) / COUNT(DISTINCT c.aid) AS average FROM Movie m, Casts c WHERE m.id = c.mid GROUP BY m.id, m.name HAVING COUNT (DISTINCT c.aid) >= 5 ; 2. For all actors who appear in more than 50 movies made after 1950, get the number of such movies they were in. SELECT a.fname, a.lname, COUNT(DISTINCT m.id) AS movie_count FROM Actor a, Casts c, Movie m WHERE a.id = c.mid AND c.mid = m.id AND m.year > 1950 GROUP BY a.id, a.fname, a.lname HAVING COUNT(DISTINCT m.id) > 50 ; 3. Compute the average number of roles per movie. SELECT AVG(role_count) FROM ( SELECT COUNT(*) AS role_count FROM Casts GROUP BY mid ) counts ; 4. Give all the actors who only appear in movies before 2000. SELECT * FROM Actor a WHERE 2000 > ALL ( SELECT m.year FROM Casts c, Movie m WHERE a.id = c.pid AND c.mid = m.id ) ; [do alternative SQL queries exist?] 5. Give all the directors who did not make a movie in 2004. SELECT * FROM Directors d WHERE NOT EXISTS ( SELECT * FROM Movie_Directors md, Movie m WHERE md.did = d.id AND m.year = 2004 ) ;