CSE 544 Homework 1: Data Analytic Pipeline

Objectives:
To get familiar with the main components of the Data Analytic pipeline: schema design, data acquisition, data transformation, querying, and visualizing.
Assignment tools:
postgres, excel (or some other tool for visualization)
Due date:
Sunday, April 8, 2012, at 11:59pm. Turn it in here.
What to turn in:
These files: pubER.pdf, solution.sql, graph.pdf., and, optionally, extracredit.sql. Your solution.sql file should be executable using the command psql -f solution.sql
Starter code:
Here.

In this homework you will analyze publication data. The data is DBLP, the reference citation website created and maintained by Michael Ley. The analysis will be done in postgres. The visualization in excel, or any other tool of your choice.

1. Schema Design

Design and create a database schema about publications. We will refer to this schema as PubSchema, and to the data as PubData.
  1. E/R Diagram. Design the E/R diagram, consisting of the following entity sets and relationships.

    Draw the E/R diagram for this schema. Identify all keys in all entity sets, and indicate the correct type of all relationships (many-many or many-one); make sure you use the ISA box where needed. Aim to produce a textbook-grade perfect scheme.

    You have to turn in an E/R Diagram in a file pubER.pdf

  2. Create Database Install postgres on your computer, from http://www.postgresql.org/download/ Then create a database called dblp by running the command:

    createdb dblp

    If for any reason you need to delete the entire database and restart, run:

    dropdb dblp

    To run queries in postgres, type

    psql dblp

    then type in your SQL commands. Remember three special commands:

    \q -- quit (exit psql)
    \h -- help
    \? -- help for internal commands

  3. Create the Publication Schema in SQL We will refer below to this schema as PubSchema. Write appropriate statements:

    create table Author ( . . . ); . . .

    that implement the E/R diagram above, then run them in postgres. For you simplicity, you may choose all your data types to be int and text only. You need to create keys, foreign keys, and unique constraints; you may either do it now, or postpone this for later. To run your commands you can do one of the following. Either start postgres interactively (as explained above), then cut/paste your SQL commands one by one; you still need to store these commands in a separate file (solution.sql), which you have to turn in. Or, write the commands in a file createPubSchema.sql then type:

    psql -f createPubSchema.sql dblp

    If for any reason you need to delete the tables and restart, type:

    drop table Author;

    Drop them in the reverse order you created (otherwise you will violate foreign key constraints), or type drop table Author cascade;

    You should turn in all commands in the file solution.sql.

2. Data Acquisition.

This step typically involves downloading data, or extracting it with a software tool, or inputting it manually, or all of the above. Then it involves writing and running some python script, called a wrapper that reformats the data into some CSV format that we can upload to the database. You need to download the DBLP data yourself; we have already provided the python script for you.
  1. Import DBLP into postgres.
  2. Write SQL Queries to answer the following simple questions, using directly the RawSchema:
    You should turn in your answer in the file solution.sql. The answer consists of SELECT-FROM-WHERE queries and CREATE INDEX statements. In addition, insert into the file all answers to the queries, in form of SQL comments.

3. Data Transformation.

Transform the DBLP data from RawSchema to PubSchema. Your transformation will consist of several SQL queries, one per PubSchema table. For example, to populate your Article table, you will likely run a query like:

insert into Article (select ... from Pub, Field ... where ...)

Since PubSchema is a well design schema, you will need to go through some trial and error to get the transformation right: use SQL interactively to get a sense of RawData, and find how to map it to PubData. We give you a few hints:


You should turn in several INSERT SQL, CREATE TABLE, ALTER TABLE, ... statements, in the file solution.sql.


4. Queries.

Finally, now comes the fun part. Write SQL queries to answer the following questions:

  1. Find the top 20 authors with the largest number of publications. Runtime: under 10s.

  2. Find the top 20 authors with the largest number of publications in STOC. Repeat this for two more conferences, of your choice (suggestions: top 20 authors in SOSP, or CHI, or SIGMOD, or SIGGRAPH; note that you need to do some digging to find out how DBLP spells the name of your conference). Runtime: under 10s.

  3. The two major database conferences are 'PODS' (theory) and 'SIGMOD Conference' (systems). Find (a) all authors who published at least 10 SIGMOD papers but never published a PODS paper, and (b) all authors who published at least 5 PODS papers but never published a SIGMOD paper. Runtime: under 10s.

  4. A decade is a sequence of ten consecutive years, e.g. 1982, 1983, ..., 1991. For each decade, compute the total number of publications in DBLP in that decade. Hint: for this and the next query you may want to compute a temporary table with all distinct years. Runtime: under 1minute.

  5. For each decade, find the most prolific author in that decade. Hint: you may want to first compute a temporary table, storing for each decade and each author the number of publications of that author in that decade. Runtime: about 1 hour.

  6. Find the top 20 most collaborative authors. That is, for each author determine its number of collaborators, then find the top 20. Hint: for this and some question below you may want to compute a temporary table of coauthors. Runtime: a couple of minutes.

  7. Find the institutions that have published most papers in STOC; return the top 20 institutions. Then repeat this query with your favorite conference (SOSP or CHI, or ...), and see which are the best places and you didn't know about. Hint: where do you get information about institutions? Use the Homepage information: convert a Homepage like http://www.cs.washington.edu/homes/levy/ to http://www.cs.washington.edu, or even to www.cs.washington.edu: now you have grouped all authors from our department, and we use this URL as surrogate for the institution. Google for substring, position and trim postgres.

You should turn in your answer in the file called solution.sql.



5. Data Visualization.

 

Compute and draw two histograms: the histogram of the number of collaborators, and the histogram of the number of publications. In each case, start by writing a SQL query that returns a set (k, f(k)), where k=1,2,3,... and f(k) = number of authors that have exactly k collaborators (for the first histogram) and f(k) = number of authors that have k publications (for the second histogram). Your query should output the result into a CSV file (use the copy, or the \copy command); from there import in into Excel, and draw the scatter plot.

For each of the two histograms, indicate whether the scatter plot is exponential, or a power law, or something else. You may want to try either a log scale, or a log-log scale in your plot to best determine the type of law. Turn in the plots using the best scale that illustrates your finding (e.g. if you used a log-log scale to determine the type of distribution, then your plot should be in log-log).

You should turn in your answer in solution.sql (where you write the two SQL queries) and in graph.pdf (where you show the scatter plot graphs).


6. Extra credit.

Find out something interesting, or surprising from DBLP. Write the SQL query (or queries), and turn it in in the solution.sql file, together with a brief description of your finding and why it is surprising.

If you decide to do the extra credit, turn in a file called extracredit.sql