CSE 544 Homework 1: Data Analytics Pipeline

Objectives:
To get familiar with the main components of the data analytic pipeline: schema design, data acquisition, data transformation, querying, and visualizing.
Assignment tools:
postgres, excel (or some other tool for visualization)
Due date:
Monday, January 26, 2015, at 11:45pm. Turn it in here.
What to turn in:
These files: pubER.pdf, solution.sql, graph.pdf., and, optionally, extracredit.sql. Your solution.sql file should be executable using the command psql -f solution.sql
Starter code:
Here.

IMPORTANT: Please keep track of how much time the assignment is taking. If you find yourself spending an overly large amount of time on this assignment, please let us know! Similarly, if you find that some parts of this assignment are not contributing to your learning, please let us know as well.

 

In this homework you will analyze publication data. The data is DBLP, the reference citation website created and maintained by Michael Ley. The analysis will be done in postgres. The visualization in excel, or any other tool of your choice.

1. Schema Design

Design and create a database schema about publications. We will refer to this schema as PubSchema, and to the data as PubData.

  1. E/R Diagram. Design the E/R diagram, consisting of the following entity sets and relationships.

    Draw the E/R diagram for this schema. Identify all keys in all entity sets, and indicate the correct type of all relationships (many-many or many-one); make sure you use the ISA box where needed. Aim to produce a textbook-grade perfect scheme.

    You have to turn in an E/R Diagram in a file pubER.pdf

  2. Install postgres on your computer, from http://www.postgresql.org/download/ Note that postgres is a client/server database system, which means that your server needs to start when your OS boots, and must be active at all times. Normally, this happens automatically, but your luck may vary depending on your OS; if you get stuck, ask the staff. Also, check the README file: for example, the Mac OS X distribution asks you to increase your shared memory. Do as they say. If you plan to run postgres from the command line, then make sure that you PATH variable points to the right postgres installation: you must be able to run these commands: createdb, dropdb, psql. Alternatively, you can use pgadmin, which has a UI. Depending on your luck, the installation may go perfectly smoothly, or you may have to uninstall and reinstall a few times and get your fingers very dirty; treat this as a great learning experience.

    *NEW* An alternative (and possibly simpler way) to install postgres as an application: http://postgresapp.com/.

    Note to Windows users: Depending on the installation, remember the username you set up during the installation process. Once installed, to run any postgres command, you will need to input the "-U username" parameter after each command (where you would replace username with the name you set up). For example, to start postgres you would need to type "psql -U username". Also each time you start the postgres interactive shell, make sure you set the correct encoding by typing "SET CLIENT_ENCODING TO 'UTF8';".

  3. Create Database Create a database called dblp by running the command:

    createdb dblp

    If for any reason you need to delete the entire database and restart, run:

    dropdb dblp

    To run queries in postgres, type

    psql dblp

    then type in your SQL commands. Remember three special commands:

    \q -- quit (exit psql)
    \h -- help
    \? -- help for internal commands

  4. Create the Publication Schema in SQL We will refer below to this schema as PubSchema. Write appropriate statements:

    create table Author ( . . . ); . . .

    that implement the E/R diagram above, then run them in postgres. For you simplicity, you may choose all your data types to be int and text only. You need to create keys, foreign keys, and unique constraints; you may either do it now, or postpone this for later. Also, do NOT use the inherit or pivot functionality provided by postgres when creating these tables. To run your commands you can do one of the following. Either start postgres interactively (as explained above), then cut/paste your SQL commands one by one; you still need to store these commands in a separate file (solution.sql), which you have to turn in. Or, write the commands in a file createPubSchema.sql then type:

    psql -f createPubSchema.sql dblp

    If for any reason you need to delete the tables and restart, type:

    drop table Author;

    Drop them in the reverse order you created (otherwise you will violate foreign key constraints), or type drop table Author cascade;

    You should turn in all commands in the file solution.sql.

2. Data Acquisition.

Typically, this step consists of downloading data, or extracting it with a software tool, or inputting it manually, or all of the above. Then it involves writing and running some python script, called a wrapper that reformats the data into some CSV format that we can upload to the database. You need to download the DBLP data yourself; we have already provided the python script for you (please use python 2.7.x, we have not tested it with python 3 and above).
  1. Import DBLP into postgres.
  2. Write SQL Queries to answer the following questions, using directly the RawSchema:
    You should turn in your answer in the file solution.sql. The answer consists of SELECT-FROM-WHERE queries and CREATE INDEX statements. In addition, insert into the file all answers to the queries, in form of SQL comments.

3. Data Transformation.

Transform the DBLP data from RawSchema to PubSchema. Your transformation will consist of several SQL queries, one per PubSchema table. For example, to populate your Article table, you will likely run a query like:

insert into Article (select ... from Pub, Field ... where ...)

Since PubSchema is a well designed schema (you designed it yourself!), you will need to go through some trial and error to get the transformation right: use SQL interactively to get a sense of RawData, and find how to map it to PubData. We give you a few hints:


You should turn in several INSERT SQL, CREATE TABLE, ALTER TABLE, ... statements, in the file solution.sql.


4. Queries.

Finally, now comes the fun part. Write SQL queries to answer the following questions:

  1. Find the top 20 authors with the largest number of publications. Runtime: under 10s.

  2. Find the top 20 authors with the largest number of publications in STOC. Repeat this for two more conferences, of your choice (suggestions: top 20 authors in SOSP, or CHI, or SIGMOD, or SIGGRAPH; note that you need to do some digging to find out how DBLP spells the name of your conference). Runtime: under 10s.

  3. The two major database conferences are 'PODS' (theory) and 'SIGMOD Conference' (systems). Find (a) all authors who published at least 10 SIGMOD papers but never published a PODS paper, and (b) all authors who published at least 5 PODS papers but never published a SIGMOD paper. Runtime: under 10s.

  4. A decade is a sequence of ten consecutive years, e.g. 1982, 1983, ..., 1991. For each decade, compute the total number of publications in DBLP in that decade. Hint: for this and the next query you may want to compute a temporary table with all distinct years. Runtime: under 1minute.

  5. Find the top 20 most collaborative authors. That is, for each author determine its number of collaborators, then find the top 20. Hint: for this and some question below you may want to compute a temporary table of coauthors. Runtime: a couple of minutes.

  6. Extra credit: For each decade, find the most prolific author in that decade. Hint: you may want to first compute a temporary table, storing for each decade and each author the number of publications of that author in that decade. Runtime: a few minutes.

  7. Extra credit: Find the institutions that have published most papers in STOC; return the top 20 institutions. Then repeat this query with your favorite conference (SOSP or CHI, or ...), and see which are the best places and you didn't know about. Hint: where do you get information about institutions? Use the Homepage information: convert a Homepage like http://www.cs.washington.edu/homes/levy/ to http://www.cs.washington.edu, or even to www.cs.washington.edu: now you have grouped all authors from our department, and we use this URL as surrogate for the institution. Google for substring, position and trim in postgres.

You should turn in your answer in the file called solution.sql.



5. Using a DBMS from Python or Java and Data Visualization.

The goal of this final question is to ensure that you are comfortable writing an application that uses a DBMS as a back-end. This question requires that you write a small amount of Java or Python code. You can choose the language that you prefer.

You may find the following documentation helpful when completing this question:

Java documentation:

Python documentation:

From your Java or Python program, connect to your database, submit queries to compute two histograms: the histogram of the number of collaborators, and the histogram of the number of publications. In each case, start by writing a SQL query that returns a set (k, f(k)), where k=1,2,3,... and f(k) = number of authors that have exactly k collaborators (for the first histogram) and f(k) = number of authors that have k publications (for the second histogram). Retrieve the results back into your program. From there, either output a CSV file and import it into Excel or use whatever other method you like to produce the graphs.

For each of the two histograms, indicate whether the scatter plot is exponential, or a power law, or something else. You may want to try either a log scale, or a log-log scale in your plot to best determine the type of law. Turn in the plots using the best scale that illustrates your finding (e.g. if you used a log-log scale to determine the type of distribution, then your plot should be in log-log).

You should turn in your answer in solution.sql (where you write the two SQL queries) and in graph.pdf (where you show the scatter plot graphs). Also, please turn in your completed Java or Python program.


6. Extra credit.

Find out something interesting, or surprising from DBLP. Write the SQL query (or queries), and turn it in in the solution.sql file, together with a brief description of your finding and why it is surprising.

If you decide to do the extra credit, turn in a file called extracredit.sql