CSE 544 Homework 1: Data Analytic Pipeline

Objectives:: To get familiar with the main components of the Data Analytic pipeline: schema design, data acquisition, data transformation, querying, and visualizing.
Assignment tools:: postgres, excel (or some other tool for visualization)
Due date:: Sunday, April 8, 2012, at 11:59pm. Turn it in here.
What to turn in:: These files: pubER.pdf, solution.sql, graph.pdf., and, optionally, extracredit.sql. Your solution.sql file should be executable using the command psql -f solution.sql
Starter code:: Here.

In this homework you will analyze publication data. The data is DBLP, the reference citation website created and maintained by Michael Ley. The analysis will be done in postgres. The visualization in excel, or any other tool of your choice.

1. Schema Design

Design and create a database schema about publications. We will refer to this schema as PubSchema, and to the data as PubData.

E/R Diagram. Design the E/R diagram, consisting of the following entity sets and relationships.
- Author - has the following attributes: id (a key; must be unique), name, and homepage (a URL).
- Publication - has the following attributes: pubid (the key -- an integer), pubkey (an alternative key, text; must be unique), title, and year. It has the following subclasses:
  - Article - has the following extra attributes: journal, month, volume, number
  - Book - has the following extra attributes: publisher, isbn
  - Incollection - has the following extra attributes: booktitle, publisher, isbn
  - Inproceedings - has the following extra attributes: booktitle, editor
- There is a many-many relationship Authored from Author to Publication.
Draw the E/R diagram for this schema. Identify all keys in all entity sets, and indicate the correct type of all relationships (many-many or many-one); make sure you use the ISA box where needed. Aim to produce a textbook-grade perfect scheme.

You have to turn in an E/R Diagram in a file pubER.pdf
Create Database Install postgres on your computer, from http://www.postgresql.org/download/ Then create a database called dblp by running the command:

createdb dblp

If for any reason you need to delete the entire database and restart, run:

dropdb dblp

To run queries in postgres, type

psql dblp

then type in your SQL commands. Remember three special commands:

\q -- quit (exit psql)
\h -- help
\? -- help for internal commands
Create the Publication Schema in SQL We will refer below to this schema as PubSchema. Write appropriate statements:

create table Author ( . . . ); . . .

that implement the E/R diagram above, then run them in postgres. For you simplicity, you may choose all your data types to be int and text only. You need to create keys, foreign keys, and unique constraints; you may either do it now, or postpone this for later. To run your commands you can do one of the following. Either start postgres interactively (as explained above), then cut/paste your SQL commands one by one; you still need to store these commands in a separate file (solution.sql), which you have to turn in. Or, write the commands in a file createPubSchema.sql then type:

psql -f createPubSchema.sql dblp

If for any reason you need to delete the tables and restart, type:

drop table Author;

Drop them in the reverse order you created (otherwise you will violate foreign key constraints), or type drop table Author cascade;

You should turn in all commands in the file solution.sql.

2. Data Acquisition.

This step typically involves downloading data, or extracting it with a software tool, or inputting it manually, or all of the above. Then it involves writing and running some python script, called a wrapper that reformats the data into some CSV format that we can upload to the database. You need to download the DBLP data yourself; we have already provided the python script for you.

Import DBLP into postgres.
- Download dblp.dtd and dblp.xml from http://dblp.uni-trier.de/xml/. We don't need dblp.dtd, but the SAX parser in the python wrapper needs it, so you better download it.
  
  Before you proceed, make sure you understand dblp.xml. Look inside by typing:
  
  more dblp.xml
  
  The file looks like this. There is a giant root element:
  
  <dblp> . . . . </dblp>
  
  Inside there are publication elements:
  <article> . . . </article> <inproceedings> . . . </inproceedings>
  etc
  
  Inside each publication element there are fields:
  
  <author> . . . </author> <title> . . . </title> <year> . . . </year>
  etc
- Run wrapper.py, which is provided in the starter code. (You may have to first edit wrapper.py appropriately to point to the correct location of dblp.xml file, and of the output files, pubFile.txt and fieldFile.txt.)
  
  python wrapper.py
  
  This step produces two large files: pubFile.txt and fieldFile.txt.
  
  Before you proceed, make sure you understand what happened during this step. Note that we are using two tools here:
  - python: if you don't know it already, you should learn it quickly, and this homework is a great excuse. There are zillions of python tutorials on the Web. Here's a brief one: http://hetland.org/writing/instant-python.html.
  - a python XML SAX parser, which is a simple, event driven parser. The advantage of a SAX parser (over a DOM parser) is that it can process the XML data in a streaming fashion, without storing it in main memory. For a quick illustration, see Example 1-1 in http://oreilly.com/catalog/pythonxml/chapter/ch01.html. A SAX application like wrapper.py needs to be written to process nested elements in a streaming fashion. For each publication element, like <article>...</article>, wrapper.py writes one line into pubFile.txt, and for each field element, like <year>...</year>, it writes one line into fieldFile.txt. Notice how startElement handles differently a publication element from a field element; also notice that most of the useful work (writing to the files) is done by endElement.
  Also, look inside pubFile.txt and fieldFile.txt by typing:
  
  more pubFile.txt more fieldFile.txt
  
  These are tab-separated files, ready to be imported in postgres.
- Run createRawSchema.sql, which is provided in the starter code. First, modify it with the absolute file path to pubFile.txt and fieldFile.txt. (The path must be absolute, because the copy command is executed by the server, not by the client.) Then run:
  psql -f createRawSchema.sql dblp
  
  This imports the data to postgres. It creates two tables, Pub and Field, which we call the RawSchema, and we call their data the RawData.
  
  Before you proceed, make sure you understand what you did. Inspect createRawSchema.sql: you should understand every single bit of this file. Also, start an interactive postgres, and type in some simple queries, like:
  
  select * from Pub limit 50;
  select * from Field limit 50;
  
  Here is one way to play with this data, in its raw format. Go to DBLP and check out one of your favorite papers, then click on the Bibtex icon for that paper. Say, you check out Hank Levy's DBLP entry, and click on your favorite paper:
```
@inproceedings{DBLP:conf/uss/GeambasuKLL09,
  author    = {Roxana Geambasu and
               Tadayoshi Kohno and
               Amit A. Levy and
               Henry M. Levy},
  title     = {Vanish: Increasing Data Privacy with Self-Destructing Data},
  booktitle = {USENIX Security Symposium},
  year      = {2009},
  pages     = {299-316},
  ee        = {http://www.usenix.org/events/sec09/tech/full_papers/geambasu.pdf},
  crossref  = {DBLP:conf/uss/2009},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
```
  The key of this entry is conf/uss/GeambasuKLL09. Use it in the SQL query below:
  
  select * from Pub p, Field f where p.k='conf/uss/GeambasuKLL09' and f.k='conf/uss/GeambasuKLL09';
Write SQL Queries to answer the following simple questions, using directly the RawSchema:
- For each type of publication, count the total number of publications of the type. Your query should return a set of (publication-type, count) pairs. For example (article, 20000), (inproceedings, 30000), ... (not the real answer).
- We say that a field "occurs" in a publication type, if there exists at least one publication of that type having that field. For example, "publisher occurs in incollection", but "publisher does not occur in inproceedings" (because no inproceedings entry with a publisher). Find the fields that occur in all publications types. Your query should return a set of field names: for example it may return ~~author~~ title, if title occurs in all publication types (article, inproceedings, etc. notice that title does not have to occur in every publication instance, only in some instance of every type), but it should not return publisher (since the latter does not occur in any publication of type inproceedings).
- Your two queries above may be slow. Speed them up by creating appropriate indexes, using the CREATE INDEX statement. You also need indexes on Pub and Field for the next question; create all indices you need on RawSchema at this point.
You should turn in your answer in the file solution.sql. The answer consists of SELECT-FROM-WHERE queries and CREATE INDEX statements. In addition, insert into the file all answers to the queries, in form of SQL comments.

3. Data Transformation.

Transform the DBLP data from RawSchema to PubSchema. Your transformation will consist of several SQL queries, one per PubSchema table. For example, to populate your Article table, you will likely run a query like:

insert into Article (select ... from Pub, Field ... where ...)

Since PubSchema is a well design schema, you will need to go through some trial and error to get the transformation right: use SQL interactively to get a sense of RawData, and find how to map it to PubData. We give you a few hints:

You may create temporary tables (and indices) to speedup the data transformation. Remember to drop all your temp tables when you are done.
Databases are notoriously inefficient at bulk inserting into a table that contains a foreign key, because they need to check the foreign key constraint after each insert. Hint: do not declare foreign keys in PubSchema; instead, populate the tables first, then run the ALTER TABLE command (see \h ALTER TABLE in postgres). Way faster...
PubSchema requires you to generate an integer key for every author, and for every publication. Use a sequence. For example, try this and see what happens:

create table R(a text); insert into R values ('a'); insert into R values ('b'); insert into R values ('c'); create table S(id int, a text); create sequence q; insert into S (select nextval('q') as id, a from R); drop sequence q; select * from S;
DBLP knows the Homepage of some authors, and you need to store these in the Author table. But where do you get the homepages from the RawData? DBLP uses a hack. Some publications of type www are not publications, but instead represent homepages. For example Hank's official name in DBLP is 'Henry M. Levy'. Here's how you find out his homepage (this command must run very fast, in 1 second or less; if it doesn't, then you are missing some indexes):

select z.* from Pub x, Field y, Field z where x.k=y.k and y.k=z.k and x.p='www' and y.p='author' and y.v='Henry M. Levy';

Get it? Now you know Hank's homepage. However, you are not there yet. Some www entries are not homepages, but are real publications. Try this:

select z.* from Pub x, Field y, Field z where x.k=y.k and y.k=z.k and x.p='www' and y.p='author' and y.v='Dan Suciu';

Your challenge is to find out how to identify each author's correct Homepage. (A small number of authors have two correct, but distinct homepages; you may choose any of them to insert in Author).
What if a publication in RawData has two titles? Or two publishers? Or two years? (You will encounter duplicate fields, but not necessarily these ones.) Your PubSchema is textbook-perfect, and does not allow multiple attributes or other nonsense; if you try inserting, should get an error at some point. There are only few repeated fields, but they prevent you from uploading PubSchema, so you must address them. It doesn't matter how you resolve these conflicts, but your data should load into PubSchema correctly.
Once you are done loading PubData, make sure you add all foreign keys and unique constraints that you have omitted for performance reasons. Hint: use ALTER TABLE.

You should turn in several INSERT SQL, CREATE TABLE, ALTER TABLE, ... statements, in the file solution.sql.

4. Queries.

Finally, now comes the fun part. Write SQL queries to answer the following questions:

Find the top 20 authors with the largest number of publications. Runtime: under 10s.
Find the top 20 authors with the largest number of publications in STOC. Repeat this for two more conferences, of your choice (suggestions: top 20 authors in SOSP, or CHI, or SIGMOD, or SIGGRAPH; note that you need to do some digging to find out how DBLP spells the name of your conference). Runtime: under 10s.
The two major database conferences are 'PODS' (theory) and 'SIGMOD Conference' (systems). Find (a) all authors who published at least 10 SIGMOD papers but never published a PODS paper, and (b) all authors who published at least 5 PODS papers but never published a SIGMOD paper. Runtime: under 10s.
A decade is a sequence of ten consecutive years, e.g. 1982, 1983, ..., 1991. For each decade, compute the total number of publications in DBLP in that decade. Hint: for this and the next query you may want to compute a temporary table with all distinct years. Runtime: under 1minute.
For each decade, find the most prolific author in that decade. Hint: you may want to first compute a temporary table, storing for each decade and each author the number of publications of that author in that decade. Runtime: about 1 hour.
Find the top 20 most collaborative authors. That is, for each author determine its number of collaborators, then find the top 20. Hint: for this and some question below you may want to compute a temporary table of coauthors. Runtime: a couple of minutes.
Find the institutions that have published most papers in STOC; return the top 20 institutions. Then repeat this query with your favorite conference (SOSP or CHI, or ...), and see which are the best places and you didn't know about. Hint: where do you get information about institutions? Use the Homepage information: convert a Homepage like http://www.cs.washington.edu/homes/levy/ to http://www.cs.washington.edu, or even to www.cs.washington.edu: now you have grouped all authors from our department, and we use this URL as surrogate for the institution. Google for substring, position and trim postgres.

You should turn in your answer in the file called solution.sql.

5. Data Visualization.

Compute and draw two histograms: the histogram of the number of collaborators, and the histogram of the number of publications. In each case, start by writing a SQL query that returns a set (k, f(k)), where k=1,2,3,... and f(k) = number of authors that have exactly k collaborators (for the first histogram) and f(k) = number of authors that have k publications (for the second histogram). Your query should output the result into a CSV file (use the copy, or the \copy command); from there import in into Excel, and draw the scatter plot.

For each of the two histograms, indicate whether the scatter plot is exponential, or a power law, or something else. You may want to try either a log scale, or a log-log scale in your plot to best determine the type of law. Turn in the plots using the best scale that illustrates your finding (e.g. if you used a log-log scale to determine the type of distribution, then your plot should be in log-log).

You should turn in your answer in solution.sql (where you write the two SQL queries) and in graph.pdf (where you show the scatter plot graphs).

6. Extra credit.

Find out something interesting, or surprising from DBLP. Write the SQL query (or queries), and turn it in in the solution.sql file, together with a brief description of your finding and why it is surprising.

If you decide to do the extra credit, turn in a file called extracredit.sql