CSE 544 Homework 1: Data Analytics Pipeline
- Objectives:
- To get familiar with the main components of the data analytic pipeline: schema design, data acquisition, data transformation, querying, and visualizing.
- Assignment tools:
- postgres, excel (or some other tool for visualization)
- Due date:
- Monday, October 17, 2015, at 11:00pm. Turn it in here.
- What to turn in:
- These files: pubER.pdf, solution.sql, graph.pdf, and, optionally, extracredit.sql. Your solution.sql file should be executable using the command psql -f solution.sql
- Starter code:
- Here.
IMPORTANT: Please keep track of how much time the assignment is taking. If you find yourself spending an overly large amount of time on this assignment, please let us know! Similarly, if you find that some parts of this assignment are not contributing to your learning, please let us know as well.
In this homework you will, you will go through a very basic data analysis pipeline, i.e. data acquisition, extraction, cleaning, analysis and sharing of results. For the purpose of this assignment you will analyze publication data. The data is DBLP, the reference citation website created and maintained by Michael Ley. The analysis will be done in postgres. The visualization in excel, or any other tool of your choice.
1. Schema Design
Design and create a database schema about publications. We will refer to this schema as PubSchema, and to the data as PubData.
- E/R Diagram. Design the E/R diagram, consisting of the following entity sets and relationships.
- Author - has the following attributes: id (a key; must be unique), name, and homepage (a URL).
- Publication - has the following attributes: pubid (the key -- an integer), pubkey (an alternative key, text; must be unique), title, and year. It has the following subclasses:
- Article - has the following extra attributes: journal, month, volume, number
- Book - has the following extra attributes: publisher, isbn
- Incollection - has the following extra attributes: booktitle, publisher, isbn
- Inproceedings - has the following extra attributes: booktitle, editor
- There is a many-many relationship Authored from Author to Publication.
Draw the E/R diagram for this schema. Identify all keys in all entity sets, and indicate the correct type of all relationships (many-many or many-one); make sure you use the ISA box where needed. While there is no "optimal" answer here, but aim to produce a textbook-grade perfect scheme.
Refer to Chapter 2, "Introduction to Database Design," and Chapter 3.5, "Logical Database Design: ER to Relational" in R&G if you need additional references.
Please turn in your E/R Diagram in a file pubER.pdf
- Install postgres on your computer. Download
postgres from http://www.postgresql.org/download/.
Note that postgres is a client/server database system, which means that
your server needs to start when your OS boots, and must be active at
all times. Normally, this happens automatically, but your luck may vary
depending on your OS; if you get stuck, ask the staff. Also, check the README
file: for example, the Mac OS X distribution asks you to increase your shared
memory. Do as they say. If you plan to run postgres from the command line,
then make sure that you PATH variable points to the right postgres
installation: you must be able to run these commands: createdb, dropdb, psql.
Alternatively, you can use pgadmin, which has a UI. Depending on your luck, the installation may go perfectly smoothly, or you may have to uninstall and reinstall a few times and get your fingers very dirty; treat this as a great learning experience.
*NEW* An alternative (and possibly simpler way) to install postgres as an application: http://postgresapp.com/.
Note to Windows users: Depending on the installation, remember the username you set up during the installation process. Once installed, to run any postgres command, you will need to input the "-U username" parameter after each command (where you would replace username with the name you set up). For example, to start postgres you would need to type "psql -U username". Also each time you start the postgres interactive shell, make sure you set the correct encoding by typing "SET CLIENT_ENCODING TO 'UTF8';" (without the quotes).
- Create a Database. Create a database called dblp by running the command:
createdb dblp
If for any reason you need to delete the entire database and restart, run:
dropdb dblp
To run queries in postgres, type
psql dblp
then type in your SQL commands. Remember three special commands:
\q -- quit (exit psql)
\h -- help
\? -- help for internal commands
- Create the Publication Schema in SQL. We will refer below to this schema as PubSchema. Write appropriate statements:
create table Author ( . . . ); . . .
that implement the E/R diagram above, then run them in postgres. For you simplicity, you may choose all your data types to be int and text only. You need to create keys, foreign keys, and unique constraints; you may either do it now, or postpone this for later. Also, do NOT use the inherit or pivot functionality provided by postgres when creating these tables. To run your commands you can do one of the following. Either start postgres interactively (as explained above), then cut/paste your SQL commands one by one; you still need to store these commands in a separate file (solution.sql), which you have to turn in. Or, write the commands in a file createPubSchema.sql then type:
psql -f createPubSchema.sql dblp
If for any reason you need to delete the tables and restart, type:
drop table Author;
Drop them in the reverse order you created (otherwise you will violate foreign key constraints), or type drop table Author cascade;
Turn in all commands in solution.sql.
2. Data Acquisition
Typically, this step consists of downloading data, or extracting it with a software tool, or inputting it manually, or all of the above. Then it involves writing and running some python script, called a wrapper that reformats the data into some CSV format that we can upload to the database. You need to download the DBLP data yourself; we have already provided the python script for you. Please use python 2.7.x as we have not tested the script with python 3 or above.- Import DBLP into postgres.
- Download dblp.dtd and dblp.xml (300MB gziped as of Oct 2015) from
http://dblp.uni-trier.de/xml/.
We don't need dblp.dtd, but the SAX parser in the python wrapper needs it, so you better download
it.
Before you proceed, make sure you understand dblp.xml. Look inside by typing:
more dblp.xml
The file looks like this. There is a giant root element:
<dblp> . . . . </dblp>
Inside there are publication elements:
<article> . . . </article>
<inproceedings> . . . </inproceedings>
etc
Inside each publication element there are fields:
<author> . . . </author>
<title> . . . </title>
<year> . . . </year>
etc
- Run wrapper.py, which is provided in the starter code.
(You may have to first edit wrapper.py appropriately to point to the
correct location of dblp.xml file, and of the output files, pubFile.txt
and fieldFile.txt.)
python wrapper.py
This step takes several minutes and produces two large files: pubFile.txt and fieldFile.txt.
Before you proceed, make sure you understand what happened during this step. Note that we are using two tools here:- python: if you don't know it already, you should learn it quickly, and this homework is a great excuse. There are zillions of python tutorials on the Web. Here's a brief one: http://hetland.org/writing/instant-python.html.
- a python XML SAX parser, which is a simple, event driven parser. The
advantage of a SAX parser (over a DOM parser) is that it can process the XML
data in a streaming fashion, without storing it in main memory. For a quick
illustration, see Example 1-1 in http://www.cafeconleche.org/books/xmljava/chapters/ch09s11.html.
A SAX application like wrapper.py needs to be written to process nested
elements in a streaming fashion. For each publication element, like
<article>...</article>, wrapper.py writes one
line into pubFile.txt, and for each field element, like
<year>...</year>, it writes one line into fieldFile.txt.
Notice how startElement handles differently a publication element
from a field element; also notice that most of the useful work (writing to
the files) is done by endElement.
Also, look inside pubFile.txt and fieldFile.txt by typing:
more pubFile.txt
more fieldFile.txt
These are tab-separated files, ready to be imported in postgres. - Run createRawSchema.sql, which is provided in the starter code. First, modify it with the absolute file path to pubFile.txt and fieldFile.txt. (The path must be absolute, because the copy command is executed by the server, not by the client.) Then run:
psql -f createRawSchema.sql dblp
This imports the data to postgres. It creates two tables, Pub and Field, which we call the RawSchema, and we call their data the RawData.
Before you proceed, make sure you understand what you did. Inspect createRawSchema.sql: you should understand every single bit of this file. Also, start an interactive postgres by typing this command:
psql dblp
and type in some simple queries, like:
select * from Pub limit 50;
select * from Field limit 50;
Here is one way to play with this data, in its raw format. Go to DBLP and check out one of your favorite papers, then click on the Bibtex icon for that paper. Say, you check out Hank Levy's DBLP entry, and click on your favorite paper:
@inproceedings{DBLP:conf/uss/GeambasuKLL09, author = {Roxana Geambasu and Tadayoshi Kohno and Amit A. Levy and Henry M. Levy}, title = {Vanish: Increasing Data Privacy with Self-Destructing Data}, booktitle = {USENIX Security Symposium}, year = {2009}, pages = {299-316}, ee = {http://www.usenix.org/events/sec09/tech/full_papers/geambasu.pdf}, crossref = {DBLP:conf/uss/2009}, bibsource = {DBLP, http://dblp.uni-trier.de} }
The key of this entry is conf/uss/GeambasuKLL09. Use it in the SQL query below:
select * from Pub p, Field f where p.k='conf/uss/GeambasuKLL09' and f.k='conf/uss/GeambasuKLL09';
- Download dblp.dtd and dblp.xml (300MB gziped as of Oct 2015) from
http://dblp.uni-trier.de/xml/.
We don't need dblp.dtd, but the SAX parser in the python wrapper needs it, so you better download
it.
- Write SQL Queries to answer the following questions, using directly the RawSchema:
- For each type of publication, count the total number of publications of that type. Your query should return a set of (publication-type, count) pairs. For example (article, 20000), (inproceedings, 30000), ... (not the real answer).
- We say that a field "occurs" in a publication type, if there exists at least one publication of that type having that field. For example, "publisher occurs in incollection", but "publisher does not occur in inproceedings" (because no inproceedings entry with a publisher). Find the fields that occur in all publications types. Your query should return a set of field names: for example it may return title, if title occurs in all publication types (article, inproceedings, etc. notice that title does not have to occur in every publication instance, only in some instance of every type), but it should not return publisher (since the latter does not occur in any publication of type inproceedings).
- Your two queries above may be slow. Speed them up by creating appropriate indexes, using the CREATE INDEX statement. You also need indexes on Pub and Field for the next question; create all indices you need on RawSchema at this point.
You should turn in your answer in solution.sql. The answer consists of SELECT-FROM-WHERE queries (you can use GROUP BY as well) and CREATE INDEX statements. In addition, insert into the file all answers to the queries, in form of SQL comments.
3. Data Transformation.
Transform the DBLP data from RawSchema to PubSchema. Your transformation will consist of several SQL queries, one per PubSchema table. For example, to populate your Article table, you will likely run a query like:
insert into Article (select ... from Pub, Field ... where ...)
Since PubSchema is a well designed schema (you designed it yourself!), you will need to go through some trial and error to get the transformation right: use SQL interactively to get a sense of RawData, and find how to map it to PubData. We give you a few hints:
- You may create temporary tables (and indices) to speedup the data transformation. Remember to drop all your temp tables when you are done.
- Databases are notoriously inefficient at bulk inserting into a table that contains a foreign key, because they need to check the foreign key constraint after each insert. Hint: do not declare foreign keys in PubSchema; instead, populate the tables first, then run the ALTER TABLE command (see \h ALTER TABLE in postgres). Way faster...
- PubSchema requires you to generate an integer key for every author, and for every publication. Use a sequence. For example, try this and see what happens:
create table R(a text);
insert into R values ('a');
insert into R values ('b');
insert into R values ('c');
create table S(id int, a text);
create sequence q;
insert into S (select nextval('q') as id, a from R);
drop sequence q;
select * from S;
- DBLP knows the Homepage of some authors, and you need to store these in the Author table. But where do you get the homepages from the RawData? DBLP uses a hack. Some publications of type www are not publications, but instead represent homepages. For example Hank's official name in DBLP is 'Henry M. Levy'. Here's how you find out his homepage (this command must run very fast, in 1 second or less; if it doesn't, then you are missing some indexes):
select z.* from Pub x, Field y, Field z where x.k=y.k and y.k=z.k and x.p='www' and y.p='author' and y.v='Henry M. Levy';
Get it? Now you know Hank's homepage. However, you are not there yet. Some www entries are not homepages, but are real publications. Try this:
select z.* from Pub x, Field y, Field z where x.k=y.k and y.k=z.k and x.p='www' and y.p='author' and y.v='Dan Suciu';
Your challenge is to find out how to identify each author's correct Homepage. (A small number of authors have two correct, but distinct homepages; you may choose any of them to insert in Author).
- What if a publication in RawData has two titles? Or two publishers? Or two years? (You will encounter duplicate fields, but not necessarily these ones.) Your PubSchema is textbook-perfect, and does not allow multiple attributes or other nonsense; if you try inserting, should get an error at some point. There are only few repeated fields, but they prevent you from uploading PubSchema, so you must address them. It doesn't matter how you resolve these conflicts, but your data should load into PubSchema correctly.
- Once you are done loading PubData, make sure you add all foreign keys and unique constraints that you have omitted for performance reasons. Hint: use ALTER TABLE.
Turn in several INSERT SQL, CREATE TABLE, ALTER TABLE, ... statements, in solution.sql.
4. Queries.
Finally, now comes the fun part. Write SQL queries to answer the following questions:
- Find the top 20 authors with the largest number of publications. Runtime: under 10s.
- Find the top 20 authors with the largest number of publications in STOC. Repeat this for two more conferences, of your choice (suggestions: top 20 authors in SOSP, or CHI, or SIGMOD, or SIGGRAPH; note that you need to do some digging to find out how DBLP spells the name of your conference). Runtime: under 10s.
- The two major database conferences are 'PODS' (theory) and 'SIGMOD Conference' (systems). Find (a) all authors who published at least 10 SIGMOD papers but never published a PODS paper, and (b) all authors who published at least 5 PODS papers but never published a SIGMOD paper. Runtime: under 10s.
- A decade is a sequence of ten consecutive years, e.g. 1982, 1983, ..., 1991. For each decade, compute the total number of publications in DBLP in that decade. Hint: for this and the next query you may want to compute a temporary table with all distinct years. Runtime: under 1minute.
Note: we are looking for ANY period of ten consecutive years, not just those that start and end with 0's.
- Find the top 20 most collaborative authors. That is, for each author determine its number of collaborators, then find the top 20. Hint: for this and some question below you may want to compute a temporary table of coauthors. Runtime: a couple of minutes.
- Extra credit: For each decade, find the most prolific author in that decade. Hint: you may want to first compute a temporary table, storing for each decade and each author the number of publications of that author in that decade. Runtime: a few minutes.
- Extra credit: Find the institutions that have published most papers in STOC; return the top 20 institutions. Then repeat this query with your favorite conference (SOSP or CHI, or ...), and see which are the best places and you didn't know about. Hint: where do you get information about institutions? Use the Homepage information: convert a Homepage like http://www.cs.washington.edu/homes/levy/ to http://www.cs.washington.edu, or even to www.cs.washington.edu: now you have grouped all authors from our department, and we use this URL as surrogate for the institution. Google for substring, position and trim in postgres.
Turn in your answer in the file called solution.sql.
5. Using a DBMS from Python or Java and Data Visualization.
The goal of this final question is to ensure that you are comfortable writing an application that uses a DBMS as a back-end. This question requires that you write a small amount of Java or Python code. You can choose the language that you prefer.
You may find the following documentation helpful when completing this question:
Java documentation:
- Chapters 6 ("Database Application Development") and 7("Internet Applications") of the book by Ramakrishnan and Gehrke (R&G), Third Edition. Chapter 6 includes an introduction to JDBC with examples.
- JDBC Documentation
- The PostgreSQL JDBC Interface documentation.
- An example to get you started.
Python documentation:
- Documentation for accessing PostgreSQL with Python
- For sample python code, see pythonpsql.py in the starter code provided
From your Java or Python program, connect to your database, submit queries to compute two histograms: the histogram of the number of collaborators, and the histogram of the number of publications. In each case, start by writing a SQL query that returns a set (k, f(k)), where k=1,2,3,... and f(k) = number of authors that have exactly k collaborators (for the first histogram) and f(k) = number of authors that have k publications (for the second histogram). Retrieve the results back into your program. From there, either output a CSV file and import it into Excel or use whatever other method you like to produce the graphs.
For each of the two histograms, indicate whether the scatter plot is exponential, or a power law, or something else. You may want to try either a log scale, or a log-log scale in your plot to best determine the type of law. Turn in the plots using the best scale that illustrates your finding (e.g. if you used a log-log scale to determine the type of distribution, then your plot should be in log-log).
You should turn in your answer in solution.sql (where you write the two SQL queries) and in graph.pdf (where you show the scatter plot graphs). Also, please turn in your completed Java or Python program.
6. Extra credit.
Find out something interesting, or surprising from DBLP. Write the SQL query (or queries), and turn it in in the solution.sql file, together with a brief description of your finding and why it is surprising.
If you decide to do the extra credit, turn in a file called extracredit.sql