CSED 516: Class Project

The class project is open-ended. The only requirement is that it uses a big data system.

You can work on project individually, or form a team of two students

There are no late days for the project.

Milestones

0. Form Teams: October 14

Please fill out this form (even if you plan on working solo): Sign Up

If you work in a team of two students, only the first team member needs to submit the form, proposal, milestone, and final report (described below).

1. Project Proposal (1 page): October 28

Submit via Gitlab, in the porject directory; e.g. call your file proposal.pdf.

Your proposal must include your name(s) and a project title (it's OK to change it later)

Describe very briefly what you want to do:

  1. What question are you planning to address?
  2. What system are you using? Redshift/snowflake/spark/sql-server/postgres/etc/etc? Do you have access to it?
  3. What data are you planning to use? Do you have access to it? If you know details, describe briefly how large it is, in how many files does it come, in what format does it come (csv/json/xml/whatever), and whether you thought how you will import it into the system.
  4. What do you hope to report in your project? For example, a graph showing the runtime as a function of the data size; or a bar chart showing the runtimes for 10 queries with features X enbabled and with feature X disabled.
  5. If you work in a team of two, tell us about how you are planning to split the work.

2. Project Milestone (2-3 pages): November 25

Submit via Gitlab, in the porject directory. E.g. project-milestone.pdf

This is a preliminary draft of your final report. Summarize what you did already and what you are still planning to do. If you have any preliminary graphs, show them! You can change them later. Mention how you split the work, in case you are working in a team.

I am planning to meet with each of you individually shortly after you submit the milestone.

3. In-class presentation (5 minutes): December 6

On December 6, we will have a mini-conference, where each team gets to present their project. This may take up to 4 hours, perhaps even more, please plan ahead. Stay tuned for details.

4. Project Final Report (4-5 pages): December 9

Submit via Gitlab, in the porject directory. E.g. project-final-report.pdf

The final project report should have the following sections:


Suggestions

The goal of the project is for you to build some experience with scaling up computations to large data sets. Normally, your project will include some form of performance evaluations, for example compare two different systems, or evaluate how a particular systems scales with the size of the data. Sometimes the systems do not scale, and then it's interesting to find out why, and whether we can fix that.

Below I have some suggestions; some are a bit old, use them for inspiration only. Also, it’s OK if several people/teams choose the same project, it will be fun to compare the approaches and the results.

Benchmarking

Here you will choose your favorite big data system X (where X=Spark on AWS, or X=Spark on Azure, or X=Snowflake, or X=Redshift, or X=Bigquery …) then test something on this system. For example, you can try to benchmark a system's performance on a specific task (e.g., ingesting data, making backups, rescaling a cluster, etc) or on a new set of queries (e.g., try to run queries on a different dataset or try to run new types of queries). Another good type of project is to try a system that we did not have a chance to try in class and report what you find in terms of ease of use and performance, for example Microsoft Azure (ask the instructor for credits).

Sampling

Suppose you have to compute a query, say join-group-by-aggregate. The query takes too long, and/or costs too much to run on system X. Instead, you want to sample from your base tables, then run the query on the sample, and return an estimate of the answer. (This is called Approximate Query Processing, or AQP). In this project you would explore X’s sampling capabilities. What sampling primitives does it support? Are they efficient? Are they guaranteed to be uniform? (Sometimes database systems “cheat” and sample entire blocks instead of sampling individual records, for efficiency purposes; how does X work?) How effective is sampling in answering your big join-group-by-aggregate query? Does it save any money?

Skew

Distributed query processing becomes much harder when there is skew. Study how system X copes with skew in query processing. For example, when joining two tables R(x,y) Join S(y,z), if y is skewed in either R or in S then what happens? Some things to expect: (a) the query runs out of memory and crashes, or (b) the query doesn’t crash but takes significantly longer, or (c) it runs as efficiently as for non-skewed data. Extend this to more complex queries: multiway joins, and/or add selection predicates.

Comparing X to Y

Compare system X to system Y. For example, you may want to compare the performance of Spark on AWS with the performance of Spark on Azure. Or compare SQL on Redshift with SQL on Snowflake, and/or with SQL on Azure. Find some benchmarks to do the comparison. For SQL, the best known benchmark is TPC/H (and there are other variants of TPC) as well as the newer “Join Benchmark” (which uses IMDB data). If you choose Spark, look for benchmarks for Spark (or design/adapt your own). Feel free to adapt any benchmark as you see fit, i.e. simplify it, drop some queries that would take more effort to run on X or on Y.

ML in Big Data

A typical ML algorithm runs gradient descent, or stochastic gradient descent, until convergence (or for a fixed number of iterations). One gradient descent step can be expressed in SQL or in Spark, as an aggregate query (with or without group by). Evaluate the effectiveness of running gradient descent on a big data system, e.g. in SQL or in Spark. Compare this to the standard approach of using python for gradient descent. Which one scales better? (One expects system X, which is usually distributed, to scale better) Which one is faster? How does this depend on the size of the data? (Perhaps python is faster for small training data but crashes when the training data reaches a certain size.). Your goal here should be to express a very simple ML algorithm (e.g. for linear regression, or logistic regression) using a SQL query (for example, one step of gradient descent requires one to compute a sum over the data: write that sum as a SQL query to compute one step of the gradient descent, update the data, then repeat the gradient descent step). Some references to (much more advanced) usages of SQL for ML tasks are here and here.

Exploring some interesting system of dataset

Yago is one of the best known open source knowledge graphs, see here. Feeling adventurous? Download Yago 4 / English Wikpedia (about 2-3GB compressed), store it in a relational database (I recomment using a cloud database, e.g. redshift or snowflake or bigquery or SQL Azure), then run some interesting queries, which google cannot answer. Suggestions (I don't know which of these is even answerable from the data): Who won both a Turing Award and an Oscar?; Has anyone published a paper with both Erdos and Einstein?; Who are the actors who acted in a movie directed by their spouse? (or their father? or their child?); Was there any parent and child who both won the Nobel Prize? (Or an Oscar; or the Fields medal; or any major award X) What about a grantparent/partent/child triple?

SalesForce open sourced an interesting library for automating data prep and feature extraction: TransmogrifAI. See also here. Try to upload a large knowledge graph in an RDBMs and run some interesting queries. Free, large KG's are hard to find: apparently Facebook apparently has an open source knowledge graph FB15K (but I couldn't find it) Semantic Scholar makes their KG available: it's great, but, warning, it's big!


General Advice

The best projects will ask a question and will vary one or more parameters to see how performance changes. We expect each final report to include at least one graph with some performance numbers (you can have more than one graph). Beyond performance graph(s), you can also include a qualitative discussion based on what you observed and what you read about in the documentation. The scale of the final project should be similar to the scale of the other three assignments.


Examples of Past Projects

Presentations from 2019 are here.

Selected final reports from 2019: