CSED 516: Class Project

The final project is open-ended. The only requirement is that it uses a big data system. Watch this space for more updates.

Suggestions

Here are some suggestions for the CSED 516 class project. You can come up with your own project, or choose an idea below, or start from an idea below and modify it. It’s OK if several people choose the same project, it will be fun to compare the approaches and the results.

Benchmarking

Here you will choose your favorite big data system X (where X=Spark on AWS, or X=Spark on Azure, or X=Snowflake, or X=Redshift, or X=Bigquery …) then test something on this system. For example, you can try to benchmark a system's performance on a specific task (e.g., ingesting data, making backups, rescaling a cluster, etc) or on a new set of queries (e.g., try to run queries on a different dataset or try to run new types of queries). Another good type of project is to try a system that we did not have a chance to try in class and report what you find in terms of ease of use and performance, for example Microsoft Azure (ask the instructor for credits).

Sampling

Suppose you have to compute a query, say join-group-by-aggregate. The query takes too long, and/or costs too much to run on system X. Instead, you want to sample from your base tables, then run the query on the sample, and return an estimate of the answer. (This is called Approximate Query Processing, or AQP). In this project you would explore X’s sampling capabilities. What sampling primitives does it support? Are they efficient? Are they guaranteed to be uniform? (Sometimes database systems “cheat” and sample entire blocks instead of sampling individual records, for efficiency purposes; how does X work?) How effective is sampling in answering your big join-group-by-aggregate query? Does it save any money?

Skew

Distributed query processing becomes much harder when there is skew. Study how system X copes with skew in query processing. For example, when joining two tables R(x,y) Join S(y,z), if y is skewed in either R or in S then what happens? Some things to expect: (a) the query runs out of memory and crashes, or (b) the query doesn’t crash but takes significantly longer, or (c) it runs as efficiently as for non-skewed data. Extend this to more complex queries: multiway joins, and/or add selection predicates.

Comparing X to Y

Compare system X to system Y. For example, you may want to compare the performance of Spark on AWS with the performance of Spark on Azure. Or compare SQL on Redshift with SQL on Snowflake, and/or with SQL on Azure. Find some benchmarks to do the comparison. For SQL, the best known benchmark is TPC/H (and there are other variants of TPC) as well as the newer “Join Benchmark” (which uses IMDB data). If you choose Spark, look for benchmarks for Spark (or design/adapt your own). Feel free to adapt any benchmark as you see fit, i.e. simplify it, drop some queries that would take more effort to run on X or on Y.

ML in Big Data

A typical ML algorithm runs gradient descent, or stochastic gradient descent, until convergence (or for a fixed number of iterations). One gradient descent step can be expressed in SQL or in Spark, as an aggregate query (with or without group by). Evaluate the effectiveness of running gradient descent on a big data system, e.g. in SQL or in Spark. Compare this to the standard approach of using python for gradient descent. Which one scales better? (One expects system X, which is usually distributed, to scale better) Which one is faster? How does this depend on the size of the data? (Perhaps python is faster for small training data but crashes when the training data reaches a certain size.). Your goal here should be to express a very simple ML algorithm (e.g. for linear regression, or logistic regression) using a SQL query (for example, one step of gradient descent requires one to compute a sum over the data: write that sum as a SQL query to compute one step of the gradient descent, update the data, then repeat the gradient descent step). Some references to (much more advanced) usages of SQL for ML tasks are here and here.

Exploring some interesting system of dataset

SalesForce open sourced an interesting library for automating data prep and feature extraction: TransmogrifAI. See also here. Try to upload a large knowledge graph in an RDBMs and run some interesting queries. Free, large KG's are hard to find: apparently Facebook apparently has an open source knowledge graph FB15K (but I couldn't find it) Semantic Scholar makes their KG available: it's great, but, warning, it's big!


Milestones

1. Project Proposal

1 page. Describe very briefly what you want to do. Then answer three questions: (1) what data will you use, and do you already have access to it, or how are you planning to access it? (2) what system are you using (redshift/snowflake/spark/sql-server/postgres/etc/ect)? (3) what cloud service are you using (AWS/Azure/Google-cloud) and do you have credits for it? Submit via gitlab.

2. Project Milestone

2-3 pages. A preliminary draft of your final report. Summarize what you did already and what you are still planning to do; I am planning to meet with each of you individually shortly after you submit the milestone . Submit via gitlab.

3. In-class presentation (5minutes)

This will be on Dec. 8, and may take up to 4 hours. Stay tuned for details.

4. Project Final Report

4-5 pages report. The final project report should have the following sections:

Submit via gitlab.


General Advice

The best projects will ask a question and will vary one or more parameters to see how performance changes. We expect each final report to include at least one graph with some performance numbers (you can have more than one graph). Beyond the one graph, you can also include a qualitative discussion based on what you observed and what you read about in the documentation. The scale of the final project should be similar to the scale of the other three assignments.


Examples of Past Projects

Presentations from 2019 are here.

Selected final reports from 2019: