CSE 544 Homework 3: Myria

Objectives:
To get experience with running data analytics on the cloud.
Assignment tools:
Amazon Web Services (AWS), Myria.
Due date:
Friday, November 18th, 2016, at 11:00pm. Turn it in here.
What to turn in:
See below.
NOTE:
You will need a unix shell (or some windows equivalend like git bash) and pip for this assignment.

Amazon AWS account

Please follow the instructions on this page to sign up for an account with free credits that you can use for this assignment. Use your CSE / UW email address when you sign up.

Check the instructions. You are learning how to use the Amazon cloud, which is by far the most popular cloud today!

Assignment Goal

In this assingment, you will learn how deploy and manage the database management system (Myria) on the cloud. Using the cloud platforms allows you to scale well beyond the capabilities local machines/clusters, and you only pay for the time use actually need the compute/storage resources.

Description

In this assignment, we will perform some basic analysis over a genomics/oceanography dataset. For the purpose of this assignment, we will be using just a subset of the entire dataset. The datasets are available on s3, follow this link. NOTE: You don't need to download these datasets yet. Please read through the full instructions.
Myria is a distributed, shared-nothing Big Data management system and Cloud service from the University of Washington. It derives requirements from real users and complex workflows, especially in science.
The Myria service has three major components, a WebUI frontend called Myria Web, REST Server which serves both as a query optimizer and middleware, and a distributed query execution backend called MyriaX. Myria Web takes MyriaL programs as input. MyriaL is an imperative-yet-declarative high-level data flow language based on the relational algebra that includes support for SQL syntax, iteration, user-defined functions, and familiar language constructs such as set comprehensions. For a detailed description of the MyriaL syntax and examples, visit the MyriaL language reference.
It is strongly advised that you go through the language reference before starting the assignment.

Getting Started

Setup Myria cluster on EC2: Myria: http://myria.cs.washington.edu/docs/myria-ec2.html
After you follow the steps listed on the website, you will see MyriaWeb URL on the console. Open this URL in the browser.

At this point, you should be able run MyriaL programs on Myria Web, for e.g.:
T = LOAD("https://s3-us-west-2.amazonaws.com/uwdb/sampleData/TwitterK.csv", csv(schema(a:int, b:int), skip=0)); -- skip specifies how many lines you need to skip from the head of the csv file.
STORE(T, TwitterK, [a, b]);
What to turn in:
Nothing for this step.

Dataset description:

Datasets which you will be using for this assignment aare a summarization of information about the biodiversity at various locations (and depths) in the ocean. Each sample, contains information about genomic sequence of the organisms for a specific water sample.
For the purpose of this assignemnt, you given a 2 such samples:
https://s3-us-west-2.amazonaws.com/cse544data/S0001.csv
https://s3-us-west-2.amazonaws.com/cse544data/S0002.csv

Schema for each sample is the following: (id:string, seq:string). Where id is the sequenceId and seq is the corresponding genomic sequence.
Each file also contains a header which you will want to skip while loading the dataset.

Task 1: Compute k-mer counts

Write a query to compute abundances of all possible kmers for each sequence in each sample. A kmer (length k) is a substring of the sequence.
The output relation should be partitioned by the attibute [kmer] (See schema of the output below).
For example, all possible 5mers for the sequence ATGCGCAT are:
ATGCG, TGCGC, GCGCA, CGCAT
Hint: Use a flatmap. Myria has a builtin flatmap operator for ngram extraction ngram(col_name, 11).
-- The following code does 15mer extraction.
s = select ngram(seq, 15) as kmer from T1;

Output schema (sampleid, kmer, count). Where sampleid is the same as sample filename.
For the remaining assignment use k=11.

Task 2: Compute Normalized abundances

For each sample, compute normalized abundances. Normalized abundance is the abundance of a kmer on the scale of 0 to 1.
Output schema (sampleid, kmer, norm_count).

Task 3: Compute pairwise Bray-Curtis distance

Compute pairwise Bray-Curtis distances for the two samples. Bray-Curtis distance is a measure of dissimilarity between two different samples.
https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity

Extra Credit:

Discuss the scale issues you will face on increasing the value of k.
How would you address these?