HW 3

Tuesday, November 13th at 11:59pm

Objective: Become familiar with Apache Spark. Learn how to use Spark to perform data management tasks and simple data analysis tasks.

Assignment tools : Apache Spark on Amazon Elastic Map Reduce.

What to turn in: Turn in code as well as results (with plots) as listed in each question.

Assignment Details

Application

In this assignment, we will work with astronomy data. We first give an overview of the application that motivates the analysis we will perform in the assignment.

Large-scale astronomical imaging surveys (e.g., Sloan Digital Sky Survey) collect databases of telescope images. The key analysis is to extract sources (galaxies, stars, quasars, etc.) from these images. While extracting sources is a simple process, classifying these sources into object types is difficult. An example of this type of classification is the use of multiple wavelength observations in separating high redshift quasars (i.e., galaxies powered by a central black hole) from stars within our Galaxy. Given that both quasars and stars are point sources (i.e., they cannot be distinguished by data from a single image alone) and that there are 400-times more stars than quasars within a data set, the accuracy of this classification determines the success of finding some of the highest redshift sources within the universe.

The Gaussian Mixture Model (GMM) algorithm can serve as a foundation to perform this classification. It is a clustering algorithm that will help us group sources into different types. We describe this algorithm below.

In this assignment, we will use as input a table that contains sources extracted from telescope images and their features. We will apply the GMM algorithm to build a model of those sources.

For more details about the application, please refer to [RM15].

Data

The data takes the form of a table with five attributes:
ID X Y Z W
1237661088029606015 0.575469 1.37509 1.941 -0.0360003
1237661088024822536 1.00735 3.06909 3.701 -0.059
1237661088024822606 1.4684 2.5072 3.184 -0.105
1237661088024887302 0.761256 1.44754 1.356 -0.0959997

The first attribute, 'ID', is a unique identifier for each source. Each of the other four attributes, X, Y, Z, and W is a measurement for that source. Thus each column is a type of measurement and each row is a source.

Algorithm

In our application, we have points (i.e., sources) in a 4D space. We want to cluster them to identify different types of sources. We could use different clustering algorithms. GMM is one type of algorithm that has been shown to work well for these types of applications, so we will use it.

A Gaussian mixture model is a model of a distribution as a mixture of K separate multivariate normal distributions, each with three parameters: mean, covariance, and weight. We will talk about covariance rather than variance or standard deviation because our data is in a 4D space. Thus, the mean is a 4D vector and the covariance is a 4x4 matrix. The mean and covariance describe the position and shape of the Gaussian. The weight captures the relative amplitude of the Gaussian in the model. The sum of the weights of the Gaussians in the model is 1.0. GMM assumes all the data points are generated from this mixture of Gaussian distributions.

There is no analytic solution to estimate the parameters of a mixture of Gaussians from a sample data set. The standard approach is to iteratively maximize the likelihood function of the mixture model in an algorithm called Expectation Maximization (EM). Expectation-Maximization(EM) [AD77] has two steps: E-step and M-step, which are repeated until parameters converge to maximize the likelihood of the observations. EM starts by assuming a set of K random components. It computes for each observation the probability of belonging to each component of the model. This is called the E-step. The second step in the iterative process, called the M-step, uses the points associated with each component to estimate better parameters for each of the K components. The E-step and the M-step are repeated until convergence.

The following figure shows the results of applying GMM to a similar astronomical data [RM15]. The figure shows two 2D projections of the 4D model.

Questions

Spark MLLib implements GMM and, in this assignment, we will use it to cluster 1.9M astronomical sources.

  1. Deploy EMR with Spark: To deploy Spark on Elastic MapReduce use the instructions from Sections 5-6.

  2. Data load and transform: Data is provided in the form of CSV files in S3. Read the data from S3. Parse it and otherwise prepare it as needed for the analysis.
    Bucket: csed516
    Key:smalldatasetspark/wise-colors-15-20-subsetsmall8.csv

  3. Run GMM: From MLlib on Spark, the GMM implementation is available in Scala, Java, Python and R. Choose the language that you prefer and run GMM on the sources that you prepared above. Remember the points are in four dimensions.

    1. Find and describe components: Use the MLlib implementation of GMM with k=7 components to build a model of the data. List the weight, mean(mu), and covariance matrix of each component. Run this experiment on a 2-node (i.e., 2-instance) cluster. On each instance, run a single worker. If you choose to change the number of partitions, report the number of partitions. (20 Points)
    2. Plot the source clusters:Each point is represented in four dimensions(X,Y,Z,W). Plot one or more 3D or 2D plots with a subset of the dimensions of your choosing to show the sources with each cluster of points denoted by a different color. (10 Points)
    3. Speed-up: In this part of the assignment, we will evaluate Spark's speed-up on the GMM algorithm. For this, we keep the data size and task fixed and vary the number of cores/compute nodes available for processing. Please answer the following question How does the runtime vary when we change the number of instances in the Spark cluster? Vary the number of instances from 2 to 4 to 8 and plot the runtime for finding 7 components as in question 1. You already ran part 1 with 2 instances. For this part, you only need to run GMM on 4 and 8 instance clusters. You can continue to use one worker per instance. Report the runtimes. If you choose to change the number of partitions, report the number of partitions. (20 Points)
    4. Scale-up: In this part, we increase the data set size and see how the change in the data size impacts the analysis time. For this a larger dataset with 15M sources is located at:

      bucket: csed516
      key: largerdatasetspark/wise-colors-15-20-subset1.csv

      Run GMM on this larger dataset and on the 2 node cluster. Report the execution time as well the number of partitions for this dataset. Compare the components that you get on this larger dataset vs the one your got on the smaller data set. (20 Points)

    5. Data Management and analysis: Generate different-size subsets of the larger dataset (you can use selections to extract subsets of the data). Run the GMM algorithm on those subsets. Comment on the query execution time and on the components that you find. (10 points)
    6. Data Management and analysis: On the large dataset, repeat the experiment by running GMM using three out of four of the available dimensions. Comment on the query execution time and on the components that you find. (20 points)

References

[AD77] A. P. Dempsteret al., Maximum likelihood from incomplete data via the EM algorithm, JRSS, 1977

[RM15] Ryan Maas et al., Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria, VLDB, 2015