Tuesday, November 13th at 11:59pm
Objective: Become familiar with Apache Spark. Learn how to use Spark to perform data management tasks and simple data analysis tasks.Assignment tools : Apache Spark on Amazon Elastic Map Reduce.
What to turn in: Turn in code as well as results (with plots) as listed in each question.
In this assignment, we will work with astronomy data. We first give an overview of the application that motivates the analysis we will perform in the assignment.
Large-scale astronomical imaging surveys (e.g., Sloan Digital Sky Survey) collect databases of telescope images. The key analysis is to extract sources (galaxies, stars, quasars, etc.) from these images. While extracting sources is a simple process, classifying these sources into object types is difficult. An example of this type of classification is the use of multiple wavelength observations in separating high redshift quasars (i.e., galaxies powered by a central black hole) from stars within our Galaxy. Given that both quasars and stars are point sources (i.e., they cannot be distinguished by data from a single image alone) and that there are 400-times more stars than quasars within a data set, the accuracy of this classification determines the success of finding some of the highest redshift sources within the universe.
The Gaussian Mixture Model (GMM) algorithm can serve as a foundation to perform this classification. It is a clustering algorithm that will help us group sources into different types. We describe this algorithm below.
In this assignment, we will use as input a table that contains sources extracted from telescope images and their features. We will apply the GMM algorithm to build a model of those sources.
For more details about the application, please refer to [RM15].
ID | X | Y | Z | W |
---|---|---|---|---|
1237661088029606015 | 0.575469 | 1.37509 | 1.941 | -0.0360003 |
1237661088024822536 | 1.00735 | 3.06909 | 3.701 | -0.059 |
1237661088024822606 | 1.4684 | 2.5072 | 3.184 | -0.105 |
1237661088024887302 | 0.761256 | 1.44754 | 1.356 | -0.0959997 |
A Gaussian mixture model is a model of a distribution as a mixture of K separate multivariate normal distributions, each with three parameters: mean, covariance, and weight. We will talk about covariance rather than variance or standard deviation because our data is in a 4D space. Thus, the mean is a 4D vector and the covariance is a 4x4 matrix. The mean and covariance describe the position and shape of the Gaussian. The weight captures the relative amplitude of the Gaussian in the model. The sum of the weights of the Gaussians in the model is 1.0. GMM assumes all the data points are generated from this mixture of Gaussian distributions.
There is no analytic solution to estimate the parameters of a mixture of Gaussians from a sample data set. The standard approach is to iteratively maximize the likelihood function of the mixture model in an algorithm called Expectation Maximization (EM). Expectation-Maximization(EM) [AD77] has two steps: E-step and M-step, which are repeated until parameters converge to maximize the likelihood of the observations. EM starts by assuming a set of K random components. It computes for each observation the probability of belonging to each component of the model. This is called the E-step. The second step in the iterative process, called the M-step, uses the points associated with each component to estimate better parameters for each of the K components. The E-step and the M-step are repeated until convergence.
The following figure shows the results of applying GMM to a similar
astronomical data [RM15]. The figure shows two 2D projections of the
4D model.
Spark MLLib implements GMM and, in this assignment, we will use it to cluster 1.9M astronomical sources.
Deploy EMR with Spark: To deploy Spark on Elastic MapReduce use the instructions from Sections 5-6.
Data load and transform: Data is provided in the form of
CSV files in S3. Read the data from S3. Parse it and otherwise
prepare it as needed for the analysis.
Bucket: csed516
Key:smalldatasetspark/wise-colors-15-20-subsetsmall8.csv
Run GMM: From MLlib on Spark, the GMM implementation is available in Scala, Java, Python and R. Choose the language that you prefer and run GMM on the sources that you prepared above. Remember the points are in four dimensions.
bucket: csed516
key: largerdatasetspark/wise-colors-15-20-subset1.csv
Run GMM on this larger dataset and on the 2 node cluster. Report the execution time as well the number of partitions for this dataset. Compare the components that you get on this larger dataset vs the one your got on the smaller data set. (20 Points)
[AD77] A. P. Dempsteret al., Maximum likelihood from incomplete data via the EM algorithm, JRSS, 1977
[RM15] Ryan Maas et al., Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria, VLDB, 2015