HW 1

Objectives: Setup and run simple queries on a shared nothing DB service, Amazon Redshift.

Assignment tools: Amazon Redshift on Amazon web services.

Due date: Tuesday, Oct 17 by 11:00pm. Turn in txt file or PDF.

What to turn in: You will turn in SQL for the queries, run time for each query, number of rows returned and first ten rows from the result set (or all rows if a query returns fewer than 10 rows).

Assignment Details

In this Assignment you will be required to deploy a Redshift cluster, ingest data and run some queries on this data.

A. Setting up Amazon Redshift (0 points)

We will cover this in Section 1 and 2.
- Set up AWS account
- Download SQLWorkbenchJ
- Create myRedshift IAM role
- Create Redshift security group
- Deploy Redshift cluster
- Connect to the Redshift cluster

B. Ingest Data into Amazon Redshift(0 points)

Ingest data into Redshift

C. Run Queries: Run each query multiple times. Plot the average and either min/max or standard deviation. Use the warm cache timing, which means you discard the first time the query is run.

Write and run queries on the 1GB data dataset and a 2-node cluster.(10 points)
- What is the total number of parts offered by each supplier? The query should return the name of the supplier and the total number of parts.
- What is the cost of the most expensive part for each supplier? The query should return the name of the supplier and the cost of the most expensive part but you do not need to return the name of that part.
- What is the most expensive part by any supplier? The query should return only the price of that most expensive part. No need to return the name.
- What is the total number of customers per nation? The query should return the name of the nation and the number of unique customers.
- What is number of parts shipped between 10 oct, 1996 and 10 nov, 1996 for each supplier? The query should return the name of the supplier and the number of parts.
- Extra credit: what is the average delay (in days) between an item being ordered and shipped? (1 point)
Run queries from (1) on the 10 GB and 100GB data sets using the 2-node cluster. Remember to execute each query multiple times. Discuss the runtimes that you observe. (20 points)
Note: Before you take down your 10GB 2-node cluster, you may want to complete query (5) below.
Run queries from (1) on the 100 GB dataset but this time increase the cluster size from 2 to 16 nodes (try at least 2, 4, 8, and 16 nodes). Discuss the runtimes that you observe. (20 points)
Query data on s3 vs local data: re-run the queries from (1) on the 10GB data set on S3. S3 data is located in s3://uwdb/tpch/athena/ with a folder for each of the following tables: customer, supplier, orders, region, nation, part, partsupp & lineitem. The data is in textfile format and the delimiter is the pipe character(|). e.g. s3 location of the data for the lineitem relation would be s3://uwdb/tpch/athena/linitem/. (10 points)

Extra credit: Report the cold cache timings for queries from (1) for local data. To get the cold cache timing for each query you will have to restart the Redshift cluster before you run the query.(2 points)
A customer is considered a Gold customer if they have orders totalling more than $1,000,000.00. Customers with orders totalling between $1,000,000.00 and $500,000.00 are considered Silver. Write a SQL query to compute the number of customers in these two categories. Try different methods of writing the query (only SQL or use a UDF or a View to categorize a user). Discuss your experience with the various methods to carry out such analysis. Use the 10GB data set and the 2-node cluster. (10 points)