HW 1

Objectives: Setup and run simple queries on a shared nothing DB service, Amazon Redshift.

Assignment tools: Amazon Redshift on Amazon web services.

Due date: Tuesday, Oct 17 by 11:00pm. Turn in txt file or PDF.

What to turn in: You will turn in SQL for the queries, run time for each query, number of rows returned and first ten rows from the result set (or all rows if a query returns fewer than 10 rows).

Assignment Details

In this Assignment you will be required to deploy a Redshift cluster, ingest data and run some queries on this data.

A. Setting up Amazon Redshift (0 points)

B. Ingest Data into Amazon Redshift(0 points)

C. Run Queries: Run each query multiple times. Plot the average and either min/max or standard deviation. Use the warm cache timing, which means you discard the first time the query is run.

  1. Write and run queries on the 1GB data dataset and a 2-node cluster.(10 points)
  2. Run queries from (1) on the 10 GB and 100GB data sets using the 2-node cluster. Remember to execute each query multiple times. Discuss the runtimes that you observe. (20 points)

    Note: Before you take down your 10GB 2-node cluster, you may want to complete query (5) below.

  3. Run queries from (1) on the 100 GB dataset but this time increase the cluster size from 2 to 16 nodes (try at least 2, 4, 8, and 16 nodes). Discuss the runtimes that you observe. (20 points)

  4. Query data on s3 vs local data: re-run the queries from (1) on the 10GB data set on S3. S3 data is located in s3://uwdb/tpch/athena/ with a folder for each of the following tables: customer, supplier, orders, region, nation, part, partsupp & lineitem. The data is in textfile format and the delimiter is the pipe character(|). e.g. s3 location of the data for the lineitem relation would be s3://uwdb/tpch/athena/linitem/. (10 points)

    Extra credit: Report the cold cache timings for queries from (1) for local data. To get the cold cache timing for each query you will have to restart the Redshift cluster before you run the query.(2 points)

  5. A customer is considered a Gold customer if they have orders totalling more than $1,000,000.00. Customers with orders totalling between $1,000,000.00 and $500,000.00 are considered Silver. Write a SQL query to compute the number of customers in these two categories. Try different methods of writing the query (only SQL or use a UDF or a View to categorize a user). Discuss your experience with the various methods to carry out such analysis. Use the 10GB data set and the 2-node cluster. (10 points)