Objectives:

  • To practice advanced SQL.
  • To get familiar with commercial database management systems (SQL Server).
  • To get familiar with using a database management system in the cloud (Azure).
  • To practice physical tuning through the addition of appropriate indexes.

Assignment tools: Microsoft SQL Server and Azure

Due date: Tuesday, Noverber 7th, 2017 by 11:00pm.
Turn in your solution using the assignment drop box linked from the main course web page.

What to turn in:
One file, hw3-queries.sql, described below.

Overview

This homework is a continuation of homework 2, with three changes. We will use a larger dataset, the queries are more challenging, and you will have to use a commercial database system. SQLite simply cannot execute these queries in any reasonable amount of time; hence, we will use SQL Server, which has one of the most advanced query optimizers. SQL Server also has a very nice client application, SQL Server Management Studio, that you can use in this assignment.

Here is again the schema of the Flights database, for your reference:

  • Flights (fid, year, month_id, day_of_month, day_of_week_id, carrier_id, flight_num, origin_city, origin_state, dest_city, dest_state, departure_delay, taxi_out, arrival_delay, canceled, actual_time, distance)
  • Carriers (cid, name)
  • Months (mid, month)
  • Weekdays (did, day_of_week)

We leave it up to you to decide on the types to use for the various fields. However, note that SQL Server will complain if you try to import data and it needs to truncate it because VARCHAR fields are too small or if integer fields are used for real valued data.

In addition, however, you should impose the following constraints:

  • The primary key of the FLIGHTS table is fid.
  • The primary keys for the other tables are cid, mid, and did respectively.
  • Flights.carrier_id references Carrier.cid
  • Flights.month_id references Months.mid
  • Flights.day_of_week_id references Weekdays.did

In this homework, you will do four things. First, you will create a database in the SQL Server database management system running as a service on Azure. Second, you will write and test the SQL queries below. Keep in mind that the queries are quite challenging, both for you and for the database engine. Third, you will add indexes to speed up your queries. Finally, you will reflect on using a database management system running in a public cloud.

A. Setting up an Azure SQL Database [0 points]

In this assignment, we want you to learn how to use an Azure SQL database from scratch. Your first step will thus be to setup a database in the Azure service and importing your data. This step may seem tedious but it is crucially important. We want you to be able to continue using Azure after the class ends. For this, you need to know how to use the system starting from nothing.

NOTE: These steps will take some time to complete, so start early!

Step 1: Create an Azure account (use promo code)

If you have not already done so, follow the Azure setup instructions to create an Azure account using the promotional code you were sent in email.

Step 2: Learn about Azure SQL Server

Spend some time clicking around, reading documentation, watching tutorials, and generally familiarizing yourself with Azure and SQL Server.

Step 3: Create a database

From the Azure portal page, select "+ New" on the left, then select "Databases", then select "SQL Database". This will bring up a panel with configuration options for a new DB instance.

Perform the following configuration:

  • Choose a database name (e.g., "cse414-17au").
  • Choose a name for the resource group that will be created (e.g., "myresourcegroup").
  • Create a new server by clicking on "Server" (it will say "Configure required settings"). A second panel will appear to the right, which has "Create a new server" already selected. A third panel will also appear to the right of this. Fill in the form as follows:
    • Choose a name for the database (e.g., "zatsqlserver"). Unlike your database name, the server name must be unique across the universe of Azure SQL databases.
    • Choose an admin login and password. (You will need this when access your database using tools other than the portal.)
    • Set the location to "West US" or "West US 2".
    • Make sure "Allow azure services to access server" is checked.
  • Change to a cheaper pricing tier. (The default is currently "Standard S2", which is too expensive.) To do this, click on "Pricing tier", which will open a pane with "Standard" selected. Find the slider for "DTUs", and turn it all the way down to 10. It should now say the monthly cost is only $15/month.
  • Select "Pin to dashboard".

Click on the "Create" button to create this database. You will see a panel on the dashboard that says "Deploying SQL Database" while the database is being set up. Once that is done, it should open up a panel with details on your database. (If not, click on the panel for your SQL database to open it.)

Finally, click on the button near the top that says "Set server firewall". You will need to change the settings before you can upload data. The easiest option is to add a rule that allows connections from any client, which you can do as follows:

Be sure to click "Save" once you have added this rule.

Step 4: Try out the database

The simplest way to play with the database is using the built-in Query editor in the Azure portal.

To launch this, click on the "Tools" button on top (to the left of the firewall settings button you clicked before). Select "Query editor", and accept the warning that this is a preview feature (i.e., not necessarily what the final launched version will look like).

Click on the "Login" button and enter the username and password that you chose when you created your database in Step 3. Once you have done that, you can try entering SQL commands. Press the "Run" button to execute them.

Step 5: Use the database from SQL Server Management Studio

As convenient as the Azure portal Query editor is, you will need to use SQL Server Management Studio (SSMS) for the last part of this assignment, where you are asked to examine SQL Server's query execution plans.

If you have Windows, SSMS may already be installed or you should be able to download and install it from Microsoft's web site. (For detailed instructions on installing SSMS, see this document from CSE 344.) You can also use SSMS on the PC lab machines, where it is already installed, or on the VDI lab machines.

(If you have trouble connecting from any of those machines, it may be because there was a mistake setting up the firewall rules above.)

Another option is that you can launch a Windows VM directly in Azure. The process is similar to how we created a database instance above. Once your VM instance is running, open the panel for the instance and click on "Connect" to download a .rdp file. On a Mac, you can then connect to your Windows machine by using Microsoft Remote Desktop. Once connected, you can install SSMS using the link above.

When you launch SSMS, it will ask you to connect to a SQL server instance. Tell it the name of the server you created in Step 3 above (e.g., mine was zatsqlserver.database.windows.net). Then, select "SQL authentication" (rather than Windows authentiation) and give it your username and password from Step 3. At that point, you should be able to see the tables of your database in the panel on the left side. Buttons to create a new query and execute it are in the middle of the menu bar:

Just to the right of the "Execute" button is an option to display the query execution plan. (It's highlighted yellow in the middle of the image above.) You will need to enable this option for part D of this assignment. Once you do so, the query execution plan will appear in a panel on the bottom of SSMS, as shown here (click on the image for a full-sized version):

B. Importing Data [0 points]

Next, you will import all the data from HW2. Make sure that you execute your CREATE TABLE statements first so that the tables you will add tuples to already exist. Also, make sure that the types of the columns in the tables you created match the data.

To import data, you will need to use a utility called bcp, which should come with the command prompt on Windows. (Note that you can also use a Windows machine in the lab or a Windows VM instance on Azure.) If you are using a unix system, you can use the freebcp utility, which is a part of freetds. On a Mac, you can install freebcp using the homebrew package manager:

brew install freetds

Make sure that homebrew installs freetds version at least 0.95.79. (You may need to run brew update and brew upgrade to get the most recent versions.)

The command to load [file_name] into the table [table_name] with bcp is:

bcp dbo.[table_name] in [file_name] -U [user_name]@[server_name] -P [password] -S [server_name].database.windows.net -c -t [separator] -r [newline] -d [database]

Here is an example with server zatsqlserver.database.windows.net, user zat, and database name cse414-17au:

bcp dbo.Flights in flights-small.csv -U zat@zatsqlserver -S zatsqlserver.database.windows.net -P "some-complicated-password" -c -t "," -r "0x0a" -d cse414-17au

Usage of freebcp is similar to bcp except you need to use -D instead of -d when specifying the database and you need to use \n instead of 0x0a. The same example from above becomes:

freebcp dbo.Flights in flights-small.csv -U zat@zatsqlserver -S zatsqlserver.database.windows.net -P "some-complicated-password" -c -t "," -r "\n" -D cse414-17au

Note: When loading the flights-small.csv table, bcp may appear to hang toward the end; however, it probably did not hang. It is creating the index on the primary key, which will take what seems like an eternity but is approximately 20 min, and then it will complete.

As a sanity check, count the number of rows in the Flights table and make sure it matches the number of rows in the SQLite Flights table. If the count on SQL Server is too low, the data didn't all get imported. You will need to drop the table and try again.

If it continues to not work, you can try splitting the flights-small.csv file into multiple smaller files and importing them one at a time. On Mac, to split the file into 10 smaller files, run:

split -l 114868 flights-small.csv flights-small-part

On Windows, you can either use cygwin (to get the same command) or install something like HJ-Split.

C. SQL Queries [15 points]

For each question below, write a single SQL query to answer that question. This time, you can use subqueries. Add a comment to each query with the question number, the number of rows your query returns, how long the query took, and the first 20 rows of the result (if the result has less than 20 rows, output all of them). You can simply copy and paste the first rows into the comment. Please end each SQL query with a semicolon.

As in HW2, after each question, we have provided the number of rows that should be in your answer (written between [ ] brackets after the question).

Questions

  1. For each origin city, find the destination city (or cities) with the longest direct flight. By direct flight, we mean a flight with no intermediate stops. Judge the longest flight in time, not distance. Show the name of the origin city, destination city, and the flight time between them. Do not include duplicates of the same origin/destination city pair. Order the result by origin_city name then destination city name. [333 rows]
  2. Find all origin cities that only serve flights shorter than 3 hours. You can assume that flights with NULL actual_time are not 3 hours or more. Return only the names of the cities sorted by name. List each city only once in the result. [147 rows]
  3. For each origin city, find the percentage of departing flights shorter than 3 hours. For this question, treat flights with null actual_time values as longer than 3 hours. Return the name of the city and the percentage value. Order by percentage value. Be careful to handle cities without any flights shorter than 3 hours. We will accept both 0 and NULL as the result for those cities. [327 rows]
  4. List all cities that cannot be reached from Seattle though a direct flight but can be reached with one stop (i.e., with two flights). Do not include Seattle as one of these destinations (even though you could get back with two flights). Order results alphabetically. [256 rows]
  5. List all cities that cannot be reached from Seattle though a direct flight nor with one stop (i.e., with two flights). Do not forget to consider all cities that appear in a flight as an origin_city. Order results alphabetically. [3 or 4 rows]

D. Physical Tuning [11 points]

In this part you will be asked to examine the query plan of a SQL query. You do that in SQL Server Management Studio my clicking on the option to "Include actual execution plan" and looking at the execution plan tab after the query finishes. You can view the used query plan in the online interface by clicking the actual button in the new query interface and clicking on the Query Plan link near the results.

Questions

  1. Consider the following three queries:
    1. SELECT DISTINCT carrier_id
      FROM Flights
      WHERE origin_city = 'Seattle WA' AND actual_time <= 180;
    2. SELECT DISTINCT carrier_id
      FROM Flights
      WHERE origin_city = 'Gunnison CO' AND actual_time <= 180;
    3. SELECT DISTINCT carrier_id
      FROM Flights
      WHERE origin_city = 'Seattle WA' AND actual_time <= 30;
    Now, answer the following questions:
    1. Choose one single simple index (index on one attribute) that is most likely to speed up all three queries. Write down the CREATE INDEX statements in your .sql file and explain why you chose that index in a SQL comment.
    2. Add this index to your database. For each of the three queries, indicate if SQL Server used the index; if not, give a short explanation why not. As a SQL comment, list the query numbers, if the index was used, and, if applicable, an explanation why the index was not used. Note that SQL Server creates and uses a clustered index, usually on the primary key of the table, by default. We only care about the index you create.
  2. Consider this query:

    SELECT DISTINCT F2.origin_city
    FROM Flights F1, Flights F2
    WHERE F1.dest_city = F2.dest_city
      AND F1.origin_city='Gunnison CO'
      AND F1.actual_time <= 30;

    Choose one simple index (index on one attribute), different from the index for the question above, that is likely to speed up this query. Write down the CREATE INDEX statement in hw3-queries.sql and explain why you chose that index as a SQL comment.

  3. Add this second index to your database (you should already have the first one added), and check that SQL Server indeed uses the index. If not, give a brief explanation why not.
  4. See how the queries from part C perform now that you have two indexes. Record the runtimes of the queries with and without indexes as a comment in hw3-queries.sql. Note that different instances of SQL Server can have a very different performance and that creating an index takes time.
  5. Extra Credit [4 points]: Repeat the timing experiments by adding the same two indexes on a larger flights dataset: flights-more.tar.gz. Note that ingesting data into a table that has indexes is much slower than creating indexes later in bulk. So we recommend that you either drop your indexes or create a new table called Flights2 and ingest the data there. Time your queries without indexes. Then add indexes and re-run your queries. Feel free to use any subset of the data we provide. Indicate what data you used in your answer to this question.

    Record the dataset you ended up using and the timing results in a seperate ile called hw3-extracredit.txt

E. Using a cloud service [1 points]

The DBMS that we use in this assignment is running somewhere in one of Microsoft's data centers. Comment on your experience using this DBMS cloud service. What do you think about the idea of offering a DBMS as a service in the cloud?

F. Turning in your assignment

Put all the SQL code (both SELECT and CREATE INDEX statements) and comments in a file called hw3-queries.sql. If you do the extra credit, put that into a separate file called, hw3-extra-credit.txt. Submit your file(s) using the dropbox.