hw3-queries.sql
This homework is a continuation of homework 2, with three changes. We will use a larger dataset, the queries are more challenging, and you will have to use a commercial database system. SQLite simply cannot execute these queries in any reasonable amount of time; hence, we will use SQL Server, which has one of the most advanced query optimizers. SQL Server also has a very nice client application, SQL Server Management Studio, that you will get to use in this assignment.
Here is again the schema of the Flights database, for your reference:
FLIGHTS (fid, year, month_id, day_of_month, day_of_week_id, carrier_id, flight_num, origin_city, origin_state, dest_city, dest_state, departure_delay, taxi_out, arrival_delay, canceled, actual_time, distance)
CARRIERS (cid, name)
MONTHS (mid, month)
WEEKDAYS (did, day_of_week)
We leave it up to you to decide on the types to use for the various fields. Note that SQL Server will complain if you try to ingest data and it needs to truncate it because VARCHAR fields are too small.
In addition, impose the following constraints:
- The primary key of the FLIGHTS table is fid.
- The primary keys for the other tables are cid, mid, and did respectively.
- Flights.carrier_id references Carrier.cid.
- Flights.month_id references Months.mid.
- Flights.day_of_week_id references Weekdays.did.
In this homework, you will do four things. First, you will create a database in the SQL Server database management system running as a service on Windows Azure. Second, you will write and test the SQL queries below; keep in mind that the queries are quite challenging, both for you and for the database engine. Third, you will add indexes to speed up your queries. Finally, you will reflect on using a database management system running in a public cloud.
In this assignment, we want you to learn how to use SQL Azure from scratch. Your first step will thus be to setup a database in the SQL Azure service and ingest your data. This step may seem tedious but it is crucially important. We want you to be able to continue using SQL Azure after the class ends. For this, you need to know how to use the system starting from nothing.
Plan ahead: Expect these steps to take time to complete!
Step 0: Create an Account.
If you don't have a Microsoft Azure account, go to https://azure.microsoft.com/en-us/account/ and click on the "Portal" link in the upper right. It will take you to a sign in page where you can create a new account using any email address you own.
Step 1: Activate your promotion code.
Go to: http://www.microsoftazurepass.com and activate the code that was distributed to you.
Step 2: Learn about SQL Azure.
Spend some time clicking around, reading documentation, watching tutorials, and generally familiarizing yourself with SQL Azure.
Step 3: Setup a database.
From the main Azure portal (https://portal.azure.com/), select the SQL databases section on the left panel and click the add icon near the top left of the page. A panel with configuration options for the new DB instance will appear to the right. Use the following configuration options:
You should now be able to create your database tables. Before proceeding, go to the main page of your database by first clicking on "All resources" on the left and selecting your database. Take note of the server name because you will need it later.
Note: at any point in time, if you get the error "error connecting to the server", it is because of the firewall on the server. Go to http://manage.windowsazure.com → SQL DATABASES → SERVERS → your server name → CONFIGURE → add the rule: "all access" "0.0.0.0" "255.255.255.255" → press save at the bottom of the page.
Now use one of the following methods to create your tables. We recommend using SQL Server Management Studio on your own machine (you can get a copy from https://www.cs.washington.edu/lab/software) or one in the lab machines
If you have windows, it may already be installed or you should be able to install the appropriate software. Try the "Install Microsoft SQL Server Data Tools" (let us know how that works and see next point for more details on connecting).
You can also use SQL Server Management Studio already installed on the PC lab machines and also available on all VDI lab machines -- see: http://vdi.cs.washington.edu/vdi/). Before you can connect, you will need to change the firewall rules on the Azure server as explained above.
If neither of those works, you can spin up a Windows VM directly in Azure and follow the steps above. You may need to install bcp. It does not seem to be included by default.
Make sure homebrew installs freetds version at least 0.95.79. You may need to upgrade and update homebrew first using the command:brew install freetds
The command to load [file_name] into the table [table_name] with bcp isbrew upgrade brew update
bcp dbo.[table_name] in [file_name] -U [user_name]@[server_name] -P [password] -S [server_name].database.windows.net -c -t [separator] -r [newline] -d [database]
Here is an example with server pz4ur74rts.database.windows.net, user magda, and database name cse344-15au:
For flights, you need to use a different end-of-line character:bcp dbo.Carriers in carriers.csv -U magda@pz4ur74rts -S pz4ur74rts.database.windows.net -P "some-complicated-password" -c -t "," -r "\n" -d cse344-15au
Usage of freebcp is similar to bcp except you need to use -D instead of -d when specifying the database. Also, you may need to use single quotes instead of double quotes and may need to use \n instead of 0x0a or \r\n instead of \n.bcp dbo.Flights in flights-small.csv -U magda@pz4ur74rts -S pz4ur74rts.database.windows.net -P "some-complicated-password" -c -t "," -r "0x0a" -d cse344-15au
freebcp dbo.[table_name] in [file_name] -U [user_name]@[server_name] -P [password] -S [server_name].database.windows.net -c -t [separator] -r [newline] -D [database]
Important: When loading the flights-small.csv table, bcp will appear to hang toward the end. It did not hang. It is creating the index on the primary key. It will take what seems like an eternity but is approximately 20 min and then it will complete.
As a sanity check, count the number of rows in the Azure Flights table and make sure it matches the number of rows in the SQLite Flights table. If the count on Azure is too low, the data didn't all get transfered. You will need to drop the table and try again. If it continues to not work, split the flights-small.csv file into multiple smaller files and import one at a time. On mac, to split the file into 10 smaller files, run:
On windows, you can either use cygwin or install something like HJ-Split.split -l 114868 flights-small.csv flights-small-part
For each question below, write a single SQL query to answer that question (you can use subqueries). Add a comment to each query with the question number, the number of rows your query returns, how long the query took, and the first 20 rows of the result (if the result has less than 20 rows, output all of them). You can simply copy and paste the first rows into the comment. Please end each SQL query with a semicolon.
For each origin city, find the destination city (or cities) with the longest direct flight. By direct flight, we mean a flight with no intermediate stops. Judge the longest flight in time, not distance. Show the name of the origin city, destination city, and the flight time between them. Do not include duplicates of the same origin/destination city pair. Order the result by origin_city name then destination city name.
[Output relation cardinality: 333 rows]Find all origin cities that only serve flights shorter than 3 hours. You can assume that flights with NULL actual_time are not 3 hours or more. Return only the names of the cities sorted by name. List each city only once in the result.
[Output relation cardinality: 147]For each origin city, find the percentage of departing flights shorter than 3 hours. For this question, treat flights with null actual_time values as longer than 3 hours. Return the name of the city and the percentage value. Order by percentage value. Be careful to handle cities without any flights shorter than 3 hours. We will accept both 0 and NULL as the result for those cities.
[Output relation cardinality: 327]List all cities that cannot be reached from Seattle though a direct flight but can be reached with one stop (i.e., with two flights). Do not include Seattle as one of these destinations (even though you could get back with two flights). Order results alphabetically.
[Output relation cardinality: 256]List all cities that cannot be reached from Seattle though a direct flight nor with one stop (i.e., with two flights). Do not forget to consider all cities that appear in a flight as an origin_city. Order results alphabetically.
[Output relation cardinality: 3 or 4, depending on what you consider to be the set of all cities]In this part you will be asked to examine the query plan of a SQL query. You do that in SQL Server Management Studio my clicking on the option to "Include actual execution plan" and looking at the execution plan tab after the query finishes. You can view the used query plan in the online interface by clicking the actual button in the new query interface and clicking on the Query Plan link near the results.
Take the following 3 queries:
SELECT DISTINCT carrier_id FROM Flights WHERE origin_city = 'Seattle WA' AND actual_time <= 180;
SELECT DISTINCT carrier_id FROM Flights WHERE origin_city = 'Gunnison CO' AND actual_time <= 180;
SELECT DISTINCT carrier_id FROM Flights WHERE origin_city = 'Seattle WA' AND actual_time <= 30;
hw3-queries.sql
file and explain why you chose that index in a SQL comment.hw3-querires.sql
, list the query numbers, if the index was used, and, if applicable, an explanation why the index was not used. Note that Azure creates and uses a clustered index, usually on the primary key of the table, by default. We only care about the index you create.SELECT DISTINCT F2.origin_city FROM Flights F1, Flights F2 WHERE F1.dest_city = F2.dest_city AND F1.origin_city='Gunnison CO' AND F1.actual_time <= 30;Choose one simple index (index on one attribute), different from the index for the question above, that is likely to speed up this query. Write down the CREATE INDEX statement in
hw3-queries.sql
and explain why you chose that index as a SQL comment.hw3-queries.sql
. Note that different instances of SQL Azure can have a very different performance and that creating an index takes time.
Extra credit (4 points): Repeat the timing experiments by adding the same two indexes on a larger flights dataset: flights-more.tar.gz. Note that ingesting data into a table that has indexes is much slower than creating indexes later in bulk. So we recommend that you either drop your indexes or create a new table called Flights2 and ingest the data there. Time your queries without indexes. Then add indexes and re-run your queries. Feel free to use any subset of the data we provide. Indicate what data you used in your answer to this question.
Record the dataset you ended up using and the timing results in hw3-extracredit.txt
.
The DBMS that we use in this assignment is running somewhere in one of Microsoft's data centers. Comment on your experience using this DBMS cloud service. What do you think about the idea of offering a DBMS as a service in a public cloud?
Put all your code (SELECT-FROM-WHERE
and CREATE INDEX
code) and SQL comments in a file called hw3-queries.sql
. Put your extra credit results in hw3-extracredit.txt
.