Objectives:
Assignment tools: Microsoft SQL Server and Azure
Due date: Tuesday, April 25th by 11:00pm.
Turn in your solution using the assignment drop box linked from the main course
web page.
What to turn in:
One file, hw3-queries.sql, described below.
This homework is a continuation of homework 2, with three changes. We will use a larger dataset, the queries are more challenging, and you will have to use a commercial database system. SQLite simply cannot execute these queries in any reasonable amount of time; hence, we will use SQL Server, which has one of the most advanced query optimizers. SQL Server also has a very nice client application, SQL Server Management Studio, that you can use in this assignment.
Here is again the schema of the Flights database, for your reference:
We leave it up to you to decide on the types to use for the various fields. However, note that SQL Server will complain if you try to import data and it needs to truncate it because VARCHAR fields are too small or if integer fields are used for real valued data.
In addition, however, you should impose the following constraints:
In this homework, you will do four things. First, you will create a database in the SQL Server database management system running as a service on Azure. Second, you will write and test the SQL queries below. Keep in mind that the queries are quite challenging, both for you and for the database engine. Third, you will add indexes to speed up your queries. Finally, you will reflect on using a database management system running in a public cloud.
In this assignment, we want you to learn how to use an Azure SQL database from scratch. Your first step will thus be to setup a database in the Azure service and importing your data. This step may seem tedious but it is crucially important. We want you to be able to continue using Azure after the class ends. For this, you need to know how to use the system starting from nothing.
NOTE: These steps will take some time to complete, so start early!
If you have not already done so, follow the Azure setup instructions to create an Azure account using the promotional code you were sent in email.
Spend some time clicking around, reading documentation, watching tutorials, and generally familiarizing yourself with Azure and SQL Server.
From the Azure portal page, select "+ New" on the left, then select "Databases", then select "SQL Database". This will bring up a panel with configuration options for a new DB instance.
Perform the following configuration:
Click on the "Create" button to create this database. You will see a panel on the dashboard that says "Deploying SQL Database" while the database is being set up. Once that is done, it should open up a panel with details on your database. (If not, click on the panel for your SQL database to open it.)
Finally, click on the button near the top that says "Set server firewall". You will need to change the settings before you can upload data. The easiest option is to add a rule that allows connections from any client, which you can do as follows:
Be sure to click "Save" once you have added this rule.
The simplest way to play with the database is using the built-in Query editor in the Azure portal.
To launch this, click on the "Tools" button on top (to the left of the firewall settings button you clicked before). Select "Query editor", and accept the warning that this is a preview feature (i.e., not necessarily what the final launched version will look like).
Click on the "Login" button and enter the username and password that you chose when you created your database in Step 3. Once you have done that, you can try entering SQL commands. Press the "Run" button to execute them.
As convenient as the Azure portal Query editor is, you will need to use SQL Server Management Studio (SSMS) for the last part of this assignment, where you are asked to examine SQL Server's query execution plans.
If you have Windows, SSMS may already be installed or you should be able to download and install it from Microsoft's web site. (For detailed instructions on installing SSMS, see this document from CSE 344.) You can also use SSMS on the PC lab machines, where it is already installed, or on the VDI lab machines.
(If you have trouble connecting from any of those machines, it may be because there was a mistake setting up the firewall rules above.)
Another option is that you can launch a Windows VM directly in Azure. The process is similar to how we created a database instance above. Once your VM instance is running, open the panel for the instance and click on "Connect" to download a .rdp file. On a Mac, you can then connect to your Windows machine by using Microsoft Remote Desktop. Once connected, you can install SSMS using the link above.
When you launch SSMS, it will ask you to connect to a SQL server instance. Tell it the name of the server you created in Step 3 above (e.g., mine was zatsqlserver.database.windows.net). Then, select "SQL authentication" (rather than Windows authentiation) and give it your username and password from Step 3. At that point, you should be able to see the tables of your database in the panel on the left side. Buttons to create a new query and execute it are in the middle of the menu bar:
Just to the right of the "Execute" button is an option to display the query execution plan. (It's highlighted yellow in the middle of the image above.) You will need to enable this option for part D of this assignment. Once you do so, the query execution plan will appear in a panel on the bottom of SSMS, as shown here (click on the image for a full-sized version):
Next, you will import all the data from HW2. Make sure that you execute your CREATE TABLE statements first so that the tables you will add tuples to already exist. Also, make sure that the types of the columns in the tables you created match the data.
To import data, you will need to use a utility called bcp, which should come with the command prompt on Windows. (Note that you can also use a Windows machine in the lab or a Windows VM instance on Azure.) If you are using a unix system, you can use the freebcp utility, which is a part of freetds. On a Mac, you can install freebcp using the homebrew package manager:
brew install freetds
Make sure that homebrew installs freetds version at least 0.95.79. (You may need to run brew update and brew upgrade to get the most recent versions.)
The command to load [file_name] into the table [table_name] with bcp is:
bcp dbo.[table_name] in [file_name] -U [user_name]@[server_name] -P [password] -S [server_name].database.windows.net -c -t [separator] -r [newline] -d [database]
Here is an example with server zatsqlserver.database.windows.net, user zat, and database name cse414-17sp:
bcp dbo.Flights in flights-small.csv -U zat@zatsqlserver -S zatsqlserver.database.windows.net -P "some-complicated-password" -c -t "," -r "0x0a" -d cse414-17sp
Usage of freebcp is similar to bcp except you need to use -D instead of -d when specifying the database and you need to use \n instead of 0x0a. The same example from above becomes:
freebcp dbo.Flights in flights-small.csv -U zat@zatsqlserver -S zatsqlserver.database.windows.net -P "some-complicated-password" -c -t "," -r "\n" -D cse414-17sp
Note: When loading the flights-small.csv table, bcp may appear to hang toward the end; however, it probably did not hang. It is creating the index on the primary key, which will take what seems like an eternity but is approximately 20 min, and then it will complete.
As a sanity check, count the number of rows in the Flights table and make sure it matches the number of rows in the SQLite Flights table. If the count on SQL Server is too low, the data didn't all get imported. You will need to drop the table and try again.
If it continues to not work, you can try splitting the flights-small.csv file into multiple smaller files and importing them one at a time. On Mac, to split the file into 10 smaller files, run:
split -l 114868 flights-small.csv flights-small-part
On Windows, you can either use cygwin (to get the same command) or install something like HJ-Split.
For each question below, write a single SQL query to answer that question. This time, you can use subqueries. Add a comment to each query with the question number, the number of rows your query returns, how long the query took, and the first 20 rows of the result (if the result has less than 20 rows, output all of them). You can simply copy and paste the first rows into the comment. Please end each SQL query with a semicolon.
As in HW2, after each question, we have provided the number of rows that should be in your answer (written between [ ] brackets after the question).
In this part you will be asked to examine the query plan of a SQL query. You do that in SQL Server Management Studio my clicking on the option to "Include actual execution plan" and looking at the execution plan tab after the query finishes. You can view the used query plan in the online interface by clicking the actual button in the new query interface and clicking on the Query Plan link near the results.
Consider this query:
SELECT DISTINCT F2.origin_city
FROM Flights F1, Flights F2
WHERE F1.dest_city = F2.dest_city
AND F1.origin_city='Gunnison CO'
AND F1.actual_time <= 30;
Choose one simple index (index on one attribute), different from the index for the question above, that is likely to speed up this query. Write down the CREATE INDEX statement in hw3-queries.sql and explain why you chose that index as a SQL comment.
Extra Credit [4 points]: Repeat the timing experiments by adding the same two indexes on a larger flights dataset: flights-more.tar.gz. Note that ingesting data into a table that has indexes is much slower than creating indexes later in bulk. So we recommend that you either drop your indexes or create a new table called Flights2 and ingest the data there. Time your queries without indexes. Then add indexes and re-run your queries. Feel free to use any subset of the data we provide. Indicate what data you used in your answer to this question.
Record the dataset you ended up using and the timing results in a seperate ile called hw3-extracredit.txt
The DBMS that we use in this assignment is running somewhere in one of Microsoft's data centers. Comment on your experience using this DBMS cloud service. What do you think about the idea of offering a DBMS as a service in the cloud?
Put all the SQL code (both SELECT and CREATE INDEX
statements) and comments in a file called hw3-queries.sql
. If you
do the extra credit, put that into a separate file called,
hw3-extra-credit.txt
. Submit your file(s) using the dropbox.