HW2: Joins and Reports

Getting started

Get the dataset. Unzip it to find a folder containing 5 CSV files.

  • flights.csv
  • carriers.csv
  • cancellation_codes.csv
  • n_numbers.csv
  • aircraft_types.csv

Here are the columns in each table, their types, and a brief description.

The flights table has one row for every marketed domestic commercial flight in the United States in September, 2024. There are approximately 500,000 rows.

flights (
  year INT,
  month INT,                   -- 1-12
  day_of_month INT,            -- 1-31
  day_of_week INT,             -- 1-7, 1 = Mon
  op_unique_carrier TEXT,      -- the code of the airline, see carriers table
  tail_num TEXT,               -- the "license plate" of the aircraft
  op_carrier_fl_num INT,
  origin TEXT,                 -- airport code where flight started
  origin_city_name TEXT,
  origin_state_abr TEXT,
  dest TEXT,                   -- airport code of destination
  dest_city_name TEXT,
  dest_state_abr TEXT,
  crs_dep_time TEXT,           -- scheduled departure time
  dep_time TEXT,               -- actual departure time
  dep_delay REAL,
  crs_arr_time TEXT,           -- scheduled arrival time
  arr_time TEXT,               -- actual arrival time
  arr_delay REAL,
  cancelled REAL,              -- 1 if flight was cancelled, else 0
  cancellation_code TEXT,      -- reason for cancellation, or empty
                               -- see cancellation_codes table
  diverted REAL,               -- 1 if the flight was diverted, else 0
  actual_elapsed_time TEXT,    -- departure-to-arrival time in minutes
  distance REAL                -- miles between origin and destination
)

The carriers table is a table of airlines and their names, roughly 2000 records. Each carrier has a unique carrier code which is used by op_unique_carrier column of the flights table above to describe which airline flew the flight.

carriers(
  cid TEXT,                   -- the "code" of the airline
  cname TEXT                  -- the long name of the airline
)

The cancellation codes table includes four short codes for possible reasons a flight would be cancelled. Used by the cancellation_code column of the flights table.

cancellation_codes(
  cancellation_code TEXT,     -- the code
  description TEXT            -- why the flight was cancelled
)

The N-numbers table includes one row for every publicly registered aircraft in the United States as of January 2025, around 300,000 records. An "N-number" is like a license plate number for aircraft in the United States. All of them start with "N" (which is therefore omitted from the table -- it is implicit). The "N" is the unique country code assigned to the United States under international agreements regarding aircraft.

n_numbers (
  n_number TEXT,         -- the "license plate number" without the N
  serial_number TEXT,
  mfr_mdl_code TEXT,     -- manufacturer and model code, see aircraft_types
  eng_mfr_mdl TEXT,
  year_mfr TEXT,
  name TEXT,             -- name of entity that registered the aircraft
  street TEXT,           -- their address is stored in the rest of the columns
  street2 TEXT,
  city TEXT,
  state TEXT,
  zip_code TEXT,
  region TEXT,
  county TEXT,
  country TEXT
)

The aircraft types table contains one row for each model of aircraft ever registered, including the manufacturer. There are around 100,000 records. It is used by the mfr_mdl_code column of the N-number table to describe the type of each registered aircraft.

aircraft_types (
  code TEXT,            -- the code, used by the n_numbers table
  mfr TEXT,             -- name of the manufacturer
  model TEXT,           -- name of the model
  no_eng INT,           -- number of engines on this type
  no_seats INT,         -- maximum number of seats on this type
  ac_weight TEXT,       -- weight class (1 through 4)
  speed INT             -- average cruising speed of this type in miles per hour
)

The tables are subject to the following additional constraints, which you should enforce:

  • The primary keys are:
    • flights has no primary key (!)
    • cid for carriers
    • cancellation_code for cancellation_codes
    • n_number for n_numbers
    • code for aircraft_types
  • flights.op_unique_carrier references carriers.cid
  • n_numbers.mfr_mdl_code references aircraft_types.code
  • For reasons we will see below, flights.n_number is not a foreign key to the n_numbers table. Nor is flights.cancellation_code a foreign key to cancellation_codes.

The problems

  1. Import the data.

    To import the flights dataset into a SQLite database, you will first need to run sqlite3 with a new database file (eg, sqlite3 hw2.db). This will create an empty, file-backed database that you can load in future SQLite sessions.

    Once you have an empty database, you can run CREATE TABLE statements to create the tables while specifying all key constraints as described above.

    CREATE TABLE table_name ( ... );
    

    You will need one CREATE TABLE statement per CSV file in the provided data.

    Currently, SQLite does not enforce foreign keys by default. To enable foreign keys, use the following command.

    PRAGMA foreign_keys=ON;
    

    Next, use SQLite's .import command to read data from each downloaded text file. You will use one file per table, and you will need to set the input format to CSV (comma-separated value). To learn how to use the .import command, please read the documentation. Something like this should get you started:

    .mode csv
    .import filename tablename
    

    You will need one .import command per CSV file in the provided data.

    Put all the SQLite commands and statements for creating your tables and enforcing foreign keys into a file called create-tables.sql, and all the code for importing the data into these tables into a separate file called import-tables.sql. If done correctly, you should be able to open up a new, empty .db file in SQLite and setup the database using these two commands:

    .read create-tables.sql
    .read import-tables.sql
    
  2. (SQL queries with joins) Write SQL queries to answer the following questions. Your queries should work on any data with the same schema as the provided data. In other words, do not assume that your query will only be tested on the provided data. (For example, the provided data consists only of flights in September 2024, but your queries should not assume that.)

    Put each query in its own file called hw2-q2a.sql, hw2-q2b.sql, etc.

    For development purposes, we recommend writing all your queries in one file using a SQL IDE (link to software setup instructions for choosing an IDE). You can run, test, and debug your queries in the IDE after connecting the IDE to your database. Then, split your queries into separate text files files before submitting your homework.

    You should only use the parts of SQL we have covered in class (with the exception that you may use any operators/functions from the documentation of SQLite). Do not use grouping, aggregation, windowing, compound queries, subqueries, or WITH. (If you don't know what those are, stay tuned for future homeworks!)

    Remember that the topic of this homework is joins. If you're stuck, try figuring out how a join might help you.

    Unless otherwise specified, you may name the output columns however you like, and the rows may be in any order.

    1. List the distinct carrier names of all carriers that offer a flight from the Seattle airport (code SEA) to any airport in Chicago. (Warning: there is more than one airport in Chicago. Your query should work even if they build yet another airport in Chicago.) Name the output column cname.

    2. List all itineraries on September 15, 2024, from the Seattle airport (code SEA) to any airport in Chicago that have exactly one intermediate stop (i.e. SEA -> somewhere -> some Chicago airport).

      Both flights must depart on the same day and must be with the same carrier, and the first flight must depart (dep_time) before the second flight departs. (Do not worry about whether the first flight arrives before the second flight departs, because that is a bit tricky with this data due to overnight flights.) The total flight time (actual_elapsed_time) of the entire itinerary should be fewer than 480 minutes (i.e., 8 hours).

      For each itinerary, your query should return the name of the carrier (not the code but the name), the first flight number, the first flight time (actual_elapsed_time), the name of the intermediate city (not the airport code but the city name), the second flight number, the second flight time, and finally the total flight time. Only count flight times here; do not include any layover time.

      Name the output columns as follows: carrier_name, f1_flight_num, f1_time, intermediate_city, f2_flight_num, f2_time, and total time (with a space in its name!) in that order.

    3. List the distinct pairs of carrier name and origin city name where it is possible to take a flight of at least 3000 miles from that city on that carrier. Name the output columns cname and origin_city_name in that order.

    4. Find all aircraft (identified by tail_num) that departed the same airport (identified by the airport code) twice on the same day to different destination airports where the two flights totaled at least 5000 miles.

      Output columns tail_num and origin. Do not include duplicate rows.

    5. Find all registered aircraft (in the N-numbers table) owned by 'UNIVERSITY OF WASHINGTON' along with the type of aircraft (from the aircraft types table).

      Output columns n_number, name, mfr, and model.

    6. The tail_num column of the flights table (usually) starts with an N, while the n_number column of the n_numbers table (always) does not include the N. Joining these tables on these columns will be more challenging than usual. To make it simpler, your query may assume that tail_num in the flights table always starts with an N, even though this is not true.

      Find all aircraft that were manufactured in 2024 and took a flight from the city in which they are registered.

      Output columns n_number, name, and city from the N-numbers table. Do not include duplicates.

      Hint: to compare the flight city to the registration city, you may find the || operator to concatenate strings useful. (That link describes all operators/expressions. Search for "concat" on that page.) For example 'hello' || 'world' evaluates to 'helloworld'. You'll also potentially need to account for uppercase versus lowercase. Check the SQLite documentation to find useful functions to help you.

    7. Find all flights from 'Tampa, FL' on a 737 that were canceled due to weather.

      For the purposes of this problem, consider "a 737" to be any aircraft that has '737' as a substring of its aircraft type's model column. (You may want to look at the LIKE operator to find substrings.)

      As in the previous problem, you may assume that tail_num in the flights table always starts with an N, even though that is not true.

      Output columns year, month, day_of_month, cname (the carrier name), and flight_num.

    Turn in each subproblem in its own hw2-q2a.sql file named exactly according to this scheme.

  3. (Java) Convert your solution to problem 2c to a parameterized query using JDBC to connect to the SQLite database you constructed in problem 1.

    Get the starter code. Put your answer in HW2.java in the problem3 method. The argument to this method is the distance that replaces 3000 in problem 2c.

    Use SQL query parameters. Do not manipulate query strings using concatenation.

    Turn in only this file for this problem. The starter code should work out of the box with VSCode. We have also provided a Makefile to run the provided tests on the command line using make.