HW4: Subqueries

We will use the same dataset as HW2 and HW3. Set up your SQLite database as before.

(SQL) Write SQL queries to answer the following questions. As on previous homeworks, your queries should work on any data with the same schema as the provided data. Put each query in its own file called hw4-q1a.sql etc.

Unless otherwise stated, you should only use the parts of SQL we have covered in class, plus any other operators/functions from the SQLite documentation. Do not use windowing.

Remember that the topic of this homework is subqueries. If you're stuck, try figuring out how a subquery might help you.

Unless otherwise specified, you may name the output columns however you like, and the rows may be in any order.
1. Find the 10 longest distinct distances flown. Then, for each such distance, find all offered "carrier routes", i.e., distinct (carrier, origin, destination) triples, and report how many times that carrier route was flown. Identify routes by airport codes, not city names.
  
  Return columns carrier_name, origin, dest, distance, and num_flights. Sort the output in order of distance (highest distance first), breaking ties by origin (A-to-Z), then destination (A-to-Z), and then carrier name (A-to-Z).
  
  Note that a flight from, say SEA to ORD by Alaska Airlines, is considered different from the route ORD to Seattle by Alaska Airlines. Most carriers offer routes in both directions, which should appear as two separate rows in your output.
  
  Expected output relation cardinality: 22 rows. (You might expect it to be 20 rows, 2 directions for each of the top 10 distances, but some distances have multiple pairs of cities associated with them.)
2. A "route" is an ordered pair of origin airport code and destination airport code. (This is different from a "carrier route" which includes the carrier.)
  
  Find the 20 distinct most flown routes as well as every route that was flown exactly once. Show these results combined in one table, with columns origin, origin_city_name, dest, dest_city_name, and num_flights. Sort the output by num_flights from lowest to highest, and break ties with origin, and then dest.
  
  Hint: to combine two relations with the same schema, check out the UNION keyword. Note carefully the limitations of UNION with regards to its component queries and LIMIT and ORDER BY. You may need to work around these limitations using additional subqueries or using the WITH keyword.
  
  Expected output relation cardinality: 70 rows. First row is a route from ATL to ISP. Last row is a route from LAX to SFO.
3. The "competition factor" on a route is the number of distinct carriers who fly that route. For each distinct competition factor, find how many distinct routes have that competition factor.
  
  Report columns competition_factor and num_routes. Sort the output by competition factor from highest to lowest.
  
  Expected output cardinality: 7 rows, one each for competition factors 7, 6, ..., 1. The first row is (7, 8). In other words, there are 8 routes with competition factor 7.
4. For each route with competition factor 7, find the names of the 7 distinct carriers that fly that route.
  
  Report your results in three columns: the origin and dest airport codes of the routes, and a third column called competitors that is a string consisting of all the names of the competitors in alphabetical (A-to-Z) order separated by semicolons and spaces. (See the example output below.) Sort the output by origin airport code (A-to-Z) and break ties with destination airport code (A-to-Z).
  
  Hint: To compute the competitors column, use the string_agg function discussed in lecture and refer to its documentation. Pay special attention to getting the carriers within the string in the right order using the weird ORDER BY syntax inside of the string_agg function. (string_agg was added to SQLite in 2023. Older versions of SQLite support another identical function group_conat. Use group_concat if you are on an older version. I prefer the name string_agg because it also works in Postgres, but both are fine for this HW. Also, older versions that only support group_concat do not support the ORDER BY syntax inside of aggregate functions. If you are using an older version, feel free to ignore the requirement that airlines are listed in alphabetical order.)
  
  You are allowed/expected to hard code the number 7 in your query.
  
  Expected output cardinality: 8 rows. The first row is from ATL to PHL with this competitors string:
  
  American Airlines Inc.; Delta Air Lines Inc.; Frontier Airlines Inc.; PSA Airlines Inc.; Republic Airline; Southwest Airlines Co.; Spirit Air Lines
5. Find all carriers that operated a route where no other carrier operated the same route. (This is equivalent to finding all carriers who operate a route with competition factor 1.)
  
  Your solution to this question may not use grouping or aggregation. Therefore, it may not be easy to compute the competition factor directly for this question. Instead, use NOT EXISTS (or any other strategy you want that is not based on grouping or aggregation) to translate the "no other carrier..." part of the question.
  
  Output one column carrier_name with the name of all such carriers. Do not include duplicates.
  
  Expected output cardinality: 15 rows.
  
  Hint: it's a bit difficult to test your answer to this one because it is expected to return all carriers with flights in the provided data, because it turns out that every carrier has such a route. One way to double check your answer is to run the following statements to insert bogus carriers that both operate a single bogus route. Then rerun your query. It should not include the bogus carriers.
```
INSERT INTO carriers 
  VALUES ('foo', 'foo'), ('bar', 'bar');
INSERT INTO flights(op_unique_carrier, origin, dest) 
  VALUES ('foo', 'A', 'B'), ('bar', 'A', 'B');
```
  After you are done, carefully undo the bogus changes:
```
DELETE FROM flights 
  WHERE op_unique_carrier IN ('foo', 'bar');
DELETE FROM carriers 
  WHERE cid IN ('foo', 'bar');
```
6. We consider a flight to be "on time" if it has an arrival delay of at most 15 minutes. The daily on-time percentage for a carrier is the percentage of that carrier's flights that arrived on time for that day. Find all carriers that flew at least one flight and have a daily on-time percentage of at least 75% on every day that they flew a flight.
  
  Be careful not to assume the data only consists of one month of flights. Use (year, month, day_of_month) to identify a day, not just day_of_month.
  
  Output one column carrier_name with the names of the carriers in any order but without duplicates.
  
  Hint: "Every day" is a for-all statement.
  
  Expected output cardinality: 4 rows.
7. Starting at some origin airport, a "one-hop destination" is an airport reachable in a single flight from the origin airport. The "connection factor" of an origin airport is how many distinct one-hop destinations one can reach from that origin.
  
  Find airports where all of their one-hop destinations have a strictly higher connection factor.
  
  Output a single column airport with the airport codes of all such airports in any order but without duplicates.
  
  Expected output cardinality: 252 rows
(Java) Get the starter code. Implement a Java program with the same specification as question 1c. See the comments in that file for more details.

Turn in all your SQL files and single Java file to the HW4 assignment on Gradescope.