HW3: Aggregation and Grouping

We will use the same dataset as HW2. Set up your SQLite database the same way as in problem 1 from HW2.

(SQL) Write SQL queries to answer the following questions. As on HW2, your queries should work on any data with the same schema as the provided data. Put each query in its own file called hw3-q1a.sql etc.

Unless otherwise stated, you should only use the parts of SQL we have covered in class, plus any other operators/functions from the SQLite documentation. Do not use windowing, subqueries, or WITH.

Remember that the topic of this homework is grouping and aggregation. If you're stuck, try figuring out how those might help you. Also, most of the harder problems require joins that are then aggregated, so don't be afraid to join stuff. Also be on the lookout for outer joins.

Unless otherwise specified, you may name the output columns however you like, and the rows may be in any order.
1. Write a query without aggregation or grouping to find all tail numbers in the flights data set that flew flights for more than one carrier. Output the tail_num and the carrier code carrier. Do not include duplicate rows.
  
  As in HW2, your query may assume that tail_num in the flights table always starts with an N, even though this is not true.
  
  Hint: Use a join.
  
  (Output relation cardinality: 7 rows)
  
  Hint: If you get 7 rows but something looks weird, you're on the right track. Check out the next subproblem.
2. Write a query without joins to find how many different carriers each tail number flew. Output tail_num and carrier_count (the number of different carriers that tail number flew for). Only include rows where the number of carriers is 2 or more.
  
  Hint: use grouping and aggregation.
  
  (Output relation cardinality: 1 row)
  
  Then, investigate the tail number(s) in the query output. Look at their flights in the original flights table. What do you notice about all of them? Write a comment in your hw3-q1b.sql file with your answer.
3. Write a query to count how many non-canceled flights each tail number made. Output columns tail_num, carrier, mfr, and model, and count. (carrier is the name of the carrier from the carriers table. mfr and model are from the aircraft types table. count refers to the number of rows this tail number has in the flights table.) Sort the output by count from highest to lowest, and return only the 20 rows with the highest counts.
  
  (Output relation cardinality: 20 rows. Max value of count is 294.)
  
  Then, write a comment in your hw3-q1c.sql file explaining why you chose the GROUP BY clause that you chose. (There is more than one reasonable choice.)
4. For each possible cancellation reason (description in cancellation_codes), find how many flights were canceled for that reason. Include cancellation reasons that are never used, displayed with 0 flights. Output columns cancellation_reason and num_flights.
  
  (Output relation cardinality: 4 rows. Min value of num_flights is 0.)
5. The column actual_elapsed_time represents the flight duration in minutes. Write a comment at the top of your file hw3-q1e.sql that explains (1) what is surprising about the order of the rows in the output of this query and (2) why SQLite orders it that way. (Note the ORDER BY clause. Look carefully at the actual output.)
```
SELECT DISTINCT actual_elapsed_time 
FROM flights 
ORDER BY actual_elapsed_time
```
  Then rewrite the ORDER BY clause of the query to put the rows in the "right" order that the original query author probably intended. Do not change any other clauses. Put your query in the same file, hw3-q1e.sql.
  
  Hint: You may find the SQLite data types documentation page useful again.
  
  (Output relation cardinality: 614 rows. Second row of corrected output has actual_elapsed_time 16.00).
6. For each origin city, determine the destination city with the longest flight, based on flight duration (i.e., actual_elapsed_time). Use what you learned in the previous problem to make sure you are handling actual_elapsed_time correctly. If there is more than one destination city that achieves the maximum flight time for a given origin city, pick one to include arbitrarily.
  
  Output columns as origin_city, longest_dest_city, and longest_time.
  
  Order the output from highest longest_time to lowest. Break ties with origin_city alphabetically A-to-Z.
  
  Hint: Use SQLite's special handling of bare columns in aggregate queries with max/min.
  
  (Output relation cardinality: 334 rows. First row of output has longest_time 690.0.)
7. For each airline, calculate the percentage of its non-canceled flights that arrived on time or early (i.e., arrival delay <= 0). Sort the output in descending order of this percentage, and break ties in ascending alphabetical order by airline name. Report percentages as percentages, not decimals (e.g., report 75.2534... rather than 0.7525...). Do not round the percentages. Name the output columns carrier_name and on_time_pct. In case a carrier has no flights, include a row for that carrier with percentage 0.0.
  
  Hint: Use CASE expressions (link to SQLite documentation) to implement conditionals in SQL (kind of like if statements or the ternary operator in other languages).
  
  Hint: You can put a CASE expression as the argument to an aggregate function.
  
  (Output relation cardinality: 1743 rows because there are 1743 carriers in the carriers table. First row of output has carrier_name "Endeavor Air Inc." and on_time_pct 80.46...)
(Theory) Let \(\mathbb{Z}\) be the set of integers, \(\mathbb{S}\) be the set of strings, and \(\mathbb{B}\) be the set of booleans. Let \(R\) be an arbitrary finite relation on \(\mathbb{S} \times \mathbb{Z}\), and let \(T\) be an arbitrary finite relation on \(\mathbb{Z} \times \mathbb{B}\).

Define the inner join of \(R\) and \(T\), denoted by \(R\bowtie T\) to be

\[ R \bowtie T = \{ (s, z, b) \mid (s, z) \in R \text{ and } (z, b) \in T \} \]

As usual, we denote the cardinality of a set \(A\) by \(|A|\).

In each of the following problems, you are asked to give an example of \(R\) and \(T\) satisfying some condition. Since these are both finite sets of pairs, you should simply list out their elements between curly braces.

For each answer, also explain in one sentence why your example satisfies the condition.
1. Give an example of \(R\) and \(T\) such that both \(R\) and \(T\) have at least two elements, but \(R\bowtie T\) is empty.
2. Give an example of \(R\) and \(T\) such that \(|R| = 3\) and \(|T| = 2\) and \(|R\bowtie T| = 6\).
3. Give an example where \(R\) and \(T\) each have at least two elements, and for every \((s, z) \in R\), there is exactly one tuple in \(R\bowtie T\) with that value of \(s\) in its first component.
4. Give an example where \(R\) and \(T\) each have at least two elements, and there is at least one \((s_1, z_1) \in R\) with more than one tuple in \(R\bowtie T\) with the value of \(s_1\) in its first component and at least one \((s_2, z_2) \in R\) such that no tuple in \(R\bowtie T\) has the value \(s_2\) in its first component.
Turn in your answers to this problem as a PDF to the HW3 Written assignment on Gradescope.
(Java) Get the starter code. Implement a Java program with this specification (similar to problem 1c but simpler).

Find how many non-canceled flights each tail number made. Sort the output by the count from highest to lowest. (Breaking ties between equal counts arbitrarily.) If a tail number has canceled flights but no non-canceled flights, do not include it.

Unlike the Java problems on previous homeworks, for this problem we want you to think like a Java programmer and forget what you know about SQL/databases. How would you solve this problem in Java? Try to write simple code, but also code of reasonable efficiency, as if your code would be run on the full data set of all months and not just September 2024. The full dataset has hundreds of millions of rows, but you may assume the data and all intermediate results fit in main memory.

Turn in your HW3.java file along with all your hw3-q1....sql files to the HW3 Programming assignment on Gradescope.