We will use the same dataset as HW2. Set up your SQLite database the same way as in problem 1 from HW2.
(SQL) Write SQL queries to answer the following questions. As on HW2, your queries
should work on any data with the same schema as the provided data. Put each
query in its own file called hw3-q1a.sql
etc.
Unless otherwise stated, you should only use the parts of SQL we have
covered in class, plus any other operators/functions from the SQLite
documentation. Do not use windowing, subqueries, or WITH
.
Remember that the topic of this homework is grouping and aggregation. If you're stuck, try figuring out how those might help you. Also, most of the harder problems require joins that are then aggregated, so don't be afraid to join stuff. Also be on the lookout for outer joins.
Unless otherwise specified, you may name the output columns however you like, and the rows may be in any order.
Write a query without aggregation or grouping to find all tail numbers
in the flights data set that flew flights for more than one carrier.
Output the tail_num
and the carrier code carrier
.
Do not include duplicate rows.
As in HW2, your query may assume that tail_num in the flights table always starts with an N, even though this is not true.
Hint: Use a join.
(Output relation cardinality: 7 rows)
Hint: If you get 7 rows but something looks weird, you're on the right track. Check out the next subproblem.
Write a query without joins to find how many different carriers each
tail number flew. Output tail_num
and carrier_count
(the number of
different carriers that tail number flew for). Only include rows where
the number of carriers is 2 or more.
Hint: use grouping and aggregation.
(Output relation cardinality: 1 row)
Then, investigate the tail number(s) in the query output. Look at their
flights in the original flights table. What do you notice about all of
them? Write a comment in your hw3-q1b.sql
file with your answer.
Write a query to count how many non-canceled flights each tail number
made. Output columns tail_num
, carrier
, mfr
, and model
, and
count
. (carrier
is the name of the carrier from the carriers table.
mfr
and model
are from the aircraft types table. count
refers to
the number of rows this tail number has in the flights table.) Sort the
output by count
from highest to lowest, and return only the 20 rows
with the highest counts.
(Output relation cardinality: 20 rows. Max value of count
is 294.)
Then, write a comment in your hw3-q1c.sql
file explaining why you chose the
GROUP BY
clause that you chose. (There is more than one reasonable
choice.)
For each possible cancellation reason (description
in cancellation_codes
),
find how many flights were canceled for that reason. Include cancellation
reasons that are never used, displayed with 0 flights. Output columns
cancellation_reason
and num_flights
.
(Output relation cardinality: 4 rows. Min value of num_flights
is 0.)
The column actual_elapsed_time
represents the flight duration in
minutes. Write a comment at the top of your file hw3-q1e.sql
that
explains (1) what is surprising about the order of the rows in the
output of this query and (2) why SQLite orders it that way. (Note the
ORDER BY
clause. Look carefully at the actual output.)
SELECT DISTINCT actual_elapsed_time
FROM flights
ORDER BY actual_elapsed_time
Then rewrite the ORDER BY
clause of the query to put the rows in the
"right" order that the original query author probably intended. Do not
change any other clauses. Put your query in the same file,
hw3-q1e.sql
.
Hint: You may find the SQLite data types documentation page useful again.
(Output relation cardinality: 614 rows. Second row of corrected output
has actual_elapsed_time
16.00).
For each origin city, determine the destination city with the longest
flight, based on flight duration (i.e., actual_elapsed_time
). Use what
you learned in the previous problem to make sure you are handling
actual_elapsed_time
correctly. If there is more than one destination
city that achieves the maximum flight time for a given origin city, pick
one to include arbitrarily.
Output columns as origin_city
, longest_dest_city
, and longest_time
.
Order the output from highest longest_time
to lowest. Break ties with
origin_city
alphabetically A-to-Z.
Hint: Use SQLite's special handling of bare columns in aggregate
queries with
max
/min
.
(Output relation cardinality: 334 rows. First row of output has
longest_time
690.0.)
For each airline, calculate the percentage of its non-canceled flights
that arrived on time or early (i.e., arrival delay <= 0). Sort the
output in descending order of this percentage, and break ties in
ascending alphabetical order by airline name. Report percentages as
percentages, not decimals (e.g., report 75.2534... rather than
0.7525...). Do not round the percentages. Name the output columns
carrier_name
and on_time_pct
. In case a carrier has no flights,
include a row for that carrier with percentage 0.0.
Hint: Use CASE
expressions
(link to SQLite documentation) to implement conditionals in SQL (kind of
like if statements or the ternary operator in other languages).
Hint: You can put a CASE
expression as the argument to an aggregate
function.
(Output relation cardinality: 1743 rows because there are 1743 carriers
in the carriers table. First row of output has carrier_name
"Endeavor
Air Inc." and on_time_pct
80.46...)
(Theory) Let \(\mathbb{Z}\) be the set of integers, \(\mathbb{S}\) be the set of strings, and \(\mathbb{B}\) be the set of booleans. Let \(R\) be an arbitrary finite relation on \(\mathbb{S} \times \mathbb{Z}\), and let \(T\) be an arbitrary finite relation on \(\mathbb{Z} \times \mathbb{B}\).
Define the inner join of \(R\) and \(T\), denoted by \(R\bowtie T\) to be
As usual, we denote the cardinality of a set \(A\) by \(|A|\).
In each of the following problems, you are asked to give an example of \(R\) and \(T\) satisfying some condition. Since these are both finite sets of pairs, you should simply list out their elements between curly braces.
For each answer, also explain in one sentence why your example satisfies the condition.
Give an example of \(R\) and \(T\) such that both \(R\) and \(T\) have at least two elements, but \(R\bowtie T\) is empty.
Give an example of \(R\) and \(T\) such that \(|R| = 3\) and \(|T| = 2\) and \(|R\bowtie T| = 6\).
Give an example where \(R\) and \(T\) each have at least two elements, and for every \((s, z) \in R\), there is exactly one tuple in \(R\bowtie T\) with that value of \(s\) in its first component.
Give an example where \(R\) and \(T\) each have at least two elements, and there is at least one \((s_1, z_1) \in R\) with more than one tuple in \(R\bowtie T\) with the value of \(s_1\) in its first component and at least one \((s_2, z_2) \in R\) such that no tuple in \(R\bowtie T\) has the value \(s_2\) in its first component.
Turn in your answers to this problem as a PDF to the HW3 Written assignment on Gradescope.
(Java) Get the starter code. Implement a Java program with this specification (similar to problem 1c but simpler).
Find how many non-canceled flights each tail number made. Sort the output by the count from highest to lowest. (Breaking ties between equal counts arbitrarily.) If a tail number has canceled flights but no non-canceled flights, do not include it.
Unlike the Java problems on previous homeworks, for this problem we want you to think like a Java programmer and forget what you know about SQL/databases. How would you solve this problem in Java? Try to write simple code, but also code of reasonable efficiency, as if your code would be run on the full data set of all months and not just September 2024. The full dataset has hundreds of millions of rows, but you may assume the data and all intermediate results fit in main memory.
Turn in your HW3.java
file along with all your hw3-q1....sql
files to
the HW3 Programming assignment on Gradescope.