The assignment is due by 12pm noon on January 24th. Please submit the assignment here.

Corrections

Assignment Goal

The goal of this assignment is to ensure that you are comfortable using a relational database management system (RDBMS), that you are comfortable with SQL. SQL, the Structured Query Language, is the most widely used relational database language today. It allows you to specify the data you are interested in by using logical expressions.

For the purpose of the class, we will be using the PostgreSQL open-source DBMS. PostgreSQL provides a fairly standard implementation of SQL. Please keep in mind, however, that when you change to a different DBMS, you may need to use different variants of SQL. The differences normally affect only advanced features.

Please let us know if you encounter any problems with this assignment. We will post clarifications and fixes here as necessary.

Turn in instructions

There are 6 parts to this assignment. For all parts except part 3, please write down your solution in a file problem<part number>.txt. For example, the file for part 1 should be named problem1.txt. For part 3, a txt file might not be convenient. If so, please use a pdf file named problem3.pdf.

To hand-in your assignment, issue the following command under the directory where you saved the above files:

After executing this command, you should see a file named cse544-11wi-youraccountname.tar.gz. Please submit the file to this Catalyst dropbox.

PostgreSQL

You will have access to postgres on cubist.cs.washington.edu. Log in through ssh and use your unix workstation password at the prompt (this should be the same password as the one that you use to check email). However, we strongly encourage you to download and install postgres on your own machine and use it locally. The reason is that you need to do this once in your lifetime, and now it is the best of times. Once you install it locally, you are the administrator and the sole user: you are in full control of the database system.

Overview

In this assignment you will download a cool dataset from the Web, store it in your database, then use the database system to ask interesting queries over the data. The dataset is NELL, generated by Tom Mitchell's “never ending learning project”.

Database Content

Physical Database Schema.

We removed the header from the file used in this assignment. The header consisted (in order) of the seven attribute names in the list above. Also, it might be needed look at the dataset to figure out what the attribute types should be. For the sizes of the varchars we used a script was used to determine the maximum sizes. But you do not need to run this script, we have already computed the lengths of the fields.

Import the data

Use postgres 'copy’ command. To find out more about this command, type this in postgres: '\h copy’.

Logical Database Schema

We will use only the following four attributes in this homework: entity, relation, value, and p. Create a new table: Nell(entity, relation, value, p). We call each record in Nell a “triple”. This is because we view it as a triple (entity, relation, value) which has some probability p of being correct. This question asks you to populate Nell with the all the triples derived from NellRaw: in other words, materialize the view Nell.

Application Schema

For some questions below you need to create virtual views for specific relations. For example, you will create a view:

that consists of all triples of the form Nell(entity, 'concept:producesproduct’, value, p). We will refer to these as the “application schema”.

Probabilities

Indexes

In question 5 you will be asked to create indexes, but you may want to create them much earlier, or your homework will take forever to complete

Turn in the answer to the following questions

Solutions Overview The SQL queries for the individual subparts can be found with the subparts.

1. SQL Queries On The Logical Schema. Solution: Part 1
Write a SQL query to answer each of the following questions:

1.2 how many distinct entities and how many distinct values are there in Nell? (Note here you may write two SQL queries.)

1.3 how many distinct values are both “entities” and “values”? (These are candidates for computing joins.)

1.4 what fraction of triples in Nell have probability < 1.0? (Note: you should write a single SQL query.)

1.5 find the top 10 most frequent relations, and the number of times they occur. Hint: the SELECT command in SQL has a syntax for restricting the set of answers to the top k; use that in your answer.

2. The Application Schema. Solution: Part 2
Define virtual views for the following relations:

3. SQL Queries On The Application Schema. Solutions: Part 3 (sql
For each of the queries below, (a) write a query in the relational calculus, (b) write an equivalent relational algebra expression, and (c) write and execute a SQL query

3.3: find all products manufactured by companies headquartered in San Jose (concept:city:san_jose)

3.4: find all products manufactured only by companies s.t. all their competitors make that product

3.5: find all products manufactured only by companies that have some competitor making that product

4. Integrity Constraints. Solution: Part 4
Consider the full application schema, where each relation value is transformed into a new relation: R(entity, value, p). (There are a few hundreds such: you will have to find the exact number in question 1. But you do not have to actually define these hundreds of views in order to answer the questions below.)

4.3 find all relation R1 s.t. the inclusion constraint R1.value ==> R2.entity holds where R2='concept:stadiumhometoleague’. That is, you should report all relations R1 s.t. every value in R1.value occurs in R2.entity.

Hint: in order to speed up all queries 4.1-4.3you may want to create a materialized view consisting of all relation names R.

5. Indexes. Solution: Part 5
Databases typically contain large amounts of data. Because this data must persist even if the DBMS goes down, it is stored on disk. To answer a query, the DBMS must thus read data from disk, which can be slow if a lot of data must be read. Indexes are data structures that can speed-up this process by enabling the DBMS to read only a subset of all data from disk when answering a query.

SELECT *
FROM NellRaw
WHERE entity='concept:food:kefir’;

5.2 Now modify the schema of NellRaw to indicate that (entity, relation, value) is a key. (Hint: use the ALTER TABLE command.) Execute the above mentioned query again. How long does it take now? Why is that?

Note that PostgreSQL automatically creates an index on the attributes that serve as the primary key of a relation. In the case of PostgreSQL, this index is also the mechanism that enforces the primary key constraint.

5.3 An index on the primary key speeds-up only specific types of queries that may not correspond to the query workload of the system. Consider the following query:

SELECT count(*)
FROM NellRaw
WHERE iteration = 171;

5.4 Create an index on NellRaw that will speed-up the query above. Please provide the SQL statement that you use to create the index and an explanation for your choice of index. How long does the query take now?

6. Probabilities. Solution: Part 6
Assume now that each triple in Nell is an independent probabilistic event: the triple may be present or not, and p gives the probability. Write SQL queries for the questions below; your queries should return both the answers (tuples) and their probabilities, and order the tuples in decreasing order of the probabilities.

For example, assume the question is “Retrieve all products manufactured by a company headquartered in 'san_jose’, and their manufacturer”, which can be written as a conjunctive query:

SELECT u.value, v.entity, u.p*v.p as prob
FROM nell u, nell v
WHERE u.relation = 'concept:producesproduct’
AND v.relation = 'concept:headquarteredin’
AND u.entity = v.entity
AND v.value = 'concept:city:san_jose’
ORDER by prob desc;

because the probability of an answer (x,y) is the product of the probability of the producesproduct(x,y) tuples and the probability of the headquartered(y,'concept:city:san_jose’) tuple.

On the other hand, consider the following example: “find all products manufactured by at least one company”. As a conjunctive query this is:

SELECT value, or_aggregate(p) as p
FROM nell where relation='concept:producesproduct’
GROUP by value
ORDER by p desc;

Here or_aggregate is an aggregate operator defined in this file that, when applied to a set {p1, p2, …, pn}, computes the value 1-(1-p1)*(1-p2)*…*(1-pn). That is, the probability of an output y is the “or” of the probabilities of all tuples of the form producesproduct(x,y).

6.1 find all products manufactured by companies headquartered in 'concept:city:san_jose’. Use the relations 'concept:producesproduct’ and 'concept:headquarteredin’. Note: in general there may be more than one company in san_jose manufacturing a given product.

6.2 Find all products made by companies without competitors. Use the relations 'concept:producesproduct’ and 'concept:competeswith’.

6.3 Find all cities that are headquarters of some company that have at least one competitor and manufacture at least one product. Use the relations 'concept:headquarteredin’, 'concept:producesproduct’, 'concept:competeswith’.

Note. In all these four queries you need to be careful to manipulate probabilities correctly. You can multiply probabilities only if they are independent: otherwise you cannot multiply them.

CSE 544: Assignment 1