CSE 414: Homework 5

Objectives:

To practice using a NoSQL database system (AsterixDB).
To practice writing queries using the semistructured data model.
To practice manipulating semistructured data in JSON.

Assignment tools: VMWare and AsterixDB (see Setup below) along with these additional files (in a tar.gz archive).

Due date: Thursday, May 11th by 11:00pm.
Turn in your solution using the assignment drop box linked from the main course web page.

What to turn in:
A single text file, hw5-queries.sqlp. It should contain SQL++ commands executable by AsterixDB, along with comments indicating the problem numbers and any other answers asked for below (for example, where you are asked to report the running time).

AsterixDB Setup

AsterixDB does not work well on Windows, so options 1 and 2 below discuss how to run AsterixDB on a Linux machine. Option 1 is likely to be easiest, so that approach is recommended for Windows users. For Mac users, option 3 is likely to be easiest, so that approach is recommended for Mac users.

Option 1 (Recommended Windows)

One easy way to get access to a Linux machine when you have Windows is to use VMware. VMware allows you to run Linux (a competing operating system to Windows) as an ordinary Windows program. VMware does this by running the real Linux operating system in a virtual environment, where all of the I/O devices (disks, network, keyboard, etc.) are simulated by VMware. For example, what Linux thinks is a hard disk drive is actually just a file that VMware creates in Windows.

As a student in CSE 414, you should be able to download and install VMware from this web site (using your UW email address) at no cost. For Windows users, install the most recent version of "VMware Workstation". For Mac users, install the most recent version of "VMware Fusion". If you run into any difficulties getting access to this software, send email to support@cs.washington.edu.

In order to launch Linux in VMware, you will need a virtual machine image to run. Grab the one from this page. In addition to Linux, it also has AsterixDB already installed for you! Once you download and unzip the virtual machine image file, you can run it in VMware by clicking File > "Open and Run" on the menu. (You may see an error message about an ide disk not being found. Just ignore that.)

Once the VM instance launches, you should see a login window like this one:

Click on "A. Happy User" to log in. Once you do so, you will see a Linux desktop, which looks like this:

At the bottom of the screen, there is a row of icons. The second from the left is the terminal program. The fourth from the left is the web browser. Those are the only two that you will need for this assignment.

Open the terminal program, and execute this command:

ssh-keygen -t rsa -P ""

It will ask you if you want to name the file it creates id_rsa. Just press ENTER and accept the default name.

Now run the follwowing commands. They should each run without error.

cp .ssh/id_rsa.pub .ssh/authorized_keys
sudo chmod 777 /usr/local/asterixdb/clusters/local
sudo chmod 666 /usr/local/asterixdb/clusters/local/*.xml
sudo chmod 666 /usr/local/asterixdb/clusters/demo/*.xml
sudo chmod 666 /usr/local/asterixdb/conf/managix-conf.xml 
managix configure
managix validate
managix create -n my_asterix -c /usr/local/asterixdb/clusters/local/local.xml

That last command may take some time to complete. After it runs, you should see Status: ACTIVE, which tells you that AsterixDB is now running in your Linux virtual machine.

Finally, click on the browser window icon at the bottom of the screen. And navigate to the URL http://127.0.0.1:19001. You should see a page like this one where you can type in queries.

If you want to test that it works, you can try running the following query:

SELECT VALUE ds FROM Metadata.`Dataset` ds;

This query will print out information about all of the datasets that already exist in your AsterixDB instance. You will add your own datasets below.

If you later shut down and restart VMware, you can use the following commands to restart AsterixDB:

managix stop -n my_asterix
managix start -n my_asterix

Option 2 (Advanced)

Another way to get access to a Linux machine is to create one in Azure. Start by following these instructions to create a linux machine that you can connect to using SSH. (See the linked instructions on that page, or click here, to see how to install SSH on a Windows machine. On a Mac, SSH is already available in the Terminal program.)

Once you have connected to your Linux machine using SSH, you can follow these instructions to install AsterixDB on the Linux machine.

Note that you will need to configure the the firewall settings for your Linux instance to allow connections both via SSH and via HTTP on port 19001.

Option 3 (Recommended Mac)

For Mac users, it is also possible to install AsterixDB directly on your machine. To do so, start by downloading AsterixDB from this page. Click on the downloaded file to unzip it in your Downloads directory.

Next, run the following command in the Terminal to start your Asterix cluster:

cd Downloads/asterix-server-0.9.0-binary-assembly
samples/local/bin/start-sample-cluster.sh

Now, navigate in your browser to http://localhost:19001, and you should see a web page where you can type in SQL++ queries.

If you want to test that it works, you can try running the following query:

SELECT VALUE ds FROM Metadata.`Dataset` ds;

This query will print out information about all of the datasets that already exist in your AsterixDB instance. You will add your own datasets below.

When you want to stop your AsterixDB cluster, run this command:

samples/local/bin/stop-sample-cluster.sh

Loading Data

Now that you have AsterixDB running, we need to load some data. Start by downloading the data in the browser on your Linux machine. Navigate to this page (hw5.html) in that browser and click the link at the top of the page to download the additional files.

The browser will store the hw5.tar.gz file in the Downloads subdirectory. Run this command to unpack the archive:

tar xfz Downloads/hw5.tar.gz

If you run the ls command, you will now see four files ending with .adm that store the data we will use in this assignment. (Recall that ADM is an extension of JSON that supports additional types of data.)

Create a new dataverse "hw5", which includes a dataset "world", and load the mondial.adm data into that dataset using the following commands:

CREATE DATAVERSE hw5;
USE hw5;

CREATE TYPE worldType AS {auto_id: uuid};

CREATE DATASET world(worldType) PRIMARY KEY auto_id AUTOGENERATED;

LOAD DATASET world using localfs(("path"="127.0.0.1:///home/auser/mondial.adm"),("format"="adm"));

Test Queries

Run each of the following queries and examine the results. These queries contain useful templates for the questions on the homework, so make sure that you understand them.

Return the set of countries:

USE hw5;
SELECT x.mondial.country FROM world x;

Return each country, one by one (see the difference?):

USE hw5;
SELECT y as country FROM world x, x.mondial.country y;

Return just their codes and their names, orderd alphabetically. (Note that -car_code is not a legal field name since it includes a -, which means minus, so we must enclose the name in back quotes `...`.)
```
USE hw5;
SELECT y.`-car_code` as code, y.name as name
FROM world x, x.mondial.country y
ORDER BY y.name;
```

Here is a query that will produce an error because some provinces have a single city while others have lists of cities:

USE hw5;
SELECT z.name as province_name, u.name as city_name
FROM world x, x.mondial.country y, y.province z, z.city u
WHERE y.name = 'Hungary';

Try this version that fixes all cities to be lists:

USE hw5;
SELECT z.name as province_name, u.name as city_name
FROM world x, x.mondial.country y, y.province z, 
     CASE WHEN is_array(z.city) THEN z.city ELSE [z.city] END u 
WHERE y.name = 'Hungary';

Finally, try this query that includes the LET construct, which does not valid SQL but can be useful. (See the SQL++ documentation for more details on what is allowed.)

USE hw5;
SELECT z.name as province_name, (SELECT u.name FROM cities u)
FROM world x, x.mondial.country y, y.province z
LET cities = CASE WHEN z.city is missing THEN []
                  WHEN is_array(z.city) THEN z.city
                  ELSE  [z.city] END
WHERE y.name = 'Hungary';

Problems [24 points]

Give a SQL++ command to perform each of the following queries. A correct answer for each question is worth 2 points.

Retrieve all the names of all cities located in Peru, sorted alphabetically. [30 rows]
For each country return its name, its population, and the number of religions, sorted alphabetically by country. [238 rows]
For each religion return the number of countries where it occurs. Order them in decreasing number of countries. [37 rows]
For each ethnic group, return the number of countries where it occurs, as well as the total population world-wide of that group. (Hint: you need to multiply the ethnicity’s percentage with the country’s population. Use the functions float(x) and/or int(x) to convert a string x to a float or to an int.) [262 rows]
Compute the list of all mountains, their heights, and the countries where they are located. Here you will join the "mountain" collection with the "country" collection, on the country code. You should return a list consisting of the mountain name, its height, the country code, and country name, in descending order of the height. [272 rows]
Compute a list of countries with all their mountains. This is similar to the previous problem, but now you will group the mountains for each country. Your query should return a list where each element consists of the country code, country name, and a list of mountain names and heights. Order the countries by the number of mountains they contain. [238 rows]
Find all countries bordering two or more seas. Here you need to join the "sea" collection with the "country" collection. For each country in your list, return its code, its name, and its list of bordering seas, in decreasing order of the number of seas. [74 rows]
Return all landlocked countries. A country is landlocked if it borders no sea. Order your answers in decreasing order of the country's area. (Note: this should be an easy query to derive from the previous one.) [45 rows]
Find all pairs of countries that share both a mountain and a sea. Your query should return a list of pairs of country names. Avoid including a country with itself, like in (France,France). Also avoid listing both orderings of each pair, e.g., (France,Korea) and (Korea,France) (not a real answer). [7 rows]
For this query, you should also measure and report the runtime. It may be approximate. (Expect it around 10-30 minutes.)
Create a new dataverse called hw5index. Then, run the following commands:
```
USE hw5index;
CREATE TYPE countryType AS OPEN {
        `-car_code`: string,
	`-area`: string,
	population: string
    };
CREATE DATASET country(countryType)
   PRIMARY KEY `-car_code`;
CREATE INDEX countryID ON country(`-car_code`) TYPE BTREE;
LOAD DATASET country using localfs(("path"="127.0.0.1:///home/auser/country.adm"),("format"="adm"));
```
These command created the country dataset and a BTREE index on the attribute `-car_code`. (Once again, we have to put the name in back quotes because it is not otherwise a legal column name as it starts with a "-".) The country dataset stores data of type countryType, which has three required fields: `-car_code`, `-area` and population. Since the type is declared OPEN, the rows can contain additional fields beyond these required ones.
Create two new types, mountainType and seaType, and two new datasets, mountain and sea. Both types should have required fields `-id` and `-country`. Their primary keys should be autogenerated UUIDs (just as with mondial above). For each new dataset, create an index of type KEYWORD (rather than BTREE) on the `-country` field.

Include your commands for creating all three types, datasets, and indexes. (Note that there is no world dataset in hw5index.)
Rewrite the query from problem 9 to work in the new dataverse hw5index. Run it and report the new runtime. [7 rows]
Modify the query from problem 11 to return, for each pair of countries, the list of common mountains and the list of common seas. [7 rows]

Homework 5: JSON, NoSQL, and AsterixDB