CSE 414 Homework 5: JSON, NoSQL, CouchBase

Objectives:

To be able to manipulate semistructured data in JSON and practice using a NoSQL database system (CouchBase)

Due date:

Wednesday, May 11, at 11:00 pm. Turn in your solution using the assignment drop box linked from the main course web page.

What to turn in:

A single file with comments indicating which problem each query answers called hw5-queries.n1ql. Important: make sure you can run your submission file all at once by running

cat hw5-queries.n1ql | cbq

This should produce a JSON object for each query result without errors.

Assignment Tools:

CouchBase

Download and install Couchbase (enterprise edition) from here http://www.couchbase.com/nosql-databases/downloads on your pc. Complete the registration as required, then download and unzip. For more information, go to http://developer.couchbase.com/documentation/server/4.0/getting-started/installing.html.

Mac OS/X: after download do not open!. First copy Couchbase Server to /Applications, then open. If Mac OS/X refuses to open, could be security settings: systems-preferences/security-and-privacy/allow-apps-downloaded-from=Anywhere. If you need to re-install it later, delete both the Couchbase program and the directory ~/Library/Application Support/Couchbase. Add this to your PATH variable via export PATH="$PATH:/Applications/Couchbase Server.app/Contents/Resources/couchbase-core/bin"
Windows: Open the executable setup launcher after download. Follow the instructions to install Couchbase.
Linux: Install the package. The server should start automatically. Add to your PATH variable via export PATH="$PATH:/opt/couchbase/bin"

Start the server by opening couchbase (click on the icon). It should start automatically if on Linux. Setup procedure:
- Open http://127.0.0.1:8091/index.html (normally your browser will open it automatically when you start the server).
- Keep all default settings; when asked for sample buckets do whatever you want (let's say you select all of them); complete the registration; choose a password for your Administrator account. Once finished, give it a few seconds to start it's server.
- Go to "Data buckets" and choose a bucket to delete (say, the "default data bucket"): click on the arrow / edit / delete. Confirm: yes, delete. (Why delete? because Couchbase allocates all RAM in the cluster to the existing buckets. To create a bucket, we must delete a bucket.)
- Click on Create New Data Bucket: name = mondial (leave all options default)
Download the Mondial JSON dataset from here. Store the file mondial.json in your work directory for this homework.
Upload mondial.json to Couchbase. From a terminal window type:
```
cbdocloader -u Administrator -p [your-password]  -n 127.0.0.1:8091 -b mondial -s 100 mondial.json
```
Make sure your PATH variable is correctly (so it can find cbdocloader), or your admin password is correct, and that the file mondial.json is accessible (cd or type the full path). If you are on a Mac and get an error relating to a snappy-c.h file, you will need to install the snappy library. If you have Homebrew, the command is brew install snappy.
Start the query interface. From a terminal window type:
```
cbq
```
On Windows: you can find file cbq.exe in C:\Program Files\Couchbase\Server\bin\cbq.exe. Open this executable file instead.

Now you can try your first queries, here are some examples:

select y from mondial x unnest x.mondial.country y where y.name='Greece';
select y.["-car_code"] as car_code, y.name, ARRAY_LENGTH(y.province) as no_provinces  from mondial x unnest x.mondial.country y where y.name='Greece';
select y.name, count(y.province) as no_provinces from mondial x unnest x.mondial.country y unnest y.province group by y.name;
select y.name, y.height from mondial x unnest x.mondial.mountain y;

There is no 'help': to exit, type CTRL/D. If you get the error message "No primary index on keyspace mondial. Use CREATE PRIMARY INDEX to create one." then you need to create one manually:

CREATE PRIMARY INDEX mondial_primary ON mondial USING GSI;

mondial.json is small enough that you can browse it with a text editor to explore it structure.

More Resources:

CouchBase Tutorials: http://developer.couchbase.com/documentation/server/4.0/getting-started/first-n1ql-query.html
CheatSheet http://docs.couchbase.com/files/Couchbase-N1QL-CheatSheet.pdf

Problems (2 points each):

Retrieve all the names of all cities located in Peru, sorted alphabetically.
[Result Size: 30 rows]

For each country return its name, its population, and the number of religions, sorted alphabetically by country. We are flexible about how you handle countries without the religion attribute, but try to include countries that don't have it. For this, you will need the ARRAY_LENGTH function.
[Result Size: 238 rows]

For each country that has provinces return its name, the number of provinces, and the number of cities, sorted alphabetically by country. Notice that for countries that have provinces all cities are listed inside the provinces.
[Result Size: 59 rows]

Find all countries with 20 provinces or more. Return the country name and the number of provinces. Order descending by the number of provinces and then ascending by the country name.
[Result Size: 26 rows]

For each province(state) in the United States, compute the ratio of its population to area, and return each province's name, its computed ratio, and order them descending by the ratio. You need the function TONUMBER(...) to convert a string to a number.
[Result Size: 51 rows]

For each country return its name and the number of mountains in that country, sorted alphabetically by country. Notice that here you need to join the country collection with the mountain collection.
[Result Size: 43 rows or 84 rows (depending on how you write the query)]

For each country return its name and the names and heights of all mountains over 2000m high. Return objects of the form country name, mountain name, mountain height where the country name is repeated for each of its mountains. Sort your results by country name and then mountain name. Note: the height attribute is in meters, so you don't have to do any conversions.
[Result Size: 149 rows or 168 rows (depending on how you write the query)]

For each river which crosses two more more countries, return its name, and the full names (not country code) of the countries that it crosses. Return objects of the form rive name, country name where the river name is repeated for each country it crosses. Sort your answers by the river name and then country name(s). Hint: all rivers that cross two more more countries have a 'located' collection, but some rivers that cross only one country also have a 'located' collection.
[Result Size: 144 rows]

Extra Credit (4 points): In a separate file called hw5-extracredit.n1ql, write a query using the NEST operator that answers question 7 but returns a list of mountain names and mountain heights for each country name. Do not do it using ARRAY_AGG. The staff isn't even sure how to do this.