CSEP 544 Homework 6

Objectives:
To be able to manipulate XML: query it with XQuery.
Reading Assignments:
Lecture notes on XML and XQuery.
Number of points:
100, 10 for each sub-question.
Due date:
Wednesday, Dec 1st, 2010 - 12:01 am
Turn in format:
You should turn in your answers in a single text file with the solutions to the questions in numeric order. The header of the file (commented) should contain your name, course, and which XQuery engine you used (Zorba or Saxon). Each query should be preceded by a comment giving the problem number. Below the XQuery, in a comment, should be the first 3 items it returned (if there are fewer than 3, list them all). XQuery comments look like (: this :). For example, the first question is: (1) Retrieve all the names of all cities located in Peru, sorted alphabetically.

Your file should contain
      (: Name
         CSEP 544
         Other metadata...
         XQuery Engine: Saxon
      :)

      (: Problem 1. :)

      (Insert your XQuery here)

      (: Results
          <result>
              <country>
                  <name>Peru</name>
                  <city>
                      <name>Abancay</name>
                  </city>
                  <city>
                      <name>Arequipa</name>
                  </city>
                  <city>
                      <name>Ayacucho</name>
                  </city>
                  ...
              </country>
          </result>
      :)
We should be able to extract your answer for any of the problems and place it into a separate file that can be run to verify your solution. For instance, if one of your answers is placed in a file named X.xq and the following command is run,
zorba -i -f -q X.xq
Or,
java -cp saxon9he.jar net.sf.saxon.Query X.xq
the correct query result for that problem should be printed to standard output. Note: '-i' command in zorba for indenting the final output.
If you want to run queries from one file, assuming that you use the above comment syntax to specify problems, runProblem.pl script can be used to run a problem. Usage is: perl runProblem.pl hw6.xq <problem#>. This script will parse the problem between (: Problem #1. :) and the next (: Problem #2. :) (or EOF), dump that to tmp.xq and run it. One drawback is that the error reported is with respect to tmp.xq (ie line numbers are off).
 
Turn in link:
Please turn in your assignment in the dropbox.
Assignment Tools:
XQuery via the Zorba Command Line Client or Saxon (which has both Java and .NET version).
  1. Please first install zorba or saxon on your pc (Saxon Java version only requires unzipping the jar file).
  2. Download Mondial XML dataset and Mondial DTD.
  3. Follow the brief tutorial to get started (both Zorba and Saxon with Java+Linux).
References:

Problems

[100 points, 10 pts for each sub-question, 8 pts for correct answer, 2 pts for following the DTD] Consider the XML data instance Mondial, available here (about 1.8 MB). Write XQueries to answer the following questions. In formulating your questions, you need to understand how various elements are nested: e.g. what is under a country, under which element is a city etc. For that it helps if you inspect the DTD (ignore the warning that the data is not valid), or inspect the data directly.

Moreover, the output of the xquery should follow the DTD that is specified immediately after the problem statement. Furthermore, the output of the xquery should be a well formed XML after standard XML headers (<?xml version="1.0" encoding="UTF-8" ?>, etc) have been added. That is, the output of the first question should be (along the lines of):

    <result>
        <country>
            <name>Peru</name>
            <city>
                <name>Abancay</name>
            </city>
            <city>
                <name>Arequipa</name>
            </city>
            <city>
                <name>Ayacucho</name>
            </city>
            ...
        </country>
    </result>
Note: The amount of white space does not matter.

  1. Retrieve all the names of all cities located in Peru, sorted alphabetically.
    <!ELEMENT result (country)>
    <!ELEMENT country (name, city+)>
    <!ELEMENT city (name)>
    <!ELEMENT name (#PCDATA)>
    
  2. For each province of China, return its capital. Order the result by province name.
    <!ELEMENT result (country)>
    <!ELEMENT country (name, province+)>
    <!ELEMENT province (name, capital)>
    <!ELEMENT capital (name)>
    <!ELEMENT name (#PCDATA)>
    
  3. Find all countries with more than 20 provinces. Order by the number of provinces.
    <!ELEMENT result (country*)>
    <!ELEMENT country (name)>
    <!ATTLIST country num_provinces CDATA #REQUIRED>
    <!ELEMENT name (#PCDATA)>
    
    
  4. For each province(state) in the United States, compute the ratio of its population to area, and return each province's name, its computed ratio, and order them by ratio.
    <!ELEMENT result (country)>
    <!ELEMENT country (name, state+)>
    <!ELEMENT state (name, population_density)>
    <!ELEMENT name (#PCDATA)>
    <!ELEMENT population_density (#PCDATA)>
    
  5. Find all ethnic groups that live in more than 10 countries.
    <!ELEMENT result (ethnicgroups+)>
    <!ELEMENT ethnicgroups (name)>
    <!ATTLIST ethnicgroups num_countries CDATA #REQUIRED>
    <!ELEMENT name (#PCDATA)>
    
  6. Find the countries adjacent to the 'Pacific Ocean' (sea).
    <!ELEMENT result (waterbody)>
    <!ELEMENT waterbody (name, adjacent_countries+)>
    <!ELEMENT adjacent_countries (country+)>
    <!ELEMENT country (name)>
    <!ELEMENT name (#PCDATA)>
    
    Note: Use the country attribute for the tag <sea> to find the respective countries.
  7. Find all the provinces(states) of the United States with population more than 11,000,000. Compute the ratio of each qualified state's population to the whole population of the country. Return each state's name and the ratio. Order by the ratio in descending order.
    <!ELEMENT result (country)>
    <!ELEMENT country (name, state+)>
    <!ELEMENT state (name, population_ratio)>
    <!ELEMENT name (#PCDATA)>
    <!ELEMENT population_ratio (#PCDATA)>
    
  8. For each river which crosses at least 2 countries, return its name, and the names of the countries it crosses. Order by the numbers of countries crossed.
    <!ELEMENT html (head, body)>
    <!ELEMENT head (title)>
    <!ELEMENT title (#PCDATA)>
    <!ELEMENT body (h1, ul)>
    <!ELEMENT h1 (#PCDATA)>
    <!ELEMENT ul (li+)>
    <!ELEMENT li (#PCDATA | font | ol)*>
    <!ELEMENT ol (li+)>
    <!ELEMENT font (#PCDATA)>
    
    The idea with the <li> containing a font and ol tag is such that the output looks roughly like:
    ...
    <ul>
      <li>
        <font>River name</font>
        <ol>
          <li>Country crossed #1</li>
          <li>Country crossed #2</li>
          ...
        </ol>
      </li>
      ...
    </ul>
    
    Note: Use the country attribute for the tag <river> to find the respective countries.
  9. Find the names of all countries that have at least 3 mountains over 2000m high, and list the names and heights of all mountains in these countries (regardless of their height). Note: the height attribute is in meters, so you don't have to do any conversions.
    <!ELEMENT result (country+)>
    <!ELEMENT country (name, mountains+)>
    <!ELEMENT mountains (name, height)>
    <!ELEMENT height (#PCDATA)>
    <!ELEMENT name (#PCDATA)>
    
  10. One user is interested in long rivers. Produce the following view of the data, containing only rivers longer than 2000 (all units are in km), in the format described below:
    <!ELEMENT user (river+)>
    <!ELEMENT river (name, country*)>
    <!ELEMENT country (name)>
    <!ELEMENT name (#PCDATA)>
    
    (Note 1: some rivers may not have any country, due to noise in the data. It is OK to include these rivers, even if they look as they flow through no country at all.) Note 2: Use the country attribute for the tag <river> to find the respective countries.