CSE 373 homework 0

CSE373—Data Structures and Algorithms for Nonmajors

Programming Assignment #5

due: Wednesday, 3/12/08, 10:00 PM

Introduction:

You may have heard terms such as ontologies, OWL, the Semantic Web, XML, etc. The World Wide Web and HTML are great for sharing content, but the content is meant to be readable by a person. On the Semantic Web, the content is meant to be readable by a computer. Already there is much XML content on the web as part of RSS feeds, accumulated by mashups, etc. However a challenge is that the information provided by one site is largely independent of information provided by another site; there exists no meta-knowledge unifying their information. The goal of the Semantic Web is for everything to be represented in a form which a computer can reason about. Obviously, we are nowhere close, but great strides are being made in biology on smaller domains focusing on anatomy, biochemistry, genetics, etc.

Why is it so important for computers to be able to reason? Here’s a simple example in the domain of anatomy. Suppose a doctor has a patient with a heart defect and wants to research it. A query on Google or PubMed for “heart defect” might turn up relevant information. Suppose another doctor had a patient with a heart defect in the left ventricle and wrote a paper on it, but the paper got indexed under “left ventricle defect.” The first doctor would never find that paper using the standard dumb search algorithms. But if the computer knew that the left ventricle was part of the heart, then that paper could be found!

In this project, you will be working with the Foundational Model of Anatomy (FMA), an acclaimed knowledge base developed at the University of Washington. The FMA is more specifically called an ontology, which is a controlled vocabulary of terms with formal relationships between the terms. Each term can be composed of other terms, and the rules governing the relationships can be defined in the ontology as well. Often the terms will have a taxonomy that allows traits to be inherited. At a minimum, an ontology can be represented crudely in the resource description framework (RDF) as subject-predicate-object triples, or with full rules and inferences in the web ontology language (OWL). An ontology enables the data it describes by supporting reasoning and data-mining, as well as standardizing nomenclatures across the world.

The current release of the Foundational Model of Anatomy has roughly 78,000 entities, 200 types of predicates, in excess of a hundred thousand instances (many just synonyms but some reified relations), and over a million subject-predicate-object triples. Its subclass taxonomy has a depth of 19 (Dorsal digital vein of left big toe, along with 17 other veins of the feet, are of that depth). We’ll be accessing only a few of the predicates.

Subject-Predicate-Object triples:

Your access to the FMA is through working with triples that read much like a sentence. For example, the FMA contains the triples (“Human body”,HAS_SUBCLASS,”Female human body”) and (“Human body”,HAS_SUBCLASS,”Male human body”). The FMA also contains the triple (“Heart”, IS_PART_OF,”Cardiovascular system”). You work with these triples by giving a subject s and a predicate p, and it will return each object o (stored in a list) such that (s,p,o) is a triple in the ontology. So a query of getObjects(“Human body”,HAS_SUBCLASS) will return all direct subclasses of “Human body.”

This representation might remind you of a graph. In fact, an ontology is a graph. Each item in the ontology is a labeled node, and a predicate is a directed edge (with a label) connecting a subject to an object (the source vertex to the target vertex). The getObjects method in effect is acting as an adjacency list.

The FMA has been defined robustly such that every predicate we’ll use has an inverse, and if (s,p,o) is in the ontology, then (o,inv(p),s) is also in the ontology. Note that the FMA is not yet complete, so don’t be alarmed if you think something should be there or be connected somehow and it’s not!

Programming (80 points):

You should use good style, whitespace, naming conventions, etc., and comment your code.

You may use ANYTHING in java.util.*. You may add as many helper methods as you want. Do not add any other instance or class variables. Feel free to change main... we will test your methods on the same and other input parameters.

The predicates to use are defined as constants at the top of the code. Desired outputs are in main. When the output is large, a checksum (64-bit hash) is shown instead as a way of authenticating/validating your results. (Checksums are frequently used to make sure that downloads haven’t been corrupted/tampered.) However you still should look at the outputs to see what they are and if they make sense. You may even learn something about anatomy!

To compile and run, you’ll need to add “fma.jar” to the classpath. For example on the command-line,
javac –cp fma.jar;. FMAQueries.java and java –cp fma.jar;. FMAQueries

Implement the following methods however you want:

0) directSubclasses(String s). Returns a Collection of Strings containing the direct subclasses of s. The Collection must iterate in alphabetical order. Already implemented for you!

1) transitiveParts(String s). Returns a Collection of Strings containing items that are transitively part of s, in other words, all items obtained from recursively asking for the parts of s. The Collection must iterate in alphabetical order.

2) certainOrgansInPart(String s). Returns a Collection of Strings containing items that are transitively part of s and are descendents of (transitively subclasses of) “Organ with organ cavity” or “Parenchymatous organ”. The Collection must iterate in alphabetical order.

3) commonPartsOfKids(String s). Returns a Collection of Strings containing items that are transitively part of each subclass of s but are not transitively part of s. A way to interpret this list is that it is the list of all items that are redundant and should be defined more abstractly in the common parent. For example, when “Human body” is passed in, it computes everything that is recursively part of both “Male human body” and “Female human body” but is not part of “Human body.” The Collection returned must iterate in alphabetical order.

4) shortestPath(String s, String t). Returns a Collection of Strings that when iterated, produces any shortest path from s to t (including the endpoints) via the IS_CONTINUOUS_WITH predicate. There is no weighting of edges, i.e. each edge costs 1. For this problem, we will use just the different myocardial zones as possible input. The numbers for the different zones are 1..20 plus 13a..17a. Cardiologists have demarcated the zones based on what coronary arteries supply blood to where.

5) allShortestPaths(String s, String t). Returns a Collection of Collections of Strings that when iterated, produces ALL shortest paths from s to t (including the endpoints) via the IS_CONTINUOUS_WITH predicate. The paths should appear in lexicographical order, that is, one Collection is before another if the first differing element between them is smaller in the former (see the sample output in main). For this problem, we will use just the different myocardial zones as possible input.

Turn in your work electronically from the “assignments” link on the class webpage. If you did not receive an email receipt, you did not turn in your project.