Marvel Comics #1

Handout H5

Contents:

Introduction

In this assignment you will put the graph you designed in Homework 4 to use by modeling the Marvel Comics universe! By trying out your ADT in a client application, you will be able to test the usability of your design as well as the correctness, efficiency, and scalability of your implementation.

This application builds a graph containing thousands of nodes and edges. At this size, you may discover performance issues that weren't revealed by your unit tests on smaller graphs. With a well-designed implementation, your program will run in a matter of seconds. Bugs or less ideal choices of data structures can increase the runtime to anywhere from several minutes to 30 minutes or more.

This Marvel dataset was also used by researchers at Cornell University, who published a research paper showing that the graph is strikingly similar to "real-life" social networks.

The MarvelPaths Application

In this application, your graph models a social network among characters in Marvel comic books. Each node in the graph represents a character, and an edge ⟨Char1,Char2⟩ indicates that Char1 appeared in a comic book that Char2 also appeared in. There should be a separate edge for every comic book. For example, if Zeus and Hercules appeared in (say) five issues of a given series, then Zeus would have five edges to Hercules, and Hercules would have five edges to Zeus.

You will write a class hw5.MarvelPaths (in file MarvelPaths.java) that reads the Marvel data from a file, builds a graph, and finds paths between characters in the graph. Because we will only test your class through a test script file format similar to the one in Homework 4, you are not required to write a main method as a driver for your application; nevertheless, we encourage you to do so, both for your own convenience and for the satisfaction that comes with creating a complete, standalone application.

As you complete this assignment, you may need to modify the implementation and perhaps the public interface of your ADT. Briefly document any changes you made and why in hw5/answers/changes.txt (no more than 1-2 sentences per change). If you made no changes, state that explicitly. You don't need to track and document cosmetic and other minor changes, such as renaming a variable; we are interested in substantial changes to your API or implementation, such as adding a public method or using a different data structure. Describe logical changes rather than precisely how each line of your code was altered. For example, "I switched the data structure for storing all the nodes from a ___ to a ___ because ___" is more helpful than "I changed line 27 from nodes = new ___(); to nodes = new ____();."

Leave your graph in the hw4 package where it was originally written, even if you modify it for this assignment. There is no need to copy files nor duplicate code! You can just import and use it in HW5.

Problem 0: Getting the Marvel Universe Data

Before you get started, obtain the Marvel Universe dataset. For legal reasons, we have not added the file to your repositories. Instead, download the data from Infochimps, unzip the download, and copy and rename the labeled_edges.tsv file inside to hw5/test/marvel.tsv. (Note that although the download page encourages you to create an account, it does allow you to bypass that step.)

Take a moment to inspect the file. Each line in marvel.tsv is of the form

    "character"	"book"

where character is the name of a character, book is the title of a comic book that the character appeared in, and the two fields are separated by a tab.

Problem 1: Building the Graph

The first step in your program is to construct your graph of the Marvel universe from a data file. We provide you with a class MarvelParser to help you. MarvelParser has one static method, parseData(), which reads data from marvel.tsv, or any file structured in the same format. parseData() creates in-memory data structures: a Set of all characters and a Map from each book to a Set of the characters in that book. These are not the data structures you want, however; you want a Graph.

You may modify MarvelParser however you wish to fit your implementation. You might change the method signature (parameters and return value) of parseData(), or you might leave parseData() as is and write code that processes its output. The only constraint is that your code needs to take a filename as a parameter so the parser can be reused with any file.

At this point, it's a good idea to test the parsing and graph-building operation in isolation. Verify that your program builds the graph correctly before you go on. The assignment formally requires this in Problem 3.

Problem 2: Finding Paths

The real meat (or tofu) of MarvelPaths is the ability to find paths between two characters in the graph. Given the name of two characters, MarvelPaths searches for and returns a path through the graph connecting them. How the path is subsequently used, or the format in which it is printed out, depends on the requirements of the particular application using MarvelPaths, such as your test driver.

Your program should return the shortest path found via breadth-first search (BFS). A BFS from node u to node v visits all of n's neighbors first, then all of n's neighbors' neighbors', then all of n's neighbors' neighbors', and so on until v is found or all nodes with a path from u have been visited. Below is a general BFS pseudocode algorithm to find the shortest path between two nodes in a graph G:

    start = starting node
    dest = destination node
    Q = queue, or "worklist", of nodes to visit: initially empty
    M = map from nodes to paths: initially empty.
        // Each key in M is a visited node.
        // Each value is a path from start to that node.
        // A path is a list; you decide whether it is a list of nodes, or edges,
        // or node data, or edge data, or nodes and edges, or something else.
    
    Add start to Q
    Add start->[] to M (start mapped to an empty list)
    while Q is not empty:
        dequeue next node n
        if n is dest
            return the path associated with n in M
        for each edge e=⟨n,m⟩:
            if m is not in M, i.e. m has not been visited:
                let p be the path n maps to in M
                let p' be the path formed by appending e to p
                add m->p' to M
                add m to Q
            
    If the loop terminates, then no path exists from start to dest.
    The implementation should indicate this to the client.

(Note: for code readability, we recommend using more descriptive variable names in the actual code than are needed in the pseudocode.)

Here are some facts about the algorithm.

Many character pairs will have multiple paths. For grading purposes, your program should return the lexicographically (alphabetically) least path. More precisely, it should pick the lexicographically first character at each next step in the path, and if those characters appear in several comic books together, it should print the lexicographically lowest title of a comic book that they both appear in. The BFS algorithm above can be easily modified to support this ordering: in the for-each loop, visit edges in increasing order of m's character name, with edges to the same character visited in increasing order of comic book title. This is not meant to imply that your graph should store data in this order; it is merely a convenience for grading. Because of this application-specific behavior, we recommend that you implement your BFS algorithm in MarvelPaths rather than directly in your graph.

Using the full Marvel dataset, your program must be able to construct the graph and find a path in less than 30 seconds on attu. ScriptFileTests, the provided class used to run your specification tests, is set to put a 3000 ms (30 second) timeout for each test. (As a point of reference, the staff solution took around 5 seconds to run on attu).

Problem 3: Testing Your Solution

Because the Marvel graph contains literally thousands of nodes and hundreds of thousands of edges, using it for correctness testing is probably a bad idea. By contrast, using it for scalability testing is a great idea, but should come after correctness testing has been completed using much smaller graphs. In addition, it is important to be able to test your parsing/graph-building and BFS operations in isolation, separately from each other. For these reasons, you will use a test driver to verify the correctness of both your parser and BFS algorithm on much smaller graphs, as you did in Homework 4.

Specification Tests

You will write specification tests using test scripts similar to those you wrote in Homework 4. The format is defined in the test file format section below. In addition to writing *.test and *.expected files as before, you will write *.tsv files in the format defined for marvel.tsv to test your parser. All these files will go in the hw5/test directory.

You should create a class named HW5TestDriver that runs tests in the same way that HW4TestDriver did. We have provided a starter file in test/ with method stubs and a small amount of starter code, but you can replace it. You have two choices as to your approach:

  1. As one option, you may copy your hw4/test/HW4TestDriver.java source file to hw5/test/HW5TestDriver.java, change it to package hw5.test, and revise it to match this assignment's file format specification. This will cause duplicated code between the two problem sets, but avoids some of the tricky issues with subclassing. This is probably the easier option.
  2. Alternatively, you may choose to write a new HW5TestDriver class that extends the HW4TestDriver and reuses some of its code. This approach may or may not be easier, but it is more elegant. You will need to go back and edit your HW4TestDriver to make it more extensible — in particular, by making some of its methods and/or fields protected instead of private. Also, static methods may not have the inheritance properties you want. You may find it easiest to just copy the main method from HW4TestDriver and change the references from HW4 to HW5. However you decide to change HW4TestDriver, it must continue to work correctly for Homework 4.

We may run each student's SpecificationTests against each other student's assignment; how you fare will determine a small part of your grade. (It would not make sense for us to do this cross-checking with ImplementationTests. If you don't know why, please review the difference between the two varieties of tests until you understand this point.)

Implementation Tests

Your specification tests will most likely cover the bulk of your testing. You may need to test additional behaviors specific to your implementation, such as handling of edge cases. If so, write these tests in a regular JUnit test class and add them to ImplementationTests.java. If you specify data files to load in your implementation tests, see the Hints section for information about specifying filenames. Be sure to run your tests on attu with ant validate to verify that the data files are still found successfully under the configuration in which we will run your tests.

Reflection [1 point]

Please answer the following questions in a file named reflection.txt in your answers/ directory.

  1. How many hours did you spend on each problem of this assignment? (Only include time when you are completely focused on CSE 331. For example, if you worked an hour in your noisy dorm lounge, or you were periodically reading email or IMs, estimate the amount of time you actually spent on CSE 331 during that hour.)
  2. In retrospect, what could you have done better to reduce the time you spent solving this assignment?
  3. What could the CSE 331 staff have done better to improve your learning experience in this assignment?
  4. What do you know now that you wish you had known before beginning the assignment?

Collaboration [1 point]

Please answer the following questions in a file named collaboration.txt in your answers/ directory.

The standard collaboration policy applies to this assignment.

State whether or not you collaborated with other students. If you did collaborate with other students, state their names and a brief description of how you collaborated.

Returnin [25 points]

After you receive your corrected assignment back from your TA, you will need to resubmit a corrected version of the code. You will have until Friday, May 18 at 11pm to returnin your code. The returnin script will be enabled a week earlier, at 11pm Friday, May 11.

You will only receive credit if you pass all the tests, and you have a fixed number of trials without penalty. See the returnin331 documentation for details.

Test script file format

The format of the test file is similar to the format described in Homework 4. As before, the test driver manages a collection of named graphs and accepts commands for creating and operating on those graphs.

Each input file has one command per line, and each command consists of whitespace-separated words, the first of which is the command name and the remainder of which are arguments. Lines starting with # are considered comment lines and should be echoed to the output when running the test script. Lines that are blank should cause a blank line to be printed to the output.

The behavior of the testing driver on malformed input command files is not defined; you may assume the input files are well-formed.

In addition to all the same commands (and the same output) as in Homework 4, the test driver accepts the following new commands:

LoadGraph graphName file.tsv

Creates a new graph named graphName from file.tsv, where file.tsv is a data file of the format defined for marvel.tsv and is located in the cse331/bin/hw5/test directory. (Files you place in cse331/src/hw5/test are automatically copied over when your code is compiled.) Filenames should be simple, meaning they do not include the directory in which they are located. For example, marvel.tsv is a simple filename; hw5/test/marvel.tsv is not. To automatically generate the full path from the simple filename so the file can be found by MarvelParser, call the provided getAbsoluteFilename(String filename) helper method in the HW5TestDriver starter code.

The command's output is

loaded graph graphName

You may assume file.tsv is well-formed; the behavior for malformed input files is not defined.

FindPath graphName node_1 node_n

Find the shortest path from node_1 to node_n in the graph using your breadth-first search algorithm. For this command only, underscores in node names should be converted to spaces before being passed into any methods external to the test driver. For example, "node_1" would become "node 1". This is to allow the test scripts to work with the full Marvel dataset, where many character names contain whitespace (but none contain underscores). Anywhere a node is printed in the output for this command, the underscores should be replaced by spaces, as shown below.

Paths should be chosen using the lexicographic ordering described in Problem 2. If a path is found, the command prints the path in the format:

path from CHAR 1 to CHAR N:
CHAR 1 to CHAR 2 via BOOK 1
CHAR 2 to CHAR 3 via BOOK 2
...
CHAR N-1 to CHAR N via BOOK N-1

where CHAR 1 is the first node listed in the arguments to FindPath, CHAR N is the second node listed in the arguments, and BOOK K is the title of a book that CHAR K-1 and CHAR K appeared in.

Not all characters may have a path between them. If the user gives two valid node arguments CHAR_1 and CHAR_2 that have no path in the specified graph, print:

path from CHAR 1 to CHAR N:
no path found

If a character name CHAR was not in the original dataset, print:

unknown character CHAR

where, as before, any underscores in the name are replaced by spaces. If neither character is in the original dataset, print the line twice: first for CHAR 1, then for CHAR N. These should be the only lines your program prints in this case — i.e., do not print the regular "path not found" output too.

Sample testing files

Several sample test files demonstrating the new commands are provided in hw5/test.

Hints

Performance

If your program takes an excessively long time to construct the graph for the full Marvel dataset, first make sure that it parses the data and builds the graph correctly for a very small dataset. If it does, here are a few questions to consider:

Paths to Files

When specifying filenames outside of the test file scripts, such as in a main() method, you will need to include the full directory name in the filename; another way of saying this is that you will use an "absolute pathname". (In the test file scripts, you only need to specify the simple filename because HW5TestDriver should use getAbsoluteFilename() to append the path at runtime.)

If you are running your program from the command line in the bin directory, you might use a pathname such as hw5/test/marvel.tsv, which will look for the file under the current (bin) directory. If you are running your program through Eclipse, it will look for files starting in the root of your cse331 directory, so you might supply the filename as src/hw5/test/marvel.tsv.

If you specify files in your implementation tests, make sure that the files can be found and the tests pass when you run ant validate on attu.

What to Turn In

You should add and commit the following files to SVN:

Additionally, be sure to commit any updates you may have made to the following files:

Finally, remember to run ant validate to verify that you have submitted all files and that your code compiles and runs correctly when compiled with javac on attu (which sometimes behaves differently from the Eclipse compiler).

Errata

None yet!

Q & A

Q: How should the test driver print a path from a character to itself?

A: A trivially empty path is different from no path at all, so the "no path found" output isn't appropriate here. But there are no edges to print, either. So your test driver should print the header line

path from CHAR 1 to CHAR 1:
but nothing else.