Contents:
- Introduction
- The MarvelPaths Application
- Part 0: Understanding the Marvel Universe Data
- Part 1: Parsing the Data
- Part 2: Building the Graph
- Part 3: Finding Paths
- Part 4: Testing Your Solution
- Part 5: Running Your Solution From the Command Line
- Test script file format
- Paths to Files
- Hints
- How to Turn In Your Homework
Introduction
In this assignment, you will put the graph you designed in the graph homework to use by modeling the Marvel Comics universe. By trying out your Graph ADT in a client application, you will be able to test the usability of your design as well as the correctness, efficiency, and scalability of your implementation.
This application builds a graph containing thousands of nodes and edges. The size of the graph might expose performance issues that weren't revealed by your unit tests on smaller graphs. With a well-designed implementation, your program will run in a matter of seconds. Bugs or suboptimal data structures can increase the runtime to anywhere from several minutes to 30 minutes or more. If this is the case, you may need to go back and revisit your graph implementation from earlier. Remember that different graph representations have widely varying time complexities for various operations and this, not a coding bug, may explain the slowness.
This Marvel dataset was also used by researchers at Cornell University, who published a research paper showing that the graph is strikingly similar to “real-life” social networks.
The MarvelPaths Application
In this application, your graph models a social network among characters in Marvel
comic books. Each node in the graph represents a character, and an edge
〈Char1,Char2〉
indicates that Char1
appeared in
a comic book that Char2
also appeared in. There should be a separate
edge for every comic book any two characters appear in, labeled with the name of
the book. For example, if Zeus and Hercules appeared in five issues of a given
series, then Zeus would have five edges to Hercules, and Hercules would have
five edges to Zeus.
You will write a class MarvelPaths
(create the file yourself in:
hw-marvel/src/main/java/marvel/
named MarvelPaths.java
)
that reads the Marvel data from a file (marvel.csv
), builds a graph,
and finds paths between characters in the graph.
As you complete this assignment, you may need to modify the implementation and/or the specification of your Graph ADT. You are welcome to do this - you're not "locked in" to any of your previous decisions. We encourage you to think carefully about any changes you might make, and use the same thought process you used when originally creating your design to evaluate any potential changes. If you make changes that modify the behavior of methods or adds new methods, please update your test suite accordingly (add tests or modify expected output).
Briefly document any changes you make to your graph ADT documentation,
implemnetation or corresponding test suite in the provided
changes.txt
file. There's no need to include simple cosmetic changes
like new vairable names, but big changes like new methods, changes to your
representation, or substancial improvements to your test suite or
specifications should be clearly stated.
Leave your graph in the graph
package where it was originally written,
even if you modify it for this assignment. There is no need to copy files nor
duplicate code! You can just import
and use it in this assignment. If
you do modify your graph
code, be sure to commit those changes your
repository.
Part 0: Understanding the Marvel Universe Data
Before you get started, look at the file with the Marvel data:
hw-marvel/src/main/resources/data/marvel.csv
.
A CSV (“comma-separated values”) file consists of human-readable
data delineated by commas, and can be opened in your favorite text editor,
spreadsheet, or IntelliJ. Each line in marvel.csv
is of the
form
character,book
where character is the name of a character, book is the title of
a comic book that the character appeared in, and the two fields are separated by a
comma (,
).
Part 1: Parsing the Data
The first step in your program is to parse the provided csv files. We have provided
a starter file MarvelParser.java
and you must complete the
implementation.
We've provided a helper method readLines
that returns a
List<String>
where each element in the list is one line of the
file that has been opened. You can use this method to read in the raw data in the
file and then process it according to the format of the file. You'll need to
complete the implementation of parseData
to parse the data from the
file into some useful data structure(s) and return them to your main program (in
MarvelPaths.java) to be assembled into your graph.
Note: When the readLines
method is called (such as in the provided
code), we can just pass it the simple filename of the file and it'll take care of
looking in the correct directory (src/resources/data/). For example:
readLines("marvel.csv")
. If you write any parsing code elsewhere in
this course (this is unlikely for most students), you should write
code similar to the implementation of readLines
here to ensure that it
works across projects/assignments and in testing. See the comments in
readLines
for more info.
Part 2: Building the Graph
The next step in your program is to construct your graph of the Marvel universe
from a data file. Once you have completed MarvelParser.parseData
,
you need to convert the data from it into a Graph. This will be specific to your
implementation of a Graph, but most likely will involve looping through all the
data returned by the parser and calling methods on your Graph class(es) to
construct nodes/edges and link them together appropriately. Remember that, for each
comic book relationship between Character A and Character B, there should be two
edges: one from Character A to Character B, and one the other way around, from
Character B to Character A.
Warning: You should not create edges from a character to themself as part of this process.
At this point, it's a good idea to test the parsing and graph-building operation in isolation. Verify that your program builds the graph correctly before continuing by completing the relevant parts of Part 4.
Part 3: Finding Paths
The real meat of MarvelPaths
is the ability to find paths between two
characters in the graph. Given the name of two characters,
MarvelPaths
should search for and return a path through the graph
connecting them. How the path is subsequently used, or the format in which it is
printed out, depends on the requirements of the particular application using
MarvelPaths
, such as your
test driver (see below).
Your program should return the shortest path found via breadth-first search (BFS). A BFS from node u to node v visits all of u's neighbors first, then all of u's neighbors' neighbors, then all of u's neighbors' neighbors' neighbors, and so on until v is found or all nodes with a path from u have been visited. Below is a general BFS pseudocode algorithm to find the shortest path between two nodes in a graph G. For readability, you should use more descriptive variable names in your actual code:
start = starting node dest = destination node Q = queue, or "worklist", of nodes to visit: initially empty M = map from nodes to paths: initially empty. // Each key in M is a visited node. // Each value is a path from start to that node. // A path is a list; generally, this is a list of node and edge data together (in // whatever data structure makes sense for you) because we want to include // both the nodes along the path and the specific edges you take to get to them. Add start to Q Add start->[] to M (start mapped to an empty list/path) while Q is not empty: dequeue next node n if n is dest return the path associated with n in M for each edge e=〈n,m〉: if m is not in M, i.e. m has not been visited: let p be the path n maps to in M let p' be the path formed by appending e to p add m->p' to M add m to Q // If the loop terminates, then no path exists from start to dest. // The implementation should indicate this to the client. Note that // BFS returns the path with the fewest number of edges.
Many character pairs will be connected by multiple shortest paths with the same length. For grading purposes, your program should return the lexicographically (alphabetically) least path. More precisely, it should pick the lexicographically first character at each next step in the path, and if those characters appear in several comic books together, it should print the lexicographically least title of a comic book that they both appear in. The BFS algorithm above can be easily modified to support this ordering: in the for-each loop, visit edges in increasing order of m's character name, with edges to the same character visited in increasing order of comic book title. This is not meant to imply that your graph should store data in this order; it is merely a convenience for grading. Note that normally it wouldn't be a great idea to "specialize" BFS like this, where it requires a sort-able input, but it's fine to modify the BFS itself to achieve this sorting behavior for this assignment and our use-case.
Because this search is specific to this particular application, you should
implement your BFS algorithm in MarvelPaths
rather than directly in
your graph, as other hypothetical applications that might need BFS probably would
not need this special ordering. Furthermore, other applications using the graph
ADT might need to use a different search algorithm, so we don't want to hard-code
a particular search algorithm in the graph ADT itself.
Using the full Marvel dataset, your program must be able to construct the graph
and find a path in less than 30 seconds on attu. ScriptFileTests
, the
provided class used to run your script tests, is set to put a 30000 ms (30 second)
timeout for each test. (For reference, the staff solution took around 5 seconds to
run on attu).
If your Graph ADT has a particularly thorough or expensive checkRep
function, it is possible that it could take too long to load the Marvel dataset
when all checks are performed. A good checkRep
function should include
some way to disable particularly expensive checks by setting a compile-time or
run-time variable. See the week # section
slides
as an example.
You must disable the expensive checkRep
tests in the version
you submit for grading so it will pass all tests in the allotted time.
Part 4: Testing Your Solution
Because the Marvel graph contains thousands of nodes and hundreds of thousands of edges, you should not use it for correctness testing (such as ensuring that your code computes the correct path between two characters), because it will be too hard to determine the correct path in most cases. By contrast, the full data set is useful for scalability testing or to detect corner cases, after you have performed correctness testing using much smaller graphs that you create. Additionally, it is important to be able to test your parsing/graph-building and BFS operations in isolation and separately from each other. For these reasons, you should use a test driver to verify the correctness of both your parser and BFS algorithm on much smaller graphs.
Note: all of your hw-graph tests still exist, so there's no need to re-create any of those tests for hw-marvel. Your hw-marvel tests should be focused on the behavior your MarvelPaths (and related) code has.
Script Tests
You should write script tests using test scripts similar to those you wrote in the
graph homework. The format is defined in the
test file format section below. In addition to
writing *.test
and *.expected
files as before, you will
write *.csv
files in the format defined for
marvel.csv
to test your parser. All the *.csv
files used
for testing should be placed in the hw-marvel/src/main/resources/data
directory. Place your test scripts in
hw-marvel/src/test/resources/testScripts/
.
You should create a class named MarvelTestDriver
that runs tests in
the same way that GraphTestDriver
did. We have provided a starter file
in hw-marvel/src/test/java/marvel/scriptTestRunner
with method stubs
and a small amount of starter code, but you can replace it. You may copy your
graph/src/test/java/graph/scriptTestRunner/GraphTestDriver.java
source
file to hw-marvel/src/test/java/marvel/scriptTestRunner/MarvelTestDriver.java
,
change it to package marvel.scriptTestRunner
, and revise it to match
this assignment's file format specification. This will cause duplicated code
between the two assignments, but avoids some tricky issues with subclassing
(especially when trying to subclass test classes).
JUnit Tests
Your script tests will most likely cover the bulk of your testing. You may need to
test additional behaviors specific to your implementation, such as handling of edge
cases. If so, write these tests in a regular JUnit test class or classes in the
marvel.junitTests
package. If your data fails to load in your JUnit
tests, see the File Paths section for information about
resource files. Be sure to run validation on attu with
./gradlew hw-marvel:validate
to verify that the data files are still
found successfully under the configuration in which we will run your tests. (And to
make sure that your test run in a reasonable amount of time.)
Running Tests
To run your tests, we recommend using the IntelliJ JUnit testing integration. Make
sure you've set up your IntelliJ to properly run tests using gradle - see the
Editing/Compiling/Running/Testing handout
for more setup instructions. Once you've done so, you can right-click on any of
your JUnit tests and choose "Run <filename>" to run the tests in that file.
To run all the tests in a folder, you can right click a folder and choose "Run
Tests in <folder>". The script tests (.test and .expected files) are run by
ScriptFileTests
, so right-click and run that file to run all of your
script tests.
You can also always use the gradle hw-marvel:test
target in the
gradle window to run all tests. (Under cse331 > hw-marvel > Tasks > verification >
test in the IntelliJ gradle window).
Please be extra sure that your IntelliJ is properly configured to use the gradle test runner before trying to run tests using any of these methods. (See the editing/compiling handout if you haven't already done this.) Running tests with the IntelliJ runner instead of the Gradle runner will likely result in "no tests found" when trying to run your script tests. This requires extra work to undo beyond just fixing the settings after the first test run, as IntelliJ will remember the original settings you had when you ran the tests the first time, even if you change them later. If you find yourself in this situation, please contact the staff for instructions on fixing IntelliJ so it will always use the Gradle runner.
Part 5: Running Your Solution From the Command Line
A surprising number of students forget to do this part of the assignment. Please don't let that be you!
Add a main
method to MarvelPaths
that allows a user to
interact with your program from the command line (i.e., your code should read user
input from System.in
). The program should read pairs of character
names and then print to the console the path between those two characters if any,
or else print a suitable message (your code should write to
System.out
). There is no rigid specification here for input or output
formats, but especially creative ones may receive a small amount of extra credit.
At a minimum, your program should make it clear to the user how they are supposed to input the character names, and there should be some reasonable way to quit the program when they are done (that doesn't involve just crashing).
To run your program, a user would run the hw-marvel:runMarvel
gradle
task. If you're having trouble running the task, try the following command in the
command line:
./gradlew hw-marvel:runMarvel
Test script file format
The format of the test file is similar to the format described in the graph assignment . As before, the test driver manages a collection of named graphs and accepts commands for creating and operating on those graphs.
Each input file has one command per line, and each command consists of whitespace-separated words, the first of which is the command name and the remainder of which are arguments. Lines starting with # are considered comment lines and should be echoed (copied) to the output when running the test script. Lines that are blank should cause a blank line to be printed to the output.
The behavior of the testing driver on malformed input command files is not defined; you may assume the input files are well-formed.
In addition to all the same commands (and the same output) as in the graph assignment, the test driver for this homework accepts the following new commands:
LoadGraph graphName file.csv
Creates a new graph named graphName from file.csv, where
file.csv is a data file of the format defined for
marvel.csv
and is located in the
src/main/resources/data/
directory of your project. The
command's output is
loaded graph graphName
You may assume file.csv is well-formed; the behavior for malformed input files is not defined.
Filenames supplied to the LoadGraph command should be simple, meaning they
do not include the directory in which they are located. For example,
marvel.csv
is a simple filename;
src/main/resources/data/marvel.csv
is not.
FindPath graphName node_a node_b
Find the shortest path from node_a to node_b in the graph using your breadth-first search algorithm.
Paths should be chosen using the lexicographic ordering described in Part 3. If a path is found, the command prints the path in the format:
path from CHAR 1 to CHAR N: CHAR 1 to CHAR 2 via BOOK 1 CHAR 2 to CHAR 3 via BOOK 2 ... CHAR N-1 to CHAR N via BOOK N-1
where CHAR 1 is the first node listed in the arguments to FindPath, CHAR N is the second node listed in the arguments, and BOOK K is the title of a book that CHAR K and CHAR K+1 appeared in.
It is possible that given two character names there is no path in the data between them. If the user gives two valid node arguments CHAR 1 and CHAR 2 that have no path in the specified graph, print:
path from CHAR 1 to CHAR 2: no path found
If a character name CHAR was not in the original dataset, print:
unknown: CHAR
If neither character is in the original dataset, print the line twice: first for CHAR 1, then for CHAR N (even if they are the same character). These should be the only lines your program prints in this case — i.e., do not print the regular "path not found" output, and do not print the "path from CHAR 1 to CHAR N" output.
What if the user asks for the path from a character in the dataset to itself? Print:
path from C to C:
but nothing else. (Hint: a well-written solution requires no special handling of this case.)
This applies only to characters in the dataset: a request for a path from a character that is not in the dataset to itself should print the the usual "unknown: C" output, including printing it twice, since "both" characters could not be found.
Sample testing files
Several sample test files demonstrating the new commands are provided in
hw-marvel/src/test/resources/testScripts/
. Your .test
and
.expected
files should also go in this directory alongside the
examples.
Paths to Files
Since your parsing code will likely always be using the CSV reader in
MarvelParser
, and that class is declared in src/main
, it
will always look for data files in src/main/resources/data/
instead
of src/test/resources/data/
. This might seem a little odd, since some
of your testing files (the data files) will be in 'main' instead of 'test.' We've
done this to greatly simplify the infrastructure required for getting these files
to work across projects (as you start using your parsing code for future
assignments.)
In an industry setting, you'd have a separate parsing system capable of finding
files only in 'src/test', but for 331 you should have all your data files in
src/main/resources/data
.
Note that your .test
and .expected
files should be in
'src/test/...' as usual, it's just any .csv
files you create that
should go in 'src/main/...'.
Hints
Performance
If your program takes an excessively long time to construct the graph for the full Marvel dataset, first make sure that it parses the data and builds the graph correctly for a very small dataset. If it does, here are a few questions to consider:
- What data structures are you using in your graph? What is their "big-O" runtime? Are there others that are better suited to the purpose?
-
Did you remember to correctly override
hashCode
if you overrodeequals
? -
What is the "big-O" runtime of your
checkRep
method? Does performance improve if you disable expensive checks?
Be aware that machine specs affect not only how fast your program runs but also how much memory it is allowed to use (more precisely, the maximum heap size ). For this reason, it is very important that you test your code for this assignment on attu (where we will be running the autograder on your assignment) to verify that the tests run at a reasonable speed on that configuration.
Miscellaneous
As always, remember to:
- Use descriptive variables names (especially in the BFS algorithm) and inline comments as appropriate.
- Include an abstraction function, representation invariant, and checkRep in all classes that represent an ADT. If a class does not represent an ADT, place a comment that explicitly says so where the AF and RI would normally go. (For example, classes that contain only static methods and are never constructed usually do not represent an ADT.) Please come to office hours if you feel unsure about what counts as an ADT and what doesn't.
How to Turn In Your Homework
Refer to the Assignment Submission Handout and closely follow the steps listed to submit your assignment. Do not forget to double check your submission as described in that handout - you are responsible for any issues if your code does not run when we try to grade it.
Use the tag name hw6-final for this assignment. To verify your
assignment on attu, use the gradle task: hw-marvel:runMarvel
.
Your TA should be able to find the following in your repository:
-
hw-graph/src/main/java/graph/*.java
[Changes you made to your graph class, if any.] hw-marvel/src/main/java/marvel/MarvelPaths.java
hw-marvel/src/main/java/marvel/MarvelParser.java
-
hw-marvel/src/main/java/marvel/*.java
[Other classes you create, if any (there may be none!).] hw-marvel/src/main/java/marvel/changes.txt
-
hw-marvel/src/main/resources/data/*.csv
[Any csv files you create for testing or running.] hw-marvel/src/test/java/marvel/scriptTestRunner/MarvelTestDriver.java
hw-marvel/src/test/resources/testScripts/*.test
hw-marvel/src/test/resources/testScripts/*.expected
-
hw-marvel/src/test/java/marvel/junitTests/*Test.java
[Other JUnit test classes you create, if any.]