DNA Indexing

Comparing tree search structures against array search algorithms.

Learning objectives
Design
Analyze
1. Asymptotic analysis
2. Experimental analysis
Critique
Submission

Deoxyribonucleic acid (DNA) is a molecule responsible for representing genetic information for living organisms. DNA consists of smaller building blocks called nitrogenous bases. There are four different nitrogenous bases present in DNA: adenine (A), cytosine (C), thymine (T), and guanine (G). The sequence of these bases within a DNA molecule determines the biological instructions for a given organism. For example, AGGATC may encode for curly hair, while GGACTA may encode for straight hair. In the modern day, DNA has seen further application outside of biological instruction, such as criminal justice (confirming/ruling-out suspects) and data storage (encoding images in DNA). Since DNA can store an immense amount of information—millions or billions of bases long—searching through DNA data is a complicated task. Data structures and algorithms can organize efficient access to DNA data.

Autocomplete is a useful program that helps a user select valid search results by showing possible inputs as they type. For example, in a map app, autocomplete might allow the user to enter a prefix such as Sea and automatically suggest the city, Seattle. It turns out that autocomplete can be a useful abstraction for implementing DNA search too. Instead of indexing a list of all of the city names, a DNA data structure indexes a list of all suffixes of a given DNA sequence. Autocompleting from the list of suffixes will return every DNA substring match.

In this project, we will compare 4 implementations for autocomplete. To get started, join your project team on Canvas. One team member should Fork the starter workspace to create a new workspace and Share it with other team members’ UW emails.

Learning objectives

Design and implement tree-based and array-based search data structures.
Analyze and compare runtimes using asymptotic and experimental analysis.
Critique and connect real-world issues around DNA and social genomics.

Design

Interface

Interfaces are a useful way to indicate the minimum functionality provided by implementations of the interface. Any client of an Autocomplete implementation will be able to use it knowing only the methods declared in the interface.

void addAll(Collection<? extends CharSequence> terms): Adds all of the terms to the autocompletion dataset. Each term is a CharSequence that represents a potential autocompletion option. The order of the terms in the collection is arbitrary. Behavior is undefined if duplicate terms are added to the dataset.
List<CharSequence> allMatches(CharSequence prefix): Returns a list of all terms that begin with the same characters as the given prefix. Given the terms [alpha, delta, do, cats, dodgy, pilot, dog], allMatches("do") should return [do, dodgy, dog] in any order.

Collection is the parent interface to all of the lists, sets, and maps in Java. Iterate over a collection using a for-each loop.

for (CharSequence term : terms) {
    // process the term
}

Instead of representing terms with the String data type, Autocomplete uses the CharSequence interface. CharSequence is a generalization of String representing an ordered (comparable) sequence of characters and cannot be modified once created. By relying on the more general CharSequence interface, clients can design and use more specialized and efficient computer representations for DNA.

The CharSequence.compare static method takes two CharSequence instances and returns an int representing the relationship between cs1 and cs2.

If cs1 < cs2, then CharSequence.compare(cs1, cs2) < 0.
If cs1 == cs2, then CharSequence.compare(cs1, cs2) == 0, i.e. contents are exactly the same.
If cs1 > cs2, then CharSequence.compare(cs1, cs2) > 0.

Reference implementation

Unlike your prior programming courses, the focus of this course is not only to build programs that behave according to specifications. In fact, a complete and correct reference solution is given. Rather, the goal of this course is to analyze the trade-offs between different approaches.

TreeSetAutocomplete: Terms are added to a self-balancing red-black search tree. Since the search tree is sorted, matches for a given prefix will be next to each other. Using binary search, the algorithm finds the first matching term and then performs an in-order tree traversal to collect all matching terms in the vicinity.

TreeSetAutocomplete relies on TreeSet.ceiling to find the first matching term and TreeSet.tailSet to perform an in-order tree traversal from the first matching term onwards. To check that a term begins with the given prefix, TreeSetAutocomplete uses the following code snippet.

if (prefix.length() <= term.length()) {
    CharSequence part = term.subSequence(0, prefix.length());
    if (CharSequence.compare(prefix, part) == 0) {
        // term begins with the given prefix
    }
}

Implementation task

Design and implement 3 other implementations of the Autocomplete interface.

LinearSearchAutocomplete: Terms are added to the list in an arbitrary order. The algorithm scans across the entire dataset and checks every term to see if it could be a valid option and returns a list of all matching terms.
BinarySearchAutocomplete: When terms are added to the autocompletion dataset, the entire dataset is sorted using Collections.sort / Arrays.sort. Since the dataset is always sorted, matches for a given prefix will be next to each other. Using binary search, the algorithm finds a match within that region and collects all matching terms in the vicinity. Modify the binary search algorithm presented in the lesson, or use Collections.binarySearch / Arrays.binarySearch.; The Java sort and binarySearch methods accept a Comparator parameter: pass it the argument CharSequence::compare. The Java binarySearch methods return “the index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1).” To obtain the index when the prefix itself is not in the dataset, add 1 to the index and negate the result to get the expected index.
TernarySearchTreeAutocomplete: Terms are added to a ternary search tree. Identify relevant portions of the TST class presented in the lesson and modify the code to implement the Autocomplete interface. Note that TST associates keys with values, which is not necessary for Autocomplete.
Video: Explain one part of the project implementation that you’re particularly proud of developing.

Run, check, debug

Open Terminal and enter the following command to compile and run the SimpleExample class. To stop program execution at any time, use the keyboard shortcut Ctrl+C. Once the SimpleExample class has finished execution, the rm command removes the compiled class files to clean-up your workspace.

javac SimpleExample.java && java SimpleExample; rm *.class

By default, the SimpleExample class runs the TreeSetAutocomplete reference implementation with the simple example provided above. Modify the class to test your own implementation for debugging purposes and check that it lines up with the matches returned by TreeSetAutocomplete.

For debugging, use print statements to output the state of important variables. Study this 4-minute video on debugging with print statements. Note how they spend almost all of the time diagnosing the problem and developing a thorough explanation for why the program is not working as expected. The actual bugfix only takes a few seconds of typing. This is normal practice for all programmers, though programmers who have seen similar bugs before will be able to identify and resolve problems more quickly.

To check an implementation against a larger dataset, modify, compile, and run Cities or DNA. It might be necessary to increase the amount of memory allocated to Java: the following command runs DNA while requesting up to 200MB of memory for data structures with the -Xmx flag and up to 4MB of memory for recursive calls with the -Xss flag.

javac DNA.java && java -Xmx200m -Xss4m DNA; rm *.class

To compare several implementations at once, run the provided CitiesMultiTest, which takes a query and validates the implementations against TreeSetAutocomplete. If some of your implementations are not ready for testing, comment them out of the implementations map.

javac CitiesMultiTest.java && java CitiesMultiTest; rm *.class

Analyze

Asymptotic analysis

Video: Give a big-theta bound for the worst-case runtime of the addAll and allMatches methods of each implementation, including TreeSetAutocomplete, with respect to N, the size of the autocompletion dataset. Assume all strings, including the prefix for allMatches, have a constant length. For addAll, assume a constant number of terms are added to the dataset. Explain the runtime of each implementation in 1 or 2 sentences while referencing the code.

When identifying the worst-case analysis for allMatches, consider the relationship between the autocompletion dataset and the prefix. What happens if all of the terms in the dataset begin with A and the prefix is A? What happens if all of the terms in the dataset begin with B and the prefix is A?

Collections.sort / Arrays.sort uses Timsort, a sorting algorithm with a runtime in O(Nlog N) where N is the size of the dataset.

Experimental analysis

Run the provided CitiesInputSizeExperiments to compare the real-world runtime of each implementation. For each implementation, CitiesInputSizeExperiments constructs an empty instance and records the number of seconds that it takes to add N terms to the dataset and call allMatches with the TEST_PREFIX (“Sea” by default).

javac CitiesInputSizeExperiments.java && java CitiesInputSizeExperiments; rm *.class

The benchmarking script will create a folder named after the TEST_PREFIX and output a CSV file for each implementation.

The first column denotes N, the size of the autocompletion dataset.
The second column denotes the average runtime for addAll in seconds.
The third column denotes the average runtime for allMatches in seconds.

Open each CSV file and copy-paste the text into plotting software such as Desmos. Plot the runtimes of all 4 implementations on addAll and allMatches.

Video: Compare the runtimes across different implementations. Are certain algorithms faster than others? Describe how the assumptions in CitiesInputSizeExperiments might explain any disagreements between the asymptotic analysis and experimental analysis. For allMatches, describe how the default prefix affects the experimental analysis.

Critique

Connect DNA search algorithms to discussions and opinions occurring in real-world contexts. Search algorithms are one component of DNA databases, so consider including professional discussions on DNA databases as well. Here are a few articles and essays to begin your research.

DNA Databases and Human Rights: What is the role of DNA in the criminal justice system? How do DNA searches produce false matches? How do DNA databases encode social bias?
Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective: What issues arise when using DNA banks in biomedical research? What are some of the ethical questions around using genetic data? How is consent involved in such processes?
Social genomics can combat inequality or be used to justify it: How is social genomics research used to justify political agendas? How are scientific agendas, in turn, shaped by the scientists who determine which research questions to study?
Video: Synthesize at least one of the above articles together with a couple additional sources. Examine the positionality of the various articles by considering some of the following questions (and any of your own questions). How does the author’s lived experiences lead them to their point of view? How do the distribution of benefits and harms affect different people and society?

Submission

The final deliverable for your project is a video no longer than about 10 minutes that addresses all of the tasks labeled Video

All project team members must be present in the video, which you can do by recording it over a Zoom meeting. Submit your video as an mp4 file to the Canvas assignment and make sure your implementations are up-to-date in your team’s Ed Workspace.