DNA Indexing
Comparing tree search structures against array search algorithms.
Table of contents
Deoxyribonucleic acid (DNA) is a molecule responsible for representing genetic information for living organisms. DNA consists of smaller building blocks called nitrogenous bases. There are four different nitrogenous bases present in DNA: adenine (A), cytosine (C), thymine (T), and guanine (G). The sequence of these bases within a DNA molecule determines the biological instructions for a given organism. For example, AGGATC may encode for curly hair, while GGACTA may encode for straight hair. In the modern day, DNA has seen further application outside of biological instruction, such as criminal justice (confirming/ruling-out suspects) and data storage (encoding images in DNA). Since DNA can store an immense amount of information—millions or billions of bases long—searching through DNA data is a complicated task. Data structures and algorithms can organize efficient access to DNA data.
Autocomplete is a useful program that helps a user select valid search results by showing possible inputs as they type. For example, in a map app, autocomplete might allow the user to enter a prefix such as Sea and automatically suggest the city, Seattle. It turns out that autocomplete can be a useful abstraction for implementing DNA search too. Instead of indexing a list of all of the city names, a DNA data structure indexes a list of all suffixes of a given DNA sequence. Autocompleting from the list of suffixes will return every DNA substring match.
In this project, we will compare 4 implementations for autocomplete. To get started, join your project team on Canvas. One team member should Fork
Learning objectives
- Design and implement tree-based and array-based search data structures.
- Analyze and compare runtimes using asymptotic and experimental analysis.
- Critique and connect real-world issues around DNA and social genomics.
Design
Interface
Interfaces are a useful way to indicate the minimum functionality provided by implementations of the interface. Any client of an Autocomplete
implementation will be able to use it knowing only the methods declared in the interface.
void addAll(Collection<? extends CharSequence> terms)
- Adds all of the terms to the autocompletion dataset. Each term is a
CharSequence
that represents a potential autocompletion option. The order of the terms in the collection is arbitrary. Behavior is undefined if duplicate terms are added to the dataset. List<CharSequence> allMatches(CharSequence prefix)
- Returns a list of all terms that begin with the same characters as the given prefix. Given the terms [alpha, delta, do, cats, dodgy, pilot, dog],
allMatches("do")
should return [do, dodgy, dog] in any order.
Collection
is the parent interface to all of the lists, sets, and maps in Java. Iterate over a collection using a for-each loop.
for (CharSequence term : terms) {
// process the term
}
Instead of representing terms with the String
data type, Autocomplete
uses the CharSequence
interface. CharSequence
is a generalization of String
representing an ordered (comparable) sequence of characters and cannot be modified once created. By relying on the more general CharSequence
interface, clients can design and use more specialized and efficient computer representations for DNA.
The CharSequence.compare
static method takes two CharSequence
instances and returns an int
representing the relationship between cs1
and cs2
.
- If
cs1 < cs2
, thenCharSequence.compare(cs1, cs2) < 0
. - If
cs1 == cs2
, thenCharSequence.compare(cs1, cs2) == 0
, i.e. contents are exactly the same. - If
cs1 > cs2
, thenCharSequence.compare(cs1, cs2) > 0
.
Reference implementation
Unlike your prior programming courses, the focus of this course is not only to build programs that behave according to specifications. In fact, a complete and correct reference solution is given. Rather, the goal of this course is to analyze the trade-offs between different approaches.
TreeSetAutocomplete
- Terms are added to a self-balancing red-black search tree. Since the search tree is sorted, matches for a given prefix will be next to each other. Using binary search, the algorithm finds the first matching term and then performs an in-order tree traversal to collect all matching terms in the vicinity.
TreeSetAutocomplete
relies on TreeSet.ceiling
to find the first matching term and TreeSet.tailSet
to perform an in-order tree traversal from the first matching term onwards. To check that a term
begins with the given prefix
, TreeSetAutocomplete
uses the following code snippet.
if (prefix.length() <= term.length()) {
CharSequence part = term.subSequence(0, prefix.length());
if (CharSequence.compare(prefix, part) == 0) {
// term begins with the given prefix
}
}
Implementation task
Design and implement 3 other implementations of the Autocomplete
interface.
LinearSearchAutocomplete
- Terms are added to the list in an arbitrary order. The algorithm scans across the entire dataset and checks every term to see if it could be a valid option and returns a list of all matching terms.
BinarySearchAutocomplete
- When terms are added to the autocompletion dataset, the entire dataset is sorted using
Collections.sort
/Arrays.sort
. Since the dataset is always sorted, matches for a given prefix will be next to each other. Using binary search, the algorithm finds a match within that region and collects all matching terms in the vicinity. Modify the binary search algorithm presented in the lesson, or useCollections.binarySearch
/Arrays.binarySearch
.- The Java
sort
andbinarySearch
methods accept aComparator
parameter: pass it the argumentCharSequence::compare
. The JavabinarySearch
methods return “the index of the search key, if it is contained in the list; otherwise,(-(insertion point) - 1)
.” To obtain the index when the prefix itself is not in the dataset, add 1 to the index and negate the result to get the expected index. - The Java
TernarySearchTreeAutocomplete
- Terms are added to a ternary search tree. Identify relevant portions of the
TST
class presented in the lesson and modify the code to implement theAutocomplete
interface. Note thatTST
associates keys with values, which is not necessary forAutocomplete
. - Video
- Explain one part of the project implementation that you’re particularly proud of developing.
Run, check, debug
Open Terminal SimpleExample
class. To stop program execution at any time, use the keyboard shortcut Ctrl+C. Once the SimpleExample
class has finished execution, the rm
command removes the compiled class files to clean-up your workspace.
javac SimpleExample.java && java SimpleExample; rm *.class
By default, the SimpleExample
class runs the TreeSetAutocomplete
reference implementation with the simple example provided above. Modify the class to test your own implementation for debugging purposes and check that it lines up with the matches returned by TreeSetAutocomplete
.
For debugging, use print statements to output the state of important variables. Study this 4-minute video on debugging with print statements. Note how they spend almost all of the time diagnosing the problem and developing a thorough explanation for why the program is not working as expected. The actual bugfix only takes a few seconds of typing. This is normal practice for all programmers, though programmers who have seen similar bugs before will be able to identify and resolve problems more quickly.
To check an implementation against a larger dataset, modify, compile, and run Cities
or DNA
. It might be necessary to increase the amount of memory allocated to Java: the following command runs DNA
while requesting up to 200MB of memory for data structures with the -Xmx
flag and up to 4MB of memory for recursive calls with the -Xss
flag.
javac DNA.java && java -Xmx200m -Xss4m DNA; rm *.class
To compare several implementations at once, run the provided CitiesMultiTest
, which takes a query and validates the implementations against TreeSetAutocomplete
. If some of your implementations are not ready for testing, comment them out of the implementations
map.
javac CitiesMultiTest.java && java CitiesMultiTest; rm *.class
Analyze
Asymptotic analysis
- Video
- Give a big-theta bound for the worst-case runtime of the
addAll
andallMatches
methods of each implementation, includingTreeSetAutocomplete
, with respect to N, the size of the autocompletion dataset. Assume all strings, including theprefix
forallMatches
, have a constant length. ForaddAll
, assume a constant number of terms are added to the dataset. Explain the runtime of each implementation in 1 or 2 sentences while referencing the code.
When identifying the worst-case analysis for allMatches
, consider the relationship between the autocompletion dataset and the prefix. What happens if all of the terms in the dataset begin with A and the prefix is A? What happens if all of the terms in the dataset begin with B and the prefix is A?
Collections.sort
/ Arrays.sort
uses Timsort, a sorting algorithm with a runtime in O(Nlog N) where N is the size of the dataset.
Experimental analysis
Run the provided CitiesInputSizeExperiments
to compare the real-world runtime of each implementation. For each implementation, CitiesInputSizeExperiments
constructs an empty instance and records the number of seconds that it takes to add N terms to the dataset and call allMatches
with the TEST_PREFIX
(“Sea” by default).
javac CitiesInputSizeExperiments.java && java CitiesInputSizeExperiments; rm *.class
The benchmarking script will create a folder named after the TEST_PREFIX
and output a CSV file for each implementation.
- The first column denotes N, the size of the autocompletion dataset.
- The second column denotes the average runtime for
addAll
in seconds. - The third column denotes the average runtime for
allMatches
in seconds.
Open each CSV file and copy-paste the text into plotting software such as Desmos. Plot the runtimes of all 4 implementations on addAll
and allMatches
.
- Video
- Compare the runtimes across different implementations. Are certain algorithms faster than others? Describe how the assumptions in
CitiesInputSizeExperiments
might explain any disagreements between the asymptotic analysis and experimental analysis. ForallMatches
, describe how the default prefix affects the experimental analysis.
Critique
Connect DNA search algorithms to discussions and opinions occurring in real-world contexts. Search algorithms are one component of DNA databases, so consider including professional discussions on DNA databases as well. Here are a few articles and essays to begin your research.
- DNA Databases and Human Rights
- What is the role of DNA in the criminal justice system? How do DNA searches produce false matches? How do DNA databases encode social bias?
- Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective
- What issues arise when using DNA banks in biomedical research? What are some of the ethical questions around using genetic data? How is consent involved in such processes?
- Social genomics can combat inequality or be used to justify it
- How is social genomics research used to justify political agendas? How are scientific agendas, in turn, shaped by the scientists who determine which research questions to study?
- Video
- Synthesize at least one of the above articles together with a couple additional sources. Examine the positionality of the various articles by considering some of the following questions (and any of your own questions). How does the author’s lived experiences lead them to their point of view? How do the distribution of benefits and harms affect different people and society?
Submission
The final deliverable for your project is a video no longer than about 10 minutes that addresses all of the tasks labeled Video
All project team members must be present in the video, which you can do by recording it over a Zoom meeting. Submit your video as an mp4 file to the Canvas assignment and make sure your implementations are up-to-date in your team’s Ed Workspace.