Exploring possibilities and limits of data structures and algorithms.
- Design and analysis
Throughout the course, we’ve learned how to design, analyze, and critique data structures and algorithms (and applications thereof). The research project is an opportunity to extend and apply these ideas to bigger problems and solutions.
There are two phases for the research project: Design and analysis and Critique. In both phases, your project team will be producing written documents that demonstrate your research and learning. Review the Student Academic Responsibility expectations document to avoid unintentional plagiarism. You’re welcome and encouraged to publish your work online after the course is over.
Done well and with the right mindset, the topic you learn about in your project might be the topic you remember the most from the course. The project is also a product that you can use to show potential employers, graduate schools, etc. to demonstrate your skill at learning technical subjects independently and communicating complex ideas in a way that people can understand.1
Since the research project represents an opportunity to extend knowledge beyond the expectations of the course, assistance from course staff will emphasize guidance rather than direct support. Focus on satisfactory completion of everything else before beginning the research project. Each phase of the research project will be assessed on an ESNU grading scale.
- Exemplary (E)
- Satisfies all requirements, offers exceptional insight into the algorithms involved as they relate to course concepts, presents ideas in such a way that another student could easily understand it, and communicates the value or importance of the topic to a broader computer science audience.
- Satisfactory (S)
- Satisfies all requirements.
- Not yet (N)
- Satisfies some requirements.
- Unassessable (U)
- Minimal or no submission.
The requirements are bolded below.
Select one of the following design and analysis themes. Create a blog post: an informal write-up (such as How gzip uses Huffman coding) that explains your selected theme. You may organize the blog post however you prefer and write in your own voice, though it should address the following points in roughly this logical order.
- Motivate the problem that you’re solving. Why are you passionate about this problem? How does the problem and its solutions expand on concepts introduced in this course? Consider how our study of data structures and algorithms has been motivated by different kinds of analysis. Balanced search trees helped us improve asymptotic runtime bounds over (unbalanced) binary search trees. Binary heaps can share the same asymptotic runtime bounds as balanced search trees, but are often faster in experimental runtime. And we introduced affordance analysis as a tool for solving problems beyond sets, maps, and priority queues. How does your problem fit into the rest of the course?
- Explain the design and analysis of each data structure and algorithm. Each design and analysis theme introduces 3 algorithms: discuss all 3 algorithms. Your should include synthesize all of your learning materials by including your own, original explanations and diagrams. Diagrams can be either hand-drawn or generated in a graphics program such as Google Drawings. Your explanation will be evaluated based on how thoroughly it covers the key ideas behind each data structure or algorithm.
- Address the learning target. Each design and analysis theme has a learning target that relates to course concepts. Successful completion of the learning target is required to demonstrate mastery of the big data structures and algorithms design idea.
Submit your blog post as a document to the Canvas assignment.
Space partitioning algorithms represent a category of specialized data structures often used to solve two key problems in computational geometry: nearest neighbor search and range search. These operations provide efficient algorithms for swarm intelligence, N-body simulation, and non-parametric classification.
Space partitioning often assumes multi-dimensional data: our input data is not just a set of values, but a set of datapoints in a multi-dimensional space. For example, a set of datapoints in 3-dimensional space could be represented by datapoints with x, y, and z values. For an introduction to space partitioning algorithms, watch the UC Berkeley lecture on Multidimensional Data (slides) and study the Princeton slides on Geometric Applications of BSTs.
- Learning target
- Analyze two approaches for balanced k-d tree construction. First, show that using merge sort as the median-finding algorithm results in an overall construction runtime in Θ(Nlog2N) where N is the size of the tree. Then, describe a more efficient approach that can construct a balanced k-d tree in Θ(Nlog N) time without using a linear-time selection algorithm.
Data structure isometry is the formal name for the 1-1 correspondence that we observed between 2-3 trees and left-leaning red-black trees. Where else have we seen isometry in algorithms?
Quicksort is an example of a randomized algorithm that can make random choices during the algorithm (picking a random pivot element) in order to improve running time. If randomness can be used to improve runtime for algorithms, can it also be used to improve runtime for data structures? Instead of implementing sets and maps with balanced search trees, consider implementing with one of the following randomized data structures.
- Learning target
- Draw a 5-way correspondence between skip lists, 2-3-4 trees, zip trees, red-black trees, and triple-pivot quicksort. First, recreate/redraw figure 3 from the Deterministic Skip Lists paper. Then, extend the figure to show the corresponding zip tree and red-black tree. Lastly, extend the figure once again to show the corresponding triple-pivot quicksort, which doesn’t actually exist but generalizes the idea of dual-pivot quicksort.
Some probability theory may help with understanding. For a formal introduction to randomization algorithms, study the UIUC lecture notes on Discrete Probability (lecture video) and Nuts and Bolts (lecture video).
Efficient use of contiguous memory structures such as arrays are critical to high-performance applications. Across Google datacenters, more than 4% of fleetwide memory is owned by a hash table—and this is only counting code written in C++. As demand for computation continues to grow, it will be increasingly important to design efficient algorithms that reduce system memory usage and improve running time efficiency. What kinds of techniques can we employ to design more memory-friendly and cache-efficient data structures?
- Swiss tables (documentation)
- Bloom filters (Stanford lecture notes)
- Suffix arrays (UCSD lecture video)
Each of these data structures solves a different problem. Swiss tables implement sets and maps. Bloom filters work like sets, but may return false positives: a
contains query returns either “possibly in the set” or “definitely not in the set”. Suffix arrays efficiently solve our DNA Indexing problem: given a long DNA sequence, find every occurrence of a (short) target sequence in the long DNA sequence.
- Learning target
- Bloom filters use multiple hash functions. Suppose we use multiple hash functions to resolve collisions in a separate chaining hash table by making them recursive: each bucket is itself a separate chaining hash table! The resulting data structure is a tree composed of nested hash tables, where all of the hash tables on each level of the tree uses a different hash function. If every hash function uniformly distributes the N elements but every hash table has a fixed number of buckets M, give an asymptotic runtime analysis for this recursive hash table data structure in terms of N and M.
Some familiarity with the concept of bits and bytes may help with understanding.
For students already engaged in a computational research project outside of the course, the research project can be related to your existing research if it relates to the themes of the course. Your research project proposal must be part of a real problem that you’re already working on that cannot be solved using standard approaches in the undergraduate computer science curricula such as CSE 417: Algorithms and Complexity. For example, a bioinformatician may be interested in learning more about efficient data structures and algorithms for building genome databases.
Submit a proposal as an Ed Discussion board question addressing the following questions including relevant references.
- What is your research context? Describe the abstract data type or computational problem.
- Describe a simple baseline algorithm for solving the problem. Outline 3 other approaches that you suspect would be more efficient.
- Propose a learning target that connects one or more approach to course concepts.
Suppose you’re working at a company that is planning to build a cutting-edge DNA Indexing, Content Moderation, or Image Processing system. Your team is asked to expand a previous project critique into a memo that explores the ethical considerations around the system, examines arguments both for and against the system, and presents a final recommendation to company leadership either for or against the system. Apply your understanding of affordance analysis to evaluate the potential benefits and dangers of this technology, particularly on civil and human rights. Your memo will be evaluated based on how effectively it addresses these requirements.
Start by revisiting your project critique. Rob Reich and Joshua Cohen have developed Some Guidelines on Writing a Good Political Philosophy Paper. While the memo is not intended to be a formal philosophy paper, an effective memo will present a cogent and cohesive argument for your argument by synthesizing examples and references. In particular, it should focus on your own analysis of references beyond just direct quotation or paraphrasing by integrating references as supporting evidence for your argument.
To scope expectations for this task, we think it should be possible to achieve all of these goals in fewer than 1000 words, though this is not a strict requirement. There are many resources for further research on the philosophy of technology. The Stanford Encyclopedia of Philosophy includes comprehensive reviews of topics such as the Ethics of AI and Informed Consent. The Choices and Consequences in Computing course at Cornell lists a selection of relevant readings. The instructors for the Stanford course on ethics, public policy, and technology also developed a few relevant case studies.
Submit your memo as a document to the Canvas assignment.