CSE 552 Spring 2019
Project Ideas

Spiciness levels:

★★An easy project. You must be very busy with your other classes.
★★★A solid project. You’ll be busy.
★★★★A very serious project. You’re shooting for a conference paper.

Every project should have an evaluation component.

Feel free to propose your own idea; these ideas should give you a reference for the kind of work we expect.

Benchmark EPaxos ★★

A “benchmark” project is practice at writing a good Evaluation section of a paper. Not the design and implementation, although by the time you’ve evaluated, you may be inspired.

First, drive EPaxos from a single client offering a back-to-back causally-dependent workload with varying "think time" delays between requests. Plot a latency-throughput curve.

Then explore the same curve as the number of clients goes up to saturation.

Inject failures. How does performance change in the presence of failure? Are all failure costs momentary, or are there interesting “brown failures” that are particularly bad?

EPaxos can isolate commutative operations. What happens to performance when the client workloads have non-commutative overlaps of {0%, 5%, 50%, 100%}? What happens with different kinds of overlap? For example,

  • Read-Modify-Write vs. Read-Modify-Write. Here, the operations are reading and writing the same cell of the stored state.
  • Blind-Write vs. Blind-Write. Two operations emit writes that don’t depend on reading any state, hence they can occur in any order. They shouldn’t conflict.
  • Blind increments. Each operation increments a counter. The increment operation is not conditional on the prior state (hence "blind"), but if 6 increment operations complete, a later read should show the state value to be 6.
  • Each operation decrements a value (allocates a resource). If the prior value was zero, the operation fails and returns a failure code.
Enhance EPaxos ★★★

EPaxos’s observation that two blind writes are conflict-free is a special case of the general idea of Conflict-Freedom. Consider the family of CRDTs (Conflict-free Replicated Data Types). Are there ways to map CRDTs into EPaxos operations that would expose commutativity in ways that applications might find useful (as evidenced by the fact that applications can use CRDTs)? Which ones get a performance boost “for free” in EPaxos? For those that don’t, can you change EPaxos to exploit their type of commutativity?

The project should evaluate the performance improvement of introducing the CRDTs into EPaxos versus the baseline implementation using EPaxos’s existing primitives.

Benchmark Corfu/Tango ★★

Drive Tango from a single client offering a back-to-back causally-dependent workload with varying "think time" delays between requests. Plot a latency-throughput curve. Explore the same curve as the number of clients goes up to saturation.

Break costs down into sequencer vs. storage. What happens when you use (or simulate) different types of storage? What costs are associated with failure and recovery?

Evaluate costs of TCP vs. UDP + protocol-level reliability ★★★

Some systems, like Mencius, use TCP instead of UDP and then design their protocol to take advantage of the ordering guarantees it provides. Others, like EPaxos and IronFleet, use UDP and take care of reliability via protocol design. What are the trade-offs here? Evaluate this question by updating various systems that use TCP to use UDP and vice-versa.

Make TrInc-BFT more efficient ★★★★

The TrInc paper describes one way to implement a Byzantine-fault-tolerant state machine that needs only $2f+1$ nodes, each with a trinket: by using TrInc to implement A2M, and A2M-PBFT-EA to implement PBFT. This is likely far more inefficient than necessary. For instance, the presence of a trinket on each machine means that one probably doesn’t need all the phases (pre-prepare, prepare, and commit) that PBFT uses. Implement BFT directly on TrInc, using a streamlined protocol, prove it safe and live, and evaluate its improved performance resulting from its increased efficiency.

Reconfigure blockchain peer-to-peer network using the blockchain ★★★★

Bitcoin uses a peer-to-peer network to disseminate transaction proposals and new blocks. This network relies on fixed seeds in the Bitcoin client code and other centralized mechanisms to bootstrap the process of joining. Devise a way to eliminate, or at least reduce the importance of, this centralized mechanism by having the blockchain itself contain the seeds, or perhaps servers, used to disseminate information. This will be trickier than in Paxos, where there’s an assumption of a central authority that can be used to approve proper reconfigurations; you need to have an argument for why miners and other potential malefactors can’t subvert the reconfiguration process to undermine the security properties required for dissemination. Implement your approach so you can demonstrate, by performance evaluation, some quantitative benefit of it. For instance, if you replace the peer-to-peer network with a set of well-provisioned servers, you may save a lot of network bandwidth and reduce the latency of information dissemination, and thereby perhaps reduce the probability of forks.

Use programmable switches to implement network spies ★★★★

In the paper, “Detecting failures in distributed systems with the FALCON spy network”, the authors describe a system in which spies exist at multiple levels, but stop at the network switch. That’s because it has no way of detecting the failure of switches themselves. Performing network failure localization with spies is also tricky because there may be multiple routes from a failed node and they must all be shut down to truly kill a switch (or even a multi-homed machine). Leverage software-defined networking (SDN) to program all the switches in a network to implement a network of spies. Evaluate the overhead of your implementation, as well as the resulting improvement in failure detection time.

Verify a Chain Replication implementation ★★★

Implement Chain Replication in a verified language that provides a runnable implementation, either via explicit code (Dafny) or via extraction (Ivy, Coq). Verify that the system provides serializable semantics.

Evaluate the resulting implementation. Compare it with a non-verified implementation (ideally one you didn’t write). Is the performance comparable? Where it isn’t, why isn’t it? Do those reasons relate to limitations of the verification framework?

This isn’t a verification class, but verification is a good way to be sure you really understand how invariants hold a safety property together. It’s also useful to understand why refinement is a good way to rigorously state behavioral specification.