CSE 490E1 potential projects

These are projects you could choose as your proposed project for CSE 490E1. Any of them would be appropriate. You would still need to do some work to flesh out the project, based on the notes here.

You are also allowed to propose a different project. Feel free to meet with the instructor, before the assignment due date, to discuss your choice. Doing so is welcome but not required.

Contents:

Dynamic analysis

Improved test generation via GRT techniques

GRT (Guided Random Testing) is a variant of the Randoop test generator that adds some features to Randoop. In a recent competition among testing tools, GRT won.

Programmers who want to use GRT are out of luck, because GRT's source code isn't available. Furthermore, some of the features that GRT claims were already part of Randoop, which casts doubt on the experimental methodology. It is certain that some of GRT's ideas are good ones, but we don't know which ones.

GRT's techniques are

Constant mining (static)
Extract constants from the program under test for both global usage (seed the main object pool) and for local usage as inputs for specific methods.
Impurity (static + dynamic)
Perform static purity analysis for all methods under test. At run-time, fuzz selected input from the object pool based on a Gaussian distribution and purity results.
Elephant brain (dynamic)
Manage method sequences (to create inputs) in the object pool with the exact types obtained at run-time.
Detective (static + dynamic)
Analyze the method input and return type dependency statically, and construct missing input data on demand at run-time.
Orienteering (dynamic)
Favor method sequences that require lower cost to execute, which accelerates testing for other components.
Bloodhound (dynamic)
Guide method selection and sequence generation by coverage information at run-time.

The goal of this project is to re-implement some or all the GRT techniques in the open-source Randoop implementation, and then to evaluate the benefits of each one.

Improved test generation via side effect analysis

Enhance a test generation tool such as Randoop or EvoSuite to utilize a side effect analysis that indicates which methods are pure. A pure method performs no side effects and always returns the same result when called twice on the same arguments. Here are two distinct benefits, which are independent of the “Impurity” analysis of GRT:

  1. Benefit 1: avoiding useless calls in tests.

    Ordinarily, Randoop assumes that a method call may change any object that is passed to it. For example, Randoop considers x to be different after

      x = foo();
    
    than after
      x = foo();
      y = bar(x);
    
    and Randoop will try to test the x values produced by both sequences. By ignoring the x value produced by the second sequence, Randoop could spend half as long to produce equally good tests. Also, Randoop's tests would be easier to understand because they would be shorter and would not contain mysterious, useless calls to pure methods.
  2. Benefit 2: more test assertions.

    When Randoop creates regression tests, the tests contain assertions about the objects that the test creates. For instance, Randoop might output

      assertTrue(someObject.itsField == 22);

    In addition to reading fields of an object, Randoop could invoke observer methods. For instance, it could output

      assertTrue(someObject.observerMethod() == 22);

    This is only possible for pure methods &mdash methods that have no side effect and return the same value every time they are called on the same arguments. If Randoop is told which methods are pure, it can use them as observers and can generate much richer and more sensitive assertions in its regression tests.

The goal of this project is to enable Randoop to utilize pure methods for both of the purposes outlined above.

There are some other enhancements that would make this project even more compelling, or could be projects of their own

Test prioritization based on failure probability

It is not uncommon for a test suite to take hours or longer to run. Quicker notification of errors saves time, because the developer remembers what he or she was thinking, and the developer can fix errors rather than building on a faulty foundation.

Suppose a developer is running a test suite that will eventually fail — that is, one of its tests will fail. It is most useful to know this information as soon as possible — that is, it is most useful to run the failing test first. This gives the developer the key information quickly, without waiting for the entire test suite to complete.

Of course, the developer doesn't know which tests will fail — if so, there wouldn't be a need to run them. But a test prioritization strategy reorders a test suite to first run the tests that are most likely to fail.

Here is a very simple strategy: a test is more likely to fail in the future if it has failed in the past. There is evidence that this works extremely well. However, conditional probability has not been exploited. For example, suppose that:

The best order is to run test 101, then test 201. (If test 101 fails, then a human must be involved, and humans are so slow that there will be plenty of time to run tests 102-110.)

The goal of this project is to build an extension to a testing framework (such as JUnit) that records previous test execution results and reorders tests according to their likelihood of failure. Then, measure how big an effect this has in practice.

Minimizing bug fixes

Ideally, every commit in a version control repository has a single purpose:

However, in practice programmers often make a commit that mixes together multiple distinct changes. There might be multiple new features, or a feature plus some refactorings that were needed in order to implement it plus some bugs that the programmer noticed along the way.

A commit that serves multiple purposes is harder for programmers and tools to interpret.

The goal of this project is to create a tool that minimizes a patch with respect to some particular goal. For example, minimizing a patch with respect to a bug fix means finding the smallest part of the patch that fixes the bug. The patch would not include documentation changes, variable renaming, and other refactorings.

Static analysis

Tree merge for version control

Merge conflicts are the bane of software development teams. Current version control systems are line-based: when two programmers change the same line of code, the version control system reports a conflict and gives up on the merge, leaving it to the programmer to manually edit the files to resolve the conflict.

In many cases, the conflict is not a real one. For example, suppose one person added a line break, and the other changed the name of a variable. Because these affect the same line of code, they would be considered a version control conflict and the programmer would have to resolve them. As another example, if a programmer surrounds a block of code by an if statement, then the entire block of code gets indented more, and this change conflicts with any change within the block of code. A related problem that is not a conflict but should be considered one is when one programmer changes every call to a given method, and another programmer adds a new call to the method that still uses the old, wrong calling convention.

The goal of this project is to produce a tool that automatically resolves merge conflicts, relieving the programmer of the burden of manually doing so.

In the simplest case, the tool could convert the textual edits into AST (abstract syntax tree) edits and see whether the edits are still in conflict. If not, both could be applied: either applied directly to the code, or applied to the AST and the AST then converted back to code.

A more ambitious goal would be to generalize the edits, such as “this edit changes every occurrence of foo(a, b) into foo(a, b.next)”, and then to apply that edit to the entire codebase, including changes that were made in parallel by another developer. There exist tools that will do this kind of generalization, so you would not need to build that component from scratch.

To evaluate the tool, it could be applied to historical data from GitHub repositories to see how often it would have resolved a conflict without bothering the programmer.

Type systems

Whole-program type inference

A type system is useful because it prevents certain errors. The downside of a type system is the effort required to write the types. Type inference is the process of writing the types for a program.

Type-checking is a modular or local analysis. For example, given a procedure in which types have been written, a type-checker can verify the procedure's types without examining the implementation of any other procedure.

By contrast, type inference is a non-local, whole-program analysis. For example, to determine what type should be written for a procedure's formal parameter, it is necessary to examine the type of the argument at every call to that procedure. But, to determine the type of some argument A, it may be necessary to know the types of the formal parameters to the procedure that contains A, and so forth. It is possible to resolve this seemingly-infinite regress, but only by examining the entire program in the worst case.

The differences between type checking and type inference means that they are usually written in very different ways. Type inference is usually done by first collecting all of the constraints for the entire program, then passing them to a specialized solver. Writing a type inference tool is harder, and it's also very annoying to encode all the type rules twice in different ways: once for the type checker and once for the type inference.

As a result, many type systems have a type checker but no type inference tool. This makes programmers reluctant to use these type systems, denying them the benefits of type-checking.

The goal of this project is to automatically create type inference tools from type-checking tools, so that it is not necessary for the type system designer to implement the type system twice in different ways.

A key insight is that the type-checker already encodes all knowledge about what is a legal, well-typed program. How can we exploit that for the purpose of type inference as well as type-checking? The idea is to iteratively run the type-checker, multiple times, observing what types are passed around the program and what errors occur. Each iteration

This approach has some disadvantages: it is theoretically slower, and theoretically less accurate, than a purpose-built type inference tool for each type system. However, it has the major advantage that it requires no extra work to implement a type inference tool. Furthermore, maybe it works well enough in practice.

A prototype implementation of this idea already exists, but it needs to be evaluated in order to discover its flaws, improve its design, and discover how accurate it is in practice.

Erroneous use of Java 8's Optional class

Java 8 introduced the Optional class, a container that is either empty or contains a non-null value. The Web and blogosphere are full of incorrect claims that the Optional class solves the problem of null pointer exceptions. In fact, changing your code to use the Optional class transforms a NullPointerException into a NoSuchElementException, which still crashes your program, and it creates new problems that were not a danger before.

You can learn more about the dangers of the Optional class here: http://homes.cs.washington.edu/~mernst/advice/nothing-is-better-than-optional.html

The goal of this project is to create a tool that warns programmers who have misused Optional. It would enforce as many style rules as possible for the Optional type, from among those listed here: https://stuartmarks.wordpress.com/2016/09/27/vjug24-session-on-optional/

The major problem is using Optional.get() when the Optional might not be present. This is exactly like the problem of performing a field access on a value that might be null. Therefore, it is likely that an existing solution can be directly adopted: the logic of the Nullness Checker of the Checker Framework, or of some other nullness-checking tool, can be adapted and simplified to the context of Optional.

I suspect that the existence of such a tool would help to expose many latent errors. It might convince programmers that the Optional type isn't that safe after all, but more importantly it would convince them to use it in appropriate ways.

Case study of preventing index-out-of-bounds errors

An index-out-of-bounds error occurs when a programmer provides an illegal index to an array or list, as in a[i] or a.get(i) where i is less than 0 or greater than the length of a. In languages like C, this is disastrous: buffers overflows lead to about 1/6 of all security vulnerabilities. In languages like Java, the result is “merely” that the program crashes. In both languages, it is desirable to prevent programmers from making this error and to prevent users from suffering the bad consequences.

A group of students has recently created a static analysis tool that prevents illegal index array exceptions in Java programs. It remains unknown how effective this tool is. Does it scale up to big, interesting programs? Are there common, important code patterns that it fails to handle? Does it produce too many false positive warnings? Does it place too heavy a burden on the user, either in terms of annotations or in terms of complexity of the error messages? Worst of all, are there unknown unsoundnesses in the tool?

The goal of this project is to do a substantial case study with the prototype index analysis tool, to identify its merits and limitations, and to improve it enough to make it usable by real-world programmers.

Make programs deterministic

Programs are easier to use and debug if their output is deterministic. For example, it is easier to test a deterministic program, because nondeterminism can lead to flaky tests that sometimes succeed and sometimes fail. As another example, it is easier for a user or programmer to compare two deterministic executions than two nondeterministic executions.

A number of Java methods return nondeterministic results, making any program that uses them potentially nondeterministic. Here are a few examples:

iteration over HashMaps and HashSets
solution: sort, or iterate over LinkedHashMaps and LinkedHashSets
File.list()
solution: sort the resulting list
Object.toString(), Object.hashCode()
solution: use overridding implementations
new Random()
solution: call it with an fixed argument

You can find more examples of non-deterministic specifications in the standard Java library, and suggestions for how to avoid them, in the Randoop manual and in the ICST 2016 paper Detecting assumptions on deterministic implementations of non-deterministic specifications by A. Shi, A. Gyori, O. Legunsen, and D. Marinov, which presents the NonDex tool.

The NonDex tool works dynamically, which means that it cannot detect all user-visible nondeterminism nor give a guarantee of correctness — a guarantee that the the program is deterministic from the user's point of view.

The goal of this project is to create a tool, based on a type system, that gives such a guarantee. The tool would report to the user all possible nondeterminism in a program, so that the user can fix the program before it causes problems during testing or in the field.

More concretely, this problem can be handled by creating two simple type systems that indicate whether a given value is deterministic.

    @PossiblyNonDeterministic
	  |
    @Deterministic
    @PossiblyNonDeterministicOrder
	  |
    @DeterministicOrder

The programmer would annotate routines that are expected to take deterministic inputs. (An example could be all printing routines.) Then, the type system would issue a warning whenever one of those routines is called on a possibly non-deterministic value.

The standard library would have annotations for

Bounded-size strings

Windows cannot run command lines longer than 8191 characters. Creating a too-long command line causes failures when the program is run on Windows. These failures are irritating when discovered during testing, and embarrassing or worse when discovered during deployment. The same command line would work on Unix, which has longer command-line limits, and as a result developers may not realize that their change to a command can cause such a problem.

Programmers would like to enforce that they don't accidentally pass a too-long string to the exec() routine. The goal of this project is to give a compile-time tool that provides such a guarantee.

Here are two possible solutions.

Simple solution: For each array and list, determine whether its length is known at compile time. The routines that build a command line are only allowed to take such constant-length lists, on the assumption that if the length is constant, its concatenation is probably short enough.

More complex solution: For each String, have a compile-time estimate of its maximum length. Only permit exec() to be called on strings whose estimate is no more than 8191. String concatenation would return a string whose estimated size is the sum of the maximums of its arguments, and likewise for concatenating an array or list of strings.

Avoiding exponential blowup when processing DAGs

Google's Bazel open-source project is a publicly-released version of their build system, Blaze. Blaze builds every line of source code that is written by any Google programmer — all of that source code appears in a single repository! Therefore, Blaze needs to be fast. Blaze represents all of the source code and its dependencies as a large DAG (a directed acyclic graph). It needs to manipulate these DAGs efficiently.

One of the biggest problems that the Blaze developers face is exponential blowup of DAG sizes and therefore of run time. Periodically, one of the Blaze developers makes such a mistake, and Blaze becomes unusable until they can diagnose and fix the problem.

Here are two different ways to view the problem.

  1. In a DAG, multiple nodes may have the same child. Traversing the DAG naively would visit the child multiple times — in the worst case, exponentially many times. It is necessary to avoid doing so.
  2. Blaze contains a function that takes a DAG as input and generates a DAG as output (the output is an object graph). The Blaze developers want to ensure that the size of the output DAG is O(|input DAG|). The input DAG is processed bottom up (we ensure that each input node is visited once) and Blaze stores the results of intermediate computations that construct the output DAG with nodes of the input DAG. The key thing the Blaze developers want to avoid is copying intermediate subgraphs that have unbounded size.

More concretely, there is only one Java type for all DAGs, and there is a method flatten(). It's a mistake to call flatten() on certain DAGs, because doing so may cause exponential blowup.

The goal of this project would be to better understand the problem with Blaze/Bazel, to formalize it, and to create a program analysis that solves the problem. You would evaluate your work by running it on the Blaze codebase to discover latent problems, or by providing it to the Blaze developers to run each time they propose a code change. The Blaze/Bazel team is interested in collaborating by evaluating a tool.

Case study of nullness checking

This project is also related to the Bazel/Blaze build system, and was proposed by its development manager.

The Bazel codebase contains 1586 occurrences of the @Nullable annotation. This annotation indicates that a variable may hold a null value. This is valuable documentation and helps programmers avoid null pointer exceptions that would crash Bazel. However, these annotations are not checked by any tool. Instead, programmers have to do their best to obey the @Nullable specifications in the source code. This is a lost opportunity, since documentation is most useful when it is automatically processed and verified. (For several years, Google tried using FindBugs, but they eventually abandoned it: its analysis is too weak, suffering too many false positives and false negatives.)

Despite the programmers' best efforts, null pointer exceptions do still creep into the code, impacting users. The Bazel developers would like to prevent these. They want a guarantee, at compile time, that no null pointer exceptions will occur at run time.

Such a tool already exists: the Nullness Checker of the Checker Framework. It runs as a compiler plug-in, and it issues a warning at every possible null pointer dereference. If it issues no warnings, the code is guaranteed not to throw a NullPointerException at run time.

The goal of this project is to do a large-scale case study of the Nullness Checker on Bazel. The main goal is to understand how the Nullness Checker can be used on a large-scale industrial codebase. How many lurking bugs does it find? What @Nullable annotations are missing from the codebase because the developers failed to write them? What are its limitations, such as code patterns that it cannot recognize as safe? (You might creating new analyses and incorporating them into the Nullness Checker, or you might just reporting bugs to the Nullness Checker developers for fixing.) What burdens does it place on users? Is the cost-benefit tradeoff worth the effort — that is, should Google adopt this tool more broadly? How should it be improved? Are the most needed improvements in the precision of the analysis, or in the UI of the tooling?