CSE 403 Project Ideas

These are projects you could choose as your proposed project for CSE 403. Any of them would be appropriate. You will need to do work to flesh out the project based on the notes here.

There are a lot of potential projects listed here. Don't panic! You can skim them all and only carefully read the ones that are most interesting to you. Be sure to carefully read at least 10 of them, though, or you may miss out on something that would have been a great fit for you.

You are also allowed to propose a different project, either for the project pitch or for the project proposal. If you wish to do so, you must write a draft proposal and discuss it with the instructor, before the assignment due date. Your project should be similar to those here, in that it is addressing a real difficulty that people have when programming, and it is of a size that your team can complete in a quarter.

Contents:

Testing
Debugging
Build systems
- Mapping physical to logical dependencies
- Build file inference
Machine learning
Dynamic analysis
- Run-time checker for Java specifications
Static analysis
Proving that programs are safe
New type systems
Type system enhancements and tools
Mining specifications
Test generation
Code synthesis

Testing

Do you love to write tests? If not, then you should use a tool that writes tests for you. We are looking to improve the automation and coverage of test generation tools, and there are many projects involving static analysis, dynamic analysis, and selection algorithms. Machine learning could be applied to make these systems more efficient.

Generating test setup code

The goal of the project is to automatically generate data structures that satisfy a given property.

A test can be thought of as setup, an action, and an oracle/assertion. In the case of a Java unit test, the setup code might create data structures with particular properties, the action might call a method, and the oracle might assert that the return value is as expected.

It is tedious to write setup code. It would be better for the programmer to indicate the desired property of the created data structures, and they would be created automatically for the programmer.

For example, the programmer might write

  void myTest() {
    List lst = testSetup();          // missing setup code; will be implemented
    assert lst.size() == 2;          // property of the setup code

    int result = testedMethod(lst);  // action
    assert result == 4;              // oracle
  }

The tool could create the incomplete setup code:

  List testSetup() {
    List lst = new ArrayList();
    lst.add("hello world");
    lst.add("other string");
    return lst;
  }

The best evaluation would be whether developers wish to use the tool. Prior to that, a preliminary evaluation could use the long test method names that programmers write. The evaluation could parse these into setup properties and determine how many of them could have been automatically satisfied by a test input generator.

Mining programs to generate better tests

Random generation of tests is surprisingly effective: a machine can generate millions of potential tests, showing to a human only the ones that (probably) indicate a bug. However, the search space of potential tests is vast: all possible inputs to the program. This makes the process inefficient, especially if a human must examine the inputs. It also makes the process less likely to find the needles in the haystack: the really useful tests.

This project aims to use a variety of information to guide the test generation process. Some of this information is available in the program itself. For example, if two different methods contain the same identifiers, then it makes sense to test them together. Other sources of information are code comments, bug reports, version control systems, etc. Other existing analyses such as pointer analysis or abstract type inference could also be useful. And, it is possible to use feedback from executing the tests to guide the test generation process to avoid redundancy and explore new types of tests.

In addition to being useful for generating tests, these techniques could be applied to a variety of other software engineering tasks.

Test prioritization based on failure probability

The faster your test suite tells you about an error, the better. Quicker notification of errors saves time, because the developer remembers what he or she was thinking, and the developer can fix errors rather than building on a faulty foundation.

Test prioritization reorders the tests with the goal of making the test suite fail more quickly. (If the test suite passes, it will take the same amount of time whether or not it has been reordered.)

Here is a very simple strategy: a test is more likely to fail in the future if it has failed in the past. There is evidence that this works extremely well. However, sorting test cases according to their likelihood of failure is not the best approach! This counterintuitive fact is because many tests may have correlated effects. You can do better by exploiting conditional probability.

Frequently, some tests are correlated: they fail in (mostly) the same situations. For example, suppose that:

tests 101 through 110 all fail 10% of the time, but in the past they always have the same outcome: either all 10 pass or all 10 fail.
test 201 fails 5% of the time, but its failure is not correlated with any other tests.

A bad order is to run tests 101, 102, ... 110, then test 201. A better order is to run test 101, then test 201 before test 102, even though 102 is more likely to fail than test 201. Although test 102 has a higher a priori likelihood of failure, it has a lower likelihood of failure conditioned on the fact that test 101 succeeded.

The goal of this project is to build an extension to a testing framework (such as JUnit) that records previous test execution results and reorders tests according to their likelihood of failure. Then, measure how big an effect this has in practice.

Debugging

Distributed system debugging

DVIZ is a graphical, interactive debugger for distributed systems. It enables engineers to explore and control the execution of their system, including both normal operation and edge cases — message drops, node failures, and delays. DVIZ supports time-travel, allowing engineers to navigate a branching history of possible executions. This enables users to backtrack and make different choices about the order in which messages and timeouts are delivered, allowing the exploration of many message orderings in a single debugging session.

DVIZ's visualization currently emphasizes the current state of the system, and enables users to easily explore this state. An alternative visualization of the execution of a distributed system is the space-time diagram (see the figures in this paper). In this view, a programmer can see the whole execution of a system at once. On the other hand, viewing the current state of the system is slightly more awkward.

The goal of this project is to make DVIZ support space-time diagrams as well as its current visualization. Programmers should be able to switch between them on the fly. There are several design decisions to be made: how should timeouts, and messages currently in flight, be represented? How should users interact with the display in order to inspect nodes and messages and to control delivery? DVIZ consists of a web server and a browser-based frontend; both are written in Clojure (using Clojurescript for the frontend). For this project, it should only be necessary to modify the frontend, but it may end up being useful to make server-side changes as well.

Does fault localization help programmers?

It would be useful to have a tool that tells you where in your program the bugs are. This tool could either take as input a program for which you can reproduce a bug (but you don't know where to fix the bug), or a program for which you don't know where the bugs are (in which case it would both find bugs and tell you where to fix them). Many people have proposed such tools.

However, no one knows whether these tools actually work! In fact, it's likely that some of them do not work.

The goal of this project is to evaluate a proposed approach to fault localization. Alternately, devise a more precise and accurate analysis. The evaluation should use real errors and real programmers to see what works in practice. The evaluation is likely to raise new research issues and to inspire ideas for improved tools.

Minimizing bug fixes

Ideally, every commit in a version control repository has a single purpose:

add a feature
fix a bug
refactor

However, in practice programmers often make a commit that mixes together multiple distinct changes. There might be multiple new features, or a feature plus some refactorings that were needed in order to implement it plus some bugs that the programmer noticed along the way.

A commit that serves multiple purposes is harder for programmers and tools to interpret.

The goal of this project is to create a tool that minimizes a patch with respect to some particular goal. For example, minimizing a patch with respect to a bug fix means finding the smallest part of the patch that fixes the bug. The patch would not include documentation changes, variable renaming, and other refactorings.

One approach is to infer and undo refactorings. Another approach is to experimentally remove lines and see whether the tests still work. Other approaches are possible.

Future-proofing issue trackers

Typically, a programmer views a bug report by clicking on a URL to an issue tracker. However, sometimes a service shuts down; for example, Google Code is discontinued and its issue tracker entries are only accessible as JSON files through an API. In addition, it may be desirable, or faster, to view issue tracker information while offline.

The goal of this project is to create an offline interface to issue trackers. You will need to define a file format (maybe reusing or adapting the one from Google Code), create tools to convert current issue tracker webpages into the format (SourceForge, GitHub, etc.), and create rudimentary tools, such as a website, to view the information, similarly to an issue tracker website permits.

Build systems

Mapping physical to logical dependencies

Modern build tools, such as Maven, Gradle, and Bazel, require the programmer to specify a list of dependencies (that is, libraries that the program uses), and the required version of each one. When converting an old project to use a modern build tool like this, it is desirable to automatically translate the physical dependencies (i.e. the files literally present on the developer's computer) to logical dependencies (dependency-version pairs in the build system's format). The goal of this project is to build a tool to do this translation.

This translation tool would also be useful to Dux, which eliminates certain kinds of dependency errors. Dux outputs a list of the physical dependencies, and currently a programmer must convert this list to a build system's logical dependencies.

Build file inference

Suppose that you want to run an analysis on your program. The analysis needs to fit into your build system. It needs to know where the source code is, where compiled files are output, the names of the build file targets that compile and test the program, and so forth.

A user can provide this information manually, but doing so is tedious, time-consuming, and error-prone. The goal of this project is to build a tool that automatically determines this information, for Maven build files.

The tool might work statically, by examining the text of the build file. The tool might work dynamically, by running the build file and observing its file system and build system operations. The tool might work heuristically, by running the build and looking for changed files.

Machine learning

For the natural-language projects, the paper Natural language is a programming language gives a high-level overview of the vision.

Generating bash code from natural-language specifications

One of the great successes of natural language processing is translation, such as converting the English sentence "My hovercraft is full of eels" into the Spanish sentence "Mi aerodeslizador está lleno de anguilas." This is done by recurrent neural networks (RNNs) that are trained on known correct data (English-Spanish sentence pairs).

If this approach works well for natural language, why shouldn't it work for programming languages? In other words, why can't we create a program from natural language?

We have created a system, Tellina, that converts English specifications of file system operations into bash commands. For example, when given the input "find text files in the current folder", it outputs find . -name '*.txt'.

Tellina's output is correct 40% of the time, and its output is correct except for errors in constants (such as '*.txt' above) 70% of the time. In the machine learning community, these numbers would be considered too small. However, people are remarkably good at extracting meaning from partially-correct data. In a preliminary user study, Tellina users solved bash programming problems more quickly and more correctly. Even when Tellina's output was not perfect, it often gave the programmers ideas, such as informing them about a command-line flag.

The goal of this project is to improve and replicate the user study, using the current version of Tellina which is more accurate than the one used in the previous user study. The study would also involve more people, including the entire CSE 391 software tools class.

Validating crowdsourced data

NL2Bash is a dataset that contains pairs of ⟩English, bash⟩, where the English is a description of the effect of the bash command. It was used for training Tellina. The data was collected via crowdsourcing, and we know that some of the data is incorrect. This leads to a worse model (since it was trained on some incorrect data) and misleading experiments (since the "ground truth" is wrong).

The goal of this project is to devise an approach to clean up the dataset and to obtain better data. A successful approach will probably combine static analysis, dynamic analysis, crowd-sourcing, and natural language processing.

Generating test assertions from Javadoc comments

Test assertions indicate the expected behavior of a program. They are important, but programmers are reluctant to write them. They are also redundant with procedure documentation — programmers should not have to provide the same information multiple times.

Toradocu is a tool that translates Javadoc comments into test assertions. For example, here are examples of its input and output:

Input: @param fpp the desired false positive probability (must be positive and less than 1.0)
Output: fpp > 0 && fpp < 1.0
Input: @return an empty Bag
Output: result.equals(Bag.EMPTY_BAG)
Input: @return true if this graph did not already contain the specified edge
boolean addEdge(V sourceVertex, V targetVertex, E e) { ... }
Output: result == !this.containsEdge(sourceVertex, targetVertex)

Toradocu matches text in the Javadoc comment (such as "contain the specified edge") with method and variable names (such as containsEdge). However, it can be extended. One example is it could consider the documentation of the method or variable.

The goal of this project is to analyze Toradocu's weaknesses and devise new techniques, possibly using new data sources, to make it more effective.

Dynamic analysis

Run-time checker for Java specifications

This project can be viewed in two ways.

Writing a better assertion checker for Java.
Most procedures have preconditions and postconditions. Checking preconditions at run time is easy: just write an assert statement at the beginning of the routine. Checking postconditions is more troublesome: one must remember the original values of the arguments, give a name to the return value, and check at every exit point, including return statements, fallthrough returns, and exceptional exits such as those caused by throwing an exception. Object invariants are similarly problematic: they must be duplicated for every procedure entry and exit in a class.

Run-time checking of specifications seems deceptively easy, but has proved a difficult task. We would like to perform the research and implementation that will enable creation of a robust specification (pre- and post-condition, and object invariant) checker — both for use in our own programming, and because it will enable interesting additional research.
Implementing combined static and dynamic typing. (Sometimes called hybrid typing, partial typing, soft typing, statically discharging assertions/obligations, etc.)
If a property can be proved at compile time, it does not need to be checked at run time. But if it cannot be proved at compile time, then it should be checked at run time. (Alternately, compilation could be terminated, but then the program cannot be run at all.) This is useful for both assertions and for type systems.

Static analysis

Verification games: verifying a program via a crowdsourced game

Program verification is the only way to ensure that software is free of errors. However, formal verification is too costly to apply beyond small, critical software components, because it is time-consuming and is done by specially-trained engineers.

Our goal is to make verification more cost-effective by reducing the skill set required for program verification. We have created a system that takes as input a program and a property, and produces as output a game that you can play in a browser. If a player solves the game, then the final configuration of the game board can be translated into a proof of correctness for the program. Our vision is that while ordinary people are waiting for the bus, they are proving programs correct, because it is more fun than waiting for the bus. You can view a video of an early proof-of-concept version of the game.

The goal of this project is to scale the system up so that it can run on large programs and be played by thousands of people.

Tree merge for version control

Merge conflicts are the bane of software development teams. Current version control systems are line-based: when two programmers change the same line of code, the version control system reports a conflict and gives up on the merge, leaving it to the programmer to manually edit the files to resolve the conflict.

In many cases, the conflict is not a real one. For example, suppose one person added a line break, and the other changed the name of a variable. Because these affect the same line of code, they would be considered a version control conflict and the programmer would have to resolve them. As another example, if a programmer surrounds a block of code by an if statement, then the entire block of code gets indented more, and this change conflicts with any change within the block of code. A related problem that is not a conflict but should be considered one is when one programmer changes every call to a given method, and another programmer adds a new call to the method that still uses the old, wrong calling convention.

The goal of this project is to produce a tool that automatically resolves merge conflicts, relieving the programmer of the burden of manually doing so.

In the simplest case, the tool could convert the textual edits into AST (abstract syntax tree) edits and see whether the edits are still in conflict. If not, both could be applied: either applied directly to the code, or applied to the AST and the AST then converted back to code.

A more ambitious goal would be to generalize the edits, such as “this edit changes every occurrence of foo(a, b) into foo(a, b.next)”, and then to apply that edit to the entire codebase, including changes that were made in parallel by another developer. There exist tools that will do this kind of generalization, so you would not need to build that component from scratch.

To evaluate the tool, it could be applied to historical data from GitHub repositories to see how often it would have resolved a conflict without bothering the programmer.

As related work, see the FASE 2017 paper "Precise version control of trees with line-based version control systems".

Tool for analysis diffs

Many program analyses are too verbose for a person to read their entire output. However, after a program change, the analysis results may change only slightly. An "analysis diff" tool could show the difference between the analysis run on the old code and the analysis run on the new code.

The analysis diffs may help the programmer to better understand the changes.
Bug detection tools such as FindBugs or the Checker Framework, have extremely verbose output when first run on a program. Programmers could examine and fix only the warnings about code they have changed (and that they are currently thinking about).
Tools that always have large output, such as inference tools, could become manageable to users if output is shown in small doses.
You can probably think of other uses.

The analysis diff tool would take as input two analysis results (the previous and the current one). It would output only the new parts of its second input. (It could optionally output a complete diff between two analysis results.)

One challenge is dealing with changed line numbers and other analysis output differences between runs.

It would be nice to integrate the tool with git pre-commit hooks or GitHub pull requests, to enable either of the following functionality (for either commits to master or for pull requests):

Permit only those commits/pulls that do not add any new analysis warnings.
Permit only those commits/pulls that are "clean" — the analysis issues no warnings for any changed line.

A concrete example of an analysis diff tool is checklink-persistent-errors; see the documentation at the top of the file. That tool only works for one particular analysis, the W3C Link Checker. An analysis diff tool also appears to be built into FindBugs. The goal of this project is to build a general-purpose tool that is easy to apply to new analyses.

Proving that programs are safe

The Checker Framework is an innovative programming tool that helps you prevent bugs at development time, before they escape to production.

Java's type system prevents some bugs, such as int count = "hello";. However, it does not prevent other bugs, such as null pointer dereferences, concurrency errors, disclosure of private information, incorrect internationalization, out-of-bounds indices, and so forth. Pluggable type-checking replaces a programming language's built-in type system with a more powerful, expressive one.

We have created around 20 new type systems, and other people have created many more. The more powerful type system is not just a bug-finding tool: it is a verification tool that gives a guarantee that no errors (of certain types) exist in your program. Even though it is powerful, it is easy to use. It follows the standard typing rules that programmers already know, and it fits into their workflow.

The Checker Framework is popular: it is used daily at Google, Amazon, Uber, on Wall Street, and in other companies from big to small. It it attractive to programmers who care about their craft and the quality of their code. The Checker Framework is the motivation for Java's type annotations feature. It has received multiple awards. With this widespread use, there is a need for people to help with the project: everything from bug fixes, to new features, to case studies, to integration with other tools. We welcome your contribution!

Prerequisites: You should be very comfortable with the Java programming language and its type system. You should know how a type system helps you and where it can hinder you. You should be willing to dive into and understand a moderately-sized codebase. You should understand fundamental object-oriented programming concepts, such as behavioral subtyping: subtyping theory permits argument types to change contravariantly (even though Java forbids it for reasons related to overloading), whereas return types may change covariantly both in theory and in Java.

These projects take an existing type-checker, apply it to a codebase (you can choose your favorite one, or you can ask for suggestions), and determine whether the type system is easy to use and whether it is effective in revealing or preventing defects. Case studies are our most important source of new ideas and improvements: our most useful features have arisen as a result of an observation made during a case study. Many people have started out “just” doing a case study but have ended up making deep, fundamental contributions and even publishing scientific papers about their discoveries.

A set of large case studies is one possible project. The most common choice is case studies of a recently-written type system, to determine its usability. Another choice is to annotate popular libraries for an existing type system, to make it more usable.

Here are a few suggestions, but a case study of any type system distributed with the Checker Framework is of value.

Signature strings

Show that the ASM library, or the BCEL library, properly handles signature strings (or find bugs in them).

To get started:

Fork https://github.com/typetools/commons-bcel.git
Clone your new fork
git checkout typecheck-signature
mvn verify

Some challenging aspects of this case study are:

The Signature String Checker needs to be enhanced to precisely handle code like someString.replace('.', '/') which converts from @ClassGetName to @FieldDescriptor. It also converts from @FullyQualifiedName to @BinaryName, but only for non-anonymous classes. The full rules for that, and for other calls such as someString.replace('/', '.'), need to be worked out and implemented.
ASM and BCEL define their own new signature string formats (!), which need to be defined. If they are not defined, then warnings need to be suppressed and the guarantee is weaker. (Defining them could be a second part of the case study.)
Some of ASM's and BCEL's documentation is incorrect, and in other cases the string format is not defined.
The BCEL codebase has been partially annotated, but we are not sure the annotations are 100% correct.

Array indexing (index-out-of-bounds errors)

An index-out-of-bounds error occurs when a programmer provides an illegal index to an array or list, as in a[i] or a.get(i) where i is less than 0 or greater than the length of a. In languages like C, this is disastrous: buffer overflows lead to about 1/6 of all security vulnerabilities. In languages like Java, the result is “merely” that the program crashes. In both languages, it is desirable to prevent programmers from making this error and to prevent users from suffering the bad consequences.

We have recently created a static analysis tool that prevents illegal index array exceptions in Java programs. We don't yet know how effective this tool is. Does it scale up to big, interesting programs? Are there common, important code patterns that it fails to handle? Does it produce too many false positive warnings? Does it place too heavy a burden on the user, either in terms of annotations or in terms of complexity of the error messages? Worst of all, are there unknown unsoundnesses in the tool?

This project will be a substantial case study with the Index Checker. The first goal is to identify its merits and limitations. The second goal is to improve its precision enough to make it usable by real-world programmers.

Android support annotations

Implement support for Android Studio support annotations, including @NonNull, @IntRange, @IntDef, and others. Then, do a case study to show the utility (or not) of pluggable type-checking.

Signed and unsigned numbers

The Signedness Checker ensures that you do not misuse unsigned values, such as by mixing signed and unsigned values in a computation or by performing a meaningless operation.

Perform a case study of the Signedness Checker, in order to detect errors or guarantee that code is correct.

You will need to find Java packages that use unsigned arithmetic, or that could use unsigned arithmetic but do not.

Here are some possibilities (or, search for code that uses signedness-sensitive routines in Integer and Long, namely compareUnsigned divideUnsigned parseUnsignedInt remainderUnsigned, and toUnsignedLong):

Jake2
bc-java
Project Nayuki: CRC, crypt, hash.

Your case studies will show the need for enhancements to the Signedness Checker. For example, it does not currently handle boxed integers and BigInteger; these haven't yet come up in case studies but could be worthwhile enhancements.

Java's Optional class

Java 8 introduced the Optional class, a container that is either empty or contains a non-null value. It is intended to solve the the problem of null pointer exceptions. However, Optional has its own problems.

Because of Optional's problems, many commentators advise programmers to use Optional only in limited ways.

The goal of this project is to evaluate the Optional Checker, which warns programmers who have misused Optional. Another goal is to extend the Optional Checker to make it more precise or to detect other mis-uses of Optional.

Sound checking by default

By default, the Checker Framework is unsound in several circumstances. “Unsound” means that the Checker Framework may report no warning even though the program can misbehave at run time.

The reason that the Checker Framework is unsound is that we believe that enabling these checks would cause too many false positive warnings: warnings that the Checker Framework issues because it cannot prove that the code is safe (even though a human can see that the code is safe). Having too many false positive warnings would irritate users and lead them not to use the checker at all, or would force them to simply disable those checks.

We would like to do studies of these command-line options to see whether our guess is right. Is it prohibitive to enable sound checking? Or can we think of enhancements that would let us turn on those checks that are currently disabled by default?

Improving error messages

Compiler writers have come to realize that clarity of error messages is as important as the speed of the executable (1, 2, 3 4). This is especially true when the language or type system has rich features.

The goal of this project is to improve a compiler's error messages. One example (not the only possible one) is the Checker Framework. Here are some distinct challenges:

Some type errors can be more concisely or clearly expressed than the standard "found type A, expected type B" message.
Some types are complex. The error message could explain them, or link to the manual, or give suggested fixes.
Compiler messages currently show the effective type, which may be different than what the user wrote due to defaulting, inference, and syntactic sugar. For example, a user-written @IndexFor("a") annotation is syntactic sugar for @NonNegative @LTLengthOf("a"), and those types are the ones that currently appear in error messages. It would be good to show simpler types or ones that the user wrote.
Some checkers combine multiple cooperating type systems; the Nullness Checker and the Index Checker are examples. If there is a problem with a variable's lower bound type, then its upper bound type should not be shown in the error message. This will make the message shorter and more specific, and avoid distracting the user with irrelevant information.
When a checker has multiple type systems, a type error or the lack of one may depend on facts from multiple type systems, and this should be expressed to the user.

It would be reasonable to start by improving the Index Checker's error messages, which frequently stymie users. Then, generalize the results to other type systems.

Comparison to other tools

Many other tools exist for prevention of programming errors, such as Error Prone, NullAway, FindBugs, JLint, PMD, and IDEs such as Eclipse and IntelliJ. These tools are not as powerful as the Checker Framework (some are bug finders rather than verification tools, and some perform a shallower analysis), but they may be easier to use. Programmers who use these tools wonder, "Is it worth my time to switch to using the Checker Framework?"

The goal of this project is to perform a head-to-head comparison of as many different tools as possible. You will quantify:

the number of annotations that need to be written
the number of bugs detected
the number of bugs missed
the number of false positive warnings

This project will help programmers to choose among the different tools — it will show when a programmer should or should not use the Checker Framework. This project will also indicate how each tool should be improved.

One place to start would be with an old version of a program that is known to contain bugs. Or, start with the latest version of the program and re-introduce fixed bugs. (Either of these is more realistic than introducing artificial bugs into the program.) A possibility would be to use the Lookup program that has been used in previous case studies.

New type systems

The Checker Framework is shipped with about 20 type-checkers. Users can create a new checker of their own. However, some users don't want to go to that trouble. They would like to have more type-checkers packaged with the Checker Framework for easy use.

Each of these projects requires you to design a new type system, implement it, and perform case studies to demonstrate that it is both usable and effective in finding/preventing bugs.

Make programs deterministic

Programs are easier to use and debug if their output is deterministic. For example, it is easier to test a deterministic program, because nondeterminism can lead to flaky tests that sometimes succeed and sometimes fail. As another example, it is easier for a user or programmer to compare two deterministic executions than two nondeterministic executions.

A number of Java methods return nondeterministic results, making any program that uses them potentially nondeterministic. Here are a few examples:

iteration over HashMaps and HashSets: solution: sort, or iterate over LinkedHashMaps and LinkedHashSets
File.list(): solution: sort the resulting list
Object.toString(), Object.hashCode(): solution: use overriding implementations
new Random(): solution: call it with a fixed seed
time and date functions: solution: don't make the program result depend on these

You can find more examples of non-deterministic specifications, and suggestions for how to avoid them, in the Randoop manual and in the ICST 2016 paper Detecting assumptions on deterministic implementations of non-deterministic specifications by A. Shi, A. Gyori, O. Legunsen, and D. Marinov, which presents the NonDex tool.

The NonDex tool works dynamically, which means that it cannot detect all user-visible nondeterminism nor give a guarantee of correctness — a guarantee that the the program is deterministic from the user's point of view.

The goal of this project is to create a tool, based on a type system, that gives a guarantee. The tool would report to the user all possible nondeterminism in a program, so that the user can fix the program before it causes problems during testing or in the field.

More concretely, this problem can be handled by creating two simple type systems that indicate whether a given value is deterministic. In each diagram, the supertype appears above the subtype.

    @PossiblyNonDeterministic            @PossiblyNonDeterministicOrder
              |                                       |
        @Deterministic                       @DeterministicOrder

The programmer would annotate routines that are expected to take deterministic inputs. (An example could be all printing routines.) Then, the type system would issue a warning whenever one of those routines is called on a possibly non-deterministic value.

The standard library would have annotations for

methods that return non-deterministic results
methods like sort that convert nondeterministically-ordered collections into deterministically-ordered ones

You can find a draft manual chapter that documents a possible design for a Determinism Checker. It differs slightly from the above proposal, for instance by having a single type hierarchy instead of two. That type system is nearly implemented, so your best choice for this project might be to do a case study of it, which could lead to design work to improve it.

Custom tainting checking

The Checker Framework comes with a Tainting Checker that is so general that it is not good for much of anything. In order to be useful in a particular domain, a user must customize it:

rename the @Tainted and @Untainted qualifiers to something more specific (such as @Private or @PaymentDetails or @HtmlQuoted), and
annotate libraries.

The first part of this project is to make this customization easier to do — preferably, a user does not have to change any code in the Checker Framework, as is currently the case for the Subtyping Checker. As part of making customization easier, a user should be able to specify multiple levels of taint — many information classification hierarchies have more than two levels (for example, the US government separates classified information into three categories: Confidential, Secret, and Top Secret).

The second part of this project is to provide several examples, and do case studies showing the utility of compile-time taint checking.

Possible examples include:

SQL injection
OS command injection
the @PrivacySource and @PrivacySink annotations used by the Facebook Infer static analyzer.
information flow
many of the CWE/SANS most dangerous software programming errors (and the "on the cusp" ones too)

For some microbenchmarks, see the Juliette test suite for Java from CWE.

Bounded-size strings

Windows cannot run command lines longer than 8191 characters. Creating a too-long command line causes failures when the program is run on Windows. These failures are irritating when discovered during testing, and embarrassing or worse when discovered during deployment. The same command line would work on Unix, which has longer command-line limits, and as a result developers may not realize that their change to a command can cause such a problem.

Programmers would like to enforce that they don't accidentally pass a too-long string to the exec() routine. The goal of this project is to give a compile-time tool that provides such a guarantee.

Here are two possible solutions.

Simple solution: For each array and list, determine whether its length is known at compile time. The routines that build a command line are only allowed to take such constant-length lists, on the assumption that if the length is constant, its concatenation is probably short enough.

More complex solution: For each String, have a compile-time estimate of its maximum length. Only permit exec() to be called on strings whose estimate is no more than 8191. String concatenation would return a string whose estimated size is the sum of the maximums of its arguments, and likewise for concatenating an array or list of strings.

Preventing more index-out-of-bounds errors

The Index Checker detects all possible out-of-bounds array accesses at compile-time. Sometimes it warns about an array access that is actually safe, but the Index Checker is not powerful enough to recognize that fact.

The goal of this project is to improve the Index Checker by making its analysis more powerful. You would start by examining Index Checker warnings that users have marked as false positive warnings. These are the occurrences of @SuppressWarnings("index") in Google Guava and JFreeChart. In some cases the extensions of the Index Checker may be specific to a particular program (such as Guava or JFreeChart).

This project will help to improve open-source software such as Guava and JFreeChart, improve the Index Checker, and validate or disprove the hypothesis than it is easy to extend the Index Checker to handle new code patterns.

Here are specific examples of enhancements that could be made:

Guava's primitives package implements custom collections of primitive types. Each is backed by an array of the primitive type and maintains immutable start and end fields. The class enforces an invariant that the Index Checker cannot express: if i is an index for this, then start+i is an index for the backing array. This causes 63 false positive warnings in the implementations of functions such as:


/** Returns the {@code long} value present at the given index. */
public long get(@IndexFor("this") int index) {
    return array[start + index]; // false positive
}

JFreeChart commonly uses an object's index to fetch it from another object that, by construction, must contain it. In these cases, JFreeChart does not need to check for a -1 return when calling indexOf methods. An example follows:


public class DefaultPolarItemRenderer {
/** The plot the renderer is assigned to. */
private PolarPlot plot;

public LegendItem getLegendItem(...) {
XYDataset dataset = plot.getDataset(plot.getIndexOf(this)); //f.pos
...

Because this is a field in plot, getIndexOf cannot return -1 here even though its documentation indicates that it could. This causes 46 false positive warnings, because the Index Checker reasons that getIndexOf could return -1.

Overflow checking

Overflow is when 32-bit arithmetic differs from ideal arithmetic. For example, in Java the int computation 2,147,483,647 + 1 yields a negative number, -2,147,483,648. The goal of this project is to detect and prevent problems such as these.

As a concrete application, the Index Checker is currently unsound in the presence of integer overflow. If an integer i is known to be @Positive, and 1 is added to it, then the Index Checker believes that its type remains @Positive. If i was already Integer.MAX_VALUE, then this type would be false.

This project involves removing this unsoundness by implementing a type system to track when an integer value might overflow. Implementing such a type system would permit the Index Checker to extend its guarantees even to programs that might overflow. We have noticed that this is important for some indexing bugs in practice (using the Index Checker, we found 5 bugs in Google Guava related to overflow). A key challenge will be keeping the number of false positives in this new type system low: previous attempts to build static analyses for integer overflows have either been unsound or had high false positive rates. Because we are using this analysis for a specialized purpose (i.e. the types are only important when the Index Checker's analysis might become unsound), false positives may be less of a concern.

Lock ordering

The Lock Checker prevents race conditions by ensuring that locks are held when they need to be. It does not prevent deadlocks that can result from locks being acquired in the wrong order. This project would extend the Lock Checker to address deadlocks, or create a new checker to do so.

Suppose that a program contains two different locks. Suppose that one thread tries to acquire lockA then lockB, and another thread tries to acquire lockB then lockA, and each thread acquires its first lock. Then both locks will wait forever for the other lock to become available. The program will not make any more progress and is said to be deadlocked.

If all threads acquire locks in the same order — in our example, say lockA then lockB — then deadlocks do not happen. You will extend the Lock Checker to verify this property.

Index checking for mutable length data structures

The Index Checker is currently restricted to fixed-size data structures. A fixed-size data structure is one whose length cannot be changed once it is created; examples of fixed-size data structures are arrays and Strings. This limitation prevents the Index Checker from verifying indexing operations on mutable-size data structures, like Lists, that have add or remove methods. Since these kind of collections are common in practice, this is a severe limitation for the Index Checker.

The limitation is caused by the Index Checker's use of types that are dependent on the length of data structures, like @LTLengthOf("data_structure"). If data_structure's length could change, then the correctness of this type might change.

A naive solution would be to invalidate these types any time a method is called on data_structure. Unfortunately, aliasing makes this still unsound. Even more, a great solution to this problem would keep the information in the type when a method like add or remove is called on data_structure. A more complete solution might involve some special annotations on List that permit the information to be persisted.

This project would involve designing and implementing a solution to this problem.

Nullness bug detector

Verifying a program to be free of errors can be a daunting task. When starting out, a user may be more interested in bug-finding than verification. The goal of this project is to create a nullness bug detector that uses the powerful analysis of the Checker Framework and its Nullness Checker, but omits some of its more confusing or expensive features. The goal is to create a fast, easy-to-use bug detector. It would enable users to start small and advance to full verification in the future, rather than having to start out doing full verification.

This could be structured as a new NullnessLight Checker, or as a command-line argument to the current Nullness Checker. Here are some differences from the real Nullness checker:

No initialization analysis; the checker assumes that every value is initialized.
No map key analysis; assume that, at every call to Map.get, the given key appears in the map.
No invalidation of dataflow facts. Assume all method calls are pure, so method calls do not invalidate dataflow facts. Assume there is no aliasing, so field updates do not invalidate dataflow facts.
Assume that boxing of primitives is @Pure: it returns the same value on every call.
If the Checker Framework cannot infer a type argument, assume that the type argument is @NonNull.

Each of these behaviors should be controlled by its own command-line argument, as well as being enabled in the NullnessLight Checker.

The implementation may be relatively straightforward, since in most cases the behavior is just to disable some functionality of existing checkers.

It will be interesting to compare this NullnessLight Checker to the regular Nullness Checker:

How much easier is it to use? For example, how many fewer annotations need to be written?
How many fewer true positives does it report — in other words, how many more false negatives does it suffer?
How many fewer false positives does it report?

Uber's NullAway tool may be an implementation of this idea (that is, a fast, but incomplete and unsound, nullness checker). Does Uber's tool provide users a good introduction to the ideas that a user can use to transition to a nullness type system later?

Stateful type systems

This project is to improve support for typestate checking

Ordinarily, a program variable has the same type throughout its lifetime from when the variable is declared until it goes out of scope. “Typestate” permits the type of an object or variable to change in a controlled way. Essentially, it is a combination of standard type systems with dataflow analysis. For instance, a file object changes from unopened, to opened, to closed; certain operations such as writing to the file are only permitted when the file is in the opened typestate. Another way of saying this is that write is permitted after open, but not after close. Typestate is applicable to many other types of software properties as well.

Two typestate checking frameworks exist for the Checker Framework. Neither is being maintained; a new one needs to be written.

Invent your own new type system

We also welcome your ideas for new type systems. For example, any run-time failure can probably be prevented at compile time with the right analysis. Can you come up with a way to fix your pet peeve?

It is easiest, but not required, to choose an existing type system from the literature, since that means you can skip the design stage and go right to implementation.

This task can be simple or very challenging, depending on how ambitious the type system is. Remember to focus on what helps a software developer most!

Type system enhancements and tools

Flow expression parser

A number of type annotations take, as an argument, a Java expression. The parser for these is a hack. The goal of this project is to replace it by calls to JavaParser. This should be straightforward, since JavaParser is already used in other parts of the Checker Framework.

Upgrade to a newer version of ASM

The Annotation File Utilities, or AFU, insert annotations into, and extract annotations from, .java files, .class files, and text files. These programs were written before the ASM bytecode library supported Java 8's type annotations. Therefore, the AFU has its own custom version of ASM that supports type annotations. Now that ASM 5 has been released and it supports type annotations, the AFU needs to be slightly changed to use the official ASM 5 library instead of its own custom ASM variant.

This project is a good way to learn about .class files and Java bytecodes: how they are stored, and how to manipulate them.

Whole-program type inference

A type system is useful because it prevents certain errors. The downside of a type system is the effort required to write the types. Type inference is the process of writing the types for a program.

Type-checking is a modular, or local, analysis. For example, given a procedure in which types have been written, a type-checker can verify the procedure's types without examining the implementation of any other procedure.

By contrast, type inference is a non-local, whole-program analysis. For example, to determine what type should be written for a procedure's formal parameter, it is necessary to examine the type of the argument at every call to that procedure. At every call, to determine the type of some argument A, it may be necessary to know the types of the formal parameters to the procedure that contains A, and so forth. It is possible to resolve this seemingly-infinite regress, but only by examining the entire program in the worst case.

The differences between type checking and type inference means that they are usually written in very different ways. Type inference is usually done by first collecting all of the constraints for the entire program, then passing them to a specialized solver. Writing a type inference tool is harder. Worst of all, it's annoying to encode all the type rules twice in different ways: once for the type checker and once for the type inference.

As a result, many type systems have a type checker but no type inference tool. This makes programmers reluctant to use these type systems, which denies programmers the benefits of type-checking.

The goal of this project is to automatically create type inference tools from type-checking tools, so that it is not necessary for the type system designer to implement the type system twice in different ways.

A key insight is that the type-checker already encodes all knowledge about what is a legal, well-typed program. How can we exploit that for the purpose of type inference as well as type-checking? The idea is to iteratively run the type-checker, multiple times, observing what types are passed around the program and what errors occur. Each iteration collects more information, until there is nothing more to learn.

This approach has some disadvantages: it is theoretically slower, and theoretically less accurate, than a purpose-built type inference tool for each type system. However, it has the major advantage that it requires no extra work to implement a type inference tool. Furthermore, maybe it works well enough in practice.

A prototype implementation of this idea already exists, but it needs to be evaluated in order to discover its flaws, improve its design, and discover how accurate it is in practice.

Dataflow enhancements

The Checker Framework's dataflow framework (manual here) implements flow-sensitive type refinement (local type inference) and other features. It is used in the Checker Framework and also in Error Prone, NullAway, and elsewhere.

There are a number of open issues — both bugs and feature requests — related to the dataflow framework. The goal of this project is to address as many of those issues as possible, which will directly improve all the tools that use it.

Purity (side effect) analysis

A program analysis technique makes estimates about the current values of expressions. When a method call occurs, the analysis has to throw away most of its estimates, because the method call might change any variable. If the method is known to have no side effects, then the analysis doesn't need to throw away its estimates, and the analysis is more precise.

For example, the Checker Framework unsoundly trusts but does not check purity annotations. This makes the system vulnerable to programmer mistakes when writing annotations. The Checker Framework contains a sound checker for immutability annotations, but it suffers too many false positive warnings and thus is not usable. A better checker is necessary. It will also incorporate aspects of an escape analysis.

Choosing an algorithm from the literature is the best choice, but there still might be research work to do: in the past, when implementing algorithms from research papers, we have sometimes found that they did not work as well as claimed, and we have had to enhance them. One challenge is that any technique used by pluggable type-checking to verify immutability must be modular, but many side effect analyses require examining the whole program. The system should require few or no method annotations within method bodies. I'm not sure whether such a system already exists or we need to design a new one.

Javadoc support

Currently, type annotations are only displayed in Javadoc if they are explicitly written by the programmer. However, the Checker Framework provides flexible defaulting mechanisms, reducing the annotation overhead. This project will integrate the Checker Framework defaulting phase with Javadoc, showing the signatures after applying defaulting rules.

There are other type-annotation-related improvements to Javadoc that can be explored, e.g. using JavaScript to show or hide only the type annotations currently of interest.

Performance improvements

The Checker Framework runs much slower than the standard javac compiler — often 20 times slower! This is not acceptable as part of a developer's regular process, so we need to speed up the Checker Framework. This project involves determining the cause of slowness in the Checker Framework, and correcting those problems.

This is a good way to learn about performance tuning for Java applications.

Some concrete tasks include:

Profile the Checker Framework. Run a profiler such as YourKit to determine which parts of the Checker Framework consume the most CPU time.
Perhaps compare a profile of the Checker Framework against a profile of regular javac. This probably is not necessary because the Checker Framework is so much slower than regular javac.
Consider interning string values. The Checker Framework does a fair amount of string manipulation, in part because it reads from resources such as stub files that do not produce Elements. Interning could save time when doing comparisons. You can verify the correctness of the optimization by running the Interning Checker on the Checker Framework code. Compare the run time of the Checker Framework before and after this optimization.
Based on profiling results, devise other optimizations, implement them, and evaluate them.

Run-time checking

Implement run-time checking to complement compile-time checking. This will let users combine the power of static checking with that of dynamic checking.

Every type system is too strict: it rejects some programs that never go wrong at run time. A human must insert a type loophole to make such a program type-check. For example, Java takes this approach with its cast operation (and in some other places).

When doing type-checking, it is desirable to automatically insert run-time checks at each operation that the static checker was unable to verify. (Again, Java takes exactly this approach.) This guards against mistakes by the human who inserted the type loopholes. A nice property of this approach is that it enables you to prevent errors in a program with no type annotations: whenever the static checker is unable to verify an operation, it would insert a dynamic check. Run-time checking would also be useful in verifying whether the suppressed warnings are correct — whether the programmer made a mistake when writing them.

The annotation processor (the pluggable type-checker) should automatically insert the checks, as part of the compilation process.

There should be various modes for the run-time checks:

fail immediately.
logging, to permit post-mortem debugging without crashing the program.

The run-time penalty should be small: a run-time check is necessary only at the location of each cast or suppressed warning. Everywhere that the compile-time checker reports no possible error, there is no need to insert a check. But, it will be an interesting project to determine how to minimize the run-time cost.

Another interesting, and more challenging, design question is whether you need to add and maintain a run-time representation of the property being tested. It's easy to test whether a particular value is null, but how do you test whether it is tainted, or should be treated as immutable? For a more concrete example, see the discussion of the (not yet implemented) [Javari run-time checker](http://pag.csail.mit.edu/pubs/ref-immutability-oopsla2005-abstract.html). Adding this run-time support would be an interesting and challenging project.

We developed a prototype for the EnerJ runtime system. That code could be used as starting point, or you could start afresh.

In the short term, this could be prototyped as a source- or bytecode-rewriting approach; but integrating it into the type checker is a better long-term implementation strategy.

IDE and build system support

The Checker Framework comes with support for external tools, including both IDEs (such as an Eclipse plug-in) and build tools (instructions for Maven, etc.).

These plug-ins and other integration should be improved. We have a number of concrete ideas, but you will also probably come up with some after a few minutes of using the existing IDE plugins!

This is only a task for someone who is already an expert, such as someone who has built IDE plugins before or is very familiar with the build system. One reason is that these tools tend to be complex, which can lead to subtle problems. Another reason is that we don't want to be stuck maintaining code written by someone who is just learning how to write an IDE plugin.

Rather than modifying the Checker Framework's existing support or building new support from scratch, it may be better to adapt some other project's support for build systems and IDEs. For instance, you might make coala support the Checker Framework, or you might adapt the tool integration provided by Error Prone.

Model checking of a type system

Design and implement an algorithm to check type soundness of a type system by exhaustively verifying the type checker on all programs up to a certain size. The challenge lies in efficient enumeration of all programs and avoiding redundant checks, and in knowing the expected outcome of the tests. This approach is related to bounded exhaustive testing and model checking; for a reference, see [Efficient Software Model Checking of Soundness of Type Systems](http://www.eecs.umich.edu/~bchandra/publications/oopsla08.pdf).

Mining specifications

Daikon is an implementation of dynamic detection of likely invariants; that is, the Daikon invariant detector reports likely program invariants. An invariant is a property that holds at a certain point or points in a program; these are often seen in assert statements, documentation, and formal specifications. Invariants can be useful in program understanding and a host of other applications. Examples include “x.field > abs(y)”; “y = 2*x+3”; “array a is sorted”; “for all list objects lst, lst.next.prev = lst”; “for all treenode objects n, n.left.value < n.right.value”; “p != null => p.content in myArray”; and many more.

Dynamic invariant detection runs a program, observes the values that the program computes, and then reports properties that were true over the observed executions. Daikon can detect properties in C, C++, C#, Eiffel, F#, Java, Perl, and Visual Basic programs; in spreadsheet files; and in other data sources. (Dynamic invariant detection is a machine learning technique that can be applied to arbitrary data.) It is easy to extend Daikon to other applications.

Improved output

Infer invariants over exceptional executions

The Daikon invariant detector is a machine-learning tool that infers specifications for programs. It automatically reports facts about your program, such as

"x.field > abs(y)";
"y = 2*x+3";
"array a is sorted";
"for all list objects lst, lst.next.prev = lst";
"for all treenode objects n, n.left.value < n.right.value";
"p ≠ null ⇒ p.content in myArray";

However, it ignores any procedure call that throws an exception. Thus, it cannot report useful information about exceptional executions, such as what causes a program to throws an exception or what properties are true after the exception is thrown.

This project would extend Daikon to handle exceptional execution.

The project has these components:

Defining the feature. What is the relationship between normal and exceptional executions? Should summary information about *all* executions ever be output, or should they be kept strictly separate? What about information about different types of exceptions? Defining the feature is an interesting challenge. At least three designs have been proposed, but each has limitations.
Adjust Daikon's implementation. This may be as easy as defining new “exception exit points” and adding them to Daikon's existing program point hierarchy.
Change front ends that create Daikon input files, so that they output information about exceptions.
Evaluate the results.

Finer-grained handling of arrays

Sometimes, an expression's value is "missing". For example, if x is null, then there is no value for the expression x.field.

Daikon does not currently allow nonsensical elements to appear in arrays. Change it so that it does. (This probably requires an overhaul of Daikon's data structures and algorithms.

Pure (observer) methods that take parameters

Daikon can report properties in terms of methods defined in a class. For example, it can report that arg.isEmpty() is always true on entry to a method.

However, Daikon cannot currently handle pure (observer) methods that take parameters. For example, the HashMap.add() invariants should include:

  pre: !containsKey(key)
  post: containsKey(key)

Ignore unused variables

Ignore unused variables. Currently Daikon outputs data about all variables, but it would speed Daikon up to perform output only for variables that a given procedure may read or write. As a significant side benefit, this could also eliminate some irrelevant output.

Performance

Evaluate performance and usability costs of abstraction over types

A programmer often needs to write code that is polymorphic: it operates over multiple datatypes. For instance, a graph algorithm might need to work whether the data in the graph is integers or floating-point values.

Programming languages provide abstraction mechanisms for creating polymorphic code, such as C++'s templates and Java's generics. These mechanisms have the benefit of reducing code duplication. They also have costs, such as increasing indirection (making the code harder to understand) and reducing performance (due to creation of wrapper objects).

As a result, programmers still frequently perform manual specialization: they maintain multiple copies of code. Oracle recognizes these tradeoffs. They wish to “Stop making users choose between performance and abstraction”. Project Valhalla will improve the situation and is planned for a future version of Java, but is not yet available. Likewise, the book Java Generics and Collections says, “you should measure performance to justify such a design” (of specializing or not). In a 2007 experiment, specializing Quicksort improved performance by 25% (https://dl.acm.org/citation.cfm?id=1297909). However, the JIT compiler has improved in the meanwhile, and there have been few recent studies of the performance impacts.

The goal of this project is to evaluate the costs and benefits of multiple ways of writing code, such as:

maintaining multiple copies of the code that are specialized to different datatypes
using Java abstractions (classes, generics) to avoid code duplication
using a preprocessor to generate Java code, which avoids duplication and achieves good performance but complicates the build system and makes it harder to use an IDE

A good case study for this is the Daikon invariant detector, a machine-learning tool that infers specifications for programs. It is actively maintained and the maintainers are interested in improving it. It currently uses a preprocessor, which makes it easy for you to compare all the different approaches.

Parallelize Daikon

Daikon runs in a single thread. However, its algorithms process each program point (such as a method entry or exit) in isolation. Because most computers have multiple CPUs, Daikon could be made to run faster by using multiple threads, each of which processes just some program points.

This is an interesting exercise for someone who is interested in concurrency and parallelism.

Compact trace file format

Daikon's input and output are large ASCII files. Writing and parsing these files can take 50% of the total run time of the system! A binary format would be much smaller on disk. It would also speed up IO, and it would reduce computation time (by eliminating time to format data into ASCII and then parse it back out).

Avoid recomputation over unchanged variables

Don't do computation over final statics and enums. These constants never change, so comparisons over them are wasted effort. It might be both more efficient and provide better output to examine the one-ofs and other sample dependent invariants at the end of the run and substitute in constants where they match.

Improved program structure and abstractions

Array invariants

Daikon can compute properties both over individual variables, and over all the elements of an array (via so-called "elt invariants"). Each elt invariant is individually written, which is error-prone. Find a way to enable this code to be defined just once, without degrading Daikon's structure nor its performance. For example, elt invariants could be computed from the implementation for each normal (non-array) invariant.

Libraries

Change Daikon's command-line processing to use plume.Options instead.
Replace Daikon's use of the JTB library by http://javaparser.org/.

Test generation

Do you love writing unit tests to find bugs in your programs? Or, would you prefer that the tests were written for you automatically?

The Randoop tool writes tests for you.

Writing tests is a difficult and time-consuming activity, and yet it is a crucial part of good software engineering. If you are interested in improving the state of the art in test generation and bug finding, read on.

Our system, Randoop, automatically generates unit tests for Java classes. For example, here is a test case generated by Randoop that reveals an error in the Sun JDK 1.5:

public static void test1() {
  LinkedList l1 = new LinkedList();
  Object o1 = new Object();
  l1.addFirst(o1);
  TreeSet t1 = new TreeSet(l1);
  Set s1 = Collections.unmodifiableSet(t1);
  // This assertion fails
  Assert.assertTrue(s1.equals(s1));
}

The ultimate goal is a testing system that at the push of a button creates a solid JUnit test suite providing good coverage of arbitrary classes, with no manual effort on your part. Randoop is a step in the direction of this vision and it already does a very good job, especially with library classes (such as java.util). Randoop is already used at companies like ABB and Microsoft, and on open-source projects.

Some of the projects are interesting and useful engineering that will make Randoop more effective for developers. Some projects are publishable research that extends scientific knowledge about test generation approaches. And many projects include both aspects.

Required skills: These projects require facility with Java, because Randoop is written in Java. You should be very comfortable with the idea of unit testing. You should know how unit testing helps you and where it can hinder you, and you should be passionate about improving the tradeoff. You should be willing to dive into and understand a moderate-sized codebase.

Starter project: fix an issue

One way to get familiar with the Randoop codebase is to fix some issue in the issue tracker. Particularly good choices are issues labeled as "help wanted".

Better choice of methods

Randoop selects a method to test at random. Uniform random selection is the simplest, but not always the best, option. This project would augment Randoop with different, and potentially novel strategies for selecting the next method to test. Here are a few example strategies; the project would involve fleshing out the details.

Track code coverage information and use it to guide the generation process, by choosing methods that have not been covered so far.
Let the user specify methods that are critical to test and Randoop should select more often.
Combine random selection with non-random selection strategies (for example, exhaustive testing).

Mimic real call sequences

Randoop's generated test sequences are random and may not be characteristic of real use. One problem is that there may be restrictions on usage of a class: close should only be called after open, or arguments must have particular relationships, such as the non-array argument to binary search must appear in the array argument. Another problem is that random invocations may not create interesting data structures, such as very large ones or ones with particular qualities such as duplicate elements.

Randoop does a good job of generating tests for library classes (such as java.util.*) that have few restrictions on their calling sequences and parameters. Classes that make up programs often have a number of restrictions on both the sequences of calls and their parameters.

We want to extend the system to provide better coverage for such classes. Randoop could infer which sequences of calls are valid and invalid. For example, Randoop could observe what call sequences real code makes, perhaps using a tool like Palulu. Then, bias could be introduced into the generator to make call sequences that are more realistic, and therefore better test the most commonly-used and important parts of the code.

A possible downside is reduced diversity in tests. A research question is whether mimicking real executions is worthwhile overall.

Better choice of arguments

Suppose that Randoop wants to create a call to a method with a parameter of type T1. Currently, Randoop selects an object of type T1 at random. Randoop can do better. Here is one idea; other approaches are possible.

Randoop could determine what fields a method will read, and then chose an object for which those fields have been recently side-effected. As an example, when obtaining a Point to pass to the getX() method, a Point produced by the setX() or move() method is more interesting than one produced by the setY() method.

As another example, Randoop could choose objects that are most likely to achieve additional coverage in the called method, perhaps because the values will cause particular branches to be taken.

Creating independent tests

The goal of this project is to make Randoop output independent unit tests, rather than tests that must be run in a specific order.

Randoop makes test sequences out of methods that may have side effects. This can cause running one test to affect the outcome of a subsequent, putatively independent, test.

For example, suppose that method m1 modifies a global variable, and method m2 reads it. Further suppose that test t1 includes a call to m1, and test t2 includes a call to m2. The result of m2 (and, thus, whether test t2 succeeds) depends on whether test t1 has already been run in the same JVM. All of the tests are entangled, which makes it difficult to run a subset of them, to understand them, or to debug failures.

Ideally, all of the tests in a test suite should be independent, in that they can be run in any order.

Flakiness

Randoop can call APIs whose value changes from run to run of tests, and this can cause flaky tests (tests that spuriously fail). Examples of such APIs are ones that retrieve the current date, the username, etc. Two ways to mitigate such flakiness are:

identify all Java API calls that can lead to flaky information and replace them with mocks
identify information that is "tainted" by flaky data and avoid assertions on them

EvoSuite follows approach #1 so far, and you can see its source code for the APIs it mocks. Approach #2 would avoid the need for a dedicated runtime library.

Robustness

Randoop executes methods under test with random inputs, and this can cause usability problems. Code under test may exhaust memory or other resources to the point that impedes Randoop from making progress, or even causes Randoop itself to crash. Making Randoop more robust in the face of these kinds of failures would greatly improve the tool's usability and attract more users.

One approach is to create a wrapper around Randoop: a process that monitors test generation and can terminate/restart the generator if it fails to make progress or aborts unexpectedly. For more details on the wrapper idea, see Finding Errors in .NET with Feedback-Directed Random Testing (Section 3).

Another approach is to use a safe security manager that can easily be extended by the user. Currently, the security manager has to be specified by the user in the normal fashion on the command line.

Parallelizing Randoop

This project would modify Randoop so that it can run on multiple machines. This would greatly increase the efficiency of the system, by distributing the generation effort across multiple processors. The test generation techniques that Randoop uses are amenable to parallelization: Randoop could spawn multiple test generation engine instances, each with a different random seed.

There are a number of challenges to overcome in parallelizing Randoop. Two examples:

Coordinating multiple instances. Making Randoop parallel requires writing code that can spawn, monitor, and combine the results of multiple processes running in separate machines or processors.
Managing quantity of output. A parallel version of Randoop might produce many repetitive test cases. Avoiding repetitive tests is a general challenge for a tool like Randoop, but is even more important to manage this challenge in a parallel version of the tool.

A downside of this project is that generation speed is not the most important problem for Randoop; effort would probably be better invested elsewhere.

Additional required skills: Knowledge of Java APIs for spawning processes, creating streams, reading/writing files, would also be helpful.

Side effect analysis

Enhance Randoop to utilize a side effect analysis that indicates which methods are pure (perform no side effects). Here are two distinct benefits:

Benefit 1: avoiding useless calls in tests.
Ordinarily, Randoop assumes that a method call may change any object that is passed to it. For example, Randoop considers x to be different after
```
  x = foo();
```
than after
```
  x = foo();
  y = bar(x);
```
and Randoop will try to test the x values produced by both sequences. By ignoring the x value produced by the second sequence, Randoop could spend half as long to produce equally good tests. Also, Randoop's tests would be easier to understand because they would be shorter and would not contain mysterious, useless calls to pure methods.
Benefit 2: more test assertions.
When Randoop creates regression tests, the tests contain assertions about the objects that the test creates. For instance, Randoop might output
```
  assertTrue(someObject.itsField == 22);
```
In addition to reading fields of an object, Randoop could invoke observer methods. For instance, it could output
```
  assertTrue(someObject.observerMethod() == 22);
```
This is only possible for pure methods — methods that have no side effect and return the same value every time they are called on the same arguments. If Randoop is told which methods are pure, it can use them as observers and can generate much richer and more sensitive assertions in its regression tests.

The goal of this project is to enable Randoop to utilize pure methods for both of the purposes outlined above.

There are some other enhancements that would make this project even more compelling.

Randoop takes optional inputs such as a list of pure methods. Create a wrapper around Randoop that automatically invokes a purity analysis and does other pre-processing, so that a user doesn't have to do those steps and Randoop is more automated.
Although side effect and purity analysis is an old research topic, the available tools aren't great. Enhance an existing purity analysis, or create a new one.
As a start, make Randoop support manually-written null and reference immutability annotations This could both cut down the search space and possibly allow more actual errors to be found (such as null dereference errors when nulls are allowed). It would indicate whether the rest of the project is worth pursing.

User interface and user experience projects

Test cases with debugging clues

Debugging is difficult. When a test fails, it can be very time-consuming to understand the source of the failure in order to fix the code (or the test). For example, consider the following test generated by Randoop:

public static void test1() {

  ArrayList l = new ArrayList(7);
  Object o = new Object();
  l.add(o);
  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);
  // This assertion fails
  Assert.assertTrue(s.equals(s));
}

This test reveals an error in Sun's JDK (version 1.5). It shows a short sequence of calls leading up to the creation of an object that is not equal to itself! Unfortunately, it is difficult to tell simply by looking at the test what is the source of the failure.

This project would augment Randoop with the ability to provide debugging clues: comments within a failing test that provide potentially-useful facts about the failure and can help the user with debugging. For example, an improved version of the above test may look something like this:

public static void test1() {

  // Test fails for any capacity >= 0.
  // Test throws IllegalArgumentException for any capacity <0.
  ArrayList l = new ArrayList(7);

  Object o = new Object();

  // Test passes when o is a Comparable.
  l.add(o);

  TreeSet t = new TreeSet(l);
  Set s = Collections.synchronizedSet(t);

  // This assertion fails
  Assert.assertTrue(s.equals(s));
}

The new test provides several clues that can help the user avoid time debugging. The key observation is that the test passes if we use a Comparable object in the call l.add(o). Further investigation would reveal that the TreeSet constructor should not accept a list with a non-comparable object in it but does, and this leads to the error.

To output tests with debugging comments, Randoop would have to be modified in two ways:

Re-execute different variations of a failing test, trying out different method parameters, and keeping track of when the test fails and passes
Synthesize the results of the above executions as debugging comments

In theory this sounds simple, but there are several issues that make the problem interesting: creating a set of variations of a test is non-trivial, and synthesizing test executions into useful comments, are both non-trivial tasks.

Other projects

These projects (some of which are small, and others of which are quite ambitious and would make research contributions) would be a good way to get started with Randoop development and make a contribution.

Make Randoop able to test private methods: generate tests that directly call non-public methods and constructors (this is sometimes useful when testing certain code). Currently, Randoop can create and execute such tests (via the unpublicized option --public-only=boolean), but the resulting JUnit tests do not compile. This task involves modifying Randoop's JUnit output format so that non-public methods are called via reflection, making them accessible for execution.
Combine Randoop and EvoSuite. EvoSuite has multiple components: test generation, test evaluation, test minimization. If EvoSuite's test generation component were replaced by Randoop, what would the effect be? Another way of asking this is how much of EvoSuite's good performance is due to its different components, and how much is due to its core search-based algorithm.
Compare Randoop to QuickCheck, which is another popular test generation tool. Once you see the ways in which each of them is better than the other, add features to each to make each one better.
A key difference is that QuickCheck requires the user to provide generators for values, whereas Randoop creates them automatically. Which is more effective in practice?

Randoop could be extended to make it easier to specify a routine that returns a set of legal values for a parameter (that Randoop wouldn't otherwise guess).
Better classification. Given a test that throws an exception (or does not!), Randoop guesses whether the behavior is expected or unexpected. Evaluate its heuristics and try to improve them.
Run on open source. Randoop doesn't just generate tests: it detects bugs. Run Randoop on open-source programs and report bugs to their maintainers.
Read specifications about method behavior, such as whether a method is pure or whether a particular parameter may be null, and make Randoop respect those specifications. The specifications could be expressed as Java annotations or as Javadoc comments (see the Toradocu project).
Create your own enhancement suggestion. Run Randoop on a benchmark such as Defects4J, and work to improve Randoop's performance such as its coverage or mutation score. Identify a reason for poor performance, propose a way to improve Randoop, and evaluate it.

Code synthesis

Cozy automatically synthesizes efficient data structure implementations from high-level specifications. It can help developers write complex software modules that are 100% bug-free, with orders of magnitude less code. Here is a blog post about Cozy. Here are some ways Cozy can be extended to be more useful.

Additional data structures

Cozy can use lists and hash maps to write data structures. You would extend Cozy so that it also knows about binary search trees, tries, etc. That would allow Cozy to generate more expressive and efficient data structures.

Workload tuning

Cozy generates many data structure implementations, then makes a static guess about how fast each one will run data structure implementation will run. This is good for finding a data structure with good asymptotic cost, but bad for choosing one with the best constant factors, such as choosing between linked or array lists, choosing hash map load factors, and choosing where to insert caches. You would extend Cozy to do tuning, given a benchmark to optimize.

Persistence

Cozy generates in-memory data structures. It would be incredible to turn Cozy into a database synthesizer. This would involve writing a new code generator that produces on-disk data structures—perhaps building on an existing database system.