These notes are in addition to the feedback you received during your in-class presentations. A general comment is to make your problem concrete by giving simple code examples. You know what you are talking about, but the other people in the room probably don't. When you present in generalities that robs others of understanding and you of feedback. Every presentation would have been strengthened by code examples. Indices Gone Wild (case study of Index-Out-Of-Bounds Checker) There are a lot of interesting issues to be explored, but your presentation was vague about them. When you propose to perform a task, you should state why the task is valuable to perform. For example, for a case study, you should express research questions or specific concerns. Why is this case study worth doing, and what do you expect to learn? When you do a case study, do you expect to find more problems in the code you are type-checking, in its documentation, in the functionality of the Index Checker, or in the usability of the Index Checker? You could at least try the tool yourself to see whether you have concerns about it or to see where you should focus your work. You said you would evaluate how much programmer time is saved. This can only be determined via a controlled experiment, and I recommend against a controlled experiments. They are very difficult to do and you don't have time for it in a quarter. A case study, of the sort that you plan do do, is appropriate and conveys other information, but it doesn't let you measure how much programmer time is saved. Git Merge It's great that you estimated whether this is a problem in practice, by looking at forums and discussions and by surveying existing tools that address the problem. Considering related work is essential not just for any research project, but any project. Your work is most valuable if it does something new. If you don't know what others have done, then you will waste your time re-inventing the wheel. You expressed concern about performance. One type of performance is run-time CPU consumption. A much more important type of performance is human time consumption. Electricity to run a computer is very cheap, and computers run very fast. By contrast, human time is expensive. Whenever you require human intervention, CPU performance becomes irrelevant, and it is worthwhile to spend quite a bit of computation to same human time. If your tool can sometimes resolve a conflict without involving a human, that is a big win even if it takes a little while to complete. Furthermore, building an AST is a very fast computation. The technique is general, but AST building is different for each programming language and you should start out with just one particular language. What metrics will you use to determine the success of your project? That is, what will you measure? You should discuss the possibility of semantic conflicts even when there is no textual conflict. AST merge doesn't actually make this problem any worse than it is already, but some people might believe that it does because it converts some textual conflicts into clean merges. Dig Dog: GRT Randoop Enhancements Since you only have to re-implement what is well-described in a published paper, you may be able to complete more than two of GRT's six enhancements. It's good to prioritize them into a particular order, though. And it's more important to do a complete, solid job on fewer enhancements (which gives useful information) than a partial or incomplete job on an ambitiously-scoped project (which gives no useful, trusted information). Some of the enhancements that are claimed in the GRT paper are already in Randoop, and the GRT authors were sloppy in claiming that they had invented them. As an example, Randoop already reads constant values from the program under test; see https://randoop.github.io/randoop/manual/index.html#option:literals-level . There still may be enhancements that you can make based on what is actually new in the GRT paper, though. It was good that you explained how you plan to evaluate your work: in terms of evaluation on Defects4J. There are multiple things you can measure: whether Randoop's error-revealing tests find more errors, and whether Randoop's regression tests achieve more coverage. You should do both, rather than just coverage of regression tests. Evaluating against QuickCheck may be a bit tricky. A reason is that QuickCheck is not a fully automated system. It expects, for each type in the whole program, for the user to write a little bit of code to express how to create random values of that type. I suspect that QuickCheck won't do well without this guidance (I might be wrong about this). Therefore, your evaluation may be time-consuming for you to do and the results might depend too much on your skill in writing those generators. Optional Type Several of the comments above apply to your project, so I won't repeat them here. Your presentation would have been improved by concrete code examples (you showed method names at least, but that's not enough for people who may not know about this new Java feature). You should also explain what your study goals are and how you expect to evaluate them. You should motivate that this is a real problem, for example by finding occurrences of the exception name in bug reports and forum posts. Prioritization based on failure probability If you are going to use JUnit, then it may be easier to use the JUnit 5 codebase rather than JUnit 4. I think that will be backward-compatible with codebases that use JUnit 4. Your presentation emphasized putting the developer into the loop, with the developer expected to view a dashboard, prioritize, decide the time that tests will be allowed to run, and make other choices. It's good to divide work, with people doing work that requires insight and computers taking over tedious tasks. I think this project will be more successful if you automate to the greatest extent possible. The developer shouldn't have to devote extra time and thought to testing, but should just run the test suite, which your tool will automatically make fail faster. The developer doesn't have to say "run for 1 hour", but your tool will automatically run the most useful tests first and the developer can stop it whenever the developer wants (such as when returning from lunch). Your tool should not run the test suite extra times. Just use historical data, from when the developer runs the test suite or from overnight runs. Your tool will both collect test failure data and will reorder tests to make failures occurs more quickly in practice.