CSE 510 Project Report

Evaluating the User Interface of Software for Grading Code

Ken Yasuhara <yasuhara@cs.washington.edu>
Kent Unruh <ktunruh@u.washington.edu>
Miao Jin <miaojin@u.washington.edu>

4 June 2001
CSE 510, 2001 Spring

1 Project Overview

1.1 Software for Grading Code

In this project, we applied a variety of user interface (UI) evaluation methods to a prototype software system designed for grading and annotating homework code in introductory programming courses. The system allows a human grader to navigate and annotate electronically submitted homework code. (See Figure 1.) Note that the software is not designed to automate the evaluation of the code, a much more difficult problem which is consciously excluded from the system's goals.

When fully implemented, this system will have at least two distinct UIs. The first is the grader's interface, i.e. what the grader uses for reading and grading code. The second is the student's interface, since there must be a mechanism for delivering the graded homework and feedback back to the students. In addition to these primary UIs, there might be an interface for an instructor or head grader to review archived, graded homeworks. This project focuses on a prototype of the first (i.e. grader's) interface.

Figure 1: The prototype system is shown here with four annotations made on a short sample homework submission. Code is shown in the left pane of the window, and annotations are shown in the right pane.

The system is not yet ready for testing in the context of a real course offering, but the hope is that trial use in our university's Department of Computer Science & Engineering (CSE) can begin sometime in the 2001-2002 academic year. At this point, only the essential elements of the grader's interface is implemented. For rapid development, fast execution, and platform independence (at least with Linux and Windows), the prototype is implemented in Perl/Tk. Eventually, we hope that the system can be used at other institutions' computer science departments as well, so while the current prototype is being developed to fit the way grading is done in UW CSE's introductory courses, we are conscious of the importance of flexibility and extensibility, as well.

1.2 Context: UW CSE's Intro. Programming Courses

UW CSE has a two-quarter programming introduction sequence, consisting of courses CSE 142, currently taught in C, and CSE 143, taught in C++. (Both courses will be transitioning to using Java during the next academic year.) Traditionally, students are assigned five homework assignments in each course. Typically, the primary component of the homework is completion of a program, for which students are given up to two weeks to accomplish. For some assignments, in order to achieve more complex results without overburdening the students, the given homework materials include a partially completed program or code framework.

Homework submission and media. Each TA's responsibilities include grading homeworks for a section of about thirty students. (Although grading is only one of the CSE 142/143 TA's many responsibilities, we use the more general "grader" for the remainder of this report.) Currently, students submit each assignment in duplicate, electronically via the web and printed on paper. The submission web server automatically builds an executable program from the electronically submitted copy of the source code.

Grading process. Currently, most of the grading process is done by hand, i.e. the grader handwrites feedback on the printed submission. The process primarily entails reading the code on paper but usually also includes running the program a few times. The electronically submitted source code is also available to the grader. However, in Section 2.1, we observe this code is seldom read on-line. We discuss the current process in more detail with the observations presented in that section.

Grading guideline. Within each course, the instructor(s) and/or head TA usually produce the guideline for all of the graders. However, there are no explicit, standing, course-specific standards on grading guideline format and content. As a result, factors such as criteria, point scale, points per criteria, special case penalties (e.g. late submission) vary widely, sometimes even within a quarter.

Submission scale. Submission length and complexity vary widely across homeworks, usually increasing as the quarter progresses. Early in CSE 142, submissions are typically under five pages of code. By the end of CSE 143, however, submissions can include several files of code, totalling upwards of thirty pages. Files can include up to dozens of short function definitions each.

1.3 Methodology Overview

1.3.1 Background Studies

Before conducting studies focusing on the prototype system, we conducted two background studies: observations of and a survey on conventional grading, i.e. grading programs on paper by hand. A variety of design philosophies, including contextual design (Beyer & Holtzblatt, 1998), are centered on the belief that the success of a system intended to support or augment a work process rests on the extent to which its design is informed by a carefully investigation of the process as it exists before the system's introduction. In this project's case, for instance, the system must be designed to match and accommodate the way graders currently do their work to offer any promise of wide adoption or to improve feedback and/or efficiency.

One of the investigators has extensive experience as a CSE 142/143 TA and head TA. The hope was that these background studies would be helpful for the other two investigators as an opportunity to familiarize themselves with the current process, context, and subject matter of grading. For the investigator with grading experience, the studies would be just as valuable, largely because the grading process is not standardized (beyond a common grading guideline), and each grader has his/her own style of grading. In general, the goal of the background studies was to inform the choice and design of the more focused studies that would follow.

1.3.2 Usability Studies

To evaluate the prototype grader's interface, we applied the "think aloud" usability testing and Nielsen's heuristic evaluation (Nielsen & Molich, 1990, Nielsen, 1993, 1994) methods. Think aloud protocols can let the testers hear and record the participants reactions to the product that are being tested. People do not normally think out loud while they are performing the tasks. Although studies show that retrospective thinking out loud can get more information from the participants, we still choose to let the participants think aloud while performing the tasks due to time constraints.

We chose Nielsen's heuristic evaluation method in our second usability study because (1) it is well-suited for the limited time and resources of this project, (2) it is flexible in allowing evaluator<->observer conversation to clarify the task scenario and available functionality, (3) it should be very effective with evaluators who have extensive experience with software user interfaces.

Each participant in these studies was also asked to complete the survey from the background studies introduced above., and although it was made clear that this was optional, nearly all of them submitted surveys.

1.3.3 Participants

Target Audience. The target audience consists of future CSE 142/143 TAs. We chose to focus specifically on current TAs of CSE 142/143 for the following reasons:

They are familiar with the grading procedures and are experienced graders at this level.
They might be more willing to participate in studies than others, because they know the amount of work that they spend grading and might like to see a tool that will improve their grading efficiency and quality.

Recruited Audience. We successfully recruited several participants who match the above profile. All of the participants in the studies described here are current or former grading TAs for these courses. Most of them are CSE graduate students, with the exceptions being undergraduate CSE majors.

2 Background Studies

2.1 Grading Observations

2.1.1 Objectives

Due to time constraints and preparatory nature of these observations, we chose not to perform a full, formal contextual inquiry. Although we do not present our results in a work model framework, the underlying study objectives are the same.

2.1.2 Procedure

Each of three current CSE 142/143 graders participated in two styles of observation as they graded three actual students' submissions from their current sections. First, we asked the participant to grade two submissions as he/she normally would while we observed their actions as unobtrusively as possible, i.e. silently and without interrupting the participant. The specific goal of employing this style of observation was to identify the major subtasks of the grading process, the amount of time spent on each, and how they were ordered in the grading process. It is common for a grader to move from one subtask to the next very rapidly, spending as little as a few seconds on each. To facilitate rapid recording of subtask timings and aggregation of results, the list of common subtasks and abbreviations shown in Table 1 was used. Timings were taken approximately in multiples of five seconds.

abbreviation	subtask
rsp	read submission, on paper
rsc	read submission, on computer
rgg	read grading guideline
cs	handwrite comment, specific to section of code
cg	handwrite comment, general
mrc	mentally run code, i.e. deduce what code would do by simulating computer
pc	use computer, e.g. for running program, related e-mail

Table 1: standardized list of grading subtasks and abbreviations used in unobtrusive observation

Immediately following the unobtrusive observation, the participant was asked to grade a third submission, this time, using a "think aloud" protocol. Rather than focus on timings, the goal was to discover the goals underlying the subtasks and actions, as well as to learn more about the participant's personal grading philosophy and style, which we expected would not be as apparent in the unobtrusive observations. After grading the third submission, the participant was asked some questions as necessary to clarify notes taken by the investigators.

At the conclusion of the observations, each of the participants was asked to complete the survey described in Section 2.2.

2.1.3 Results and Discussion

"Unobtrusive" Observation. Average total grading time varied widely from 4 to 11 minutes per submission, but this is not surprising, given that the participants graded different homework assignments. Among the subtasks listed in Table 1, for all three participants, four accounted for over 90% of the grading time and in the same order by percentage of total time: read submission on paper (rsp), handwrite specific comment (cs), use computer (pc), and read grading guideline (rgg). All participants spent about 40% of their time reading code on paper, with writing specific comments following close behind at 20 to 35%. The computer usage, about 10 to 20% of the total time, was almost always for the purpose of running the submitted program. Complete results are below in Table 2.

subtask	avg. fraction of total time	sample std. dev.
read submission, on paper	0.41	0.0071
handwrite comment, specific to section of code	0.30	0.072
use computer	0.17	0.057
read grading guideline	0.06	0.029

Table 2: fraction of total grading time spent on top four subtasks; averages across three participants

"Think Aloud" Observation. The timing data above at least suggests some commonalities in ostensible grading procedure. Equally interesting and in contrast, while thinking aloud, the participants revealed very different grading philosophies and methods, as well as actions that did not fit in the prepared subtask categorization. Two of the participants paused to take notes on commonly seen mistakes and later explained that they intended to discuss them with their section as a whole, either in person or in a message to the section e-mail list. Two participants began the grading process by running the submitted program before reading any code, in order to quickly determine what parts were working properly and allow them to focus code reading on perceived trouble spots. This and other remarks strongly implied that some sections of code were merely skimmed, while others were read more closely. Finally, the minimal time spent on reading grading guideline is consistent with remarks indicating that the grader "internalizes" the guideline after applying it to two or three papers, and he/she subsequently only occasionally references it for point values.

Although these observations were conducted with only three participants, they proved useful in providing some evidence for two key conjectures and design recommendations:

There are certain aspects of the process that are common to most graders. In designing the grader's interface, accommodating these aspects should be high priority.
Grading varies in philosophy and method, and for a single interface to support a majority of graders, it must offer features matching this variety and, more generally, constrain the grader as little as possible with few assumptions built into the design.

2.2 Survey

2.2.1 Objectives

The second background study was a survey of current CSE 142/143 graders. The survey fulfilled several key goals and complemented the observations described in Section 2.1. Through a series of questions about grading, some of which with responses on Likert (numerical) scales, the survey provided a more comprehensive quantitative characterization of current grading practice (at the possible expense of inaccurate self-assessment and reflection). Additional hypothetical questions allowed participants to suggest changes to the current grading process, providing insights into what graders consider the essential characteristics of the ideal grading process. Further questions revealed the perceived advantages of paper and on-line grading, respectively. Finally, since participants could complete the surveys on their own time, they were much easier to administer and less burdensome for the participants (possibly facilitating recruitment of volunteers) than the observation sessions. In the absence of artificial time limits, participants might have provided more feedback than in a timed situation.

The survey questions were designed to gather the the graders' perceptions of the following:

which subtasks of the current grading process are more important and more time-consuming
the relative advantages of grading on paper and on the computer

See Appendix A for the complete survey document.

2.2.2 Procedure

As mentioned earlier, all participants of the other studies (observations, usability testing, heuristic evaluation) also completed the survey, but always after completing the study participation. In addition to these participants, a number of curent CSE 142/143 TAs who did not participate in the other studies also completed surveys. The survey was designed to take about fifteen minutes, but the instructions clearly welcomed as much additional feedback as the participant was willing to provide.

2.2.3 Results and Discussion

A total of 9 surveys were submitted, and almost all of the participants had at least two quarters of experience as TAs for CSE 142/143. It was interesting to see that the primary subtasks from the timed, unobtrusive observation (Table 2) match the top four grading subtasks when sorted by how time-consuming grader's thought they were (Table 3). This result was encouraging in two ways. First, the agreement of results from two different inquiry methods invites (informally) more confidence in them. Second, since the survey results are inherently self-reflective, it suggests some capacity on the participants' part to accurately report on their actions.

Focusing on the subtask of providing feedback, graders prioritizations on different kinds of feedback was roughly equal, except for providing code examples, which was given significantly lower priority on average. When asked to rate how helpful these same kinds of feedback are to students for learning homework concepts and skills, however, the results diverged notably from the corresponding prioritizations. Providing code examples received rating about as high as the other kinds of feedback, except for indicating point penalties. Since providing code examples can be much more time-consuming than the other kinds of feedback, we conjecture that this reflects a compromise in the current grading process between pedagogical quality of feedback and time spent on grading, at least on a per student basis. The rating of the value of point penalties received the lowest average rating but was the most varied.

The short answer responses suggested key weaknesses of the current grading process:

no mechanism for measuring or otherwise ensuring consistency of grading across different graders' sections (1 participant)
grading guideline credit breakdowns are too fine-grained and don't sufficiently emphasize concepts, style, functionality (2 participants)
poor accessibility of submission files, i.e. difficult to access files from outside CSE's local network, difficult to use Linux (instead of Windows) to access (2 participants)

They also highlighted what graders perceived to be the primary advantages of paper- and computer-based grading. This is directly relevant and useful data in the sense that, if possible, the grading system should achieve the advantages of paper grading, and system design might focus first on the advantages of computer-based grading cited by these participants. A selection of the remarks are paraphrased in Table 5.

Finally, the survey also gathered perceived feedback valence (i.e. how positive or negative the feedback is). On a scale from 1 (mostly negative, i.e. correcting, identifying errors) to 10 (mostly recognizing good work), the average was 4.4 with a high standard deviation of 2.2. This suggests that most of the graders perceive their feedback to be near balanced. During grading observations, however, the vast majority of annotations for the three participants had negative valence. Further, more focused study with a larger number of participants might reveal the extent to which grader's perceptions of feedback valence balance match reality, a key factor in pedagogical quality of feedback.

subtask	how time-consuming (1 = less time-consuming, 10 = more time-consuming)
subtask	avg.	sample std. dev.
writing comments about the code	8.6	0.88
reading the code	8.0	1.1
running tests with the executable	6.8	2.0
referring to the grading guideline	4.9	1.8
calculating, recording the overall grade	3.3	1.0
preparing papers, files, computer for grading	2.9	0.93

Table 3: a broader breakdown of grading into subtasks, listed in decreasing order by how time-consuming graders perceive them to be

subtask	priority when grading (1 = lowest priority, 10 = highest priority)		perceived helpfulness to students in learning skills and concepts relevant to homework (1 = least helpful, 10 = most helpful)
subtask	avg.	sample std. dev.	avg.	sample std. dev.
providing comments about general sections of code	7.8	1.5	7.6	0.73
indicating in the code where points are taken off	7.1	2.6	4.8	2.4
providing specific comments to students at the exact line number of the relevant code	6.4	2.5	7.2	1.4
providing summary comments about the program as a whole	6.9	2.8	5.9	1.7
providing your own samples of code, demonstrating better alternatives	3.8	1.7	7.8	1.5

Table 4: feedback subtasks, sorted by priority

paper-based

computer-based

can grade anywhere; pleasant to get away from computer (5 participants)
no constraints on writing feedback, i.e. location, diagrams, arrows, etc. (4 participants)

less paper to deal with (6 participants); less monetary and environmental cost, no need to carry submissions to section to return to students, harder to lose submissions
typing faster than writing, so more/better feedback can be provided (3 participants)
could archive graded submissions (2 participants)
easier to read code with syntax-highlighting and code navigation features (2 participants)
could apply same comment to multiple students' papers (2 participants)

Table 5: perceived advantages of paper- and computer-based grading

3 Usability Studies

3.1 "Think Aloud" Usability Testing

3.1.1 Summary

The purpose of this usability test was to uncover usability problems related to using the prototype grader's interface. The two primary areas of concern were as follows:

Annotation. Can users successfully use the annotation features?
Navigation. Can users use the tool to locate the section of code associated with an annotation and search the code for a particular string?

We gathered data on the usability of the prototype through observations of the participants as they used the interface, a brief post-test survey, and a post-test interview.

3.1.2 Key Findings

The three participants had minimal difficulty with the following tasks: annotating a section of code within one line, annotating a section of code spanning more than one line, deleting an existing annotation.

All participants showed difficulty annotating two separate sections of code with a single annotation. One participant showed difficulty editing an existing annotation, but the other two participants found out how to do it easily. Two participants showed difficulty searching the code for a particular string, while one participant did it without any problems.

3.1.3 Key Recommendations

(See Section 3.1.8 for more detailed discussion.)

Easier highlighting of separate sections of code. e.g. matching Windows standard Ctrl-Drag behavior
Visual feedback for annotation location. e.g. blinking highlight, highlighted line numbers
Keyboard shortcuts throughout. e.g. accepting and cancelling actions in dialog boxes, highlighting, annotating, and other common operations
Clearer display of code associated with multiple, overlapping annotations.

3.1.4 Overview of the Test

The overall aim of the grading system is to improve homework feedback in UW CSE's introductory computer programming courses. It will be used by the future CSE 142/143 TAs to improve their grading quality and efficiency. Our study focused on current CSE 142/143 TAs and the two main categories of functionality: annotating and navigating code.

3.1.5 Test Design

The pilot study was exploratory in nature. Participants were asked to complete nine tasks (Appendix B.2), complete a short questionnaire (Appendix B.5), and answer some final interview questions (Appendix B.6).

3.1.6 Procedure

The tests, which lasted approximately 40 minutes each, took place at a CSE graduate student's desk using a PC running Linux. An actual homework assignment from a past offering of CSE 142 was used, although its grading guideline was simplified in the interest of time. All of the participants worked with the same sample submission file, which was not an actual student's submission but a simulated one created by heavily modifying a sample solution with mistakes observed in student submissions. To supplement handwritten notes, the participants' verbalizations were recorded on audio microcassettes.

The investigator's duties were as follows:

Confirmed that computer was set up and ready for test using the pretest and ready for test using the pretest checklist
Introduced the investigators and test to the participant with the initial meeting script; had the participant sign the consent form.
Described the test to the participant, including information about tasks
Recorded general observations from test
Gave participant the post-test questionnaire
Administered post-test interview and recorded comments

3.1.7 Results

The pilot tests provided much information useful for improving the design of the grader's interface. The results for each task are as follows:

Skim the Homework Assignment

All of the three participants skimmed the Homework Assignment on paper. The reason is that we did not prepare an on-line version.
Skim Grading Guideline

All of the participants skimmed the Grading Guideline on paper, although an onaline version is available. One participant commented that he would like to use the on-line version of the Grading Guideline while he is grading to avoid switching from screen to paper.
Annotate a section of code within one line

All of the participants were able to finish this task easily. One participant said, "Easy," after he was done.
Annotate a section of code spanning more than one line

All of the participants finished this task without any difficulties.
Edit an existing annotation

All of the participants finished this task. The first participant tried a few ways including right clicking and editing the annotation directly and finally noticed the Edit button. He commented that, "It is still nice," but would prefer to edit the annotation directly in the annotation area, without using the pop-up windows. Both of the other two participants found out after trying to right click with the mouse.
Annotate two separate sections of code with a single annotation

All of the participants finished this task with different degree of difficulty. None of them found the Highlight button easily. All of them tried to right click, one of them tried to drag the mouse.
Delete an existing annotation

Participants quickly deleted an existing annotation.
Locate the section of code associated with an annotation

At first, none of them figured out that by just left clicking the annotation, the section of code associated with the annotation appears in the code pane. One of them thought he could locate the code by using the line number contained in the annotation, and the other two participants figured out the feature after some struggle.
Search the code for a particular string

All of them finished the task successfully.

In the Search dialog box, two of them tried to hit the Enter key to start the search instead of clicking the Search button.

One of them tried to use Control-S to start the search.

3.1.8 Recommendations

The following suggestions for change are based on observation of the three individuals in the pilot study. Before implementing any major changes, we recommend that a full usability study, with a minimum of five or six participants, be conducted. This will help confirm whether or not the findings mentioned above are significant and reoccurring problems.

We also recommend that any significant changes to the user interface be retested in a usability study. Testing will assure the developers that their changes really did improve the usability of the interface.

Easier highlighting of separate sections of code. Make slight changes to the highlight button to make annotation of separate sections of code more intuitive. Consider matching Windows standard Ctrl-Drag behavior for adding a new section of selection when other text is already selected.

Visual feedback for annotation location. In the current version of the prototype, when an annotation is clicked in the right pane, the code in the left pane scrolls to ensure that the associated code is visible. However, if the associated code is already visible, there is no visual feedback indicating that it has been located. Change this functionality such that when the user clicks the annotation, the associated code will blink for a second, appear at exactly the top of the code pane, or the line numbers in the normally gray left margin are highlighted for those lines containing the associated code.

Keyboard shortcuts throughout. For example, make a slight change to the Edit Annotation and Find dialog boxes so that, in addition to clicking buttons, the user has the option of pressing Enter to accept annotation changes and start the search, respectively.

Clearer display of code associated with multiple, overlapping annotations. When annotations are made on sections of code already associated with another annotation, sometimes, the highlight for the larger annotation obscures that of the contained annotation, i.e. overlapping annotation highlights obscure each other and are, at this point, arbitrarily ordered in depth. One suggestion is to indicate overlapping annotations by simulating translucent highlighting, mixing the colors where highlighting overlaps, but this might get confusing quickly. Another idea is to bring an annotation's highlight to the front when it is selected, i.e. clicked in the annotation pane.

Proper screen resizing for larger view. The current version does not properly resize the code and annotation panes vertically when the window is resized.

Meaning associated with highlight color. The current version does not support user choice in highlight color very well. For example, highlight color could be used to convey seriousness of the highlighted error.

Support for program execution. Two participants said they would have liked to be able to run the submitted program from within the interface, e.g. by clicking a Run button. As observed in the "think aloud" observation (Section 2.1.3), these participants prefer to start the grading process by running the code.

3.1.9 Evaluation

The tasks are very appropriate and well suited for our study. The participants were able to understand and complete the tasks. Although due to the implementation problems, some functionality could not be tested, we still got valuable data on the existing functionality. We got a lot of valuable information from the post-test interview and post-test questionnaire, which were an opportunity to directly ask for ratings of task difficulty and helped to supplement observations from the think aloud process.

Future usability tests could be improved based on our experience. Due to resource constraints, we only worked with three participants, but more could have led to more confidence in recommendations. The sample submission file provided for use with the tool should be improved to provide a more compelling context for some of the functionality tested in the tasks. For example, locating the code associated with an annotation is not very useful with a small code file that can be fully viewed with little or no scrolling. Finally, gathering data might have been simplified and more effective if the sessions were videotaped, at the possible cost of turning away potential volunteers who might be more comfortable off camera. We tried to compromise by using audio microcassette, but at least one participant spoke too quietly to yield a useful recording. Some combination of better equipment and more prompting for louder speaking might remedy this.

3.2 Heuristic Evaluation

3.2.1 Overview of Method

The technique of heuristic evaluation has been developed to assist in the identification of design problems, particularly in the examination of user interfaces. "Heuristic evaluation is a usability engineering methods for finding the usability problems in a user interface design so that they can be attended to as part of an iterative design process (Nielsen, "How")". Heuristic evaluations were developed as an alternative to complex testing involving a large number of usability issues to be identified and analyzed. "Current user interface standards and collections of usability guidelines typically have on the order of one thousand rules to follow and are therefore seen as intimidating by developers. For the discount method I advocate cutting the complexity by two orders of magnitudes and instead rely on a small set of heuristics such as the ten basic usability principles (Nielsen, 1994)." The list of ten basic usability heuristics as developed by Nielsen include:

visibility of system status
match between the system and the real world
user control and freedom
consistency and standards
error prevention
recognition rather than recall
flexibility and efficiency of use
aesthetic and minimalist design
help users recognize, diagnose, and recover from errors
help and documentation

In addition to distilling these ten usability heuristics that describe common properties of usable interfaces, Nielsen proposes a methodology to examine them. First, an evaluator, the person who is actually testing the system, is given both a task (or set of tasks) to complete using the interface under design and the list of ten usability heuristics. As the evaluator goes through the interface, they identify usability issues that appear to be absent, violate one of the properties listed in the ten usability heuristics, and/or otherwise inhibit the use of the interface/system. The role of the observer(s) (one or more project team members) is to (a) answer any questions that the evaluators have about the interface and (b) ensure that the evaluators cover all areas of the interface/system under evaluation. In addition to identifying usability issues, evaluators also rate their severity. The severity of a usability problem is a combination of three factors: frequency, which indicates the likelihood that a user will encounter the problem; impact which determines how easy or difficult it will be for the user to overcome the problem; and persistence which notes the extent to which the problem will be repeatedly bothersome (Nielsen, "Severity").

Heuristic evaluation was developed by Nielsen to balance two important issues in the design process: (a) the need to study users and (b) the constant push to constrain the costs of development. "It is certainly true that one should study user needs before implementing

solutions to those problems. Even so, the perception that anybody touching usability will come down with a bad case of budget overruns is keeping many software projects from achieving the level of usability their users deserve (Nielsen, 1994)." Nielsen's (1993) own analysis confirms that even industry rarely uses recommended usability engineering techniques adequately in the design of systems. Because heuristic evaluation can be effectively used identify problems at multiple iterations of the design process and possibly used in conjunction with other traditional usability testing approaches, Nielsen kept the basic methodology simple. "These principles can be presented in a single lecture and can be used to explain a very large proportion of the problems one observes in user interface designs (Nielsen, "How")." Heuristic evaluation methodology can effectively be employed with 3-5 evaluators, who can be usability specialists, perhaps even people from other development teams who are easy to recruit (Nielsen, 1995). The end-result of a heuristic evaluation method "is a list of usability problems in the interface with references to those usability principles that were violated by the design in each case in the opinion of the evaluator (Nielsen, "How")."

3.2.2 Rationale

Nielsen's version of heuristic evaluations were chosen as a primary methodology to evaluate the prototype grading tool for three reasons. First, because the entire project timeline was only 10 weeks in duration and the initial version of the prototype grading tool took the majority of this time to be developed, the project team needed a methodology that would yield quality results in a relatively short amount of time. In addition to being easy to implement, Nielsen ("Characteristics") has shown that heuristic evaluation is a good method for finding both major and minor problems in a user interface. When the testing version of the prototype grading tool was nearing completion, the project team was able to develop the scenarios and design the heuristic evaluations so that they would be ready to be implemented as soon as the software was ready. This is not unlike the research and development context in other industry settings. For instance, Nielsen (1995) mentions that "in planning for technology transfer of new usability methods, we have seen that the first requirement is to make sure that the method provides information that is useful in making user interfaces better. Equally important [italics inserted], however, is to make the method cheap and fast to use and to make it easy to learn." Nielsen's heuristic evaluations met all these requirements.

The second reason that Nielsen's heuristic evaluation was chosen to be implemented is the flexibility of the methodology in regard to evaluator<->observer interaction. Grading university-level homework assignments even from introductory courses such as CSE 142 and CSE 143 is a highly complex process. It requires that graders adequately understand the general context for the homework including assignment instructions, the grading guidelines developed by the head TA as well as the specific assignment of individual students which include both executable and descriptive copies of code. For the evaluators, these intricacies of the grading process would be combined with the additional complexities of using a software tool that was still under development. The project team was concerned about minimizing the potential 'frustration level' of evaluators. Because the goal was to get as much quality feedback as possible about the usability aspects of the prototype grading tool, the project team wanted the evaluators to spend their time and energies identifying, classifying, and rating usability issues and not in blind attempts to learn system functionality. To focus on the appropriate issues, the project team wanted to minimize the amount of time evaluators needed to understand the scenario and utilize the functionality of the software to complete the tasks. Nielsen's approach to heuristic evaluation provided the project team an opportunity to answer any questions about either the scenario or the prototype grading tool, thereby maximizing the time of the evaluators and reducing the complexity of the evaluation sessions.

Finally, the use of Nielsen's heuristics seemed especially appropriate for the available pool of potential evaluators easily accessible given the project timeline. The project team had multiple personal contacts of other CSE graduate students who had experience grading CSE 142 or CSE 143 homework assignments in the past. In addition to their availability to the project team, the CSE graduate students provided a group of evaluators who are already generally familiar with the idea of user testing. Again, in the interests of time, it was expected that the CSE graduate students would easily be able to understand both the reasons why the project team was evaluating the prototype grading tool and the ten primary usability heuristics that would be used to classify the usability issues identified during the evaluation. This combination of being "domain experts" as well as being relatively familiar with the context for the evaluation made Nielsen's heuristic evaluation methodology an adequate choice for the testing of the prototype grading tool. Having knowledgeable evaluators is particularly important because Nielsen's methodology places the responsibility for analyzing the user interface on the evaluator, not the observer (Nielsen, "How").

3.2.3 Procedure

The Heuristic Evaluations took place at a TA desk on the fourth floor of Sieg Hall. Five evaluators who responded to email requests for participants from the project team were recruited. Each evaluator was scheduled for a one-hour evaluation over two days in the same week. All of the evaluators were CSE graduate students who had experience grading either CSE 142 or CSE 143 courses in the past.

When the evaluators arrived, they were greeted by the observers which consisted of either two or three project team members. After the evaluator was made comfortable in front of the terminal on the TA desk, the project team explained the project with the following statement which also appeared at the top of the task sheet:

"We are conducting a heuristic evaluation of a prototype currently under development to support the grading of homework for CSE 142 and CSE 143. Your responses will be kept anonymous and handled only by the three researchers on this project. Individual responses will be aggregated with other participating evaluators before being reported. We very much appreciate your time and assistance."

The observers reiterated that the project team was conducting a heuristic evaluation as discussed by Nielsen. The project team also explained to the evaluators that because this was a Nielsen heuristic evaluation, the evaluator was encouraged to ask questions about how to use the system to accomplish tasks and that the developer of the prototype grading tool was available specifically to answer any questions about software functionality. At this point, the evaluators received a single sheet of paper (Appendix C.2) listing the ten usability heuristics developed by Nielsen. The observers explained that these heuristics represent categories of usability issues that the project team is particularly interested in examining. Although the evaluator was encouraged to read over the heuristic list, they were specifically instructed not to limit their comments to these categories and that they would be given a chance at a later time to revisit their feedback in light of these usability heuristics.

The evaluators were also informed that the evaluation process would involve three stages. In stage one, the evaluators were given a task list (Appendix C.1) reiterating the goals of the project and a list of tasks to complete while grading the assignment. The evaluators also received a copy of the homework assignment and the grading guideline. The observers assisted the evaluator in opening the actual assignment into the prototype grading tool. The evaluators were informed that the assignment was a modified homework created by the project team to model common grading issues that arise from introductory computer science courses such as CSE 142 and CSE 143. During the first step of the evaluation, the evaluators communicated the usability issues they encountered orally to the observers who recorded the observations using a form (Appendix C.3). Following the completion of the basic grading tasks in stage one, the observers explained stage two which involved rating the usability issues on a 0-4 scale. In addition to explaining the severity ratings, the evaluators were also give n the usability scale on a piece of paper for visual reference (Appendix C.4). The scale was adapted from Nielsen's severity ratings (Nielsen, "Severity"). The observers then repeated each usability issue that had been identified by the evaluator and asked for a rating using the severity scale. In the third stage, evaluators were asked to re-visit each usability issue previously identified and rated, and classify it as relating to one or more of the ten usability heuristics developed by Nielsen. Throughout the three stages, evaluators visited each usability issue three times: once to identify a usability issue, a second time to rate it, and a third time to classify it according to Nielsen's ten usability heuristics.

3.2.4 Results

The results from the heuristic evaluations are listed below in three primary tables. Each table is a compilation of data received from all five evaluators. The data is presented in three columns. The first column indicates the usability issues, the second the severity ratings, and the third the classification of the usability issue into one or more of Nielsen's ten usability heuristics. Data from all five heuristic evaluations were analyzed and clustered around usability concepts and described with a label. This intermediate-level analysis provided for a more succinct presentation of the data and also allows readers to clearly identify areas of overlap. In cases where an evaluator identified an issue but did not rate it or classify it according to Nielsen's ten usability heuristics, the column contains the notation "n/r" and "not classified" respectively.

Table 6 below provides a collective list of specific usability issues identified by the evaluators. Duplicate entries indicate that more than one evaluator identified the issue during their evaluator.

Table 6 Data organized by Usability Issues Identified
ability to "over annotate" for separate problems in same line/s of code	3	<user control and freedom> <error prevention>
ability to add general comments	1	<not classified>
ability to link annotations together into themes	3	<user control and freedom>
ability to see comments & annotations listed in a separate window	4	<visibility of system status>
allow for different styles of points (1/4 +6 -3)	4	<match between system & real world>
annotate output of code	4	<user control and freedom>
annotation window match editors (vi)	2	<flexibility and efficiency of use> and <error prevention>
assign colors according to "type" of annotation	3	<visibility of system status>
associate metadata or duplicate annotations b/t homeworks	4	<match between system & real world>
avoid OK button by activating another window w/ mouse	2	<not classified>
checklist for major items in the grading guideline	2	<not classified>
checklist for major items in the grading guideline	4	<flexibility and efficiency of use> <error prevention>
concern about contrast of adjacent colors	2	<flexibility and efficiency of use>
concern about contrast of adjacent colors	3	<not classified>
concern about contrast of adjacent colors	4	<aesthetic and minimalist design>
draw circle to mark text rather than highlighting	not rated	<not classified>
duplicate and edit a line of code and incorporate that into an annotation	3	<match between system & real world>
edit annotation directly in right pane	2	<user control and freedom>
edit annotation directly in right pane	3	<flexibility and efficiency of use> <aesthetic & minimalist design> <visibility of system status>
edit annotation directly in right pane	n/r	<not classified>
edit annotations by right clicking	2	<flexibility and efficiency of use>
effective pg-up & pg down	3	<flexibility and efficiency of use>
extend an annotation	2	<flexibility and efficiency of use>
extend an annotation	3	<user control and freedom>
extend an annotation	4	<user control and freedom>
extend an annotation	n/r	<not classified>
find feature provide feedback if nothing is found	4	<flexibility and efficiency of use> <user control and freedom>
grading tool should add up points	2	<flexibility and efficiency of use>
grading tool should add up points	2	<user control and freedom>
grading tool should add up points	3	<match between system & real world>
highlight non-text spaces	3	<user control and freedom>
incorporate output and append to annotations	3	<user control and freedom>
incorporate screen shots of output	n/r	<not classified>
integration of the grading guideline	0	<flexibility and efficiency of use> <match between system & real world>
integration of the grading guideline	2	<match between system & real world>
integration of the grading guideline	2	<not classified>
invoke search function w/ right click	4	<flexibility and efficiency of use> <user control and freedom>
lack of visual cue in code when clicking on annotation in right screen	2	<recognition rather than recall>
lack of visual cue in code when clicking on annotation in right screen	2	<visibility of system status>
left bar too distracting, should be white	1	<aesthetic and minimalist design>
line numbers don't contrast enough	0	<flexibility and efficiency of use>
must click OK button/no keyboard alternate	2	<flexibility & efficiency of use>
must click OK button/no keyboard alternate	3	<flexibility and efficiency of use>
must click OK button/no keyboard alternate	not rated	<not classified>
run code inside annotation tool	1	<flexibility and efficiency of use>
run code inside annotation tool	2	<match between system and the real world>
scroll easily w/in annotation window	4	<flexibility and efficiency of use>
scroll easily w/in annotation window with keyboard	2	<flexibility and efficiency of use>
selectively undo highlighted text w/o losing other multiple line highlights	1	<flexibility and efficiency of use>
use shift/control to highlight w/o mouse	4	<flexibility and efficiency of use>
use shift/control to highlight w/o mouse	n/r	<not classified>
warning box based on checklist	3	<not classified>
when highlighting, grab a whole word	1	<flexibility and efficiency of use>
when highlighting, use ctrl-shift not button in button bar	3	<flexibility and efficiency of use>

Table 7 below provides a collective list of specific usability issues identified by the evaluators. They are organized alphabetically by the Nielsen category in which the evaluators associated the usability issue. Because the five evaluator classified the usability issues they identified differently, the same usability issue may appear in different categories. Duplicate entries indicate that more than one evaluator identified the issue during their evaluator.

Table 7 Usability Issues organized by Nielsen's Categories
concern about contrast of adjacent colors	4	<aesthetic and minimalist design>
left bar too distracting, should be white	1	<aesthetic and minimalist design>
must click OK button/no keyboard alternate	2	<flexibility & efficiency of use>
concern about contrast of adjacent colors	2	<flexibility and efficiency of use>
edit annotations by right clicking	2	<flexibility and efficiency of use>
effective pg-up & pg down	3	<flexibility and efficiency of use>
extend an annotation	2	<flexibility and efficiency of use>
grading tool should add up points	2	<flexibility and efficiency of use>
line numbers don't contrast enough	0	<flexibility and efficiency of use>
must click OK button/no keyboard alternate	3	<flexibility and efficiency of use>
run code inside annotation tool	1	<flexibility and efficiency of use>
scroll easily w/in annotation window	4	<flexibility and efficiency of use>
scroll easily w/in annotation window with keyboard	2	<flexibility and efficiency of use>
selectively undo highlighted text w/o losing other multiple line highlights	1	<flexibility and efficiency of use>
use shift/control to highlight w/o mouse	4	<flexibility and efficiency of use>
when highlighting, grab a whole word	1	<flexibility and efficiency of use>
when highlighting, use ctrl-shift not button in button bar	3	<flexibility and efficiency of use>
edit annotation directly in right pane	3	<flexibility and efficiency of use> <aesthetic & minimalist design> <visibility of system status>
checklist for major items in the grading guideline	4	<flexibility and efficiency of use> <error prevention>
integration of the grading guideline	0	<flexibility and efficiency of use> <match between system & real world>
find feature provide feedback if nothing is found	4	<flexibility and efficiency of use> <user control and freedom>
invoke search function w/ right click	4	<flexibility and efficiency of use> <user control and freedom>
annotation window match editors (vi)	2	<flexibility and efficiency of use> and <error prevention>
allow for different styles of points (1/4 +6 -3)	4	<match between system & real world>
associate metadata or duplicate annotations b/t homeworks	4	<match between system & real world>
duplicate and edit a line of code and incorporate that into an annotation	3	<match between system & real world>
grading tool should add up points	3	<match between system & real world>
integration of the grading guideline	2	<match between system & real world>
run code inside annotation tool	2	<match between system and the real world>
ability to add general comments	1	<not classified>
avoid OK button by activating another window w/ mouse	2	<not classified>
checklist for major items in the grading guideline	2	<not classified>
concern about contrast of adjacent colors	3	<not classified>
draw circle to mark text rather than highlighting	not rated	<not classified>
edit annotation directly in right pane	n/r	<not classified>
extend an annotation	n/r	<not classified>
incorporate screen shots of output	n/r	<not classified>
integration of the grading guideline	2	<not classified>
must click OK button/no keyboard alternate	not rated	<not classified>
use shift/control to highlight w/o mouse	n/r	<not classified>
warning box based on checklist	3	<not classified>
lack of visual cue in code when clicking on annotation in right screen	2	<recognition rather than recall>
ability to link annotations together into themes	3	<user control and freedom>
annotate output of code	4	<user control and freedom>
edit annotation directly in right pane	2	<user control and freedom>
extend an annotation	3	<user control and freedom>
extend an annotation	4	<user control and freedom>
grading tool should add up points	2	<user control and freedom>
highlight non-text spaces	3	<user control and freedom>
incorporate output and append to annotations	3	<user control and freedom>
ability to "over annotate" for separate problems in same line/s of code	3	<user control and freedom> <error prevention>
ability to see comments & annotations listed in a separate window	4	<visibility of system status>
assign colors according to "type" of annotation	3	<visibility of system status>
lack of visual cue in code when clicking on annotation in right screen	2	<visibility of system status>

Table 8 below provides a collective list of specific usability issues identified by the evaluators. The usability issues are organized according to the priority assigned by the five evaluators. Because the five evaluators prioritized the usability issues differently, the same usability issue may appear in different places in the table. Duplicate entries indicate that more than one evaluator identified the issue during their evaluator.

Table 8 Usability Issues Organized By Nielsen's Severity Ratings
ability to see comments & annotations listed in a separate window	4	<visibility of system status>
allow for different styles of points (1/4 +6 -3)	4	<match between system & real world>
annotate output of code	4	<user control and freedom>
associate metadata or duplicate annotations b/t homeworks	4	<match between system & real world>
checklist for major items in the grading guideline	4	<flexibility and efficiency of use> <error prevention>
concern about contrast of adjacent colors	4	<aesthetic and minimalist design>
extend an annotation	4	<user control and freedom>
find feature provide feedback if nothing is found	4	<flexibility and efficiency of use> <user control and freedom>
invoke search function w/ right click	4	<flexibility and efficiency of use> <user control and freedom>
scroll easily w/in annotation window	4	<flexibility and efficiency of use>
use shift/control to highlight w/o mouse	4	<flexibility and efficiency of use>
ability to "over annotate for separate problems in same line/s of code	3	<user control and freedom> <error prevention>
ability to link annotations together into themes	3	<user control and freedom>
assign colors according to "type" of annotation	3	<visibility of system status>
concern about contrast of adjacent colors	3	<not classified>
duplicate and edit a line of code and incorporate that into an annotation	3	<match between system & real world>
edit annotation directly in right pane	3	<flexibility and efficiency of use> <aesthetic & minimalist design> <visibility of system status>
effective pg-up & pg down	3	<flexibility and efficiency of use>
extend an annotation	3	<user control and freedom>
grading tool should add up points	3	<match between system & real world>
highlight non-text spaces	3	<user control and freedom>
incorporate output and append to annotations	3	<user control and freedom>
must click OK button/no keyboard alternate	3	<flexibility and efficiency of use>
warning box based on checklist	3	<not classified>
when highlighting, use ctrl-shift not button in button bar	3	<flexibility and efficiency of use>
annotation window match editors (vi)	2	<flexibility and efficiency of use> and <error prevention>
avoid OK button by activating another window w/ mouse	2	<not classified>
checklist for major items in the grading guideline	2	<not classified>
concern about contrast of adjacent colors	2	<flexibility and efficiency of use>
edit annotation directly in right pane	2	<user control and freedom>
edit annotations by right clicking	2	<flexibility and efficiency of use>
extend an annotation	2	<flexibility and efficiency of use>
grading tool should add up points	2	<flexibility and efficiency of use>
grading tool should add up points	2	<user control and freedom>
integration of the grading guideline	2	<match between system & real world>
integration of the grading guideline	2	<not classified>
lack of visual cue in code when clicking on annotation in right screen	2	<recognition rather than recall>
lack of visual cue in code when clicking on annotation in right screen	2	<visibility of system status>
must click OK button/no keyboard alternate	2	<flexibility & efficiency of use>
run code inside annotation tool	2	<match between system and the real world>
scroll easily w/in annotation window with keyboard	2	<flexibility and efficiency of use>
ability to add general comments	1	<not classified>
left bar too distracting, should be white	1	<aesthetic and minimalist design>
run code inside annotation tool	1	<flexibility and efficiency of use>
selectively undo highlighted text w/o losing other multiple line highlights	1	<flexibility and efficiency of use>
when highlighting, grab a whole word	1	<flexibility and efficiency of use>
integration of the grading guideline	0	<flexibility and efficiency of use> <match between system & real world>
line numbers don't contrast enough	0	<flexibility and efficiency of use>
draw circle to mark text rather than highlighting	not rated	<not classified>
must click OK button/no keyboard alternate	not rated	<not classified>
edit annotation directly in right pane	n/r	<not classified>
extend an annotation	n/r	<not classified>
incorporate screen shots of output	n/r	<not classified>
use shift/control to highlight w/o mouse	n/r	<not classified>

3.2.5 Discussion

The five heuristic evaluations provided excellent feedback about a number of specific usability issues that the developers of a prototype grading tool must consider. A number of key issues were identified, and each evaluator discovered some usability issues that were unique and some that were also identified by at least one other evaluator. The individual usability issues are identified above in each of the three tables above. In addition to these individual issues, however, the data revealed several significant patterns.

First, one of the big issues appears to be allowing full functionality from the keyboard in addition to the mouse. All five evaluators mentioned wanting to be able to use the keyboard to perform specific tasks that the current grading tool prototype only allows via direct manipulation of buttons, task bars, or text insertion of the mouse. Table 6 illustrates the depth of this concern. Evaluators collectively identified keyboard-related usability issues eight times. Evaluators described their concern about keyboard functionality in a variety of ways including:

1 reference to wanting the annotation window to match keyboard-based editors like (vi)
3 references to being forced to use the mouse to click OK with no obvious keyboard alternate
1 reference to be able to scroll easily in the annotation window using the arrow keys
2 references wanting the ability to use control-key combinations to highlight without the use of the mouse
1 reference to wanting the ability to use a control-key combination in conjunction with the mouse to highlight to avoid leaving the code-text to hit a button on the menu-bar with the mouse when highlighting multiple lines of text.

One evaluator explicitly stated at the beginning of the heuristic evaluation that they were "keyboard oriented" and another evaluator reiterated several times that keyboard functionality is very important for graders' ability to complete their work. Clearly, added keyboard functionality to the prototype grading tool will need to be addressed.

A second pattern in the data noted difficulties in the ways the prototype grading tool allows graders to apply annotations to multiple sections of code. Four out of five evaluators indicated that they would like to be able to go back and apply an additional segment of highlighted code to an existing annotation already applied to a previous section of code. One evaluator emphasized the importance of this functionality by stating that "when reading the program [code] sequentially, you might get halfway down, recognize a pattern, and then go back and link previous annotation to different lines of code." While the current version of the prototype grading tool allows the user to apply different sections of code to the same annotation, the separate sections of code must be highlighted together when the annotation is created. The prototype grading tool will need to provide the functionality to "extend an annotation" to different pieces of code after the annotation is created. Such additional functionality will also have the benefit of supporting users with a wider range of grading styles supporting not only those that read the code sequentially, but also those graders that skip around the code skimming for particular grading issues.

A principle of 'direct manipulation' represents a third pattern present in the data. Multiple evaluators identified the forced use of the annotation window for editing purposes as a potentially inhibiting feature of the prototype grading tool. To edit an annotation, users are forced to double click on the annotation, which then brings up an annotation window in which the user can modify the text of the annotation. Four out of five evaluators expressed a preference to edit an annotation directly in the right pane, where the annotations are displayed on the screen. There was not agreement, however, on either the severity or the classification of this usability issues. In Table 8, this usability issue appears in three different places: two evaluators rating its severity as a "2", one evaluator rating its severity as a "3" and another evaluator who did not rate the usability issue. The usability was classified in several places in Table 7 including: flexibility & efficiency of use, aesthetic & minimalist design, visibility of system status, and a "not classified" category. This diversity of severity ratings and classifications into Nielsen heuristic categories also appeared with other usability issues mentioned by multiple evaluators (see Table 7 and Table 8).

Other important usability issues identified by multiple evaluators included:

concern about the lack of contrast of adjacent colors used to associate a highlighted piece of code to an annotation.
additional visual cues associating a particular annotation with the code to which it is referring (especially when editing or clicking on a specific annotation)
a desire for some integration of the grading guideline
the ability to annotate output of program code
providing additional shortcuts in the form of hot-keys or right clicks or some combination thereof.
functionality to tabulate scores/points

Complementing these patterns present in the data discussed above, some evaluators offered additional feedback that emphasized not only a desire for a particular feature but also provided insight into the larger issues affecting graders of CSE 142 and CSE 143 homework assignments. For example, two evaluators mention that they would like to be able to run the actual code of an assignment within the grading system. Initial design decisions for the prototype grading tool emphasized keeping the tool as simple as possible. This meant providing necessary functionality without duplicating any processes or tasks that are available in an operating system or the general user environment. Observations of graders grading actual assignments conducted earlier in the research process revealed that graders routinely use multiple windows in their user environment. The graders also indicated during those initial observations that having multiple windows was not a significant barrier to their grading tasks and that they were experienced at manipulating multiple windows. However, one evaluator made a follow-up comment to the desire to be able to run student code within the grading system. "...some degree of integration would be nice...I definitely don't want to get out of synch when I am grading a piece of code but running someone else's code [in another window that did not get updated]." This evaluator appears to feel that some form of integration of the execution of student code was appropriate in the grading tool. The evaluator referred to this integration as a "safety mechanism."

This desire for a "safety mechanism" seemed to dovetail well with another general statement made by a different evaluator. "The most tedious thing about grading an assignment is that we do 50, we never do just one." The implication here is that graders of CSE 142 and CSE 143 student homework assignments are constantly juggling a number of homework assignments in their section. This indicates some desire on the part of graders to have "help" managing the multiple components (e.g. grading guideline, homework assignment instructions, student executable code, uncompiled student code, grading comments/annotations) involved in the grading process.

Lastly, two evaluators expressed a desire to be able to annotate the output of the executable program code. One evaluator mentioned that there are times when the grader wants to show relations between code and output. Another mentioned that they could provide annotations to different test cases that they were using to illustrate nuances to the design of the code for particular data sets. Both of the evaluators classified this usability issue as related to "user control and freedom" and rated it very high with severity ratings of level "3" and level "4" respectively.

4 Future System Design and Development

We believe that the following points of emphasis for future system design and development emerge from the wide range of results obtained in our studies:

An continuing, overall focus on annotation and navigation. To date, the prototype developer chose to focus on what he considered essential functionality for the grader's interface: annotating and navigating code. The observations and survey data (Sections 2.1, 2.2) independently suggest that reading and annotating code are the grader's primary concerns, and continuing to focus on functionality to support these subtasks seems advisable.

More complete support of the work process. In addition to the focuses discussed above, new categories of functionality might include the ability to run submitted programs from the tool and possibly annotate output, since both the usability testing and heuristic evaluation results indicated demand for this. The results also suggest experimenting with some level of grading guideline integration, minimally allowing per annotation point value entry and grade calculation.

Balancing flexibility with utility. As was discussed in the overview (Section 1.1), there are a variety of reasons why flexibility is an essential aspect of the system's design. There are many factors which impact the system's usefulness, and even if the system does not address all of them at once, it should be flexible enough to facilitate adapting or extending the system to do so. Examples of such factors and how the tool might address them are as follows:

Department-specific factors, e.g. programming language. Even within UW CSE, the upcoming transition to Java will mean that the tool must support working with Java code submissions, requiring additional functionality to syntax highlight and navigate code in this language.
Instructor/head TA-specific factors, e.g. grading guideline style. If the system is to directly integrate a grading guideline, it might require that course staff produce them in a certain format, but the system must accommodate a wide range of grading philosophies.
Grader-specific factors, e.g. keyboard/mouse preference and keyboard shortcut standards. This was a significant issue, judging from the results of the usability testing and heuristic evaluation. The primary goal of supporting keyboard-based operation whereever possible is not as difficult as supporting the wide variety of keyboard shortcut standards in our participant pool, e.g. the vi editor, the emacs editor, and Windows standards. The fact that certain participants requested certain hybrids of the standards raises further questions of how much a system can accommodate these preferences before becoming too umanageable for any one user.

Looking further ahead, still, a somewhat analogous but separate set of usability studies is warranted for the other major interface for this system, the student's view of graded homework and feedback. Unlike the grader's interface, which is inherently constrained to one media (computer), designing the student's interface is complicated by the possibility that students will prefer a choice from a range of media for viewing graded homework, e.g. printed on paper, as a web page, through a specialized navigator application. Concerns such as privacy will become more apparent as the grader's and student's interfaces are integrated.

Acknowledgements

The authors would like to thank the CSE 142/143 TAs who generously volunteered their time to participate in these studies. Ken Yasuhara would also like to acknowledge the support and feedback provided by his advisor Richard Anderson and still more 142/143 TAs who shared their ideas and advice.

Appendices

The documents linked here are in HTML or Microsoft Word 97 format.

Appendix A: Survey
Appendix B.1: Usability Testing Pre-Test Checklist
Appendix B.2: Usability Testing Instructions
Appendix B.3: Usability Testing Task List
Appendix B.4: Usability Testing Data Recording Form
Appendix B.5: Usability Testing Post-Test Questionnaire
Appendix B.6: Usability Testing Post-Test Interview
Appendix C.1: Heuristic Evaluation Task Sheet
Appendix C.2: Heuristic Evaluation Heuristics List
Appendix C.3: Heuristic Evaluation Data Recording Form
Appendix C.4: Heuristic Evaluation Severity Rating Scale

References

Beyer, H. & Holtzblatt K. (1998). Contextual Design. Morgan Kaufmann: San Francisco.

Nielsen, J. (1995). Technology Transfer of Heuristic Evaluation and Usability Inspection. Keynote Address: International Conference on Human-Computer Interaction (Lillehammer, Norway), June 27.

Nielsen, J. (1993). Usability Engineering. Academic Press: Boston.

Nielsen, J. (1994). Heuristic Evaluation. In Usability Inspection Methods. Nielsen, J. & Mack, R. L. (Eds). John Wiley & Sons, New York. pp. 25-62.

Nielsen, J. & Molich, R. (1990). Heuristic Evaluation of User Interfaces. Proceeds ACM CHI (Seattle, WA, April 1-5). pp. 249-256.

Nielsen, J. How To Conduct A Heuristic Evaluation [http://www.useit.com/papers/heuristic/heuristic_evaluation.html] Accessed 4/1/01.

Nielsen, J. Characteristics Of Usability Problems Found By Heuristic Evaluation. [http://www.useit.com/papers/heuristic/usability_problems.html] Accessed 4/1/01.

Nielsen, J. Ten Usability Heuristics. [http://www.useit.com/papers/heuristic/heuristic_list.html] Accessed 4/1/01.

Nielsen, J. Severity Ratings for Usability Problems. [http://www.useit.com/papers/heuristic/severityrating.html] Accessed 4/1/01.