CSE403 Software Engineering, Autumn 1999
Lecture #17 Notes
Thursday November 4th, oral progress reports
Friday November 5th, written progress reports
Monday November 8th we’ll do another TAG 2000 group reviews. The idea is that you again have one hour of consulting time to get immediate feedback on your project. Please come prepared to discuss in-depth design details.
Who tests the tests, especially a large complicated test? For example, if your triangle test program generates random data, who confirms the results? Another example is testing trig functions.
Testing the error cases can be a wider set of inputs. You have two problems (1) making sure you have proper test coverage and (2) making sure the results are correct.
Fault injection is another way of testing systems. For example, injecting I/O failures in a disk controller can test the error cases for the disk driver and file system. Another example is injecting memory allocation errors, to see how programs behave when they run out of memory.
A bug can be anything from odd or unexpected behavior to system crashes or wrong answers.
Not everything is a bug and people often disagree about particular bugs. Not only do they disagree if a particular behavior is a bug they fervently about where and how the problems should be fixed.
All large system projects ship with bugs. It cannot be avoided. After a product ships all the bugs become features!
To paraphrase Dave Cutler, “if you don’t put them in then you don’t have to take them out.”
We discover bugs from normal usage, to regression testing, to stress and guerilla testing. We need to find and remove bugs early otherwise other code can become depend on the bugs’ behavior
By the time the product gets into the hands of the customer it is often too late to fix problems. But still you should expect customers to find help find bugs, that’s one reason why there are beta programs
A simple bug matrix
Stress and guerilla testing
Odd or unexpected behavior (including performance)
System crash or the wrong answer
Bugs as defined earlier can range from errors, defects, faults, and failures
Some bugs are easy to find while a lot of bugs are hard to find
First we have to demonstrate the existence of a particular bug through
1) System crashes usually highlight a bug someplace
2) Application or APIs that return the wrong or unexpected answer
3) Normal usage (“dogfood”) can illustrate bugs
4) Code reviews can uncover bugs
5) Various test scenarios can expose bugs
a) Usage tests
b) Stress test
c) Validation tests
d) Regression tests
e) Black box and white box testing
f) Fault injection (for example, I/O errors, allocation errors)
Trying to get the problem to reproduce is sometimes the hardest part. There are a lot of jokes about software engineers always wanting to see if the problem reproduces.
The main problem with finding bugs is that it is not always obvious from the crash or incorrect behavior where the bug is located
Narrowing down on the problem is often very hard. There are often times when a private build is all that is needed to find the problem, but other times we need to subject special catcher code to the public build.
Once we have an idea of a bug’s existence we have various ways to find or catch it
1) Checked builds with extra sanity checking such as asserts. The good thing about doing this is that when you originally write the program you can add consistency checks. The bad part of doing this is that you wind up with two system, almost the same but not really
2) Procedure call tracing and PC tracing can be used to see where programs are spending their time and subsequently help in understanding the behavior of the program
3) Watch points and break points are a great debugging aid when available. Sometimes they are not available.
4) Filling freed and uninitialized memory with a known bit pattern can help identify code that touches memory after it is freed or using uninitialized memory. Patterns such as deadbeef and baadf00d are usable
5) Timing problems are very hard to find, especially on an MP machine. Here are some examples I’ve seen
a) Cache and MM problems to resolve and page-in files has a lot of timing issues
b) Memory leaks in a single threaded application are hard enough to locate, but add a multi-threaded MP application and it become even harder. One technique is to keep tracing information for each allocated and freed item
c) Probing problems are doubly harder to tackle when you can have an application that remap memory used by another thread
d) On NT RtlZeroMemory and RtlCopyMemory on a MIPS architecture used the floating point registers to speed them up. However, the floating point registers were not being saved on an interrupt and so if either operation was called in an interrupt handler problems can and did occur. We stumbled upon this case when RtlZeroMemory was being interrupted and the interrupt handler also called RtlCopyMemory. This was not an MP problem but really an interrupt problem.
6) Static deadlocks are usually easy to identify (fixing is another matter) but dynamic deadlocks and priority inversion are a lot harder to identify. Which is really another topic of “Is the system hung or just slow?”
7) Pointer bias on RISC machines can catch code going through pointers they shouldn’t be using
8) Keeping page zero invalid also makes referencing through NULL essentially a runtime error, but not always because of large offsets.
9) On problems that are not easily reproducible we sometimes need to add code in the dog food system that runs on everyone’s machine
Here are some of the things I’ve done or seen done to find the bug
1) Keep a history of every allocation and freeing of memory
2) Keep a history of every file operation
3) Add special case code to see if a particular file name is ever used. In one particular MP cache problem we had to add code to check the contents of the buffer we are writing out
4) Every time an internal data structure is used or altered we do pre and post examinations of the structures
5) Once had the debugger dump all of physical memory, searching for a pattern
Sometimes it’s a compiler bug. Ugh!
Sometimes it’s an operator or hardware error
1) Unplugging the scsi disk in the middle of an operation is typically pretty bad
2) I’ve seen improperly terminated buses cause unreported I/O errors
3) Non-parity memory does go bad without warning
4) One MIPS hardware error had to do with branch instructions are the end of a page