Fixing the bugs

1) Programming is simple, getting it to work is hard

Whereas good programming practices and standards have gained some legitimacy and help popularize writing computer programs, debugging and fixing computer programs is still very much an art. There are a host of different “tricks” to use, but probably the best to learn to discover and fix bugs is to do it a lot and be around people who have done it a lot.

2) How do we fix bugs (and avoid bugs with defensive programming)

Besides being easy or hard to find some bugs are easy to fix and some are very hard to fix.

How hard a bug is to find does not necessarily relate with how easy or hard it is to fix.

Often fixing a bug raises other bugs because

a) The system can now run longer

b) The system behavior is changed with the previous fix

Bugs need to be prioritized. That is we sometimes need to choose bugs to fix and choose the bugs to ignore. Need to weigh the likelihood of the bug being encountered versus risk of the fix.

Sometimes the only thing we can do is document and warn users about the bug.

Whether you should fix a bug is a large matter of risk analysis. The severity of the bug (both how bad it is and how likely it is to be encountered) must be weighed against the risk of doing the fix. Some classic risks are:

a) Breaking something else

b) Slipping the schedule

c) The need to retest the system

The closer you get to wanting to ship the product the chooser you are about what bugs to fix. Even the simplest fix can have unforeseen consequences. For example, a simple fix to allow larger heaps can cause problems with chewing up too much memory.

3) Some category of bug fixes I have known

None of this is meant to be inclusive. It is merely meant as some examples of debugging and code fixing techniques I’ve used or seen used.

Some easy to fix bugs fall into the simple off by one errors or forgetting to take into account some end condition. The fix is usually localized and easy understood. I’ve sometimes have looked at code that has been running for a few years without problems only to see the bug and wonder how the system ever ran as long or as well as it did.

Some hard to fix bugs can span multiple modules in sometimes rather fragile code where the fix really needs to be thoroughly considered because its ramifications are not always well understood. For example, security fixes often show up other problems

Speaking of security, security holes are bugs. A common security bug is not properly capturing and probing parameters. This can cause problems with both naive application programming errors and malicious applications. Simply capturing pointers and probing user buffers may not be enough to stop users from re-mapping memory.

Sometimes a bug is located in very old code with reluctance on everyone’s part to want to touch the code. For example, the bitmap package on NT was written back in 1989 and optimized for MM and file system allocation. Just recently some people wanted to use the bitmap in a different way when searching for zero bits. They really need to think long and hard before altering this code. Beside outright breaking things, it could have serious performance implications.

Deadlock problems once identified should in principle be easy to fix. Having a set order for acquiring locks is important. For example, mutex levels are a great help if using mutexes. But sometimes having a set order of acquisition is not always practical. For example, page fault recursions can cause MM resources and Cache manager resources to be acquired recursively and out of order.

To fix a priority inversion we sometimes need to identify the low priority thread holding a resource and give it a quantum priority boost.

Know the range of your parameters, underflow, overflow, and loss of precision can cause unanticipated problems. For example, the triangle identification program mentioned in a previous lecture if not careful could suffer from an overflow problem.

4) Sometimes we break all the rules to fix a bug

We might need to look into some hidden data structures to make things work. For example, the NT kernel does not formally export its wait queue structures, but calling wait can sometimes have disastrous affect. So ever so we have modules that glance at the queue behind the kernels back.

Fixes in this category are usually put in near the end of the project where the risk of applying a more appropriate fix is too great. A good example of this is when simple “fix-up” code is added to readjust a data structure that has gone awry.

5) Sometimes the fix is simply to mask over someone else’s problem

We might also break a clean design just to add special purpose code. Running legacy applications cause this to happen a lot. For example, allocating zero bytes should be illegal.

6) Defensive programming

Defensive programming can fix or hide a lot of faults, and also identify problems. For example, setting pointers to null after freeing them will catch a lot of problems while the program is being debugged. Note, that this needs a shift in the usual programming API to handle freeing memory. You need to pass an address containing a pointer to the memory being freed. So instead of

Free( ptr );

it is

Free( &ptr );

Macro’s can help with this. After it ships this could be turned into a defensive mechanism. Another example is encoding pointers so that errand code would cause a fault.