Software Experimentation

accept that it’s probably your code’s fault
reproduce your bug quickly
start doing experiments
change one thing at a time
check your assumptions
write your code so it’s easier to debug
error messages are better than silently failing
understand what the error messages mean

What does debugging a program look like?

This reading reproduces most of Julia Evans’ blog post, What does debugging a program look like?¹

There will probably be some jargon you’re not familiar with since she discusses debugging in a variety of programming contexts.² That’s okay. Here’s a glossary of the particularly pertinent concepts that we’ll be exploring in class.

Unit Test: An automated software test that checks that a single behavior works as expected. We might write one unit test to check that add behaves correctly, another unit test to check that remove behaves correctly, and so forth.
Library: A collection of resources used to support software development. For example, ArrayList is part of the Java standard library.
Debugger: A tool that can pause a program at any point during execution, allowing the programmer to inspect the exact values of different variables.

accept that it’s probably your code’s fault

Sometimes I see a problem and I’m like “oh, library X has a bug”, “oh, it’s DNS”, “oh, SOME OTHER THING THAT IS NOT MY CODE is broken”. And sometimes it’s not my code! But in general between an established library and my code that I wrote last month, usually it’s my code that I wrote last month that’s the problem :).

reproduce your bug quickly

Everybody agrees that being able to consistently reproduce a bug is important if you want to figure out what’s going on. I have an intuitive sense for how to do this but I’m not sure how to explain how to go from “I saw this bug twice” to “I can consistently reproduce this bug on demand on my laptop”, and I wonder whether the techniques you use to do this depend on the domain (backend web dev, frontend, mobile, games, C++ programs, embedded etc).

Everybody also agrees that it’s extremely useful be able to reproduce the bug quickly (if it takes you 3 minutes to check if every change helped, iterating is VERY SLOW).

A suggested approach: writing a unit test that reproduces the bug (if you can). bonus: you can add this to your test suite later if it makes sense

start doing experiments

@act_gardner gave a nice, short explanation of what you have to do after you reproduce your bug

I try to encourage people to first fully understand the bug - What’s happening? What do you expect to happen? When does it happen? When does it not happen? Then apply their mental model of the system to guess at what could be breaking and come up with experiments.
Experiments could be changing or removing code, making API calls from a REPL, trying new inputs, poking at memory values with a debugger or print statements.

I think the loop here may be:

make guess about one aspect about what might be happening (“this variable is set to X where it should be Y”, “this code is never running at all”)
do experiment to check that guess
repeat until you understand what’s going on

change one thing at a time

Everybody definitely agrees that it is important to change one thing a time when doing an experiment to verify an assumption.

check your assumptions

A lot of debugging is realizing that something you were sure was true (“wait this request is going to the new server, right, not the old one???”) is actually… not true. I made an attempt to list some common incorrect assumptions. Here are some examples:

this variable is set to X (“that filename is definitely right”)
that variable’s value can’t possibly have changed between X and Y
this code was doing the right thing before
this function does X
I’m editing the right file
there can’t be any typos in that line I wrote it is just 1 line of code
the documentation is correct
the code I’m looking at is being executed at some point

write your code so it’s easier to debug

Another point a few people brought up is that you can improve your program to make it easier to debug. tef has a nice post about this: Write code that’s easy to delete, and easy to debug too. here. I thought this was very true:

Debuggable code isn’t necessarily clean, and code that’s littered with checks or error handling rarely makes for pleasant reading.

I think one interpretation of “easy to debug” is “every single time there’s an error, the program reports to you exactly what happened in an easy to understand way”. Whenever my program has a problem and says something “error: failure to connect to SOME_IP port 443: connection timeout” I’m like THANK YOU THAT IS THE KIND OF THING I WANTED TO KNOW and I can check if I need to fix a firewall thing or if I got the wrong IP for some reason or what.

error messages are better than silently failing

To get closer to the dream of “every single time there’s an error, the program reports to you exactly what happened in an easy to understand way” you also need to be disciplined about immediately returning an error message instead of silently writing incorrect data / passing a nonsense value to another function which will do WHO KNOWS WHAT with it and cause you a gigantic headache. This means adding code like this:

if UNEXPECTED_THING:
    raise "oh no THING happened"

This isn’t easy to get right (it’s not always obvious where you should be raising errors!) but it really helps a lot.

understand what the error messages mean

One sub debugging skill that I take for granted a lot of the time is understanding what error messages mean! I came across this nice graphic explaining common Python errors and what they mean, which breaks down things like NameError, IOError, etc.

I think a reason interpreting error messages is hard is that understanding a new error message might mean learning a new concept – NameError can mean “Your code uses a variable outside the scope where it’s defined”, but to really understand that you need to understand what variable scope is! I ran into this a lot when learning Rust – the Rust compiler would be like “you have a weird lifetime error” and I’d like be “ugh ok Rust I get it I will go actually learn about how lifetimes work now!”

And a lot of the time error messages are caused by a problem very different from the text of the message, like how “upstream connect error or disconnect/reset before headers” might mean “julia, your server crashed!”. The skill of understanding what error messages mean is often not transferable when you switch to a new area (if I started writing a lot of React or something tomorrow, I would probably have no idea what any of the error messages meant!). So this definitely isn’t just an issue for beginner programmers.

Evans, Julia. 2019. What does debugging a program look like? https://jvns.ca/blog/2019/06/23/a-few-debugging-resources/ ↩
Typically, bugs in this course are easy to reproduce since our programs have only a limited number of moving parts. In the real world, we often work with data that comes from many different sources all at once, leading to programs interacting with each other in very unexpected ways. Large software is challenging to debug because the symptoms can appear totally unrelated to the true source of the bug. Towards the end of the course, we’ll get a taste of some of this as we put together larger programs for ourselves. ↩