CSE 351, section 1: homework 0, gcc, and gdb

Michael Ratanapintha

Homework 0 review

In section, I showed two histograms displaying your distribution of ratios of column-major (j then i) runtime to row-major (i then j) runtime. I used ratios rather than absolute values both to make the results more comparable between your varied computers, and to better show you how much slower - and how much more variable in slowness - an apparently equivalent program can be thanks to the hardware.

Here are the same histograms below, modified to break out those ratios strictly less than 1 (indicating the column-major traversal was faster), and the ones between 30 and 35 (click to see larger PDF versions):

You shouldn't trust these numbers, though. Here are some reasons:

For the larger array sizes, the Java programs could suffer out of memory (OOM) errors. I discussed that in my Wednesday email and in section, as did John in Friday's class. Even with the changes I suggested in the Wednesday mail, it's possible you could still suffer OOMs.
More generally, some of the really short runtimes for both loop orders, which result in slowdown ratios of about 1 or less, may have been caused by runtime crashes in the C or Java programs.
The hardware differences between computers can affect performance of different programs in different ways. That is, if computer A is twice as fast as computer B at running a row-major array loop, it does not follow that computer A is also twice as fast as computer B on a column-major loop. Hence, simply dividing one case's runtime by the other's may get rid of the absolute units of time, but it doesn't truly cancel out all the hardware differences.
Many of you ran these experiments in the Linux virtual machines we provide for this course. For reasons we'll discuss later in the course, the fact that you are running the experimental programs under an operating system, instead of directly on the hardware, introduces extra variability in the timing results. In a virtual machine, you have two OSes between your program and the real hardware - the guest OS running on the virtual hardware provided by the VM monitor, and the host OS the VM monitor is itself running on. This compounds the variability effect of the OS significantly, making your results even more likely to vary from run to run.
It is possible that when you enlarged the sizes of the arrays, you might have inadvertently forgotten to change the loop bounds so the loops range over the larger arrays.

What did you find surprising? Here are just a few of the things we spotted in your solutions, or that we discussed in section:

As suggested above, a lot of you were surprised to find that the ordering of the outer and inner loop greatly affects the speed of the array copy. In section, we discussed why row-major loop order (i then j) is often so much faster than column-major order (j then i).

As some of you said, this is due to the hardware being designed to optimize scenarios where data access is localized to a particular small region of memory. To exploit this spatial locality, computer architects design CPUs with on-chip caches, which are small memories that replicate parts of the main memory that are likely to be used in the near future. Caches are faster than main memory because they are physically closer to the CPU, and because they often use faster static random access memory (SRAM) circuits, whereas main memory uses cheaper and simpler, but slower dynamic RAM (DRAM) circuits. We will talk more about caches and memory system performance later in the course.
Some of you found that in some cases, the optimized C code actually ran more slowly than the unoptimzed code. What you've found is that compiler optimization is a misnomer - the compiler makes improvements that it believes will make things better, but it's not always right and can never be, either in practice or in theory.

I mentioned the halting problem in section AB - the idea is that it is provably impossible for a computer to determine whether a computer program will finish given some input sequence. Many analyses a compiler can perform to improve the output code can be mathematically reduced to the halting problem; these analyses can only be performed approximately by any compiler already written or to be written.
There was some confusion about the kinds of code optimizations that compilers can perform. Some of you said that code optimization always transforms a slow chunk of source code to a faster chunk of source code. This is not so. Optimizations can also be done on assembly language, on the machine code of a fully linked and compiled program, or in an internal intermediate representation used by the compiler. In fact, many compilers use several intermediate representations of the program code, each at a different level of abstraction, so that each new program form exposes different opportunities for optimization.

The loop-unrolling example in class also confused some of you. Unrolling a loop so that every loop iteration is a separate piece of code is not the only loop optimization a compiler can do. There are many other useful loop optimizations, including rearranging the loop iteration indices, unrolling only a few iterations, and remvoing checks for an overflow in the loop counter (though this last one does not apply to C and Java, which never have such checks anyway).

We hoped that HW0 would show you that the hardware implementation can affect your program's performance and correctness in ways that are not obvious from your source code. From the wide variety of results you reported and your discussions of these results, it's clear that almost all of you got that point, though some of you were perhaps a bit too fixated on the idea of caches and hence missed the larger point.

A note on the extra credit question

If you run the original C code in the "related to that" point, you'd've found that gcc did make the original code run fast when optimizing (-O2), while after making the change we suggested to actually use the loop-computed value, the code ran slowly whether or not it was optimized. On the surface, this C program seems to exhibit the same optimization opportunity as the programs you timed earlier - a useless loop.

So why won't gcc or the Java VM optimize away the loop in the timed programs? This is a question that even we don't know the answer to. Here are some of your speculations:

Some of you argued that the timed programs' array copy loop must have its effects on the program's memory preserved, so it cannot be optimized away. This is actually exactly our own speculated answer, but you should be suspicious of answers of that kind (including ours!). The reason is that Linux tries very hard to keep the memory being used by a program private to that program, and to prevent other programs from reading or writing that memory until the program exits.

It's generally difficult for a program to predict which other programs, if any, previously used the memory it is using, so it is unlikely that any program that could benefit from the effects on memory of our program's array copy could even know whether it was getting this benefit.
Others of you argued that the fact that the various programs take different amounts of time is itself part of the program results, and that optimizing away all of the programs would erase those differences. This argument doesn't generalize very well, however - indeed these programs are a rather poor test of the CPU and memory system of the computer, because they are too short (the time used by the Linux kernel can be signficant), and because they are simple and predictable enough that a crafty architect can game the benchmark by optimizing the CPU and memory specifically for these programs and not for any others.

For the case of gcc optimizing the hw0.c program, we speculated that the structure of the program prevents gcc from realizing that the array copy loop is useless. That is, we have defined the source and destination arrays outside the main() function, so these are visible to the entire program. We think that the global accessibility of the arrays forces gcc to assume that they may be modified by other threads (pieces of the program executed simultaneously with main()), or even other programs. In this hypothesis, gcc simply doesn't realize that, in fact, the src and dst arrays are modified only by main, because it doesn't know when compiling the hw0.c source code file that that file constitutes the entire program.

Using C on Linux

You've compiled and run C programs in homework 0, but probably you didn't do more than copy and paste the commands we gave you. Now it's time to look more closely at how to work with C on Linux.

In the past, there was a whole class that taught you everything you needed to know about Linux and C. Regrettably, that class no longer exists. Hence, it's likely you'll need to learn more about C and Linux than we teach you, either in section or as asides in lecture. If you'd like to learn more, I suggest reading the material for CSE 303, CSE 374, or the new seminar CSE 390-A.

Hello, world!

The first program you ever wrote was probably the Hello world program. Here's a slightly expanded one in C:

/* hello.c */
#include <stdio.h>

int main (void)
{
	int n = 3;
	printf("Hello, world!\n");
	printf("n = %d\n", n);
	n++;
	printf("Now n = %d\n", n);

	return (0); 
}

This main() function is not too far from the corresponding Java code for the main() method, if you were to write it outside of a class. The main difference is that instead of calling the println() method of the PrintStream referenced by System.out, you call the printf() function, which is not stored in any object or part of any class. Instead of getting printf() implicitly, you must ask for it explicitly, by including the source code file stdio.h where it is defined.

Also, you don't build up output by concatenating strings and values, because C doesn't support catenating two strings together using built-in operators, let alone a string and an int. Instead, you call printf() with varying-length parameter lists, and strings with "format specifiers" like %d that say where to interpolate the extra parameters.

In general, everything that you can do with Java primitive values, you can do almost the same in C. (There are some differences in edge cases, such as arithmetic overflow, where Java precisely defines behavior that C leaves unspecified. However, in both C and Java the edge case behavior is almost never what you want, unless you prefer a program that produces the wrong answers to one that blows up.) But C does not have Java objects - C supports complex data structures called structs, but these cannot have methods or functions within them. Put in CSE 143 language, where Java combines state and behavior, C separates them.

Making it run

Remember gcc? It's back! As you did in the homework, bring up the Linux terminal window, and then type:

$ emacs hello.c
[type in the code, save and exit...]
$ gcc -g -Wall hello.c
$ ./a.out
Hello, world!
n = 3
Now n = 4
$

Some things to note here:

The dollar sign $ represents the Linux command line prompt, where you can type your next command. Your prompt will probably end in a dollar sign, but may have some stuff before it.
The gcc command takes a list of options marked by dashes, and then the source code filenames. The -Wall options, which you saw in the homework, means to generate most (not all) compiler warnings. The -g option means to insert information in the output file to match certain sequences of machine instructions to lines of code in the hello.c file. This is useful when running the program under a debugger (next).
By default, gcc writes the compiled program to a.out . You can specify a different program filename by giving the -o option, which is followed by the output filename, for example: -o hello .
To run the compiled program, simply type its filename as if it were a command, but precede the filename by a dot and slash (as above). That tells Linux to look for the program to run in the current folder.

Debugging with gdb

Like the programs you worked with in homework 0, this program does nothing useful. More interestingly, it is probably going to crash - it does on my machine:

/* crash.c */
#include <stdio.h>

int main (void)
{
	/* creates a 10-element array */
	int ns[10];
	int i;
	
	for (i = 0; i < 10; i--) {
		ns[i] = i;
	}

	return (0); 
}

Due to a slip of the fingers, we are decreasing the array subscript i, rather than increasing it. Unlike Java, C does not check that array subscripts are in bounds for the array; in fact, it is not even possible to compute how long an array is at runtime. We may never detect the error, but if we do, it will be Linux that detects the error and violently terminates your program, not C:

$ gcc -g -Wall crash.c -o crash
$ ./crash
Segmentation fault

Since we are too lazy to reread our code, we'll use the computer to help us figure out what went wrong. This calls for a debugger - a tool that lets you run the program one step at a time, handling crashes by keeping the wounded program alive as long as you need to find out what killed it. Linux's debugger is called gdb . Let's run our program crash under gdb:

$ gdb crash
GNU gdb (GDB) Fedora (7.1-34.fc13)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /tmp/crash...done.
(gdb)

Here are some things you can do with gdb (we'll see more later in the quarter):

list

You can run the list command to show the source code that corresponds to the original program.

(gdb) list

By default, list will display the ten lines around the current execution point in the program. In this case, we haven't started our program yet, so we are at the beginning, in the main function, which is the program entry point.

1       #include <stdio.h>
2
3       int main (void)
4       {
5               /* creates a 10-element array */
6               int ns[10];
7               int i;
8
9               for (i = 0; i < 10; i--) {
10                      ns[i] = i;

start

Let's start running the program, since we can't really inspect its state before we do that.

(gdb) start
Temporary breakpoint 1 at 0x804839a: file crash.c, line 9.
Starting program: /tmp/crash

Temporary breakpoint 1, main () at crash.c:9
9               for (i = 0; i < 10; i--) {
Missing separate debuginfos, use: debuginfo-install glibc-2.12.2-1.i686
(gdb)

As you can see, gdb sets an automatic (temporary) breakpoint just before the first executable line in the main function. A breakpoint interrupts normal program execution, freezing it in time, so you can inspect the internal state of your program, dissecting it on your laboratory table (don't worry, programs can't feel pain, and you put it all back together before anyone notices!). Note that when your program reaches a breakpoint, gdb shows the next line to be executed after resuming from the breakpoint.

If you don't need to stop before running main, then use run instead of start .

step

Let's step through the program line by line.

(gdb) step

This will execute the current line, then show the next executable line:

10                      ns[i] = i;

When you call a function, step will jump into that function's source code. This is not so good when you end up looking at code you didn't write, such as printf(). For that, you can use next, which treats function calls as a single operation.

cont

Our program hasn't crashed yet - use cont to run it until it finishes normally or crashes:

(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x080483a9 in main () at crash.c:10
10                      ns[i] = i;
(gdb)

The program crashes just as before, but because it is now running under gdb, Linux simply suspends the program and hands control back to gdb just as if it had hit a breakpoint. Now you can diagnose what went wrong by examining the crashing program's state using print .

print and info locals

One of the most useful gdb commands is print, which lets you print out different expressions, including (but certainly not only!) source code variables. In fact, print can print an almost arbitrary C expression.

Now that our program has crashed, let's use print to find out why:

(gdb) print ns[i]
Cannot access memory at address 0xbf7ffffc
(gdb)

Well, that's no good - Linux won't let us access the array element ns[i] . (This is what Segmentation fault means - Linux will not let you access that point in memory.) But why can't we access that region of memory? Let's take a look at our local variables - maybe ns or i is funky. For that we use info locals :

(gdb) info locals
ns = {0, -1208217120, 134513173, -1208206112, -1208209420, 134513616, 134513376, 134513627, -1208209420, 134513616}
i = -2096336
(gdb)

i shouldn't be negative! Let's use list again to check that we didn't botch the update of i somehow:

(gdb) list
5               /* creates a 10-element array */
6               int ns[10];
7               int i;
8
9               for (i = 0; i < 10; i--) {
10                      ns[i] = i;
11              }
12
13              return (0);
14      }
(gdb)

Oops, I guess we did.

help

Print has a lot of features. To see what they are and how to use them, use help:

(gdb) help print
Print value of expression EXP.
Variables accessible are those of the lexical environment of the selected
stack frame, plus all those whose scope is global or an entire file.

$NUM gets previous value number NUM.  $ and $$ are the last two values.
$$NUM refers to NUM'th value back from the last one.
Names starting with $ refer to registers (with the values they would have
if the program were to return to the stack frame now selected, restoring
all registers saved by frames farther in) or else to debugger
"convenience" variables (any such name not a known register).
Use assignment expressions to give values to convenience variables.

{TYPE}ADREXP refers to a datum of data type TYPE, located at address ADREXP.
@ is a binary operator for treating consecutive data objects
anywhere in memory as an array.  FOO@NUM gives an array whose first
element is FOO, whose second element is stored in the space following
where FOO is stored, etc.  FOO must be an expression whose value
resides in memory.

EXP may be preceded with /FMT, where FMT is a format letter
but no count or size letter (see "x" command).

Typing help by itself will show you a list of more general topics you can use to navigate down to a specific command, if you don't know exactly what you are looking for.

Command history

You can press the up and down arrow keys to go backwards and forwards through your command history. This may save you typing if you want to reuse or slightly modify a previous command.