CSE 351, section 8: Topics for homework 6

Michael Ratanapintha

A note on memory-mapping files

In class you learned about the concept of virtual memory - where each running program does not reference memory using the actual addresses the processor passes to the memory circuitry, but rather uses "virtual" addresses that are translated into real or "physical" ones before they are used. The CPU and OS do the translation using a set of page tables to map small blocks or pages of virtual addresses to the actual memory locations.

The new homework assignment explores another use of virtual addressing - having virtual addresses refer to data that isn't normally stored in main memory at all, a technique sometimes called memory mapping. In particular, you will use virtual addressing to memory-map small blocks from a large file. That way, you can read and write data to the file by simply doing memory loads and stores, rather than having to wait for the hard disk to do the read and write operations directly.

For example, the freedb-tiny.dat file you'll use in homework 6 is organized into 64-byte binary records. Suppose you want to read every 20th record from this file, modify each record, and then write them back into the file in a different order from when you read them. You could use Linux's read() system call to read one record at a time into memory, but then you will have to suffer the overhead of calling read() once for every single record you want, as well as lseek() to skip the 19 records following each record you want.1 A similar combination of lseek() and write() would be needed for every single write-back of a record.

Instead, you can just read the whole file into memory with one read() call, modify and reorder every 20th record in memory, and then when you're about to exit, write back the whole file with one write() call. This causes a lot less overhead from calls into Linux, saving a measurable amount of time. The savings would be far greater still if the records were far larger, say about 1 MB in size. Then each record would likely be in a different block on disk, so that writing the modified records in a non-sequential order would force the disk to "seek" its plates for the next block to write, which takes far more time than simply moving the writer head to the next block in sequence. In essence, by using memory mapping to avoid these repeated, costly disk operations, we are using main memory as a cache for the hard disk. Unlike the caches we have seen in class, this cache is managed by the programmer, but it is a cache nonetheless.

Relevant book reading: Section 9.8 of your textbook (pages 807-812) briefly discusses memory mapping. It examines the first motivation for memory mapping - loading programs' instructions into memory from a program executable file. It also talks about the Linux mmap() system call, which memory-maps files in a similar way to that described above, but also provides conveniences like not actually reading in the whole file, but only reading in blocks of the file when the corresponding memory addresses are read.

Makefiles

To compile the hw6file program for homework 6, you'll use the make command. Make was designed to compile complex programs quickly, by only compiling those source code files that changed since the last compilation. It uses a file called makefile to guide the compile and build process.

The compiling-linking life cycle

Before talking about the syntax of a makefile, let's review how Linux turns a set of C source code (.c) files into an program you can run with ./command. We discussed the compilation process a few weeks back in lecture, but to jog your memory, here is a simplified special case:

  1. You write a source code file, /a/main.c. It uses functions declared in several header files. One such header file is /a/program.h , which you wrote earlier, and request in main.c with #include "program.h" . Another is the standard I/O library header, /usr/include/stdio.h , requested with #include <stdio.h> .
  2. You run the command gcc main.c -o program to create an executable program /a/program from main.c. This actually does several things:
    1. gcc runs the C preprocessor, cpp, which interprets and then removes all lines in main.c that starts with #. Your two #include directives cause cpp to insert the contents of /a/program.h and /usr/include/stdio.h into your program. These files contain the declarations (names, argument, and return types) of the functions you use, but not their bodies.
    2. gcc then runs GCC's C compiler cc1 to compile main.c, turning it into an assembly language file main.s (not kept). Functions that are declared but not used in the pre-processed main.c are quietly dropped; this may include functions declared in /usr/include/stdio.h or /a/program.h which you didn't use. Functions that are declared and used in main.c, but whose bodies are not provided, are included as textual labels in main.s, with special assembler commands to mark them as referring to instructions outside main.s.
    3. gcc then runs the GNU assembler as to turn main.s into the compiled, machine-language object file main.o. As in main.s, functions used but not defined in main.o are marked as "external," to be defined by other object code.
    4. Finally, gcc runs the Linux linker ld to turn main.o into the executable file program. Ld does this by linking main.o to the C library, stored in the shared object file /lib/libc.so.62. This is done by copying the contents of main.o to program, and then storing the path to libc.so.6 in program so that when the program calls functions that are not stored in program, Linux knows to look in /lib/libc.so.6 for the code of these functions. By doing this, program can access C library functions like printf(), declared in the header file stdio.h, without having to include all the code of the C library, wasting space on extraneous and/or duplicated functions.

Not all of this process has to be done at once. In fact, when you write makefile files, you will often want to break up this process into pieces, because otherwise you will lose the compilation speedup benefit of make. For example, you might split linking from the rest of program building, so that changing a single source code file doesn't force you to recompile and relink your whole program.

When you do homework 6, you'll need to extend this base compilation process by linking in libraries other than the C library libc.so.6 . For example, suppose your program calls functions from the ALSA Linux sound library, called libasound.so . For this, you would add the option -l asound (conventionally written -lasound) to your gcc command line. This tells the linker to search a set of system-defined directories3, including /lib and /usr/lib , for any shared object file called libasound.so or libasound.so.N for some number N, and link that into your program, so calls on ALSA functions are directed to libasound.so's code.

Using make for homework 6

As noted above, you compile hw6file by running make:

$ make
cc -g -Wall   -c -o artistCDs.o artistCDs.c
cc -g -Wall   -c -o listArtists.o listArtists.c
cc -g -Wall   -c -o main.o main.c
cc -g -Wall   -c -o printTree.o printTree.c
cc -g -Wall   -c -o RecordStore.o RecordStore.c
cc -g -Wall   -c -o FileFileReader.o FileFileReader.c
gcc artistCDs.o listArtists.o main.o printTree.o RecordStore.o FileFileReader.o -l readline -o hw6file
$ 

Make works by running a series of Linux commands; each command is shown before it is executed.

To delete hw6file and other compilation cruft, possibly in preparation for starting over, you can say make clean:

$ make clean
rm -f *.o *~ hw6file
$ 

clean is a make target - a subcommand you can give to make to build only part of your program. (In this case, you are not actually "building" anything, but the concept is the same.) make by itself builds the default target (below).

The homework 6 makefile

Make uses the following file, called makefile, to guide compilation of hw6file:

CFLAGS=-g -Wall
BASEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o
FILEOBJS=$(BASEOBJS) FileFileReader.o

file: $(FILEOBJS)
	gcc $(FILEOBJS) -o hw6file

clean:
	rm -f *.o *~ hw6file

Symbol expansion

The first three lines define symbols. Those symbols are "expanded" in commands issued later in the file, using the $ operator. For example, the line

FILEOBJS=$(BASEOBJS) FileFileReader.o

is expanded to

FILEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o FileFileReader.o

because of the previous symbol definition

BASEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o

In this way, makefile symbol definitions are much like the C preprocessor's #define directives, which also define names that get textually expanded wherever they appear later in the source file.

Defining targets

The next non-whitespace line is:

file: $(FILEOBJS)

After symbol expansion, this line says that to build the target file, make must first build artistCDs.o, listArtists.o, and so on. Make already knows how to build .o files from .c files, so it looks for .c files with those names to compile. In the original makefile, one .c source file is missing, so make fails. However, if make were to succeed, it would leave all the object files in the directory, so that if only one .c file were modified, only that .o file has to be recompiled.

Following the line beginning file: is the Linux command to execute to build target file, preceded by a single tab stop (and only a tab stop, not a series of spaces):

	gcc $(FILEOBJS) -o hw6file

(When gcc is given a series of object files as input rather than source files, it knows that all you want is to link the object files together, so it doesn't bother with compiling or assembling.)

In general, each makefile target can be followed by any number of Linux commands to "build" that target. Every command line must begin with a single tab stop. Not 4 spaces, not 8 spaces, but what you get when you type [Tab] in a simple, dumb editor (i.e. not Vim or Emacs). Luckily, when it comes to makefiles, the smartest editors work just like dumb ones.

The default target, and clean

Because file is the first target in the makefile, it's the default target - the one we said make tries to build if you just say make without an explicit target argument.

To build a specific target, you specify its name after make; so for example, make clean builds the clean target, deleting files created by the build process. The clean target doesn't have any listed prerequisites (nothing following the colon), so it can be "built" at any time, without having to build any other targets.

More GDB topics

Way back in the distant past, I talked briefly about using the debugger GDB to inspect your C programs. That section's notes contain most of what you might need, but here are some topics I neglected to cover then. I didn't get time to cover these topics today either, but here they are for your reference:

Creating your own breakpoints

I said before that when you start a program under GDB with the start command, the debugger sets a breakpoint at the first line of code of main(), so execution is halted just before your code starts running. You can set your own breakpoints with the break command. To set a breakpoint at the current point of execution, do this:

(gdb) break

To set a breakpoint when you enter the main() function, give the name of main() to break:

(gdb) break main

This represents the first source line in the body of main(), after the prologue.

To set a breakpoint at line 50 of main.c, do this:

(gdb) break 'main.c:50

Note the leading apostrophe before main.c, which is not matched by another close quote. You can omit the source file name if you are executing in the same file as the one you want to break in. (This same syntax for specifying source lines also works with list.)

Deleting, disabling, and enabling breakpoints

If you set a breakpoint in the wrong place, you can delete it. To do this, you need to know the ID number for the breakpoint, which is shown when you set it:

(gdb) break main
Breakpoint 1 at 0x100000f18: file crash.c, line 9.

You can also see all breakpoints with ID numbers with info break:

(gdb) info break
Num Type           Disp Enb Address            What
1   breakpoint     keep y   0x0000000100000f18 in main at crash.c:9
2   breakpoint     keep y   0x0000000100000f21 in main at crash.c:10

Once you have the breakpoint ID, you can delete the breakpoint with delete:

(gdb) delete 1
(gdb) info break
Num Type           Disp Enb Address            What
2   breakpoint     keep y   0x0000000100000f21 in main at crash.c:10
(gdb) 

You can also temporarily disable breakpoints without having GDB forget about them altogether, then enable them when you want them again:

(gdb) disable 2
(gdb) info break
Num Type           Disp Enb Address            What
2   breakpoint     keep n   0x0000000100000f21 in main at crash.c:10
(gdb) enable 2
(gdb) info break
Num Type           Disp Enb Address            What
2   breakpoint     keep y   0x0000000100000f21 in main at crash.c:10
(gdb) 

More information about GDB

Here is a lecture tutorial on GDB from the University of Maryland. UW's CSE 303 also spent some time discussing GDB. If you feel up to it, you could also read the actual GDB manual.

Footnotes

  1. Actually, you could just call read() 19 times while throwing all of the results away, but this would cause a lot of wasted memory copies into your address space.
  2. We're describing dynamic linking here. The alternative, that of actually taking all the code from libc and inserting it into the executable program, is called static linking. On desktop Linux, static linking is rarely used for large libraries like the C library, but on embedded platforms where dynamic linking is difficult or impossible, static linking is favored.
  3. You can tell the linker to search other directories, such as the current one, with the -L (big-L, not little-l) option to gcc . In that case, you'll probably also want to change the path the preprocessor searches for header files (since you'll be including the headers for the extra libraries you want); this is the -I (big letter Eye) option to gcc .