In class you learned about the concept of virtual memory - where each running program does not reference memory using the actual addresses the processor passes to the memory circuitry, but rather uses "virtual" addresses that are translated into real or "physical" ones before they are used. The CPU and OS do the translation using a set of page tables to map small blocks or pages of virtual addresses to the actual memory locations.
The new homework assignment explores another use of virtual addressing - having virtual addresses refer to data that isn't normally stored in main memory at all, a technique sometimes called memory mapping. In particular, you will use virtual addressing to memory-map small blocks from a large file. That way, you can read and write data to the file by simply doing memory loads and stores, rather than having to wait for the hard disk to do the read and write operations directly.
For example, the freedb-tiny.dat
file you'll use
in homework 6 is organized into 64-byte binary records. Suppose you want
to read every 20th record from this file, modify each record, and then
write them back into the file in a different order from when you read them.
You could use Linux's read()
system call to read one record at a time
into memory, but then you will have to suffer the overhead of calling
read()
once for every single record you want, as well as
lseek()
to skip the 19 records following each record you
want.1
A similar combination of lseek()
and write()
would be
needed for every single write-back of a record.
Instead, you can just read the whole file into memory with one
read()
call, modify and reorder every 20th record
in memory, and then when you're about to exit, write back the whole
file with one write()
call. This causes a lot less overhead
from calls into Linux, saving a measurable amount of time. The savings
would be far greater still if the records were far larger, say about 1 MB
in size. Then each record would likely be in a different block on disk,
so that writing the modified records in a non-sequential order
would force the disk to "seek" its plates for the next block to write,
which takes far more time than simply moving the writer head to the
next block in sequence. In essence, by using memory mapping to avoid these
repeated, costly disk operations, we are using main memory as a cache for
the hard disk. Unlike the caches we have seen in class, this cache is managed
by the programmer, but it is a cache nonetheless.
Relevant book reading:
Section 9.8 of your textbook (pages 807-812) briefly discusses memory mapping.
It examines the first motivation for memory mapping - loading programs'
instructions into memory from a program executable file. It also talks about
the Linux mmap()
system call, which memory-maps files
in a similar way to that described above, but also provides conveniences
like not actually reading in the whole file, but only reading in blocks
of the file when the corresponding memory addresses are read.
To compile the hw6file
program for homework 6, you'll use the
make
command. Make
was designed to compile complex
programs quickly, by only compiling those source code files that changed since
the last compilation. It uses a file called makefile
to guide
the compile and build process.
Before talking about the syntax of a makefile
, let's review
how Linux turns a set of C source code (.c
) files into an
program you can run with ./command
.
We discussed the compilation process a few weeks back in lecture,
but to jog your memory, here is a simplified special case:
/a/main.c
. It uses
functions declared in several header files. One such header file is
/a/program.h
, which you wrote earlier, and request
in main.c
with #include "program.h"
. Another
is the standard I/O library header, /usr/include/stdio.h
,
requested with #include <stdio.h>
.
gcc main.c -o program
to create
an executable program /a/program
from main.c
.
This actually does several things:
gcc
runs the C preprocessor, cpp
, which
interprets and then removes all lines in main.c
that
starts with #
. Your two #include
directives
cause cpp
to insert the contents of /a/program.h
and /usr/include/stdio.h
into your program. These files
contain the declarations (names, argument, and return types)
of the functions you use, but not their bodies.
gcc
then runs GCC's C compiler cc1
to
compile main.c
, turning it into an assembly language file
main.s
(not kept). Functions that are declared but not used
in the pre-processed main.c
are quietly dropped; this may
include functions declared in /usr/include/stdio.h
or
/a/program.h
which you didn't use. Functions that are
declared and used in main.c
, but whose bodies are not
provided, are included as textual labels in main.s
,
with special assembler commands to mark them as referring to instructions
outside main.s
.
gcc
then runs the GNU assembler as
to turn
main.s
into the compiled, machine-language object file
main.o
. As in main.s
, functions used but not
defined in main.o
are marked as "external," to be defined
by other object code.
gcc
runs the Linux linker ld
to
turn main.o
into the executable file program
.
Ld
does this by linking main.o
to
the C library, stored in the shared object file
/lib/libc.so.6
2.
This is done by
copying the contents of main.o
to program
,
and then storing the path to libc.so.6
in program
so that when the program calls functions that are not stored in program
,
Linux knows to look in /lib/libc.so.6
for the code of these functions.
By doing this, program
can access C library functions like
printf()
, declared in the header file stdio.h
,
without having to include all the code of the C library, wasting space
on extraneous and/or duplicated functions.
Not all of this process has to be done at once. In fact,
when you write makefile
files, you will often want
to break up this process into pieces, because otherwise you will lose
the compilation speedup benefit of make
. For example, you
might split linking from the rest of program building, so that changing
a single source code file doesn't force you to recompile and relink your
whole program.
When you do homework 6, you'll need to extend this base compilation process
by linking in libraries other than the C library libc.so.6
.
For example, suppose your program calls functions from the ALSA Linux sound
library, called libasound.so
. For this, you would add
the option -l asound
(conventionally written -lasound
)
to your gcc
command line. This tells the linker to search
a set of system-defined directories3,
including /lib
and /usr/lib
,
for any shared object file called libasound.so
or libasound.so.N
for some number N, and link that into your program,
so calls on ALSA functions are directed to libasound.so
's code.
make
for homework 6
As noted above, you compile hw6file
by running make
:
$ make
cc -g -Wall -c -o artistCDs.o artistCDs.c
cc -g -Wall -c -o listArtists.o listArtists.c
cc -g -Wall -c -o main.o main.c
cc -g -Wall -c -o printTree.o printTree.c
cc -g -Wall -c -o RecordStore.o RecordStore.c
cc -g -Wall -c -o FileFileReader.o FileFileReader.c
gcc artistCDs.o listArtists.o main.o printTree.o RecordStore.o FileFileReader.o -l readline -o hw6file
$
Make
works by running a series of Linux commands; each command is
shown before it is executed.
To delete hw6file
and other compilation cruft, possibly in
preparation for starting over, you can say make clean
:
$ make clean
rm -f *.o *~ hw6file
$
clean
is a make
target - a subcommand
you can give to make
to build only part of your program.
(In this case, you are not actually "building" anything, but the concept
is the same.) make
by itself builds the default target (below).
makefile
Make
uses the following file, called makefile
,
to guide compilation of hw6file
:
CFLAGS=-g -Wall BASEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o FILEOBJS=$(BASEOBJS) FileFileReader.o file: $(FILEOBJS) gcc $(FILEOBJS) -o hw6file clean: rm -f *.o *~ hw6file
The first three lines define symbols. Those symbols are "expanded" in commands issued later in the file, using the $ operator. For example, the line
FILEOBJS=$(BASEOBJS) FileFileReader.o
is expanded to
FILEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o FileFileReader.o
because of the previous symbol definition
BASEOBJS=artistCDs.o listArtists.o main.o printTree.o RecordStore.o
In this way, makefile
symbol definitions are much like the
C preprocessor's #define
directives, which also define names
that get textually expanded wherever they appear later in the source file.
The next non-whitespace line is:
file: $(FILEOBJS)
After symbol expansion, this line says that to build the target
file
, make
must first build artistCDs.o
, listArtists.o
, and so on.
Make
already knows
how to build .o
files from .c
files, so it looks for .c
files with those names to compile. In the original makefile
,
one .c
source file is missing, so make
fails.
However, if make
were to succeed, it would leave all the object
files in the directory, so that if only one .c
file were modified,
only that .o
file has to be recompiled.
Following the line beginning file:
is the Linux command
to execute to build target file
, preceded by a single tab stop
(and only a tab stop, not a series of spaces):
gcc $(FILEOBJS) -o hw6file
(When gcc
is given a series of object files as input rather
than source files, it knows that all you want is to link the object files together,
so it doesn't bother with compiling or assembling.)
In general, each makefile
target can be followed by any number
of Linux commands to "build" that target. Every command line must begin
with a single tab stop. Not 4 spaces, not 8 spaces, but what you get when
you type [Tab] in a simple, dumb editor (i.e. not Vim or Emacs).
Luckily, when it comes to makefile
s, the smartest editors
work just like dumb ones.
clean
Because file
is the first target in the makefile
,
it's the default target - the one we said make
tries to build if you just say make
without an explicit target argument.
To build a specific target, you specify its name after make
;
so for example, make clean
builds the clean
target,
deleting files created by the build process.
The clean
target doesn't have any listed prerequisites
(nothing following the colon), so it can be "built" at any time, without
having to build any other targets.
Way back in the distant past, I talked briefly about using the debugger GDB to inspect your C programs. That section's notes contain most of what you might need, but here are some topics I neglected to cover then. I didn't get time to cover these topics today either, but here they are for your reference:
I said before that when you start a program under GDB with
the start
command, the debugger sets a breakpoint at the
first line of code of main()
, so execution is halted just
before your code starts running. You can set your own breakpoints
with the break
command. To set a breakpoint at
the current point of execution, do this:
(gdb) break
To set a breakpoint when you enter the main()
function,
give the name of main()
to break
:
(gdb) break main
This represents the first source line in the body of main()
,
after the prologue.
To set a breakpoint at line 50 of main.c
, do this:
(gdb) break 'main.c:50
Note the leading apostrophe before main.c
, which is not
matched by another close quote. You can omit the source file name if
you are executing in the same file as the one you want to break in.
(This same syntax for specifying source lines also works with list
.)
If you set a breakpoint in the wrong place, you can delete it. To do this, you need to know the ID number for the breakpoint, which is shown when you set it:
(gdb) break main
Breakpoint 1 at 0x100000f18: file crash.c, line 9.
You can also see all breakpoints with ID numbers with info break
:
(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x0000000100000f18 in main at crash.c:9
2 breakpoint keep y 0x0000000100000f21 in main at crash.c:10
Once you have the breakpoint ID, you can delete the breakpoint with delete
:
(gdb) delete 1 (gdb) info break Num Type Disp Enb Address What 2 breakpoint keep y 0x0000000100000f21 in main at crash.c:10 (gdb)
You can also temporarily disable breakpoints without having GDB forget about them altogether, then enable them when you want them again:
(gdb) disable 2 (gdb) info break Num Type Disp Enb Address What 2 breakpoint keep n 0x0000000100000f21 in main at crash.c:10 (gdb) enable 2 (gdb) info break Num Type Disp Enb Address What 2 breakpoint keep y 0x0000000100000f21 in main at crash.c:10 (gdb)
Here is a lecture tutorial on GDB from the University of Maryland. UW's CSE 303 also spent some time discussing GDB. If you feel up to it, you could also read the actual GDB manual.
read()
19 times while throwing
all of the results away, but this would cause a lot of wasted memory copies
into your address space.
libc
and inserting it into
the executable program
, is called static linking.
On desktop Linux, static linking is rarely used for large libraries like the
C library, but on embedded platforms where dynamic linking is difficult
or impossible, static linking is favored.
-L
(big-L, not little-l)
option to gcc
. In that case, you'll probably also want
to change the path the preprocessor searches for header files (since you'll be
including the headers for the extra libraries you want); this is
the -I
(big letter Eye) option to gcc
.