The Kernel

UNIX-like systems have a kernel, which contains device drivers, file systems, networking stacks, memory managment, and CPU scheduling code. The kernel also provides support for user-level processes, the purpose of which is to run an arbitrary program.

Processes

A process is a single, active, invocation of a program. There are usually dozens of processes on a UNIX machine, each with a specific purpose. For example, login(1) is responsible for authenticating users (by checking a username/password combination) and starting a shell. A single kernel may be running several different processes with the same program (e.g. we are all running emacs(1) on spinlock).

What data makes up a process?

A process consists of all the state needed to run a UNIX program. Some of this state is stored in the CPU registers and the allocated memory. The kernel also maintains a Process Control Block (PCB) for each process (Linux calls this a task_struct; it's defined in sched.h). The PCB contains all the internal state that the kernel uses to provide the process abstraction, including a list of open files, the user ID that the process is running as, and scheduling information. The PCB is not directly accessible to user programs. We might draw this all as:

The text segment is the UNIX term for the program's code in machine language. Note that, while I've included arrows for the SP (stack pointer) and PC (program counter) on the memory drawing, the SP and PC are actually stored in CPU registers.

How does a process get created?

Any process can create a new process by making a request to the kernel. For example, I might request that my shell [1] create a new mail(1) process by typing the text 'mail\n' at a prompt. The shell translates this command into a series of system calls, which, as we'll see, winds up creating a new process running the mail program. (The kernel creates init(8), the first process, at boot. The kernel never spontaneously creates any other processes.)

Under UNIX, two main system calls are used by the shell to execute my command: first, the shell clones itself using the fork(2) syscall; then, this clone replaces itself with the mail program using the exec(3) syscall. Kind of strange, but it turns out to be quite useful.

fork(2)

The function of fork(2) is to create an almost exact duplicate of the process that calls it. So, if we had the above diagram before the process invoked the fork syscall, then the diagram afterward would look like:

Most fields of the PCB are copied from the original to the newly created PCB; they are starred and shown in blue above (we'll call the new process the child and the old the parent from now on). The parent's memory is also copied [2]. Note that the PC of both processes is exactly the same.

Eventually, the kernel will get around to running these two processes. (Which one first? That's undefined; either could run first.) When this happens, the process (be it child or parent) will continue from it's saved PC. Since the process just executed the fork syscall, the PC will point to whatever instruction is immediatly following it.

Readers may have noticed a problem at this point. The processes are both exactly the same. When the kernel runs one, it'll do some stuff (whatever the instructions after the fork tell it to do). Then, when the other one runs, it'll do...the exact same thing. That's not too useful.

To get around this, the processes differ in a very slight but important way. The return value of the fork syscall will be 0 in the child process, and will be greater than 0 in the parent. We can use this to have the parent take one code-path while sending the (almost identical) child down a different path.

To summarize, immediatly after executing a fork:

There are 2 processes that are exactly the same, except for the differences described in the fork(2) man page.
Both processes are at the same line of code (the line immediatly after the fork).
In the child process, the return value of the fork is 0.
In the parent process, the return value of the fork is greater than 0.

exec(3)

So we now have two copies of the shell. But they are still both running the shell program; we want the child to run the mail(1) program. The child uses another syscall, exec(3), to replace itself with the mail program. exec does not create a new process; it just changes the program file that an existing process is running.

exec first wipes out the memory state of the calling process. It then goes to the filesystem to find the program file requested. exec copies this file into the program's memory and initializes register state, including the PC.

exec doesn't alter most of the other fields in the PCB - this is important, because it means the process calling exec(3) (the child copy of the shell, in this case) can set things up if it wants to, for example changing the open files or other user ID[3].

At this point we've still got two processes, but now one (the parent) is the shell, and the other (the child) is mail(1). The kernel will run both of them, making sure each gets a bit of processing time. It is likely the case that the shell wants to wait for the mail (child) process to finish before doing anything else; it can tell the kernel this using the wait(2) syscall.

Footnotes

[1] The shell is itself a process. What process created it? That depends on how I'm logged in. If I'm at the console (that is, sitting at the machine itself) then the login(1) process creates a shell for me after I supply a valid username/password (the login(1) man page contains a nice description of this procedure). If I'm logged in remotely -- say using ssh(1) -- a similar process would happen, but the sshd(8) (the server process that accepts ssh connections from remote clients) would create the shell for me after I connected and authenticated. What program is launched when either login(1) or sshd(8) needs to make a new shell is determined by the user's entry in the /etc/passwd file (take a look - the last entry on each line is the user's shell).

[2] Most modern systems do not actually copy all the memory when fork is called. Instead, they play a little trick to be lazy (remember, lazy is always good). They have the mapping functions for both the child and parent process point to the same memory until such time as one of them makes a change. Only when a change is made does the system copy the memory.

[3] The PCB is private to the kernel, so the shell couldn't actually directly set the values. It can, however, use a variety of syscalls to alter them in controlled ways.