CSE 451 Autumn 2000

CSE 451 Autumn 2000 - Project Assignment 3

Fair Share Scheduling

Version 1.1: Updated 10/31 at 10:00AM
Replaced "effective UID" with "UID" in description of the spawn utility.

Out: 30 October, 2000
Part A Due:6 November, 2000
Part B Due:13 November, 2000

Assignment Goals

Look under the covers of time-sharing schedulers

Some exposure to such schedulers on multi-processors

Some exposure to the interaction between the scheduler and other parts of the operating system (e.g., memory management)

Background

Linux actually contains three scheduling policies, one for "general use" and two for a kind of real-time processing. We'll be concerned with the general use policy, which is a descendant of the multi-level feedback approach discussed in the book. (However, the other two have some penetration into the code you'll end up modifying, so you need to understand at least a tiny bit about them.)

Quoting from "man sched_setscheduler":

SCHED_OTHER is the default universal time-sharing scheduler policy used by most processes, SCHED_FIFO and SCHED_RR are intended for special time-critical applications that need precise control over the way in which runnable processes are selected for execution. Processes scheduled with SCHED_OTHER must be assigned the static priority 0, processes scheduled under SCHED_FIFO or SCHED_RR can have a static priority in the range 1 to 99. Only processes with superuser privileges can get a static priority higher than 0 and can therefore be scheduled under SCHED_FIFO or SCHED_RR. The system calls sched_get_priority_min and sched_get_priority_max can be used to find out the valid priority range for a scheduling policy in a portable way on all POSIX.1b conforming systems.

and

SCHED_OTHER: Default Linux time-sharing scheduling
SCHED_OTHER can only be used at static priority 0. SCHED_OTHER is the standard Linux time-sharing scheduler that is intended for all processes that do not require special static priority real-time mechanisms. The process to run is chosen from the static priority 0 list based on a dynamic priority that is determined only inside this list. The dynamic priority is based on the nice level (set by the nice or setpriority system call) and increased for each time quantum the process is ready to run, but denied to run by the scheduler. This ensures fair progress among all SCHED_OTHER processes.

Note that these describe the logical operation of the scheduler; the actual implementation might (and probably will) not be quite what you'd expect from reading these descriptions.

As with many schedulers, one of the key concepts in the Linux scheduler is round-robin (RR) scheduling. Part of the point of RR scheduling is fairness. In particular, there is fairness among processes, in the sense that each process (of a given priority) receives an equal allocation of CPU time over an appropriately measured interval.

One problem with RR scheduling is that it is per process fairness, not per user fairness. What this means is that a user who starts many processes will receive, in total, a much larger fraction of system resources than will a user who has only a few processes running.

Fair-share schedulers address this problem, using the RR mechanism, but with the goal of providing equal allocation per user (that is, summing over all processes being run for that user), not per process. As an example, if user A has four times as many processes running as user B, each of user A's process would receive only one fourth of the allocation per second, say, given to processes for user B, so that users A and B split the CPU resource equally. Under the default scheduler, user A would receive 80% of the CPU.

Complications

The examples above assume that all processes are in tight CPU loops. That isn't the typical case. Instead, processes perform IO and other blocking operations. This complicates the definition of "fair share." (A more detailed discussion of this than you need for this assignment is given in A Coarse Grained Fair Share Scheduler, as well as in other references available on the web.)

Part of this assignment (to be completed in Part A) is for you to decide just what the objectives of your scheduler are. The objectives are distinct from the mechanisms used to actually perform the scheduling. Typically, the mechanisms do not achieve the objectives, except in extreme cases (e.g., all running processes are in tight CPU loops). In fact, typically it's difficult or impossible to give a convincing argument about just what "should happen" for a given set of real processes.

If it were me doing this assignment, I'd try to adopt the mechanisms and objectives of the current Linux scheduler, subject to the change to fair-share as the overall flavor of the scheduler, on the theory (which is true) that the current scheduler represents a survival of the fittest evolution of the algorithm based on observations of its performance in actual use over a number of years.

In any case, though, you're free to make whatever decisions you think are appropriate, but you should try to be complete in your Part A write-up about what those decisions are.

The Assignment

There are two parts to the assignment. The first is to read the existing scheduler code (and whatever other materials you find helpful), and to design modifications to convert it to fair-share scheduling, as defined above. The second part is to implement, debug, and test the modifications.

Your changes should be "the minimal changes to Linux that effect fair-share scheduling." You'll have to use your judgment in making detailed decisions (you can ask the staff if you're not sure), but at the very least all applications that run correctly on the unmodified Linux should run on your system. This means that applications that use the real-time schedulers should work. It also means that other things the scheduler is involved in (e.g., signals, "bottom half handlers", alarms, etc.), that the text book does not talk about, need to continue working. Finally, the notion of priorities ("man renice") should persist in your modified system, although exactly what happens to renice'd processes will undoubtedly be affected by your changes.

What to hand in for each part of the assignment is discussed below.

Not the Assignment

It should be obvious that if you can find an implementation of fair-share scheduling for Linux on the web, we can find that same implementation.

Linux Source and Configuration

The most directly relevant code is in .../kernel/sched.c. It is certain that you will need to look at other source files as well.

You must make sure that you have configured Linux to compile for multiprocessors. Some portions of the scheduler code have separate versions for uni-processors and multiprocessors. This is controlled by the symbol CONFIG_SMP (as in #ifdef CONFIG_SMP). You should make sure this symbol is defined, and if it isn't, run make config to get it turned on.

(How can you tell if it is defined? The easiest way, I think, is to find a block of code that will be compiled only if it is defined and edit some nonsense in there (e.g., add "this should fail" as if it were a line of C code). Now do a make. If you get an error message on this line, the symbol is defined. If not, either the symbol isn't defined or that file isn't part of the project, or something else is wrong.)

You should make decisions appropriate to execution on multi-processors when writing your code. Of course, our machines are not multi-processors, so this is a bit of pretending. We MAY have an actual multi-processor available, on loan, for testing by those who are interested. However, the bottom line is that you should write code that would be sensible for execution on multi-processors, but your code has to be tested only on uni-processors. This means that you must worry about race conditions. However, I don't believe that you need to create any new policy or mechanism specific to multiprocessor scheduling.

Recommended Procedure

It is very inconvenient to have a hard bug in the scheduler - the machine simply won't run your copy of the kernel, and you'll spend days rebooting.

For that reason, I recommend that you (a) think hard about your code before you recompile and install, and (b) make modifications incrementally, not all at once. If the machine fails to boot, or to run, it's a good guess that the last set of changes you made are the problem. (It's not certain, of course.) Making sure that set of changes is small can help find the problem.

Additionally, you might find that instrumenting the code (printk's, or similar sorts of things) helps you, and even that building some additional tools (e.g., a new system call, and a user-level application that invokes it) are worth the effort.

We've already built some of the latter, as described next.

Utilities Overview

You'll be working on the scheduler. If the systems hangs or crashes, you'll know your changes haven't worked. But, how will you know that they are working?

We've written some utility programs to help with this. These might be of some use in looking at how the existing scheduler behaves, but are intended primarily to help evaluate your changes. The source to the utilities is available, but they were written with the intention that you'd never have to look at it. The source can be used to install the utilities on Linux machines outside of the lab (they're already installed in the lab), or to customize the utilities if you want them to do things they don't already do.

There are two major utilities, one that shows a full-screen display of the fraction of CPU allocated to each user and each user process, and one to help launch a set of applications as if they were launched by a set of distinct users.

Monitor Utility

Command: monitor [refresh interval in seconds]

The monitor prints a full screen display showing CPU usage by all user-level (not root, and not a few daemon) processes. The display is updated every "refresh interval" seconds, defaulting to 2 seconds if no parameter is given. (2 seconds is also the minimum interval allowed.)

Here is some sample output:

Elapsed time: 2.13 seconds

zahorjan: 100% ( 99%, 0%)

( tcsh) 18122 0% ( 0%, 0%)

( piglet) 18168 99% ( 99%, 0%)

( monitor) 18170 0% ( 0%, 0%)

The first line shows the amount of real time that passed between screen updates. The next line shows that user zahorjan consumed 100% of CPU the during those 2.13 seconds. zahorjan was running three processes, a shell (with PID 18122), a process called piglet, and the monitor itself. The piglet process used the CPU for 99% of the 2.13 seconds, of which 99% of the 2.13 seconds was in user mode and 0% was system mode. (Conversion to integers in the monitor program, as well as the Heisenberg principle, cause some loss of precision.)

The monitor program accepts two keyboard inputs. The inputs are acted upon only the next time the screen is updated (for reasons that you can guess from the original client code in project assignment two). The inputs are:

'q' Quit the program

<space> Toggle between real-time and CPU time modes

The sample output explained above is real-time mode: fractions are given relative to the actual elapsed time between refreshes of the screen. In CPU time mode fractions are relative to the total amount of CPU used during that interval. So, if the refresh interval were exactly 2 seconds and some one process were the only one that ran during that time, and it used 1 second of CPU, it would show 50% consumption in real-time mode and 100% consumption in CPU mode.

Spawn Utility and Its Cronies

Command: spawn [configuration file]

The spawn command reads a configuration file, each line of which specifies a user ID (UID) and the name of an executable. It forks a new process for each line and runs the named executable in it. It also sets the UID of the process to the value given in the line of the configuration file. This allows you to launch a set of processes as if they had been launched by a set of distinct users. Because spawn sets the UID of processes, you must be logged in as root to use it.

Two sample executables are also provided, piglet and io. Piglet is just a tight CPU loop. Io is a program that reads every file on the system (that is accessible to the UID of the process). The io program is thus very IO bound, and not at all CPU bound.

A sample configuration file, spawn.conf, has also been provided. It's contents are:

9000 piglet
9000 piglet
9000 piglet
9001 piglet
9001 piglet

When spawn reads this file it will launch five copies of the piglet program, three as user 9000 and two as user 9001. (These UIDs were chosen because they do not conflict with any real users on the SPL machines.) Here is output from the monitor program after launching spawn with this configuration file:

Elapsed time: 2.12 seconds

zahorjan: 0% ( 0%, 0%)

( csh) 7760 0% ( 0%, 0%)

( csh) 7847 0% ( 0%, 0%)

( monitor) 7868 0% ( 0%, 0%)

User 9000: 60% ( 60%, 0%)

( piglet) 7840 19% ( 19%, 0%)

( piglet) 7841 20% ( 20%, 0%)

( piglet) 7842 20% ( 20%, 0%)

User 9001: 39% ( 39%, 0%)

( piglet) 7843 19% ( 19%, 0%)

( piglet) 7844 19% ( 19%, 0%)

(The two new user names show up as "User UID" because there is no actual user name for those UIDs on that system.)

Exactly what spawn does is a tiny bit complicated, but it's designed to work almost whatever it is you're trying to do. So, you don't have to understand the next paragraph - it should just work.

The complication has to do with where (a) the configuration file is, and (b) where the executables named in the configuration file are. The rules are this:

If no configuration file is given on the command line, the file is assumed to exist in the same directory as the spawn executable, and to be named spawn.conf.

If a configuration file name is given, if the path is absolute (e.g., ~/FairShareProject/myspawn.conf), that's used. If the path is not absolute (e.g., myspawn.conf), it's relative to the current working directory.

Executable file names in the configuration file can be either absolute path names or relative names. If relative, they are relative to the directory in which the configuration file was found.

If you're working in the SPL, you should create your own copy of the configuration file; do not edit the copy in /root/bin as that will confuse other people who are doing the same wrong thing. You will then need to name your configuration file on the command line when invoking spawn.

One last thing. Because we had quite a few problems with server processes started by people but never killed on the last assignment, the spawn program includes one additional function: it kills everything it has spawned after 2 minutes. This is intended to be a convenience, in case you forget to kill something that is either in a tight CPU loop or reading the entire file system. It is easy to change the 2 minute interval to something else, if you want, or to defeat this entirely.

One other last thing. There is absolutely no reason to run piglet or io on any machines other than those in the SPL (or your own, if you're working at home), and you should never run them on any shared machine. This includes all four general purpose instructional machines (ceylon, fiji, sumatra, and tahiti), as well as greer and baughm.

Where to Find the Utilities

The utilities have already been installed on all machines in the SPL in /root/bin.

Additionally, the source is available from /cse/courses/cse451/00au/FairShareProject/FairShare.tar. The source is there primarily for those people working on machines outside the lab, but you should feel free to use it to customize the tools as you like. The tar file expands into a number of directories. There is a README.txt file, but, to recapitulate, there is a makefile in each subdirectory that will deposit an executable in the parent directory of the subdirectories.

What to Hand In

Note: Part A could well be the more difficult of the two parts. DO NOT wait until the weekend to start on this.

Part A, Due November 6

You should hand in a write-up that explains three things:

What does the current Linux scheduler do?

What are the objectives of your fair-share scheduler?

What are your plans to modify it to achieve fair-share scheduling?

For the first part, you should answer (at least) these questions:

What is the current scheduling mechanism? Be specific and thorough. (It should be possible to re-implement the behavior of the current scheduler from your description, which should be a kind of specification.)

Is starvation possible with the current mechanism? If so, give an example of how it might occur.

Suppose that all processes on the system are scheduled using SCHED_OTHER, and that all are in tight CPU loops. Give an expression that indicates the fraction of CPU time that will be allocated to a single process, as a function of its static ("renice") priority. (Note: The answer to this depends on at least one Linux configuration parameter.)

Is there aging? That is, are the priorities of processes that have low recent CPU consumption raised to avoid effective starvation?

Are IO bound processes (those whose ratio of IO to CPU use is high) given any kind of favoritism?

For the second part, tell us how you want to allocate CPU among processes (a) that all belong to a single user, and (separately) (b) that are not in tight CPU loops. That is, tell us what you want "fair-share" to mean. It is okay, expected even, that the answers to some of these questions can be answered best by referring to the implementation (just as exactly what the current scheduler does can be defined fully only by its implementation).

For the third part, your description should be "more than a specification": it should be a specification (sufficient for some other trained person to implement from), plus it should identify specific source files to be modified.

We will try to read at least the second part of all reports as quickly as possible, and to provide feedback in time to affect what you do in Part B of this assignment. However, this cannot be guaranteed. If you are anxious to have feedback, submit your write-up (as a Word file, say), by e-mail, before the weekend. There are still no guarantees about turnaround time, though.

Note: I could be wrong. Maybe Part B is the more difficult of the two parts. DO NOT wait until the weekend to start on this.

Part B, Due November 13

Implement your fair-share scheduler. Hand in:

A copy of your answer to question 2 from Part A of this assignment.

A short write-up that describes how what you ended up doing differed from your plans from Part A.

A short description that indicates why you think your scheduler is doing what you want. (This should include the results of some experiments.)

A copy of any source code (.c and .h files) that you changed. To keep this as small as possible, just give us complete routines (rather than files) for executable code changes, and lists of additions/deletions from .h files. (Be sure to identify the file that each such excerpt comes from.)

Make sure to keep a copy of your final bzImage somewhere THAT WE CAN GET TO. (Make sure we, who aren't you, have permission to read the file.) Let us know where it is.

zahorjan:		100%	( 99%, 0%)
( tcsh)	18122	0%	( 0%, 0%)
( piglet)	18168	99%	( 99%, 0%)
( monitor)	18170	0%	( 0%, 0%)

'q'	Quit the program
<space>	Toggle between real-time and CPU time modes

zahorjan:		0%	( 0%, 0%)
( csh)	7760	0%	( 0%, 0%)
( csh)	7847	0%	( 0%, 0%)
( monitor)	7868	0%	( 0%, 0%)

User 9000:		60%	( 60%, 0%)
( piglet)	7840	19%	( 19%, 0%)
( piglet)	7841	20%	( 20%, 0%)
( piglet)	7842	20%	( 20%, 0%)

User 9001:		39%	( 39%, 0%)
( piglet)	7843	19%	( 19%, 0%)
( piglet)	7844	19%	( 19%, 0%)