Section Notes - 1/23/03

Section Outline:
  • Project 1 questions answered
  • pthreads discussion: key functions, discussion of thread_join; check man -k pthreads
  • Introduction to synchronization
  • Synchronization exercises
  • Atomic instructions
  • Producer-Consumer example

Introduction to synchronization

   In past discussions, we've seen that resource sharing (e.g. shared application-level data structures, shared memory) is one of the key benefits of threading.  We've also seen the potential for a single misbehaving thread to corrupt the shared resources for every thread.  The same is true for cooperating processes, that is, processes that can affect, or be affected by other processes in the system.  To ensure correct and (hopefully) efficient cooperation between threads and/or processes, we introduce the idea of synchronization.  We'll now briefly introduce a few formalisms which are also covered in Ch. 7 and which will be in Friday's lecture.

race condition

   Suppose multiple processes/threads are executing concurrently and accessing/modifying some shared variable (or data structure, cache, etc.), call the output of this execution 'X'.  We have a race condition when the final value of X depends on the order in which the concurrent processes/threads executed.  Race conditions make our programs unpredictable and potentially incorrect, so we'd like to avoid them.

critical section

   A section of code in a process where a shared resource is accessed or modified.  We want to 'protect' critical sections (CSs), so execution of a CS must occur in a mutually exclusive fashion.  We have 3 requirements for how we deal with critical sections:

     - Mutual exclusion
     - Progress
     - Bounded wait

A fourth requirement that we sometimes include is performance - it would be nice if our solution was fast as well as correct.


Synchronization exercises

   If the solution to the "critical section" problem is logically structured as follows:

<entry section>
   <critical section>
<exit section>

   Then what are some methods for ensuring that we meet our requirements?

(answers from class):
   - disable interrupts in the entry section
   - use an 'atomic kernel call' to modify the shared resource and/or disable interrupts
   - use atomic instructions
   - use a special algorithm

As an exercise, we'll look at a few algorithms for synchronization:
In each case, suppose that we have two processes P0 and P1, with id's 0 and 1 respectively, and that 'turn' is initialized to 0.


First algorithm:

void P(int id){
   while (true){
      while (turn != id)
      ; /* Spin */
   <CS>
   turn = 1 - id;
   <rest of code>
}

Does this solution meet our 3 goals?
   mutual exclusion: yes
   progress: no
   bounded waiting: no

Bounded wait problem:

   Suppose P1 terminates, then either turn = 1, or it will be set to 1 soon (when P0 sets it to 1).  Then P1 "takes turn with it" when it goes and P0 is blocked forever.

Progress problem:

   Suppose P0 is really active in the CS, say it wants to be in there 99% of the time, and suppose P1 needs the critical section only 1% of the time.  Then, since P0 and P1 must strictly alternate their access to the critical section (i.e. P0 in CS, P1 in CS, P0 in CS, P1 in CS, ...), P0 must wait much longer than is necessary to use the CS.  In particular, P0 can get into the CS only as often as P1 wants to.


Second algorithm (Hyman's algorithm):

bool blocked[2];
int turn;
void P(int id){
   while(true){
      blocked[id] = true;
      while(turn != id){
         while(blocked[1-id])
            ; /* Spin */
         turn = id
      }
      <CS>
      blocked[id] = false;
      <rest of code>
   }
}

Does this algorithm meet our 3 goals?
   Perhaps surprisingly, no.  Hyman's algorithm is well-known because even though it is incorrect, its error was subtle enough to fool the Communications of the ACM in 1966 when it was proposed.

Exercise: See if you can construct a counterexample to show that Hyman's algorithm is incorrect
(it should take only a few minutes of thinking).


Atomic Operations

When dealing with any sort of asynchronous code (be it threads, hardware, or some other such system), there needs to be the concept of an atomic operation when dealing with shared data. Atomic operations are operations that cannot be interrupted part way through execution. Without these, it is pretty near impossible to ensure consistency inside a data structure (let alone correctness).

Case Study 1: Assignment to a global variable

Assume we have a global variable declared:

int i;

Assume we also have 2 threads, A and B, that execute the following lines of code, A first, then B:

Thread A: i = 3;
Thread B: i = 4;
What is the value of i if read in a third thread C?

The problem here is that you're looking at the problem from "too high a level." When you are doing with asynchronous code and you need to know which operations can be assumed to be atomic when you are accessing an otherwise unprotected (not in critical section) piece of shared data. In this case, you want to assume that the line "i=3" is atomic. That is, when I say "i=3", immediately after that line executes, I want to be able to count on the fact that i gets assigned the value 3 after that line executes and all other concurrent lines of execution will see 3 when they read i. Unfortunately, on a computer, the granularity at which operations can be considered atomic is a machine instruction and not a line of C source code. Given the C source code line "i=3", you don't know which instructions this generates. So, the answer to the question above is: "You don't know. It could be 3, 4 or some other value."

The problem in depth, or "why the above doesn't necessarily work in a consistent manner"

In short, the compiler may generate code that caches the value of i in a register.

When we think of i as a shared resource, we think of it as a location in memory that two lines of execution may at any point read or write data to and from. If it were true that when we said i=3 that the compiler generated instructions that just wrote 3 directly to the memory location every time, and if we decided to read the value from i, that the compiler generated instructions that read from the memory location every time, we would find this model valid.

However, let's say the code looked something like this:

i=3;
a = i+2;
if (i == 5) {
...

Reading from memory is really slow compared to reading from a register. So, the compiler may generate code that loads the value of i into a register and just refer to the value in the register for the subsequent instructions. This means we now have 2 versions of i, one in memory, and one in a register.

If we have 2 threads accessing i, then we want the changes from one thread to be immediately visible to the other thread. That means that we can't let the compiler cache the value of i in a register. If we do let it, then thread A may not see the changes to i that thread B makes since the changes are stuck in thread B's register context which gets swapped out when thread A loads. This means you never really know what the value of i is and that just sucks when programming. How do we do that? The best answer is, learn about the machine architecture, and write the code for reading and writing to shared variables in assembly. However, using gcc on an x86 architecture, you have another option.

Atomic reads and assignments: volatile to the rescue? In the case of gcc on x86, yes.

There are very few keywords that force the compiler to listen to you; volatile, is one of these. This keyword tells the compiler to keep its grubby little hands off (in terms of optimization) the variable that it describes. That means that it can't assume the following if-block will execute.

i=3;
if (i == 3) {
... }

On x86, volatile also has the side effect that values never get cached in registers as the compiler always wants to try and get the most recent version of i that it can. Sounds ideal? Kind of. The problem with using keywords like this and making assumptions above low-level code generation is that these assumptions may or may not be correct depending on architecture and compiler implementation. So, using gcc on an x86 processor, you should be safe to assume that a word sized variable with a volatile type should be atomic. Other architectures and other compilers, I don't know. Thus, looking at the first case study, if the declaration for i were:

volatile int i;
and Thread A executed first then Thread B, you could know that i was 4.

Case Study 2: One producer and one consumer

Given a single producer and a single consumer that produce and consume data (respectively) on a shared bounded-array queue structure, it is possible to ensure that the producer and consumer do not corrupt each other using atomic operations only (no locking). How can this be done? (Solution is here). Hint: Try to design the enqueue and dequeue semantics to only require an update of 1 variable and knowledge of potentially outdated state of another variable.

Atomic test and set

Having Atomic reads and writes only allows you to react upon outdated values of an atomic variable in a read-modify-write cycle. This is often not enough since many times you want to conditionally assign to a variable based on the variable's current value. Such operations are called atomic Test-And-Set (TAS) operations. To be able to such an operation, you basically need hardware support. A test and set instruction usually takes the API:

bool test_and_set(compare, value, atomic_value)
where atomic_value gets assigned value if it is equal to compare. The function returns true or false depending on whether or not atomic_value was assigned "value".

There are 2 methods that are generally used in hardware for such a operation. In one method, they have an instruction that executes atomically on the processor (Intel Pentium architecture does this). In another method, they have a set of load and store instructions that are coupled so if the memory bus is used to modify a particular address between a load and a store, the store instruction fails (the PowerPC does this).

On i586, they have a cmpxchg (compare and exchange) instruction that takes a location in memory (atomic_value), compares it with a number (compare) stored in the register EAX, and if they are equal, assigns atomic_value <- value.

On PPC they have 2 instructions: lwarx (Load Word and Reserve Index) and stwcx (Store Word Conditional Index). If a value is loaded from memory using lwarx, the memory bus is instructed to "snoop" for accesses to that memory location. Thus, it knows if that value gets changed by someone else. If stwcx is executed, it checks to see if the location has been modified. If the location has not been modified, stwcx succeeds. Otherwise, it fails.

Atomic TAS operations are very powerful for synchronization. However, keep in mind that atomic operations are generally more complex than ones that are perhaps non-atomic. Thus, they are often slower (different instructions can take different amounts of time to execute). It may be worth your time to think about what class of functions/operations can be implemented with which atomic operations and which class can get away without using them.