Lecture: locking

Preparation

Read OSPP §5, Synchronizing Access to Shared Objects.
Exercise: memory models due next Monday morning & lab 4 out

Multi-threaded hash table

ph.c (html): put(), get()
- recall 333 lab: divide work for parallel speedup
- why missing keys
  - data races: concurrent put()s
  - example: all try to set table[0]->next - one winner, others lose
detection using ThreadSanitizer
- add -fsanitize=thread to gcc/clang compile options
- see the ThreadSanitizer paper and the Eraser paper
use lock(s) to protect put()
- coarse-grained: one lock for the entire table
- fine-grained: per-bucket locks, or even per-entry
- trade-off
why atomic_fetch_add and atomic_load
- would done += 1; while (done < nthread); work?
- gcc -O2: infinite loop - why?
how about get(), or del()?
other approaches?

Locks

mutual exclusion: only one core can hold a given lock
- concurrent access to the same memory location, at least one write
- example: acquire(l); x = x + 1; release(l);
“serialize” critical section: hide intermediate state
- another example: transfer money from account A to B
- put(a + 100) and put(b - 100) must be both effective, or neither

Dead locks

assume per-bucket lock
- acquire the lock for bucket 1 and then the lock for bucket 2
- write two values
- release both blocks
deadlock
- thread 1: lock bucket 1; lock bucket 2
- thread 2: lock bucket 2; lock bucket 1
approach
- programmers enforce partial order over locks
- always grab locks in pre-defined order

Lock implementation

strawman

struct lock { int locked; };

void acquire(struct lock *l)
{
  for (;;) {
    if (l->locked == 0) { // A: test
      l->locked = 1;      // B: set
      return;
    }
  }
}

void release(struct lock *l)
{
  l->locked = 0;
}

hw: draw cores, caches, bus, RAM
try it on ph.c: what can go wrong
atomic exchange: combine test and set into one atomic step

struct lock { _Atomic int locked; };

void acquire(struct lock *l)
{
  for (;;) {
    if (atomic_exchange(&l->locked, 1) == 0)
      return;
  }
}

show assembly code: xchg instruction
- if l->locked was 1, set it to 1 again & return 1
- if l->locked was 0, at most one xchg would see & return 0
the problem is pushed down to hardware
- guess how xchg is implemented
- understand the performance overhead
- how would you design a scalable hashtable
xv6/JOS: see spinlock.c