Lecture: Concurrency and scheduling

preparation

read the xv6 book: §6, Locking and §7, Scheduling
take a look at lab lock

overview

multicore CPUs
- hardware: cores, bus, RAM
- AMD Opteron
- Apple A15
- Cell
- SiFive U74-MC
how to use multicore CPUs correctly and efficiently
- often times things come down to this
- this class focuses on mechanisms: locks, threads, and scheduling

locks

review pthread from CSE 333
example: multi-threaded hash table (ph.c)
- why missing keys
  - data races: concurrent put()s
  - all try to set table[0]->next - one winner, others lose
- race detection using ThreadSanitizer
  - add -fsanitize=thread to gcc/clang compile options
  - see the ThreadSanitizer paper and the Eraser paper
- how to fix the bug - use lock(s) to protect put()
  - coarse-grained: one lock for the entire table
  - fine-grained: per-bucket locks, or even per-entry
  - trade-off
- why __sync_fetch_and_add and __sync_bool_compare_and_swap for done?
  - would done += 1; while (done < nthread); work?
  - gcc -O2: infinite loop - why?
- how about get()?
locks
- mutual exclusion: only one core can hold a given lock
- “serialize” critical section: hide intermediate state
  - example: transfer money from account A to B
  - put(a + 100) and put(b - 100) must be both effective, or neither
lock implementation
- strawman
- try it on ph.c: what can go wrong?

struct lock { int locked; };

void acquire(struct lock *l)
{
  for (;;) {
    if (l->locked == 0) { // A: test
      l->locked = 1;      // B: set
      return;
    }
  }
}

void release(struct lock *l)
{
  l->locked = 0;
}

atomic exchange: combine test and set into one atomic step
- gcc’s __sync builtins
  - __sync_lock_test_and_set(ptr, value): atomically write value to ptr and return old value
  - __sync_lock_release(ptr): write 0 to ptr
- show assembly code of acquire(): xchg instruction
  - if l->locked was 1, set it to 1 again & return 1
  - if l->locked was 0, at most one xchg would see & return 0
- alternatives
  - RISC-V: amoswap.w instructions
  - C11 atomics, assembly
- the problem is pushed down to hardware
  - guess how xchg is implemented
  - understand the performance overhead

void acquire(struct lock *l)
{
  while (__sync_lock_test_and_set(&l->locked, 1) != 0)
    ;
}

void release(struct lock *l)
{
  __sync_lock_release(&l->locked);
}

threads

why do we want to run multiple tasks?
- review processes and context switching from CSE 351
- time sharing among multiple users/programs/subroutines
- better utlization of multicore CPUs
- can achieve using many techniques
  - buy more CPUs, each running a different task, assuming # of tasks <= # of CPUs
  - state machines / event-driven programming
thread: an abstraction of execution context (CPU registers)
- each thread thinks it has a dedicated, virtual CPU
- a physical CPU switches between threads, running one at a time
- main challenges:
  - save & restore thread state
  - scheduling: choose which thread to run next
  - prevent threads from hogging the CPU
thread plan in xv6
- 2 threads per process
  - 1 user thread and 1 kernel thread
  - each process is either running its user thread or handling traps in its kernel thread
  - each thread is either running on exactly one core (cannot run on two cores at the same time), or is not running (saved)
- 1 scheduler thread per core
  - each core is either running its scheduler thread, or some kernel thread
  - scan the process table to find a RUNNABLE thread, or idle if none (e.g., scheduler() in kernel/proc.c)
- memory isolation & sharing
  - kernel threads share kernel memory
  - user threads use their own page tables
- we often use “process” and “thread” interchangeably
thread (context) switches in xv6
- user → kernel
- kernel thread → scheduler thread
- scheduler thread → kernel thread
- kernel → user

      +-----+             +-----+
      | sh  |             | cat |
      +-----+             +-----+
         |                   ^
user     |                   |
==================================
kernel   |                   |
         v                   |
        +-+       +-+       +-+
        | | swtch | | swtch | |
        | | ----> | | ----> | |
        | |       | |       | |
        +-+       +-+       +-+
       kstack    kstack    kstack
         sh     scheduler   cat

struct proc (kernel/proc.h)
- p->trapframe: holds saved user thread’s registers
- p->context: holds saved kernel thread’s registers
- p->kstack: points to the thread’s kernel stack
- p->state: RUNNING, RUNNABLE, SLEEPING
- p->lock: protects p->state and other things
struct cpu (kernel/proc.h)
- p->proc: process runnning on this cpu (null if no process is running on this cpu)
- p->scheduler: holds saved scheduler thread’s registers
demo: thread switches using sleep (lab util)
- window 1: make qemu-gdb CPUS=1
- window 2: make gdb
- window 2: type b sys_sleep
- window 2: press c to continue until window 1 runs to shell
- window 1: run sleep 10 in the xv6 shell
- window 2: hit breakpoint at sys_sleep (kernel/sysproc.c)
- window 2: use layout split to display both assembly and C source (if available)
- window 2: use n/s to run into sleep() (kernel/proc.c)
- window 2: use n/s to run into sched() (kernel/proc.c)
- window 2: set a breakpoint at swtch using b swtch
- window 2: use n to run into swtch() (kernel/swtch.S) and si to single-step
  - what’re the arguments to this function? a0 (old context to save to) and a1 (new context to switch to)
  - where will ret return to? scheduler() (kernel/proc.c)
  - kernel thread → scheduler thread: we entered from the swtch call in sched() and now emerged after the swtch call in scheduler()
- window 2: run to the line p->state = RUNNING; using until proc.c:<lineno>
- window 2: again run into swtch() (kernel/swtch.S) and si to single-step
  - where will ret return to this time? sched() (kernel/proc.c)
  - scheduler thread → kernel thread
questions
- swtch() doesn’t save or restore the program counter. how does it know where to resume execution?
- swtch() saves only a subset of RISC-V registers (ra, sp, s0–s11). what about the other registers?

scheduling

cooperative scheduling for kernel threads
- threads give up control by yielding
- two switches: thread 1 → scheduler → thread 2
- scheduler() (kernel/proc.c)
- swtch() (kernel/swtch.S)
device interrupts
- workflow
  - an interrupt tells kernel that the device wants attention (e.g., a key is pressed)
  - kernel handles the interrupt (e.g., reading characters)
- SiFive U54 Manual (see readings), Figure 2: U54 Interrupt Architecture Block Diagram
  - CLINT (Core-Local Interruptor): timer interrupts, software (interprocessor) interrupts
  - PLIC (Platform-Level Interrupt Controller): external interrupts
  - see also: Intel 82093AA, Figure 2, I/O And Local APIC Units
preemptive scheduling for user threads
- how to force a thread to give up control?
- timer interrupt (every 100 ms): user → kernel
- usertrap()/kerneltrap() (kernel/trap.c) → yield() (kernel/proc.c) → sched() (kernel/proc.c)
- switch to a different thread, then kernel → user
demo: thread switches using spin (see C code below; show assembly code)
- window 1: make qemu-gdb CPUS=1
- window 2: make gdb
- window 2: press c to continue until window 1 runs to shell
- window 1: run spin in the xv6 shell
- window 2: pause using Ctrl-c
- window 2: set a breakpoint b trap.c_<lineno>_ at the line clockintr(); in devintr()
- window 2: press c to resume until it hits the breakpoint
- window 2: use layout split to display both assembly and C source (if available)
- window 2: use finish to return to usertrap() (kernel/proc.c)
- window 2: show the current PID using p p->pid
- window 2: use n/s to run into yield() (kernel/proc.c)
- window 2: use n/s to run into sched() (kernel/proc.c)
- the rest is similar to the sleep demo
now you have a complete picture of how kernel works
- user space is running process A
- process A traps into the kernel upon timer (preemptive)
- the kernel switches from process B to the scheduler (cooperative)
- the scheduler switches to process B
- process B returns to user space to resume execution
discussion
- what’s the scheduling policy of xv6 (i.e., how does the scheduler decide which process to run given multiple runnable ones)? is this a good policy? if not, how would you improve it?
- can timer interrupts fire in the xv6 kernel? is this a good design choice?

// user/spin.c
#include "kernel/types.h"
#include "kernel/stat.h"
#include "user/user.h"

int
main(int argc, char *argv[])
{
  int pid;

  pid = fork();
  if(pid < 0){
    printf("fork failed\n");
    exit(1);
  }

  while (1) {}
  exit(0);
}

     +-------+           +-------+
     | spin  |           | spin  |
     | pid=3 |           | pid=4 |
     +-------+           +-------+
         |                   ^
user     |                   |
==================================
kernel   |                   |
         v                   |
        +-+       +-+       +-+
        | | swtch | | swtch | |
        | | ----> | | ----> | |
        | |       | |       | |
        +-+       +-+       +-+
       kstack    kstack    kstack
       spin(3)   scheduler spin(4)

Q&A

spinlock revisit
- hold for very short times; don’t yield CPU while holding lock
- xv6: kernel/spinlock.h, kernel/spinlock.c
  - spin on locked using atomic instruction
  - what’re push_off() and pop_off()?
- Linux kernel’s spinlocks
alternative: ticket spinlocks
- unfairness in spinlocks
- real world example: restaurants, banks, DMV, …
- use two variables
  - “next” ticket
  - “now serving” ticket
- take a number from next & wait until now_serving reaches next
- performance collapse: Non-scalable locks are dangerous
- MCS locks

struct lock { _Atomic int next, now_serving; };

void acquire(struct lock *l)
{
  int ticket = atomic_fetch_add(&l->next, 1);
  while (l->now_serving != ticket) {}
}

void release(struct lock *l)
{
  ++l->now_serving;
}

other concerns & approaches
- rwlocks (e.g., pthread_rwlock_t), RCU
- challenges in virtual machines
- transactional memory
- Explanation of the Linux-Kernel Memory Consistency Model