% Replay # this week today: tool Wednesday: system-tool co-design "On building systems that will fail" (CACM'91) # project history WiDS 1: P2P testing and simulation (HotOS'05) WiDS 2: replay-based predicate checking (NSDI'07) WiDS 3: R2 (OSDI'08) online checking (D3S, NSDI'08), model checking (MoDist, NSDI'09), MPI replay (MPIWiz, PPoPP'09), interface cut (Altair, FSE'09), min log cut (iTarget, FSE'10), query language (G2, ATC'11), graph cut for datacenter apps (PeriSCOPE, OSDI'12) # overview something goes wrong - want to know why reconstruct past execution - from crash point to root cause often use logs during normal run goal 1: make replay run identical to normal run goal 2: minimize logging overhead (size/slowdown) easy for purely functional programs; hard in general non-determinism - time, IO, scheduling, code, ... # example applications time-traveling debugging auditing & intrusion detection primary-backup replication # approaches security camera, flight recorder (black box) hardware: perf counters, FDR1 (ISCA'03) Windows Error Reporting (SOSP'09) event tracing: Windows ETW, LTTng, SystemTap app/protocol: /var/log/httpd/*log, Replayer (CCS'06) survey on earlier work: "A taxonomy of distributed debuggers based on execution replay" # approaches (cont) virtual machine: ReVirt (OSDI'02), IntroVirt (SOSP'05) process: ptrace, rr, Flashback (ATC'04) library: liblog (ATC'06), Jockey (AADEBUG'05) distributed systems: Friday/Mace/WiDS (NSDI'07) symbolic execution: bbr (FSE'09), ODR (SOSP'09) optimistic retry: PRES (SOSP'09), Eve (OSDI'12) language: DejaVu (Java, OOPSLA'00), Mugshot (JS, NSDI'10) misc: G2 (ATC'11), Arnold (OSDI'14) # R2: flexible replay interfaces single/multiple processes, multile nodes (MPIWiz, PPoPP'09) apply kernel concepts: isolation, transfer (down-up calls), sched correctness? logging overhead? program efforts? program partitioning: iTarget (FSE'10), Wedge (NSDI'10) # security integrity: attacker's interference? privacy: log sensitive data? see also: Castro et al. (ASPLOS'08) # notes on Windows microkernel & async IO [draw architecture] `printf("hello world!")` on Linux: blocking `write` syscall WriteFile → WriteConsole → (IPC) → CSRSS.exe