mmap
/munmap
/msync
fsync
revisitedCorruption: block checksum mismatch or missing start of fragmented record(2)
use gdb to find the bug?
LevelDB: key-value store
by Jeff Dean & Sanjay Ghemawat at Google
used in Bitcoin, Chrome, Facebook (RocksDB), …
operations: put(“alice”, 10); put(“bob”, 20); put(“alice”, 30)
in-memory:
on-disk (simplified):
void *mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t len);
int msync(void *addr, size_t len, int flags);
map a file into virtual address space
pointer read/write → file read/write
int fd = ...; // open a file
size_t len = ...; // file length
char *p = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
printf("%c\n", p[len - 1]); // read the last byte
*p = 'c'; // write the first byte, "dirty"
munmap(p, len);
print file content to screen using mmap
how about using read
instead of mmap
?
leveldb uses mmap
/munmap
to append data to file
OS X’s munmap
didn’t flush the tail to disk - corruption
blog author: msync
to force kernel to flush dirty pages to disk
leveldb authors: avoid using mmap
/munmap
altogether
q: why no problem on Linux?
MAP_SHARED . . . The file may not actually be updated until msync(2) or munmap() is called.
– Linux “man mmap
”
Bitcoin / leveldb / Linux / OS X / POSIX / others?
If the mapping maps data from a file (MAP_SHARED), then the memory will eventually be written back to disk if it’s dirty. This will happen automatically at some point in the future (implementation dependent). Note: to force the memory to be written back to the disk, use msync(2).
– OS X manpage
The root cause of the bitcoin data corruption bug was that different operating systems made different promises to callers of the same syscalls. While Linux (which leveldb was probably developed and tested on) made a contractual promise to never discard unmapped dirty pages - which the user was not currently writing to, but were not synced/flushed with the disk yet (hence the name ‘dirty’), Mac OS X never made that promise. Therefore, the same code that run correctly on Linux will sometimes not work on Mac OS X … And in this case, unfortunately, POSIX does not make a promise to not discard pages after munmap is called on them.
– Roee Avnon
multiple threads caused the bug?
didn’t invoke the fwrite
syscall? fsynch
? fsync
? sync
?
user-space cache not flushed to kernel?
neither I nor Alice has a Mac.
fd = create “notes.tmp” under “/home/alice”
write(fd, content, length)
close(fd)
rename(“notes.tmp”, “notes”)
q: do we need fsync(fd)
before close
?
see also: ext4 and data loss
FYI: google “Eat My Data: How everybody gets file I/O wrong”
q: call fsync
after every write
?
works, but has performance issues.
C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much much easier to generate total and utter crap with it.