Persistent Storage: Hard Drive & Solid State Drive
Storage Devices
- persistent/nonvolatile, retains data after power down
- Hard Drive (HDD) / Spinning Disk
- large capacity at low cost, block level access, not byte addressable
- physical motion needed to read and write, milliseconds access latency
- Solid State Drive (SSD)
- large capacity at intermediate cost (~3x HDD), block level access, not byte addressable
- no physical moving parts, microsecond access latency
- file system abstraction is built on top of storage devices
Disk Anatomy
- What do spinning disks look like?
- Anatomy of spinning disks
- head: moves above the platter (3nm), reads data from and writes data to disk sectors
- sector: unit of reads and writes on disk, 512 bytes, contains error correcting code
- track: length varies across disk, outer tracks have more sectors
- separated by unused guard regions to reduce likelihood of neighboring corruption
- only outer half of radius is used, most sectors are there in the outer half
- What happens on a disk read?
- a read request is sent to disk
- find the right platter and surface, arm moves to the right track
- head reads while the disk spins (desired sector will spin under the head)
- transfer read data back to the host
- Disk Performance
- total time = seek time + rotation time + transfer time
- seek time: time to move disk arm over the desired track (1-20ms)
- rotation time: time for the desired sector to rotate under the disk head (based on RPM, 4-15ms)
- eg. 7200 RPM = 120 RPS = 0.12 rotation per ms = 8.3 ms per rotation
- reasonable to assume it takes half a rotation to get to the desired sector, so 8.3/2 = 4ms
- transfer time: time to transfer data onto/off the disk (based on disk bandwidth, often < 4us per sector)
- eg. 100 MB/s bandwidth = 100 * 1000000 B/s = 100 B/us
- sector = 512 bytes, 512/100 = 5.12us per sector = 0.005 ms per sector
- how many seeks would we need to do if we read two consecutive sectors?
- how many seeks would we need to do if we read two sectors on different track?
- sequential vs random access
Disk Scheduling
since seek time is large, OS can reorder I/O requests to minimize seek time
disk scheduling = deciding on the order in which I/O requests are served
goal: minimize latency per request
- Shortest Seek Time First (SSTF):
- serve the request with the shortest seek time from the current head position
- any problem with this?
- SCAN, CSCAN, RCSCAN:
- acts like an elevator, when it goes up, stops at any desired floors on the way up, and same on the way down
- SCAN
- disk arm moves from inner to outer track, serves all requests in between
- then moves from outer to inner, serves all requests in between
- CSCAN
- disk arm moves from inner to outer track, serves all requests in between
- once the last request near the outer track is finished, seeks to inner most track (or first request near inner most track)
- only serves request in one direction (why? any benefits to this?)
- RCSCAN
- rotation aware CSCAN, rotation time is nontrivial
- might be faster to seek to a different track if that's < rotation delay
Solid State Drive
- Device Characteristics
- no moving parts, NAND-based flash
- units:
- page: unit of read and write, 2-4 KB, not VM pages!
- block: unit of erasure, 1-8 MB, span hundreds of pages
- in order to write to a page within a block, we first have to erase the entire block
- operations:
- read (a page): can read any page, fast(~10us) sequential and random access
- erase (a block): erase a block by setting all bits in the block to 1, slow(a few ms)
- what about data that was in the block?
- once a block is erased, it's ready to be programmed
- program (a page): program a page in an erased block by setting certain bits to 0 to write data, ~100us
- performance
- sequential access still faster than random access, but much closer than hard drive
- metric: I/O Operation Per Second (IOPS)
- meaningful with latency: if you batch a lot of IO requests, you can have high IOPS but also with high latency
- similarly, if you have to complete requests within a certain amount of time, you may have a relatively low IOPS
- reliability
- a block becomes unusable after a certain number(10-100K) of program/erase operations
- repeated writes to the same page is bad for endurance
- wear leveling: try to spread writes across the blocks as evenly as possible
- SSD needs a way to flexibly remap addresses
- Flash Translation Layer
- uses logical blocks/pages to communicate with its client (OS)
- maps logical blocks/pages to physical blocks/pages
- makes SSD internal management easier (wear leveling, garbage collection, see OSTEP: SSD)