Lecture Notes 05/07 Memory management and address translation schemes

Scott Ramsby, Jieyang Hu
CSE 451 Lecture Notes for 05/07
Memory management and address translation schemes

The point of Monday's lecture: Paging is good in that it uses an arbitrary translation function that allows one to map any virtual address to any physical address. This allows a lot of flexibility. Other methods (i.e. segments) are constrained by problems such as exposing the underlying hardware to the programmer, internal/external fragmentation, etc.

Today we talked about:

Paging + Segmentation (implication: more than one segment)
Improving paging: How to make it faster

Why do both together? It's like a Reese's peanut butter cup. By itself, chocolate is good. Ditto peanut butter. Put the two of them together, and they're even better. This is the same case with paging plus segmentation:

Paging provides the flexibility of an arbitrary translation function, avoids external fragmentation and allows incremental updating through copy-on-write.
Segmentation provides structured extents of memory, easy memory sharing between processes (merely updating the segment table entry as opposed to copying every page table entry).

Each memory address now contains bits for the segment, the VFN, and the offset. The segment information is used to look up the correct segment. This segment points to a page table corresponding to the segment. The VFN is used to determine the location in the page table, which is then added to the offset to get the final address location:

-------------------------
|Segment | VFN  | Offset| ------------------------------
-------------------------                               |
     |                                                  |
     |                                                  |
     |                                                 \|/
     |                                                  +---------> Physical Memory
    \|/                                                /|\
 Segment Table                                          |
 --------------                                         |
| Code segment |  -------->  Page Table  ---------------
|--------------|
| Data segment |
|--------------|
|Other segments|
 --------------

Example: Sharing two pages between processes is slow because in order to do sharing of long extents, the OS needs to copy the page table entries for each shared page for each process.

With paging and segments, only 1 set of page table entries is needed for each shared extent block. They can be all contained within a single shared segment.

There are a couple of downsides to this approach. First, this scheme is more complicated. The problem with indirection can be shown in this simple instruction:

ld r0, 296

When the human brain sees this, it does not think about segments, pages, etc. Instead, the human brain likes to view memory as a big memory space that can be used however it sees fit. This straighfoward and simple approach conflicts with all the indirection of paging + segmentation; it's confusing.

The other downside to this approach is that once again, you are once again exposing hardware to the programmer (the existence of segments) with this approach. Various operating systems take various approaches to the paging + segmentation scheme. UNIX provides 2 segments: The operating system segment and the program segment. Other incarnations of UNIX may provide data and code segments. The x86 architecutre provides a code, data, stack, global, other, and OS segment.

Improving paging: How to make it faster
We would like to make paging faster. How? Here is a typical user program:

for(i=1; i<=100; ++i)
  x[i] = x[i-1]+1;

This is a good model of how programs work, it exhibits temporal and spatial locality.

Temporal locality: If you just accessed a piece of memory, chances are you'll access it again sometime in the near future.
Spatial locality: If you just accessed a piece of memory, chances are you'll access the memory nearby (with respect to the virtual address space) sometime in the near future.

In the above code sample, each iteration through the loop accesses the array elements x[i] and x[i-1]. Accessing x[i-1] exhibits temporal locality: the same element was accessed in the previous iteration. Accessing x[i] exhibits spatial locality: this element is only one offset away from the item accessed in the previous iteration (x[i-1]).

Why do programs exhibit locality?
Programs tend to use a lot of loops which exhibit both types of locality. It's easier to write out a for loop that loops a hundred times than it is to explicitly write out each iteration a hundred times.

Where do we see spatial locality? Some examples that come to mind are sequential instruction fetching and arrays.

Programmers are fond of simple data structures like arrays, and if they can get away with them, they will use them.

This locality is exploitable!
Each time we access a virtual memory address, We can remember (i.e. cache) its translation into physical memory so that we can look up this information the second time around instead of re-doing the translation each time. This avoids all the overhead involved in address translation. We can also make this cache completely transparent to the programmer. This cache is the translation lookaside buffer.

Translation Lookaside Buffer: is a very small associative memory that contains a set of TLB entries, each of which has a virtual memory address and its corresponding translation into a physical address. Whenever a new virtual memory address is accessed, the TLB must simultaneously compare every TLB entry with the given virtual frame number in parallel. If there is a hit it spits out the physical address, if not, it writes the new virtual frame number into the TLB in addition to its corresponding physical frame number, which has to be looked up in the page table itself.

Why does a program incur a TLB miss?
Either there was a change in locality (ie the program jumps to a different location in the program) or the locality is too large (the contiguous extent of memory we are accessing is too big to store all the translations to PTEs in the TLB at once, i.e. the loop is too large).

On a TLB hit, it appears to the program as if the virtual address is directly mapped by the CPU to the physical memory.

TLB misses are about 5 times slower than a TLB hit, as it has to look up the translation in the large, complicated page table.

Why do we care? When your programs exhibit referential locality they will run really fast. If they don't, the performance will degrade.

How can we improve the coverage of the TLB?

We can make the TLB bigger. Problem: this slows down the TLB, as it requires more bits/pins in the hardware to specify which entry matches to the VFN on a hit. This approach is also more expensive and takes up more chip space.
We can make the page size bigger. Problem: this increases the internal fragmentation.

We can find the optimal TLB and page size by creating and running program benchmarks or representative programs, and then based on how they access memory, determine how the average memory access time of various schemes would be affected. We can then use this information to determine the best TLB and page size.