Memory
Contents
Virtual address & Translation
Address
- Most modern computers are byte-addressable. Each address identifies a single byte (eight bits) of storage. Data larger than a single byte may be stored in a sequence of consecutive addresses.
Page
- A process executes in its private virtual address space, composed of pages, each representing a contiguous range of addresses. The typical page size is 4 KiB, although processors also support larger pages, 2 MiB and 1 GiB on the ubiquitous 64-bit x86 (“x64”) processor architecture.
Page Table
- Translations from virtual pages to physical frames are stored in
page tables
TLB
- Processors cache recently used page table entries in the
translation look-aside buffer
(TLB)
Cache
Cache Hierarchy
- Each core typically has two private top-level caches, one each for data and instructions, called level-1 (L1) caches. A typical L1 cache size is 32 KiB with a 4-cycle access time
- Modern x86 processors typically also support core-private, unified L2 caches of intermediate size and latency
- The LLC is shared among all the cores of a multi-core chip and is a unified cache, i.e., it holds both data and instruction
Cache line
- Caches are organized in fixed-size lines, which are the units of allocation and transfer down the cache hierarchy. A typical line size B is 64 bytes.
- The log_2(B) lowest-order bits of the address, called the line offset, are used to locate a datum in the cache line.
Cache Associativity
- Caches today are usually set-associative, i.e., organized as S sets of W lines each, called a W-way set-associative cache. When the cache is accessed, the set index field of the address, log_2(S) consecutive bits starting from bit log2_(B), is used to locate a cache set. The remaining high-order bits are used as a tag for each cache line. After locating the cache set, the tag field of the address is matched against the tag of the W lines in the set to identify if one of the cache lines is a cache hit.
Latency^1
- The L1 cache latency is typically within a few cycles.
- The L2 cache latency is around 10-20 cycles.
- The LLC has a latency of 40-60 cycles.
- The main memory access latency is on the order of 200-300 cycles.
Modern LLC
- The LLC is divided into per-core slices, which are connected by a ring bus (see Figure 3). Slices can be accessed con-currently and are effectively separate caches, although the bus ensures that each core can access the full LLC (with higher latency for remote slices).
- Intel uses a carefully-designed but undocumented hash function to maps the address (excluding the line offset bits) into the slice id.
- Hund et al. found that on Sandy Bridge, only the tag field is used to compute the hash, but Liu et al. find that this is only true if the number of cores is a power of two. For other core counts, the full address (minus line offset) is used.
Inclusiveness
- The L_{}i+1 cache holds a strict superset of the contents of the Li
merit
- If the lower level cache is inclusive of the higher level cache and it is a miss in the lower level cache (e.g., L3), then the higher level cache (e.g., L2) need not be searched. Thus a shorter miss latency for an inclusive cache compared to exclusive and NINE (Non-Inclusive Non-Exclusive).
- Placing a cache line from higher level cache to lower level cache causes no involvement of higher level cache due to eviction.
Index and Tag
- The L1 is typically indexed by virtual address, while all other caches are indexed by physical address.
- A cache consists of the data array, which stores the data or code, and the tag array, which stores the high-order bits of the addresses of the data or code.
Cache Coherence
2 categories
- Snoopy Bus
- Directory
Snoopy Bus
- As the number of cores is increased, the centralized bus quickly proves to be the bottleneck.
VI protocol
- Invalid
- Valid
MSI
- M (odified): cache line is the only cached copy and is dirty
- S (hared): cache line is potentially one of several cached copies
- I (nvalid): cache line is not present in this cache
MESI
- E (clusive): cache line is the only cached copy
MOESI
- O (nwer): when synchronize via bus, transit the Modified state to Shared without writing back to memory but designate one Owner who is responsible to write the latest date to memory
Directory
- Idea: A logically-central directory keeps track of where the copies of each cache line reside and whether the line is dirty or clean. Caches consult this directory to ensure coherence.
Pipeline
Out-of-order execution.
- Dynamically-scheduled processors execute data-independent instructions in parallel, out of program order, and thereby exploit instruction-level parallelism to improve performance. Instructions are issued (enter the scheduling system) in program order, complete (execute and produce their results) possibly out of program order, and finally retire (irrevocably modify the architected system state) in program order. In-order retirement is implemented by queueing instructions in program order in a reorder buffer (ROB), and removing a completed instruction from the ROB only once it reaches the ROB head, i.e., after all prior instructions have retired.
Memory
Memory Consistency
Total Store Order (TSO)
- Total Store Order (TSO) is the memory model of the x86 architecture. TSO forbids all observable load and store reorderings except store→load reordering, which is when a load bypasses an earlier store to a different address. Implementations prevent observable load→load reordering by ensuring that the value a load reads when it is performed remains valid when the load retires. This guarantee is maintained by squashing a load that has performed, but not yet retired, if the core receives a cache invalidation request for (or suffers a cache eviction of) the line read by the load. Store→store reordering is prevented by using a FIFO write buffer, ensuring that stores perform in program order. If desired, store→load reordering can be prevented by separating the store and the load with a fence instruction, which does not complete until all prior accesses are performed. Atomic instructions have fence semantics.
Release Consistency (RC)
- Release Consistency (RC) allows any reordering, except across synchronization instructions. Loads and stores may not be reordered with a prior acquire or with a subsequent release. Therefore, RC implementations squash performed loads upon receiving an invalidation of their cache line only if there is a prior non-retired acquire, and have a non-FIFO write buffer.