Memory

Yayu Wang included in category Memory

2024-01-07 2024-07-03 1112 words 5 minutes

Virtual address & Translation

Address

Most modern computers are byte-addressable. Each address identifies a single byte (eight bits) of storage. Data larger than a single byte may be stored in a sequence of consecutive addresses.

Page Table

Translations from virtual pages to physical frames are stored in page tables

TLB

Processors cache recently used page table entries in the translation look-aside buffer (TLB)

Cache

Cache Hierarchy

Each core typically has two private top-level caches, one each for data and instructions, called level-1 (L1) caches. A typical L1 cache size is 32 KiB with a 4-cycle access time
Modern x86 processors typically also support core-private, unified L2 caches of intermediate size and latency
The LLC is shared among all the cores of a multi-core chip and is a unified cache, i.e., it holds both data and instruction

Cache line

Caches are organized in fixed-size lines, which are the units of allocation and transfer down the cache hierarchy. A typical line size B is 64 bytes.
The log_2(B) lowest-order bits of the address, called the line offset, are used to locate a datum in the cache line.

Cache Associativity

Caches today are usually set-associative, i.e., organized as S sets of W lines each, called a W-way set-associative cache. When the cache is accessed, the set index field of the address, log_2(S) consecutive bits starting from bit log2_(B), is used to locate a cache set. The remaining high-order bits are used as a tag for each cache line. After locating the cache set, the tag field of the address is matched against the tag of the W lines in the set to identify if one of the cache lines is a cache hit.

Latency^1

The L1 cache latency is typically within a few cycles.
The L2 cache latency is around 10-20 cycles.
The LLC has a latency of 40-60 cycles.
The main memory access latency is on the order of 200-300 cycles.

Modern LLC

The LLC is divided into per-core slices, which are connected by a ring bus (see Figure 3). Slices can be accessed con-currently and are effectively separate caches, although the bus ensures that each core can access the full LLC (with higher latency for remote slices).
Intel uses a carefully-designed but undocumented hash function to maps the address (excluding the line offset bits) into the slice id.
Hund et al. found that on Sandy Bridge, only the tag field is used to compute the hash, but Liu et al. find that this is only true if the number of cores is a power of two. For other core counts, the full address (minus line offset) is used.

Inclusiveness

The L_{}_i+1 cache holds a strict superset of the contents of the L_i

merit

If the lower level cache is inclusive of the higher level cache and it is a miss in the lower level cache (e.g., L3), then the higher level cache (e.g., L2) need not be searched. Thus a shorter miss latency for an inclusive cache compared to exclusive and NINE (Non-Inclusive Non-Exclusive).
Placing a cache line from higher level cache to lower level cache causes no involvement of higher level cache due to eviction.

Index and Tag

The L1 is typically indexed by virtual address, while all other caches are indexed by physical address.
A cache consists of the data array, which stores the data or code, and the tag array, which stores the high-order bits of the addresses of the data or code.

Cache Coherence

2 categories
1. Snoopy Bus
2. Directory

Snoopy Bus

As the number of cores is increased, the centralized bus quickly proves to be the bottleneck.

VI protocol

Invalid
Valid

MSI

M (odified): cache line is the only cached copy and is dirty
S (hared): cache line is potentially one of several cached copies
I (nvalid): cache line is not present in this cache

MESI

E (clusive): cache line is the only cached copy

MOESI

O (nwer): when synchronize via bus, transit the Modified state to Shared without writing back to memory but designate one Owner who is responsible to write the latest date to memory

Pipeline

Out-of-order execution.

Dynamically-scheduled processors execute data-independent instructions in parallel, out of program order, and thereby exploit instruction-level parallelism to improve performance. Instructions are issued (enter the scheduling system) in program order, complete (execute and produce their results) possibly out of program order, and finally retire (irrevocably modify the architected system state) in program order. In-order retirement is implemented by queueing instructions in program order in a reorder buffer (ROB), and removing a completed instruction from the ROB only once it reaches the ROB head, i.e., after all prior instructions have retired.

Memory

Memory Consistency

Total Store Order (TSO)

Total Store Order (TSO) is the memory model of the x86 architecture. TSO forbids all observable load and store reorderings except store→load reordering, which is when a load bypasses an earlier store to a different address. Implementations prevent observable load→load reordering by ensuring that the value a load reads when it is performed remains valid when the load retires. This guarantee is maintained by squashing a load that has performed, but not yet retired, if the core receives a cache invalidation request for (or suffers a cache eviction of) the line read by the load. Store→store reordering is prevented by using a FIFO write buffer, ensuring that stores perform in program order. If desired, store→load reordering can be prevented by separating the store and the load with a fence instruction, which does not complete until all prior accesses are performed. Atomic instructions have fence semantics.

Release Consistency (RC)

Release Consistency (RC) allows any reordering, except across synchronization instructions. Loads and stores may not be reordered with a prior acquire or with a subsequent release. Therefore, RC implementations squash performed loads upon receiving an invalidation of their cache line only if there is a prior non-retired acquire, and have a non-FIFO write buffer.

Reference

ref-1

Attack Directories, Not Caches: Side-Channel Attacks in a Non-Inclusive World

Memory

Virtual address & Translation

Address

Page

Page Table

TLB

Cache

Cache Hierarchy

Cache line

Cache Associativity

Latency^1

Modern LLC

Inclusiveness

merit

Index and Tag

Cache Coherence

Snoopy Bus

VI protocol

MSI

MESI

MOESI

Directory

Pipeline

Out-of-order execution.

Memory

Memory Consistency

Total Store Order (TSO)

Release Consistency (RC)

Reference

ref-1