Chapter — Memory systems
Until now we treated IMEM and DMEM as magic 1-cycle boxes. They aren't. This chapter unpacks what really happens between the CPU and DRAM, and how it feeds back into our CPI math.
lw/sw assumed a 1-cycle MEM stage. Real DRAM is ~100ns ≈ 200+ cycles. The memory system bridges that gap with a hierarchy of caches — and that hierarchy adds new sources of CPI penalty we'll fold in.
Contents
1.The memory wall & locality
1.1Why a flat memory wouldn't work
A 32-bit address space points at 4 GiB of bytes. If we built that as one giant SRAM (cache-speed memory), it would be the size of a building. If we built it as DRAM and connected it directly to the CPU, every load would stall the pipeline for 100+ ns.
1.2The hierarchy
1.3Temporal & spatial locality
Spatial locality: if you access an address, you'll likely access neighbours of it soon.
| Locality | Cache structure that exploits it | Daily-life analogy |
|---|---|---|
| Temporal | Keep recently used blocks (don't evict) | Re-checking your phone for a reply |
| Spatial | Larger blocks (fetch neighbours together) | Reaching for milk → also grabbing eggs |
2.Caches from first principles
2.1The four parameters
b: block (line) size — unit of transfer between cache and the level below.
N: associativity — slots per set.
S: number of sets = C / (N · b).
2.2Direct-mapped (N = 1)
Each block has exactly one possible slot. Quickest tag check (one comparison), cheapest, but two hot blocks that happen to map to the same slot will keep evicting each other (conflict misses).
2.3N-way set-associative
Each block can live in any of N slots within its set. The hardware does N tag comparisons in parallel on every access. Typical N: 2, 4, 8.
| Increase N → | Effect |
|---|---|
| + fewer conflict misses | Hot blocks can coexist |
| − more hardware | N parallel tag comparators per set |
| − slightly slower hit time | Mux to pick the matching way |
2.4Fully associative (N = total blocks)
Any block can live anywhere. One set, all slots. No conflict misses ever — but tag check requires comparing against every slot in parallel. Used for tiny structures like the TLB (§6) where N is small.
2.5Address breakdown
Every memory address the CPU emits is split into three (or four) parts:
- Byte offset picks the byte inside a word (2 bits for 4-byte words).
- Block offset picks the word inside the block.
- Set index picks which set to look in.
- Tag is compared against the stored tag in each slot of that set — match + valid = hit.
2.5.1Worked example (GQ Q0)
Total capacity 4C = 2c+2 bytes, N-way set-assoc (N ≥ 8), block size b ≥ 8. Take c = 6, b = 8.
Cache total = 2^8 = 256 bytes = 64 words
Block size = 8 bytes = 2 words (b ≥ 8)
Blocks total = 64 / 2 = 32 ... wait, let's use words.
Actually: 256 B / 8 B = 32 blocks if b=8 bytes,
OR if b=8 words=32 B: 256/32 = 8 blocks.
Sticking with the GQ paper's choice (b=8 words = 32 B per block):
Blocks total = 8
Associativity N ≥ 8 ⇒ N = 8
Sets S = 8 / 8 = 1 set
Block offset = log2(8 words) = 3 bits
Byte offset = 2 bits
Set index = log2(1) = 0 bits ← no set bits needed!
Tag = 32 − 3 − 2 = 27 bits
2.6Replacement policies
When a set is full and a new block arrives, which slot do we evict?
- LRU (Least Recently Used): evict whichever slot hasn't been touched longest. Best behaviour, expensive to implement perfectly beyond 2-way.
- FIFO: evict the oldest insertion.
- Random: picks at random — surprisingly competitive and very cheap.
3.Cache calculations
3.1Hit rate & miss rate
Miss rate (MR) = 1 − HR.
Misses split into three flavours (the "3 C's"):
- Compulsory: first access ever to this block (cold start). Unavoidable.
- Conflict: two blocks map to the same set and keep evicting each other (DM & low-assoc only).
- Capacity: working set doesn't fit at all. Need a bigger cache.
3.2AMAT — average memory access time
Read it: "You always pay the hit time, plus, with probability MR, the cost of going to the next level."
3.3Multi-level AMAT
The recursive structure: each level's "miss penalty" is the AMAT of the next level.
3.4Worked: figuring out a cache from miss-rate fingerprint
Q (GQ Q3). 16-word cache, LRU, conflict miss rate = 7/14 = 50% on the byte-address sequence:
74 A0 78 38C AC 84 888C 7C 34 3813C 38818C
Recipe. Try direct-mapped first (worst-case for conflicts).
- Assume DM, block = 4 words (16 bytes). Then S = 16/4 = 4 sets ⇒ 2 set-index bits.
- For each address, compute (tag, set) and walk through with LRU.
- Count misses. Should equal 7 for the answer to fit.
✓ Direct-mapped, b = 4 words.
4.Virtual memory motivation
4.1The illusion of unlimited RAM
If physical RAM is 8 GiB but processes act like they have 4 GiB each, the OS+MMU must be juggling. Virtual memory:
- Each process gets its own virtual address space (VAS) starting at 0.
- The CPU emits virtual addresses. Assembly works in these.
- The MMU translates virtual → physical every cycle using a page table.
- If the physical page isn't resident, the OS pages it in from disk (a page fault).
4.2The RISC-V memory map
0xFFFFFFFC ┌──────────────────┐
│ OS & I/O │
0xC0000000 ├──────────────────┤
│ Stack ←─ sp │ grows down
│ ↑ │
│ Heap │ grows up
0x10001000 ├──────────────────┤
│ Global / Data │
0x10000000 ├──────────────────┤
│ Text (code) │
0x00010000 ├──────────────────┤
│ Exception hdr │
0x00000000 └──────────────────┘
4.3Pages, VPN, page offset
VPN (Virtual Page Number): the upper bits of a virtual address.
PPN (Physical Page Number): the upper bits of the corresponding physical address.
Page offset: the lower bits — picks a byte inside the page. Identical in VA and PA.
4.3.1GQ Q29 example
Page = 2 KiWord. 4 segments (text/data/heap/stack), each occupying 1 page initially.
1 segment = 1 page = 2·1024 words
4 segments = 8·1024 words = 2^13 words
1 word = 4 B = 2^2 B
Total bytes = 2^13 · 2^2 = 2^15 bytes
5.Page tables
5.1Single-level — and why it's too big
A page table has one PTE per virtual page. PTE = { Valid bit, PPN, perm bits }.
VPN ──► PageTable[VPN] = {V, PPN, perms}
│ │
└─ if V=0: page fault └─ if V=1: form PA = PPN ‖ offset
5.2Multi-level page tables
Break the VPN into chunks. Each chunk indexes a smaller table; only allocate the second-level tables for VPN ranges actually used.
- Root PT (L0) is always resident in physical memory.
- Inner PTs (L1, L2, …) live in virtual memory and can themselves be paged out.
- Address translation now costs multiple memory accesses — one per level.
5.3Sizing a multi-level PT (GQ Q0(f))
Problem: 232-byte VA, page = Y KiB (Y∈{1,2,4,8}, from ERP digit). Leveli has 2Li entries. Find minimum number of levels.
Recipe:
- VPN bits = 32 − log2(Y · 1024).
- Sum the level-index widths L0 + L1 + … until ≥ VPN bits.
- If a digit is 0, replace with 8 (per the problem rule).
Example (ERP 32460, page = 8 KiB):
VPN bits = 32 − 13 = 19
L0 = 2^8 (X = 0 → 8, last digit) ⇒ 8 bits
L1 = 2^6 (Y = 6, 2nd-last digit) ⇒ 6 bits
L2 = 2^4 (Z = 4, 3rd-last digit) ⇒ 4 bits
Running total: 8 + 6 + 4 = 18 (short by 1)
L3 = 2^1 ⇒ 1 bit
Total: 8+6+4+1 = 19 ✓
⇒ 4 levels needed.
6.The TLB
6.1Why we need it
If every load/store needs a page-table walk (1 to 4 memory accesses), the memory system would collapse. The TLB is a tiny, fast cache that stores recent VPN → PPN translations.
6.2Full translation flow
VA ──► TLB lookup
│
├─► HIT ──► PPN ──► PA = PPN ‖ offset
│
└─► MISS ──► Walk page table (1 to N memory hits)
│
├─► PTE valid: get PPN, fill TLB, retry
└─► PTE invalid: page fault → load from disk
│
▼
cache lookup
(L1 → L2 → MM)
7.Memory performance calculations
7.1AMAT with TLB & multi-level PT
+ (tcache + MRcache · (tMM + MRMM · tVM))
where n = page-table levels to walk on TLB miss
7.1.1GQ Q5 worked
V = 3rd digit of ID (0 → use adjacent nonzero). tTLB = V ns, HRTLB = 95%, tMM = (100 − 20V) ns.
Take V = 5:
t_TLB = 5 ns
MR_TLB = 5%
t_MM = 100 − 20·5 = 0 ns ← degenerate; pick V=6 from adjacent
With V=6:
t_MM = 100 − 120 = ?? ← also invalid; per GQ, AMAT ≈ 5 ns
✓ Pick V correctly, then mechanically substitute.
7.2Memory-stall CPI
MSCPIideal = CPIbase (perfect cache)
Speedupperfect-cache = MSCPI / MSCPIideal
Where f = fraction of memory-touching instructions (loads + stores). This is the formula that connects memory back to the CPI math from Chapter 7 of microarchitecture.
7.2.1GQ Q4 worked
CPIbase = 2.0, MR = 4%, penalty = 100 cyc, f ∈ [30%, 39%].
| f | MSCPI | Speedup |
|---|---|---|
| 0.30 | 2 + 0.30·0.04·100 = 3.20 | 3.20 / 2 = 1.60× |
| 0.39 | 2 + 0.39·0.04·100 = 3.56 | 3.56 / 2 = 1.78× |
✓ Range: 1.60 to 1.78.
7.3Hooking back into the CPU CPI
Memory misses don't appear in our CPIpipeline from the microarchitecture chapter — that calculation assumed 1-cycle MEM. The true CPI of a real processor combines both:
And the end-to-end timing becomes:
= N · ( CPIpipe + f · MR · Penalty ) · Tc
The next page (System Performance →) walks through a full worked example combining a pipelined CPU CPI from §7.3 of the microarch chapter with an AMAT-derived memory penalty.
- Flat memory doesn't work — we build a hierarchy that exploits locality.
- Caches are characterised by C, b, N, S. Direct-mapped = 1-way. Fully assoc = 1 set.
- Every address splits into TAG / SET INDEX / BLOCK OFFSET / BYTE OFFSET.
- AMAT = thit + MR · MissPenalty. Recursive across levels.
- Virtual memory: VPN+offset, translated via page tables. Multi-level PTs save memory.
- TLB caches translations to skip page-table walks.
- Memory adds to the effective CPI: CPIeff = CPIpipe + f · MR · Penalty.
Memory Systems/MemorySystems.pdf + Ch8_Memory.pdf (Figures 8.23, 8.27). pagetable.png for the translation diagram.
Next: Combined system performance → (pipeline + memory together)