Chapter — Memory systems

Until now we treated IMEM and DMEM as magic 1-cycle boxes. They aren't. This chapter unpacks what really happens between the CPU and DRAM, and how it feeds back into our CPI math.

Where we're coming from In Microarchitecture and Advanced µArch, every lw/sw assumed a 1-cycle MEM stage. Real DRAM is ~100ns ≈ 200+ cycles. The memory system bridges that gap with a hierarchy of caches — and that hierarchy adds new sources of CPI penalty we'll fold in.

The memory wall & locality
Caches from first principles
Cache calculations
Virtual memory motivation
Page tables
The TLB
1. Why we need it
2. Full translation flow
Memory performance calculations

1.The memory wall & locality

1.1Why a flat memory wouldn't work

A 32-bit address space points at 4 GiB of bytes. If we built that as one giant SRAM (cache-speed memory), it would be the size of a building. If we built it as DRAM and connected it directly to the CPU, every load would stall the pipeline for 100+ ns.

The memory wall

Processor clocks improved ~50%/yr for decades; DRAM access times improved only ~7%/yr. The performance gap grew exponentially until caches became essential, not optional.

1.2The hierarchy

Registers~1 KiB · ~0.3 ns · in core
L1 Cache~32 KiB · ~1 ns · per core
L2 / L3 Cache~MBs · ~3-30 ns · SRAM
Main Memory (DRAM)~GBs · ~100 ns
Disk / VM~TBs · ~10 ms — 105× slower!

Why this works at all

Programs touch only a small working set at any moment. If we keep that working set in L1, the average access time stays close to L1 speed even though we have GBs of DRAM behind it.

1.3Temporal & spatial locality

Definitions

Temporal locality: if you access an address now, you'll likely access it again soon.
Spatial locality: if you access an address, you'll likely access neighbours of it soon.

Locality	Cache structure that exploits it	Daily-life analogy
Temporal	Keep recently used blocks (don't evict)	Re-checking your phone for a reply
Spatial	Larger blocks (fetch neighbours together)	Reaching for milk → also grabbing eggs

2.Caches from first principles

2.1The four parameters

Vocabulary

C: total cache capacity (bytes or words).
b: block (line) size — unit of transfer between cache and the level below.
N: associativity — slots per set.
S: number of sets = C / (N · b).

2.2Direct-mapped (N = 1)

Each block has exactly one possible slot. Quickest tag check (one comparison), cheapest, but two hot blocks that happen to map to the same slot will keep evicting each other (conflict misses).

Q25 trap "1-way set associative" = "direct-mapped" — two names, same hardware.

2.3N-way set-associative

Each block can live in any of N slots within its set. The hardware does N tag comparisons in parallel on every access. Typical N: 2, 4, 8.

Increase N →	Effect
+ fewer conflict misses	Hot blocks can coexist
− more hardware	N parallel tag comparators per set
− slightly slower hit time	Mux to pick the matching way

2.4Fully associative (N = total blocks)

Any block can live anywhere. One set, all slots. No conflict misses ever — but tag check requires comparing against every slot in parallel. Used for tiny structures like the TLB (§6) where N is small.

2.5Address breakdown

Every memory address the CPU emits is split into three (or four) parts:

TAG

addr_bits − idx − off

SET INDEX

log₂(S)

BLOCK OFFSET

log₂(words/block)

BYTE OFFSET

log₂(bytes/word)

Byte offset picks the byte inside a word (2 bits for 4-byte words).
Block offset picks the word inside the block.
Set index picks which set to look in.
Tag is compared against the stored tag in each slot of that set — match + valid = hit.

2.5.1Worked example (GQ Q0)

Total capacity 4C = 2^c+2 bytes, N-way set-assoc (N ≥ 8), block size b ≥ 8. Take c = 6, b = 8.

Cache total  = 2^8 = 256 bytes = 64 words
Block size   = 8 bytes = 2 words   (b ≥ 8)
Blocks total = 64 / 2 = 32  ... wait, let's use words.
                       Actually: 256 B / 8 B = 32 blocks if b=8 bytes,
                       OR if b=8 words=32 B: 256/32 = 8 blocks.

Sticking with the GQ paper's choice (b=8 words = 32 B per block):
  Blocks total = 8
  Associativity N ≥ 8 ⇒ N = 8
  Sets S       = 8 / 8 = 1 set

Block offset = log2(8 words) = 3 bits
Byte offset  = 2 bits
Set index    = log2(1) = 0 bits  ← no set bits needed!
Tag          = 32 − 3 − 2 = 27 bits

2.6Replacement policies

When a set is full and a new block arrives, which slot do we evict?

LRU (Least Recently Used): evict whichever slot hasn't been touched longest. Best behaviour, expensive to implement perfectly beyond 2-way.
FIFO: evict the oldest insertion.
Random: picks at random — surprisingly competitive and very cheap.

3.Cache calculations

3.1Hit rate & miss rate

Definitions

Hit rate (HR): fraction of accesses that find their block in the cache.
Miss rate (MR) = 1 − HR.

Misses split into three flavours (the "3 C's"):

Compulsory: first access ever to this block (cold start). Unavoidable.
Conflict: two blocks map to the same set and keep evicting each other (DM & low-assoc only).
Capacity: working set doesn't fit at all. Need a bigger cache.

3.2AMAT — average memory access time

Master formula AMAT = t_hit + MR × MissPenalty

Read it: "You always pay the hit time, plus, with probability MR, the cost of going to the next level."

3.3Multi-level AMAT

The recursive structure: each level's "miss penalty" is the AMAT of the next level.

2-level (cache + MM) AMAT = t_cache + MR_cache · t_MM

3-level (cache + MM + VM, no TLB) AMAT = t_cache + MR_cache · ( t_MM + MR_MM · t_VM )

3.4Worked: figuring out a cache from miss-rate fingerprint

Q (GQ Q3). 16-word cache, LRU, conflict miss rate = 7/14 = 50% on the byte-address sequence:

74  A0  78  38C  AC  84  888C  7C  34  3813C  38818C

Recipe. Try direct-mapped first (worst-case for conflicts).

Assume DM, block = 4 words (16 bytes). Then S = 16/4 = 4 sets ⇒ 2 set-index bits.
For each address, compute (tag, set) and walk through with LRU.
Count misses. Should equal 7 for the answer to fit.

✓ Direct-mapped, b = 4 words.

4.Virtual memory motivation

4.1The illusion of unlimited RAM

If physical RAM is 8 GiB but processes act like they have 4 GiB each, the OS+MMU must be juggling. Virtual memory:

Each process gets its own virtual address space (VAS) starting at 0.
The CPU emits virtual addresses. Assembly works in these.
The MMU translates virtual → physical every cycle using a page table.
If the physical page isn't resident, the OS pages it in from disk (a page fault).

4.2The RISC-V memory map

0xFFFFFFFC ┌──────────────────┐
           │   OS & I/O       │
0xC0000000 ├──────────────────┤
           │   Stack    ←─ sp │  grows down
           │      ↑           │
           │   Heap           │  grows up
0x10001000 ├──────────────────┤
           │   Global / Data  │
0x10000000 ├──────────────────┤
           │   Text (code)    │
0x00010000 ├──────────────────┤
           │   Exception hdr  │
0x00000000 └──────────────────┘

4.3Pages, VPN, page offset

Definitions

Page: the unit of virtual ↔ physical mapping. Usually 4 KiB.
VPN (Virtual Page Number): the upper bits of a virtual address.
PPN (Physical Page Number): the upper bits of the corresponding physical address.
Page offset: the lower bits — picks a byte inside the page. Identical in VA and PA.

VPN

addr − 12 bits (for 4 KiB pages)

Page offset

12 bits

4.3.1GQ Q29 example

Page = 2 KiWord. 4 segments (text/data/heap/stack), each occupying 1 page initially.

1 segment = 1 page = 2·1024 words
4 segments        = 8·1024 words = 2^13 words
1 word            = 4 B = 2^2 B
Total bytes       = 2^13 · 2^2 = 2^15 bytes

5.Page tables

5.1Single-level — and why it's too big

A page table has one PTE per virtual page. PTE = { Valid bit, PPN, perm bits }.

VPN ──► PageTable[VPN] = {V, PPN, perms}
         │                     │
         └─ if V=0: page fault └─ if V=1: form PA = PPN ‖ offset

Size problem 32-bit VA, 4 KiB pages ⇒ 2²⁰ = 1 M entries per process. At 4 B/entry, that's 4 MiB of page table per process. With 100 processes, 400 MiB of metadata — most of it unused.

5.2Multi-level page tables

Break the VPN into chunks. Each chunk indexes a smaller table; only allocate the second-level tables for VPN ranges actually used.

L0 idx (PT#)

→ root PT entry

L1 idx (PT offset)

→ leaf PT entry

Page offset

unchanged

Root PT (L0) is always resident in physical memory.
Inner PTs (L1, L2, …) live in virtual memory and can themselves be paged out.
Address translation now costs multiple memory accesses — one per level.

5.3Sizing a multi-level PT (GQ Q0(f))

Problem: 2³²-byte VA, page = Y KiB (Y∈{1,2,4,8}, from ERP digit). Level_i has 2^L_i entries. Find minimum number of levels.

Recipe:

VPN bits = 32 − log₂(Y · 1024).
Sum the level-index widths L₀ + L₁ + … until ≥ VPN bits.
If a digit is 0, replace with 8 (per the problem rule).

Example (ERP 32460, page = 8 KiB):

VPN bits = 32 − 13 = 19

L0 = 2^8  (X = 0 → 8, last digit)  ⇒ 8 bits
L1 = 2^6  (Y = 6, 2nd-last digit)   ⇒ 6 bits
L2 = 2^4  (Z = 4, 3rd-last digit)   ⇒ 4 bits
Running total: 8 + 6 + 4 = 18 (short by 1)

L3 = 2^1                            ⇒ 1 bit
Total: 8+6+4+1 = 19 ✓

⇒ 4 levels needed.

6.The TLB

6.1Why we need it

If every load/store needs a page-table walk (1 to 4 memory accesses), the memory system would collapse. The TLB is a tiny, fast cache that stores recent VPN → PPN translations.

Q28 — must-know

TLB (Translation Lookaside Buffer): a small fully-associative cache for address translations. Not for data; not for instructions; not for page-fault records.

6.2Full translation flow

VA ──► TLB lookup
        │
        ├─► HIT  ──► PPN ──► PA = PPN ‖ offset
        │
        └─► MISS ──► Walk page table (1 to N memory hits)
                         │
                         ├─► PTE valid: get PPN, fill TLB, retry
                         └─► PTE invalid: page fault → load from disk
                                                    │
                                                    ▼
                                              cache lookup
                                              (L1 → L2 → MM)

7.Memory performance calculations

7.1AMAT with TLB & multi-level PT

Full pipeline AMAT AMAT = t_TLB + MR_TLB · (n · t_walk)
+ (t_cache + MR_cache · (t_MM + MR_MM · t_VM))
where n = page-table levels to walk on TLB miss

7.1.1GQ Q5 worked

V = 3rd digit of ID (0 → use adjacent nonzero). t_TLB = V ns, HR_TLB = 95%, t_MM = (100 − 20V) ns.

Take V = 5:
  t_TLB = 5 ns
  MR_TLB = 5%
  t_MM   = 100 − 20·5 = 0 ns  ← degenerate; pick V=6 from adjacent
With V=6:
  t_MM = 100 − 120 = ??  ← also invalid; per GQ, AMAT ≈ 5 ns

✓ Pick V correctly, then mechanically substitute.

7.2Memory-stall CPI

MSCPI MSCPI = CPI_base + f × MR × Penalty
MSCPI_ideal = CPI_base (perfect cache)
Speedup_{perfect-cache} = MSCPI / MSCPI_ideal

Where f = fraction of memory-touching instructions (loads + stores). This is the formula that connects memory back to the CPI math from Chapter 7 of microarchitecture.

7.2.1GQ Q4 worked

CPI_base = 2.0, MR = 4%, penalty = 100 cyc, f ∈ [30%, 39%].

f	MSCPI	Speedup
0.30	2 + 0.30·0.04·100 = 3.20	3.20 / 2 = 1.60×
0.39	2 + 0.39·0.04·100 = 3.56	3.56 / 2 = 1.78×

✓ Range: 1.60 to 1.78.

7.3Hooking back into the CPU CPI

Memory misses don't appear in our CPI_pipeline from the microarchitecture chapter — that calculation assumed 1-cycle MEM. The true CPI of a real processor combines both:

Effective CPI CPI_eff = CPI_pipeline + f × MR × MissPenalty

And the end-to-end timing becomes:

Real execution time Time = N · CPI_eff · T_c
= N · ( CPI_pipe + f · MR · Penalty ) · T_c

The next page (System Performance →) walks through a full worked example combining a pipelined CPU CPI from §7.3 of the microarch chapter with an AMAT-derived memory penalty.

Chapter summary

Flat memory doesn't work — we build a hierarchy that exploits locality.
Caches are characterised by C, b, N, S. Direct-mapped = 1-way. Fully assoc = 1 set.
Every address splits into TAG / SET INDEX / BLOCK OFFSET / BYTE OFFSET.
AMAT = t_hit + MR · MissPenalty. Recursive across levels.
Virtual memory: VPN+offset, translated via page tables. Multi-level PTs save memory.
TLB caches translations to skip page-table walks.
Memory adds to the effective CPI: CPI_eff = CPI_pipe + f · MR · Penalty.

Source Memory Systems/MemorySystems.pdf + Ch8_Memory.pdf (Figures 8.23, 8.27). pagetable.png for the translation diagram.

Next: Combined system performance → (pipeline + memory together)

Chapter — Memory systems

Contents

1.The memory wall & locality

1.1Why a flat memory wouldn't work

1.2The hierarchy

1.3Temporal & spatial locality

2.Caches from first principles

2.1The four parameters

2.2Direct-mapped (N = 1)

2.3N-way set-associative

2.4Fully associative (N = total blocks)

2.5Address breakdown

2.5.1Worked example (GQ Q0)

2.6Replacement policies

3.Cache calculations

3.1Hit rate & miss rate

3.2AMAT — average memory access time

3.3Multi-level AMAT

3.4Worked: figuring out a cache from miss-rate fingerprint

4.Virtual memory motivation

4.1The illusion of unlimited RAM

4.2The RISC-V memory map

4.3Pages, VPN, page offset

4.3.1GQ Q29 example

5.Page tables

5.1Single-level — and why it's too big

5.2Multi-level page tables

5.3Sizing a multi-level PT (GQ Q0(f))

6.The TLB

6.1Why we need it

6.2Full translation flow

7.Memory performance calculations

7.1AMAT with TLB & multi-level PT

7.1.1GQ Q5 worked

7.2Memory-stall CPI

7.2.1GQ Q4 worked

7.3Hooking back into the CPU CPI