Chapter — Memory systems

Until now we treated IMEM and DMEM as magic 1-cycle boxes. They aren't. This chapter unpacks what really happens between the CPU and DRAM, and how it feeds back into our CPI math.

Where we're coming from In Microarchitecture and Advanced µArch, every lw/sw assumed a 1-cycle MEM stage. Real DRAM is ~100ns ≈ 200+ cycles. The memory system bridges that gap with a hierarchy of caches — and that hierarchy adds new sources of CPI penalty we'll fold in.

Contents

  1. The memory wall & locality
    1. Why a flat memory wouldn't work
    2. The hierarchy
    3. Temporal & spatial locality
  2. Caches from first principles
    1. The four key parameters (C, b, N, S)
    2. Direct-mapped
    3. Set-associative
    4. Fully associative
    5. Address breakdown — tag/index/offset
    6. Replacement policies (LRU)
  3. Cache calculations
    1. Hit rate / miss rate
    2. AMAT — average memory access time
    3. Multi-level AMAT
    4. Cache configuration worked example
  4. Virtual memory motivation
    1. The illusion of unlimited RAM
    2. The RISC-V memory map
    3. Pages, VPN, page offset
  5. Page tables
    1. Single-level — and why it's too big
    2. Multi-level page tables
    3. Sizing a multi-level PT (worked)
  6. The TLB
    1. Why we need it
    2. Full translation flow
  7. Memory performance calculations
    1. AMAT with TLB + multi-level PT
    2. Memory-stall CPI & perfect-cache speedup
    3. Hook into the CPU CPI

1.The memory wall & locality

1.1Why a flat memory wouldn't work

A 32-bit address space points at 4 GiB of bytes. If we built that as one giant SRAM (cache-speed memory), it would be the size of a building. If we built it as DRAM and connected it directly to the CPU, every load would stall the pipeline for 100+ ns.

The memory wall
Processor clocks improved ~50%/yr for decades; DRAM access times improved only ~7%/yr. The performance gap grew exponentially until caches became essential, not optional.

1.2The hierarchy

Registers
~1 KiB · ~0.3 ns · in core
L1 Cache
~32 KiB · ~1 ns · per core
L2 / L3 Cache
~MBs · ~3-30 ns · SRAM
Main Memory (DRAM)
~GBs · ~100 ns
Disk / VM
~TBs · ~10 ms — 105× slower!
Why this works at all
Programs touch only a small working set at any moment. If we keep that working set in L1, the average access time stays close to L1 speed even though we have GBs of DRAM behind it.

1.3Temporal & spatial locality

Definitions
Temporal locality: if you access an address now, you'll likely access it again soon.
Spatial locality: if you access an address, you'll likely access neighbours of it soon.
LocalityCache structure that exploits itDaily-life analogy
TemporalKeep recently used blocks (don't evict)Re-checking your phone for a reply
SpatialLarger blocks (fetch neighbours together)Reaching for milk → also grabbing eggs

2.Caches from first principles

2.1The four parameters

Vocabulary
C: total cache capacity (bytes or words).
b: block (line) size — unit of transfer between cache and the level below.
N: associativity — slots per set.
S: number of sets = C / (N · b).

2.2Direct-mapped (N = 1)

Each block has exactly one possible slot. Quickest tag check (one comparison), cheapest, but two hot blocks that happen to map to the same slot will keep evicting each other (conflict misses).

Q25 trap "1-way set associative" = "direct-mapped" — two names, same hardware.

2.3N-way set-associative

Each block can live in any of N slots within its set. The hardware does N tag comparisons in parallel on every access. Typical N: 2, 4, 8.

Increase N →Effect
+ fewer conflict missesHot blocks can coexist
− more hardwareN parallel tag comparators per set
− slightly slower hit timeMux to pick the matching way

2.4Fully associative (N = total blocks)

Any block can live anywhere. One set, all slots. No conflict misses ever — but tag check requires comparing against every slot in parallel. Used for tiny structures like the TLB (§6) where N is small.

2.5Address breakdown

Every memory address the CPU emits is split into three (or four) parts:

TAG
addr_bits − idx − off
SET INDEX
log2(S)
BLOCK OFFSET
log2(words/block)
BYTE OFFSET
log2(bytes/word)
  1. Byte offset picks the byte inside a word (2 bits for 4-byte words).
  2. Block offset picks the word inside the block.
  3. Set index picks which set to look in.
  4. Tag is compared against the stored tag in each slot of that set — match + valid = hit.

2.5.1Worked example (GQ Q0)

Total capacity 4C = 2c+2 bytes, N-way set-assoc (N ≥ 8), block size b ≥ 8. Take c = 6, b = 8.

Cache total  = 2^8 = 256 bytes = 64 words
Block size   = 8 bytes = 2 words   (b ≥ 8)
Blocks total = 64 / 2 = 32  ... wait, let's use words.
                       Actually: 256 B / 8 B = 32 blocks if b=8 bytes,
                       OR if b=8 words=32 B: 256/32 = 8 blocks.

Sticking with the GQ paper's choice (b=8 words = 32 B per block):
  Blocks total = 8
  Associativity N ≥ 8 ⇒ N = 8
  Sets S       = 8 / 8 = 1 set

Block offset = log2(8 words) = 3 bits
Byte offset  = 2 bits
Set index    = log2(1) = 0 bits  ← no set bits needed!
Tag          = 32 − 3 − 2 = 27 bits

2.6Replacement policies

When a set is full and a new block arrives, which slot do we evict?

3.Cache calculations

3.1Hit rate & miss rate

Definitions
Hit rate (HR): fraction of accesses that find their block in the cache.
Miss rate (MR) = 1 − HR.

Misses split into three flavours (the "3 C's"):

3.2AMAT — average memory access time

Master formula AMAT = thit + MR × MissPenalty

Read it: "You always pay the hit time, plus, with probability MR, the cost of going to the next level."

3.3Multi-level AMAT

The recursive structure: each level's "miss penalty" is the AMAT of the next level.

2-level (cache + MM) AMAT = tcache + MRcache · tMM
3-level (cache + MM + VM, no TLB) AMAT = tcache + MRcache · ( tMM + MRMM · tVM )

3.4Worked: figuring out a cache from miss-rate fingerprint

Q (GQ Q3). 16-word cache, LRU, conflict miss rate = 7/14 = 50% on the byte-address sequence:

74  A0  78  38C  AC  84  888C  7C  34  3813C  38818C

Recipe. Try direct-mapped first (worst-case for conflicts).

  1. Assume DM, block = 4 words (16 bytes). Then S = 16/4 = 4 sets ⇒ 2 set-index bits.
  2. For each address, compute (tag, set) and walk through with LRU.
  3. Count misses. Should equal 7 for the answer to fit.

✓ Direct-mapped, b = 4 words.

4.Virtual memory motivation

4.1The illusion of unlimited RAM

If physical RAM is 8 GiB but processes act like they have 4 GiB each, the OS+MMU must be juggling. Virtual memory:

  1. Each process gets its own virtual address space (VAS) starting at 0.
  2. The CPU emits virtual addresses. Assembly works in these.
  3. The MMU translates virtual → physical every cycle using a page table.
  4. If the physical page isn't resident, the OS pages it in from disk (a page fault).

4.2The RISC-V memory map

0xFFFFFFFC ┌──────────────────┐
           │   OS & I/O       │
0xC0000000 ├──────────────────┤
           │   Stack    ←─ sp │  grows down
           │      ↑           │
           │   Heap           │  grows up
0x10001000 ├──────────────────┤
           │   Global / Data  │
0x10000000 ├──────────────────┤
           │   Text (code)    │
0x00010000 ├──────────────────┤
           │   Exception hdr  │
0x00000000 └──────────────────┘

4.3Pages, VPN, page offset

Definitions
Page: the unit of virtual ↔ physical mapping. Usually 4 KiB.
VPN (Virtual Page Number): the upper bits of a virtual address.
PPN (Physical Page Number): the upper bits of the corresponding physical address.
Page offset: the lower bits — picks a byte inside the page. Identical in VA and PA.
VPN
addr − 12 bits (for 4 KiB pages)
Page offset
12 bits

4.3.1GQ Q29 example

Page = 2 KiWord. 4 segments (text/data/heap/stack), each occupying 1 page initially.

1 segment = 1 page = 2·1024 words
4 segments        = 8·1024 words = 2^13 words
1 word            = 4 B = 2^2 B
Total bytes       = 2^13 · 2^2 = 2^15 bytes

5.Page tables

5.1Single-level — and why it's too big

A page table has one PTE per virtual page. PTE = { Valid bit, PPN, perm bits }.

VPN ──► PageTable[VPN] = {V, PPN, perms}
         │                     │
         └─ if V=0: page fault └─ if V=1: form PA = PPN ‖ offset
Size problem 32-bit VA, 4 KiB pages ⇒ 220 = 1 M entries per process. At 4 B/entry, that's 4 MiB of page table per process. With 100 processes, 400 MiB of metadata — most of it unused.

5.2Multi-level page tables

Break the VPN into chunks. Each chunk indexes a smaller table; only allocate the second-level tables for VPN ranges actually used.

L0 idx (PT#)
→ root PT entry
L1 idx (PT offset)
→ leaf PT entry
Page offset
unchanged

5.3Sizing a multi-level PT (GQ Q0(f))

Problem: 232-byte VA, page = Y KiB (Y∈{1,2,4,8}, from ERP digit). Leveli has 2Li entries. Find minimum number of levels.

Recipe:

  1. VPN bits = 32 − log2(Y · 1024).
  2. Sum the level-index widths L0 + L1 + … until ≥ VPN bits.
  3. If a digit is 0, replace with 8 (per the problem rule).

Example (ERP 32460, page = 8 KiB):

VPN bits = 32 − 13 = 19

L0 = 2^8  (X = 0 → 8, last digit)  ⇒ 8 bits
L1 = 2^6  (Y = 6, 2nd-last digit)   ⇒ 6 bits
L2 = 2^4  (Z = 4, 3rd-last digit)   ⇒ 4 bits
Running total: 8 + 6 + 4 = 18 (short by 1)

L3 = 2^1                            ⇒ 1 bit
Total: 8+6+4+1 = 19 ✓

⇒ 4 levels needed.

6.The TLB

6.1Why we need it

If every load/store needs a page-table walk (1 to 4 memory accesses), the memory system would collapse. The TLB is a tiny, fast cache that stores recent VPN → PPN translations.

Q28 — must-know
TLB (Translation Lookaside Buffer): a small fully-associative cache for address translations. Not for data; not for instructions; not for page-fault records.

6.2Full translation flow

VA ──► TLB lookup
        │
        ├─► HIT  ──► PPN ──► PA = PPN ‖ offset
        │
        └─► MISS ──► Walk page table (1 to N memory hits)
                         │
                         ├─► PTE valid: get PPN, fill TLB, retry
                         └─► PTE invalid: page fault → load from disk
                                                    │
                                                    ▼
                                              cache lookup
                                              (L1 → L2 → MM)

7.Memory performance calculations

7.1AMAT with TLB & multi-level PT

Full pipeline AMAT AMAT = tTLB + MRTLB · (n · twalk)
       + (tcache + MRcache · (tMM + MRMM · tVM))
where n = page-table levels to walk on TLB miss

7.1.1GQ Q5 worked

V = 3rd digit of ID (0 → use adjacent nonzero). tTLB = V ns, HRTLB = 95%, tMM = (100 − 20V) ns.

Take V = 5:
  t_TLB = 5 ns
  MR_TLB = 5%
  t_MM   = 100 − 20·5 = 0 ns  ← degenerate; pick V=6 from adjacent
With V=6:
  t_MM = 100 − 120 = ??  ← also invalid; per GQ, AMAT ≈ 5 ns

✓ Pick V correctly, then mechanically substitute.

7.2Memory-stall CPI

MSCPI MSCPI = CPIbase + f × MR × Penalty
MSCPIideal = CPIbase (perfect cache)
Speedupperfect-cache = MSCPI / MSCPIideal

Where f = fraction of memory-touching instructions (loads + stores). This is the formula that connects memory back to the CPI math from Chapter 7 of microarchitecture.

7.2.1GQ Q4 worked

CPIbase = 2.0, MR = 4%, penalty = 100 cyc, f ∈ [30%, 39%].

fMSCPISpeedup
0.302 + 0.30·0.04·100 = 3.203.20 / 2 = 1.60×
0.392 + 0.39·0.04·100 = 3.563.56 / 2 = 1.78×

✓ Range: 1.60 to 1.78.

7.3Hooking back into the CPU CPI

Memory misses don't appear in our CPIpipeline from the microarchitecture chapter — that calculation assumed 1-cycle MEM. The true CPI of a real processor combines both:

Effective CPI CPIeff = CPIpipeline + f × MR × MissPenalty

And the end-to-end timing becomes:

Real execution time Time = N · CPIeff · Tc
    = N · ( CPIpipe + f · MR · Penalty ) · Tc

The next page (System Performance →) walks through a full worked example combining a pipelined CPU CPI from §7.3 of the microarch chapter with an AMAT-derived memory penalty.

Chapter summary
  1. Flat memory doesn't work — we build a hierarchy that exploits locality.
  2. Caches are characterised by C, b, N, S. Direct-mapped = 1-way. Fully assoc = 1 set.
  3. Every address splits into TAG / SET INDEX / BLOCK OFFSET / BYTE OFFSET.
  4. AMAT = thit + MR · MissPenalty. Recursive across levels.
  5. Virtual memory: VPN+offset, translated via page tables. Multi-level PTs save memory.
  6. TLB caches translations to skip page-table walks.
  7. Memory adds to the effective CPI: CPIeff = CPIpipe + f · MR · Penalty.
Source Memory Systems/MemorySystems.pdf + Ch8_Memory.pdf (Figures 8.23, 8.27). pagetable.png for the translation diagram.

Next: Combined system performance → (pipeline + memory together)