Chapter — Combined system performance
Putting it all together: the real CPI, IPC, and execution time of a processor when the pipeline and the memory hierarchy are both in the picture.
- Microarchitecture gave us CPIpipe — pipeline hazards (load-use, branch mispredict, etc.).
- Memory Systems gave us AMAT and the memory-stall CPI term.
Contents
1.Why we need a combined model
If you only count pipeline stalls, you get a CPI like 1.23 for SPECINT — that's the textbook number. But you can't actually run a program with that CPI, because the memory hasn't been modelled.
If you only count memory stalls, you ignore the cost of branch mispredictions, load-use hazards, etc.
2.The effective CPI equation
2.1Per-stall contribution accounting
Start from the ideal pipelined CPI of 1, then add every average penalty each instruction type contributes:
+ flw × P(load-use stall) × 1
+ fbr × P(mispredict) × 2
+ fmem × MRcache × Penaltymiss
Or — equivalently — start from the pipelined CPI you already computed and add only the memory term:
| Term | Where it comes from | Typical magnitude |
|---|---|---|
| 1 | Ideal pipeline | 1 |
| Load-use penalty | µArch §6.3 | ~0.1 (e.g. 0.25·0.4·1) |
| Branch mispredict | µArch §6.4 + Adv §3 | ~0.1 (e.g. 0.13·0.5·2) |
| Memory stall | Memory §7 | ~0.05–2.0 (huge range!) |
2.2From CPIeff to time & IPC
Time = Ninstructions × CPIeff × Tc
Throughput = IPCeff × frequency
3.End-to-end worked example
3.1Step 1 — pipeline CPI
From the µArch chapter (§7.3):
| Type | Freq | CPI |
|---|---|---|
| lw | 0.25 | 0.6·1 + 0.4·2 = 1.4 |
| sw | 0.10 | 1.0 |
| br | 0.13 | 0.5·1 + 0.5·3 = 2.0 |
| R | 0.52 | 1.0 |
CPI_pipe = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0 = 1.23
3.2Step 2 — memory penalty
From the Memory chapter (§7): assume L1 hit time = 1 cyc, L1 miss rate = 4%, miss penalty (to L2/MM) = 100 cyc. fmem = loads + stores = 0.35.
Memory stall term = f_mem × MR × Penalty
= 0.35 × 0.04 × 100
= 1.40 cycles/instruction
3.3Step 3 — combined effective CPI
CPI_eff = CPI_pipe + memory stall
= 1.23 + 1.40
= 2.63
Notice: memory contributes more than the entire pipeline did. That's why memory dominates real-world performance.
3.4Step 4 — IPC & total time
IPC_eff = 1 / 2.63 = 0.38
Frequency = 1 / 350 ps = 2.86 GHz
Throughput = 2.86 × 10^9 × 0.38 = 1.09 × 10^9 instructions/s
Time = 10^11 × 2.63 × 350 ps
= 92.05 seconds
| Model | CPI | Time |
|---|---|---|
| Single-cycle (µArch §3.3) | 1.0 | 75 s |
| Multicycle (µArch §4.4) | 4.12 | 103 s |
| Pipelined, perfect cache | 1.23 | 43 s |
| Pipelined + real cache | 2.63 | 92 s |
A "real" pipelined processor with realistic memory ends up not much faster than a single-cycle CPU with no memory penalty — unless you also tame the memory hierarchy.
4.What to optimise — where is the bottleneck?
Break down CPIeff = 2.63 by source:
| Source | Contribution | % of CPI |
|---|---|---|
| Ideal pipeline | 1.00 | 38% |
| Load-use stalls | 0.10 | 4% |
| Branch mispredicts | 0.13 | 5% |
| Memory misses | 1.40 | 53% |
5.Speedup analyses you'll be asked about
5.1Perfect cache (MR = 0)
CPI_eff_perfect = 1.23 (memory term → 0)
Speedup = CPI_eff_real / CPI_eff_perfect
= 2.63 / 1.23 = 2.14×
Equivalent formulation (GQ Q4-style):
MSCPI = CPI_base + f · MR · Penalty
MSCPI_ideal = CPI_base
Speedup = MSCPI / MSCPI_ideal
5.2Better branch predictor
Drop mispredict rate from 50% to 10%:
CPI_br new = 0.9·1 + 0.1·3 = 1.20 (vs 2.0 before)
CPI_pipe new = 0.25·1.4 + 0.10·1 + 0.13·1.20 + 0.52·1 = 1.126
CPI_eff new = 1.126 + 1.40 = 2.526
Speedup = 2.63 / 2.526 = 1.04×
Modest gain because memory dominates. With a perfect cache, the same change is much more impactful.
5.3Doubling clock frequency
Halve Tc from 350 ps → 175 ps. But the memory penalty is measured in cycles, so a 100-cycle miss now takes 200 cycles at the new clock to cover the same wall-clock latency.
Memory term new = 0.35 · 0.04 · 200 = 2.80
CPI_eff new = 1.23 + 2.80 = 4.03
Time new = 10^11 × 4.03 × 175 ps = 70.5 s
Speedup vs original = 92 / 70.5 = 1.30×
Half the speedup you'd "expect" from doubling frequency — because memory latency is fixed in real time, not cycles.
- Pipeline and memory penalties add in CPIeff.
- Compact form: CPIeff = CPIpipe + fmem · MR · Penalty.
- Real-world CPI is usually dominated by memory, not pipeline hazards.
- Time = N · CPIeff · Tc. IPCeff = 1 / CPIeff.
- Doubling clock frequency rarely doubles performance — memory latency is fixed in real time.
- Perfect-cache speedup ≈ 2× in our example — that's the prize for nailing the memory hierarchy.
Last stop: Worked Examples → for more GQ-style numeric drills.