Chapter — Microarchitecture (from assembly to pipelining)

A bottom-up build: RISC-V base assembly → instruction formats → single-cycle datapath → performance math → multicycle → pipelining → hazards → pipelined performance.

Contents

  1. RISC-V base — registers & instruction types
    1. Why we need a microarchitecture
    2. The register file
    3. Categories of instructions
    4. Six instruction formats (R, I, S, B, U, J)
  2. The single-cycle datapath
    1. Five universal tasks an instruction performs
    2. Datapath components
    3. R-type walkthrough
    4. lw walkthrough
    5. beq walkthrough
  3. Single-cycle performance
    1. Definitions: CPI, IPC, Tc, execution time
    2. The critical path
    3. Worked example
  4. Multicycle — a brief detour
  5. The pipelined datapath
    1. The five stages (IF · ID · EX · MEM · WB)
    2. Pipeline registers
    3. Steady-state throughput & speedup
  6. Pipeline hazards
    1. Data hazards: RAW, WAR, WAW
    2. Forwarding (bypassing)
    3. The load-use hazard (1 stall)
    4. Control hazards (branch flushes)
    5. Structural hazards
  7. Pipelined performance math
    1. CPI per instruction type
    2. Weighted-average CPI
    3. SPECINT2000 worked example
    4. Speedup over single-cycle & multicycle

1.RISC-V base — registers & instruction types

Before we can talk about how the hardware executes an instruction, we need to know what an instruction is. This section is the foundation everything else hangs on.

1.1Why we need a microarchitecture

Definition
Architecture (ISA): the contract — what instructions exist, what they do, what registers there are. Visible to the programmer.
Microarchitecture: the implementation — wires, gates, registers, pipelines that actually run those instructions. Invisible to the programmer.

RISC-V is an architecture (an ISA). A single-cycle CPU, a pipelined CPU, and a fancy out-of-order CPU can all implement that same ISA — they're different microarchitectures with different performance/cost trade-offs.

Q26 trap Verilog codes the microarchitecture, not the architecture. The ISA is described in a manual; assembly is what compilers emit; Verilog wires up the actual hardware.

1.2The RISC-V register file

RV32I gives you 32 general-purpose registers, each 32 bits wide, named x0x31. ABI names overlay them:

RegisterABI nameRole
x0zeroHardwired to 0 — writes are discarded
x1raReturn address
x2spStack pointer
x3gpGlobal pointer
x5-x7, x28-x31t0-t6Temporaries (caller-saved)
x8-x9, x18-x27s0-s11Saved registers (callee-saved)
x10-x17a0-a7Arguments / return values
Key takeaway
Two reads + one write happen every cycle in a single-cycle CPU. That's why the register file in our diagrams shows two read ports + one write port.

1.3Categories of instructions

Every RISC-V instruction falls into one of four behavioural categories:

CategoryWhat it doesExamples
ComputeRegFile → ALU → RegFileadd, sub, and, or, slt, sll, addi
MemoryMove between RegFile and DMEMlw, lb, lh, sw, sb, sh
BranchConditional jump if comparison holdsbeq, bne, blt, bge
JumpUnconditional jump (often with return-address save)jal, jalr

1.4Six instruction formats

The 32-bit instruction word is sliced differently depending on what fields it needs. The category drives the format:

FormatUsed byHas rs1?Has rs2?Has rd?Has immediate?
R-typeReg-reg ALU (add, sub, …)yesyesyesno
I-typeaddi, andi, lw, jalryesnoyes12-bit
S-typesw, sb, sh (stores)yes (base)yes (data)no12-bit (split)
B-typebeq, bne, blt, bgeyesyesno12-bit (split)
U-typelui, auipcnonoyes20-bit (upper)
J-typejalnonoyes20-bit (offset)

1.4.1Bit layouts

R-type
funct7[31:25]
rs2[24:20]
rs1[19:15]
funct3[14:12]
rd[11:7]
opcode[6:0]
I-type
imm[11:0][31:20]
rs1[19:15]
funct3[14:12]
rd[11:7]
opcode[6:0]
S-type
imm[11:5][31:25]
rs2[24:20]
rs1[19:15]
funct3[14:12]
imm[4:0][11:7]
opcode[6:0]
B-type
imm[12|10:5][31:25]
rs2[24:20]
rs1[19:15]
funct3[14:12]
imm[4:1|11][11:7]
opcode[6:0]
U-type
imm[31:12][31:12]
rd[11:7]
opcode[6:0]
J-type
imm[20|10:1|11|19:12][31:12]
rd[11:7]
opcode[6:0]
Why immediates are split in S and B Because rs1 and rs2 live in the same bit positions across all formats. That regularity lets the hardware decode rs1/rs2 in parallel with deciding what type of instruction it is. The price: the immediate gets chopped into pieces.
Key takeaway
Six formats, but only one register-file layout. The immediate generator (Imm Gen) is the block that knows how to reassemble whichever immediate-shape this instruction uses.

2.The single-cycle datapath

"Single-cycle" means every instruction begins and finishes within one clock period. We'll see this is conceptually simple but performance-limited.

2.1The five universal tasks

Regardless of which instruction is executing, the hardware must (in order):

  1. Fetch the instruction from instruction memory (IMEM) at address PC.
  2. Decode it: pull the opcode, read up to 2 source registers, sign-extend the immediate.
  3. Execute in the ALU: add, sub, compare, or compute an address.
  4. Memory access (only for lw/sw): read or write data memory (DMEM).
  5. Writeback: store the result into the destination register, and update PC.
Important
In single-cycle, all five tasks happen in one clock period. In pipelining, each task gets its own stage and overlaps with neighbours' tasks. Same five tasks, just timed differently.

2.2Datapath components

PC IMEM (read-only) RegFile 32 × 32-bit 2 read + 1 write port ALU DMEM (R/W) Imm Gen WB mux
High-level single-cycle datapath. Dashed arrow = writeback into RegFile.
BlockRole
PCHolds the address of the current instruction.
IMEMReads the 32-bit instruction at IMEM[PC].
RegFileReads rs1 and rs2 in parallel. Writes rd on clock edge.
Imm GenSign-extends/reassembles the immediate from the instruction bits.
ALUPerforms the arithmetic/logic operation (or address calc for lw/sw).
DMEMReads (lw) or writes (sw) data at the address the ALU computed.
ControlDecodes the opcode → sets all the mux selects (not shown).

2.3Walkthrough — R-type (add x3, x1, x2)

  1. PC drives IMEM → instruction fetched.
  2. RegFile reads x1 and x2 in parallel.
  3. ALU computes x1 + x2. DMEM is bypassed.
  4. The ALU result is muxed into the RegFile's write port; on the next clock edge x3 ← result.
  5. PC ← PC + 4.

2.4Walkthrough — lw x3, 12(x1)

  1. Fetch the instruction.
  2. RegFile reads x1; Imm Gen extracts 12.
  3. ALU computes x1 + 12 = effective address.
  4. DMEM reads the word at that address.
  5. DMEM result → RegFile write port; x3 ← memory value on clock edge.
Notice lw is the only instruction that uses all five blocks in series: PC → IMEM → RegFile → ALU → DMEM → RegFile. That makes it the longest path through the datapath — keep that in mind for §3.

2.5Walkthrough — beq x1, x2, LABEL

  1. Fetch + decode.
  2. RegFile reads x1, x2. Imm Gen builds the branch offset.
  3. ALU computes x1 − x2; zero-detect bit determines "taken".
  4. A second adder (or the same ALU) computes PC + offset.
  5. PC ← taken ? (PC + offset) : (PC + 4).

3.Single-cycle performance

Now that we know what the hardware does, we can ask: how fast does it run?

3.1Performance definitions

Definitions

Tc (clock period): seconds per clock cycle. f = 1/Tc: clock frequency.

CPI: Cycles Per Instruction — average number of clocks each instruction takes.

IPC: Instructions Per Cycle = 1 / CPI. The reciprocal — equally common.

Execution Time: the only metric that matters end-to-end.

Master formula Execution Time = #Instructions × CPI × Tc

To go faster, attack one of the three factors. Each microarchitecture style optimises a different one:

StyleCPITcNotes
Single-cycle1very longOne slow clock per instruction
Multicycle3-5shortFast clock but many cycles per inst
Pipelined≈ 1shortThroughput of the fast clock + CPI of single-cycle

3.2The critical path

Definition
Critical path: the longest combinational delay from one clocked element to the next. Tc must be ≥ this delay or the result won't latch correctly.

For single-cycle RISC-V, the critical path runs through lw (uses every block in series):

SC critical path Tc ≥ tPC + tIMEM + tRFread + tALU + tDMEM + tRFwrite

That's roughly the sum of five block delays. The whole machine is paying for the worst case every cycle — even for an add that doesn't touch DMEM. This is why single-cycle is slow.

3.3Worked example — single-cycle (Sarah Harris Ch. 7)

Given delays: IMEM = 250 ps, RegFile read = 150 ps, ALU = 200 ps, DMEM = 250 ps, RegFile write (negligible).

T_c   ≥ 250 + 150 + 200 + 250 = 850 ps  (round to typical 750-1000 ps in textbooks)
CPI   = 1
For 100 billion instructions:
  Time = 10^11 × 1 × 750 ps = 75 seconds

This 75 s is the benchmark we'll beat with pipelining (43 s, see §7).

4.From single-cycle to multicycle

Single-cycle's killer flaw: every cycle is sized for the worst instruction. Multicycle fixes Tc by giving each step its own short clock — but at the cost of multiple clocks per instruction.

4.1The motivation — why split the cycle

In single-cycle, an add uses IMEM + RF + ALU + RF write (~600 ps) but the clock is sized for lw (~850 ps). The add wastes ~250 ps of every clock doing nothing.

Insight If we could let each task finish as soon as it's done — and only pay for the tasks an instruction actually needs — we'd get a much shorter average clock. That's the multicycle idea.

4.2The multicycle datapath

Same blocks as single-cycle (PC, IMEM, RegFile, ALU, DMEM), but now one block is active per clock. Between clocks, an internal register holds intermediate values (the ALU output, the fetched instruction, the memory data, etc.).

StateWhat happensActive block
S1: FetchIR ← IMEM[PC]; PC ← PC+4IMEM
S2: DecodeRead rs1, rs2; sign-extend immRegFile
S3: ExecuteALU op or address computeALU
S4: MemoryDMEM read/write (lw/sw only)DMEM
S5: WritebackRegFile ← resultRegFile

4.3Per-instruction state count

Not every instruction needs all 5 states — that's the whole point. The control unit is a small FSM that walks through only the states each instruction needs:

InstructionStates usedCycles
R-type (add, etc.)Fetch · Decode · Execute · Writeback4
lwFetch · Decode · Execute · Memory · Writeback5
swFetch · Decode · Execute · Memory4
beqFetch · Decode · Execute (compare + PC update)3
jalFetch · Decode · Execute · Writeback4
Fetch S1 Decode S2 Execute S3 Memory S4 (lw/sw) WB S5 After WB (or after Memory for sw, or after Execute for beq) → back to Fetch
Multicycle FSM: instructions skip the Memory state if they don't need DMEM.

4.4Performance — CPI rises, Tc falls

Multicycle Tc
Tc ≥ tlongest single block ≈ max(tIMEM, tALU, tDMEM, …)
Roughly one fifth of single-cycle's Tc.
Multicycle CPI
CPImulti = Σ frequencyi × statesi

4.4.1Worked example

Using the SPECINT2000 mix from §7.3: 25% lw (5 states), 10% sw (4), 13% branch (3), 52% R-type (4).

CPI_multi = 0.25·5 + 0.10·4 + 0.13·3 + 0.52·4
          = 1.25 + 0.40 + 0.39 + 2.08
          = 4.12

With Tc = 250 ps and 1011 instructions:

Time_multi = 10^11 × 4.12 × 250 ps = 103 seconds

(Sarah Harris's textbook gets 155 s using slightly different delays — same shape.)

4.5Did we win or lose?

Single-cycleMulticycle
CPI1≈ 4
Tc~850 ps~250 ps
Time per inst850 ps≈ 1000 ps
VerdictMulticycle is sometimes slower overall. The savings on simple instructions don't outweigh the FSM overhead.
Why we still study it
Multicycle teaches us how to chop an instruction into stages. Pipelining keeps those stages but overlaps them across instructions — that's where the speedup actually comes from. Multicycle alone = the wind-up. Pipelining = the pitch.

5.The pipelined datapath

Take the single-cycle datapath, split it into 5 sub-datapaths separated by clocked registers. Each sub-datapath is a stage. At any moment, 5 different instructions are in flight.

5.1The five stages

IF
Fetch (IMEM, PC+4)
ID
Decode + RF read
EX
ALU
MEM
DMEM access (lw/sw)
WB
RF write

These map 1-to-1 onto the five universal tasks from §2.1.

5.2Pipeline registers

Between every pair of stages we drop a clocked register: IF/ID, ID/EX, EX/MEM, MEM/WB. They carry forward whatever state the next stage needs (rs values, immediate, control signals, the destination register number, the ALU result, etc.).

Why they exist Without these latches, instruction A in EX would see B's wires flipping in IF. The latches freeze each stage's inputs at the clock edge so all 5 stages can operate in parallel without colliding.

5.3Steady-state throughput & speedup

1
2
3
4
5
6
7
8
9
I1
IF
ID
EX
ME
WB
I2
IF
ID
EX
ME
WB
I3
IF
ID
EX
ME
WB
I4
IF
ID
EX
ME
WB
I5
IF
ID
EX
ME
WB
Key takeaway
Pipelining keeps the single-cycle CPI of 1 and the multicycle short Tc. Best of both worlds — modulo hazards, which we tackle next.

6.Pipeline hazards

The price of overlap. Three families of hazard prevent the ideal CPI = 1.

6.1Data hazards — RAW, WAR, WAW

TypePatternTrue dependency?Visible in 5-stage in-order?
RAW (Read-After-Write)I2 reads what I1 just wroteyes — realyes — common
WAR (anti-dependency)I2 writes a reg I1 still readsno — namingno (in-order)
WAW (output dep.)I2 writes same reg as I1no — namingno (in-order)

WAR and WAW only matter in out-of-order pipelines (see Advanced µArch). RAW is the one we deal with in the 5-stage in-order pipe.

6.2Forwarding (bypassing)

Most RAW hazards can be solved without stalling. The ALU result is already computed at the end of EX — feed it directly to the next instruction's ALU input, instead of waiting for it to land in the RegFile two cycles later.

6.3The load-use hazard (one unavoidable stall)

Definition
Load-use hazard: a load is immediately followed by an instruction that consumes its result. Even with forwarding, one bubble is required — because the load doesn't produce its data until after MEM, which is one cycle too late for the dependent EX.
1
2
3
4
5
6
7
8
lw x1
IF
ID
EX
ME
WB
add x3,x1,x2
IF
ID
**
EX
ME
WB
Load CPI CPIlw = P(no stall)·1 + P(stall)·2
e.g. 40% of loads stall ⇒ CPIlw = 0.6·1 + 0.4·2 = 1.4

6.4Control hazards — branch flushes

A branch resolves in EX. By that point, two instructions are already in IF and ID. If the branch is taken (or mispredicted), both must be flushed (turned into NOPs) and the correct path re-fetched.

Branch CPI CPIbr = P(correct)·1 + P(mispredict)·3
Misprediction cost = 1 (the branch itself) + 2 flushed instructions = 3 cycles.

Branch prediction (covered in Advanced µArch) is how we get P(correct) close to 100%.

6.5Structural hazards

Two stages want the same hardware in the same cycle. The classic example: IF and MEM both touching memory.

7.Pipelined performance math

Putting per-type CPIs and instruction mix together to get a real-program CPI.

7.1Per-type CPI

TypeBest-case CPIStall scenarioStall cost
R-type, store1
Load1Load-use+1 (bubble)
Branch1Misprediction+2 (flushes)
Jump (jal)1Always: target known too late+2 typically

7.2Weighted-average CPI

Formula CPIavg = Σ frequencyi × CPIi

The recipe:

  1. For each instruction type, find its share of the program (e.g. 25% loads).
  2. For each type, compute its CPI: P(no stall)·1 + P(stall)·stallCost.
  3. Sum the products. Done.

7.3SPECINT2000 worked example

Given: 25% loads, 10% stores, 13% branches, 52% R-type. 40% of loads cause a load-use stall. 50% of branches are mispredicted.

TypeFreqCPI calculationCPI
Loads0.250.6·1 + 0.4·2 = 1.41.4
Stores0.1011.0
Branches0.130.5·1 + 0.5·3 = 2.02.0
R-type0.5211.0
CPI_avg = 0.25·1.4 + 0.10·1.0 + 0.13·2.0 + 0.52·1.0
        = 0.35 + 0.10 + 0.26 + 0.52
        = 1.23

7.4Speedup recap

100 billion instructions, Tc = 350 ps:

Time_pipelined = 10^11 × 1.23 × 350 ps = 43 seconds
DesignTimeSpeedup
Single-cycle75 s
Multicycle155 s0.5×
Pipelined43 s1.7×
Chapter summary
  1. RISC-V code lives in 6 instruction formats (R, I, S, B, U, J).
  2. Every instruction does five tasks: Fetch, Decode, Execute, Memory, Writeback.
  3. Single-cycle = all five in one big clock (CPI=1, Tc huge).
  4. Multicycle = one task per clock (Tc small, CPI≈4).
  5. Pipelined = all five clocks in parallel for different instructions (CPI≈1, Tc small).
  6. Hazards push CPI above 1: load-use (+1), branch mispredict (+2). Forwarding kills most RAW penalties.
  7. Execution time = #Inst × CPI × Tc. Optimise any factor to go faster.
Source RiscV_Sarah_Harris.pdf Ch. 7. Slide deck mirrors at Microarchitecture/Ch7_MicroArch.pdf and Branch Prediction.pptx.pdf (examples 7.4–7.9).

Next: Advanced µArch → (branch prediction, superscalar, OoO + renaming)