Chapter — Vector processing & RVV
A different way to get IPC > 1: do the same op on many data items in a single instruction. The RISC-V Vector extension (RVV) is the modern, scalable take.
Contents
1.Three levels of parallelism
| Type | Example | Granularity |
|---|---|---|
| Instruction-Level | Pipelining, superscalar, OoO | Within one instruction stream |
| Thread-Level | Multicore, SMT | Multiple instruction streams |
| Data-Level | SIMD / Vector / GPU | One instruction, many data elements |
Data-level parallelism
Run the same operation on many data items at once. Examples: pixel-wise image filters, matrix add, dot products, AI tensor ops.
2.Scalar vs vector — the punchline
Scalar RISC-V (10-element vector add)
li t0, 0
li t1, 10
loop:
bge t0, t1, done
slli t2, t0, 2
add t3, a0, t2
flw fa0, 0(t3)
add t3, a1, t2
flw fa1, 0(t3)
fadd.s fa2, fa0, fa1
add t3, a2, t2
fsw fa2, 0(t3)
addi t0, t0, 1
j loop
done:
10 iterations · ~10 instructions each = ~100 dynamic instructions.
RVV equivalent
vsetvli t0, a3, e32, m1, ta, ma
vle32.v v0, (a0)
vle32.v v1, (a1)
vfadd.vv v2, v0, v1
vse32.v v2, (a2)
~5 instructions, one pass through 10 elements. ~20× fewer dynamic instructions.
Analogy
Scalar = pairing socks one at a time. Vector = a sock-pairing machine that swallows 8 pairs and spits them matched in one motion. Same total work, far less control overhead.
3.Traditional SIMD and its limits
| Family | Width | Vendor |
|---|---|---|
| MMX, SSE, SSE2…SSE4.2 | 64 → 128 bits | Intel/AMD |
| AVX, AVX2 | 256 bits | Intel/AMD |
| AVX-512 | 512 bits | Intel/AMD |
| NEON | 128 bits | ARM (mobile) |
| SVE / SVE2 | Scalable | ARM (server) |
Pain point Traditional SIMD bakes the vector width into the opcode. If you compile for SSE (128 b) and the CPU has AVX-512 (512 b), you can't use it without recompiling. And you have to write a separate "tail" loop for the leftover elements that don't fill a full vector.
4.RVV programming model
4.1VLEN, SEW, LMUL
Hardware fixed
VLEN: width of each vector register in bits (128, 256, 512, …). Set by chip designer.
Software-set per pass
SEW (Selected Element Width): bits per element — 8, 16, 32, or 64.LMUL (Length Multiplier): 1, 2, 4, 8 (or ½, ¼). Groups multiple v-regs into one logical larger register.
4.2AVL, VLMAX, VL
The runtime triad
AVL (Application Vector Length): how many elements the program wants to process.VLMAX: hardware capacity =
LMUL × VLEN / SEW.VL: what the hardware will process this iteration = min(AVL, VLMAX).
4.2.1Numeric example
VLEN = 256 bits, SEW = 32 bits, LMUL = 1, AVL = 5
VLMAX = 1 · 256 / 32 = 8 elements
VL = min(5, 8) = 5
[ e4 e3 e2 e1 e0 · · · ] ← 3 trailing tail lanes
4.3The vsetvli instruction
vsetvli rd, rs1, vtypei
│ │ └─ encoding of SEW, LMUL, tail/mask policy
│ └─ AVL (elements remaining)
└─ destination: VL is written here (use it as loop step)
One instruction configures everything: element width, LMUL, tail/mask policy, and returns the runtime VL the hardware will use.
4.4Tail & mask policies
| Policy | Behaviour for inactive lanes |
|---|---|
| Undisturbed | Old values preserved. |
| Agnostic | Hardware may write 1s, leave alone, or anything — do not rely on the value. |
5.Strip-mining loop walkthrough
Add two arrays element-wise. a0 = count, a1/a2 = x/y bases, a3 = z base.
vvaddint32:
vsetvli t0, a0, e32, m1, ta, ma # VL = min(a0, VLMAX); t0 = VL
vle32.v v0, (a1) # load VL elements of x
sub a0, a0, t0 # remaining -= VL
slli t0, t0, 2 # VL words → bytes
add a1, a1, t0
vle32.v v1, (a2) # load VL elements of y
add a2, a2, t0
vadd.vv v2, v0, v1 # element-wise add
vse32.v v2, (a3) # store VL results
add a3, a3, t0
bnez a0, vvaddint32 # loop while remaining > 0
ret
Iteration with AVL=6, VLMAX=4
| Iter | vsetvli inputs | VL | elements processed | a0 after |
|---|---|---|---|---|
| 1 | AVL=6 | 4 | x[0..3], y[0..3] → z[0..3] | 2 |
| 2 | AVL=2 | 2 | x[4..5], y[4..5] → z[4..5] | 0 |
No tail loop, no recompile if VLMAX changes — the hardware just handles fewer/more elements per pass.
6.Why RVV beats SSE/AVX
| Feature | Traditional SIMD | RVV |
|---|---|---|
| Vector width | Fixed (opcode-encoded) | Variable (runtime) |
| Portability | Recompile per width | Same binary scales |
| Tail handling | Manual tail loop | Automatic via VL |
| Scalability | Limited (max width) | 128 b → 2048 b without code change |
One-line summary
RVV makes vector length a runtime variable. Same RVV binary runs on a 128-bit phone CPU and a 2048-bit HPC chip — automatically using whatever width the hardware has.
Source
Vector Processing/Vector_concise.pdf — programming model on pp. 9-13, full strip-mining example pp. 17-24. Vector_longnotes.pdf for deep dive.
Next: Green Computing + DV →