15 min read
CALPU

Prelim

Summary of required specs and constraints

Project Overview 1

NCO+DAC and BPM Detector circuits will not be used.

Architecture

  • Implement RV32I base integer instruction set
  • Implement CSRRW, CSRRWI and 1 CSR register (tohost=0x51E)
  • Exactly 3 stages, one cycle per stage
  • No stalling for data hazards (must forward)
  • Register file with 2 async read ports, 1 sync write port
  • $F_{max} \geq 50$ MHz

RAMs

  • All synchronous read/write
  • Bios RAM with 12-bit address (16 KiB), for both data and hex loaded bios
  • IMEM with 14-bit address (16 KiB), needs to be writable by BIOS
  • DMEM with 14-bit address (16 KiB)
  • Memory-mapped IO for status and uart
  • UART for serial communication

Running

  • Xilinx PYNQ-Z1 FPGA (has 125 MHz clock)
  • Communicate with BIOS via serial port (CPU <-> UART <-> serial)
  • Load hex programs via BIOS into IMEM and jal to run

1. Design & Block Diagram

Designed the CPU layout, using the standard 5-stage pipeline as reference.

Pipeline Stages

The main factor for stage design is the synchronous memory, as memory reads/writes will only reflect in next cycle(stage). These synchronous elements will naturally sit at stage borders:

  • PC
  • imem, dmem
  • regfile write

Also there is dependency between the stages and elements. Considering the synchronous elements, dependencies, and the standard 5-stage pipeline, the most straightforward (though not most efficient) 3-stage pipeline design would be:

  1. IF: Instruction Fetch
  2. DXM: Decode, Execute, Memory access
  3. WB: WriteBack

Can already know that there will be a absurdly long critical path in the DXM stage, especially with forwarding. Some ideas for later optimization:

  • Separate target address adder parallel to ALU
    • Memory load/stores become parallel to ALU, cutting critical path
  • Move M to WB, becoming MWB
    • Need a forwarding path from MWB to EX, preceded by load extend unit and writeback mux
    • No longer can have branch compare unit in D, so branch resolution now in EX, may benifit from branch prediction
    • Will have to handle hazards differently
  • Combine the above two, with IF -> DX -> MWB and dedicated target address adder
  • (Not sure if this works) Make IF PC combinational, after ALU and align instruction memory with DXM-WB
    • IF still occupies a stage, but 1 cycle branch penalty is gone.
    • Seems that it shouldn’t increase critical path but need to confirm with implementation. However, I would expect worse fanout.

(The ideas may or may not be compatible with each other.)

Eventually I decided to start with implementing IF -> DXM -> WB as a baseline and optimize/restructure later.

Block Diagram

block_diagram_v1

MMIO not shown but basically aligned with the memory blocks after ALU.

Hazards

Structural Hazards

Correct block design should let us avoid structural hazards completely.

Control Hazards (Branches)

Branch resolution happens in DXM with ALU outputting the target address.

Unconditional: (jal, jalr)

PC register in IF is already the previous PC + 4, so there will be a 1 cycle penalty to kill the next DXM instruction.

Conditional

We will be naively predicting taken/not-taken, and a mispredict will cost 1 cycle similar to jumps.

Branch prediction with the current design wouldn’t make sense because of the max (and min) 1 cycle penalty for fetching anything other than PC+4, but may be useful for other designs later when optimizing.

Data Hazards

WAR, WAW

Not needed for in-order single-issue

RAW, Load-use

Forwarding path from writeback mux to regfile readouts allow 0 delay.

2. Basic Implementation

  • Working Core
  • MMIO
  • 50 MHz (synthesize optimized away everything because top module output depends on MMIO UART)
  • Pass all tests
    • unit tests
    • cpu_tb.v
    • assembly tests
    • isa tests
    • echo_tb.v
    • uart_parse_tb.v
    • bios_tb.v
  • FPGA
  • mmult benchmark

Datapath

I start by implementing self-contained combinational-logic elements.

ALU

While testing, I encountered this bug where AUIPC, LUI, or CSR instructions would get unknown values from ALU output. It was because U-type and CSR instructions don’t have full register address fields, so the interpreted addresses were accessing uninitialized registers. Allowing ALU to passthrough (muxing between) A or B fixed that.

ImmGen

Pretty Straightforward, just adhere to the spec.

LDX

Since RISC-V is byte-addressed but our block RAMS are word-addressed, we use the lower two bits of the ALU-calculated load address as a byte offset to select the correct portion of the 32-bit word returned by memory.

Control

Control signals have different dependencies but are all generated in DXM.

Memory Map

Address[31:28]Address TypeDeviceAccessNotes
0001PCInstruction MemoryRead-Only
0100PCBIOS MemoryRead-only
-------------
001xDataInstruction MemoryWrite-OnlyOnly if PC[30] == 1
0100DataBIOS MemoryRead-only
00x1DataData MemoryRead/Write
1000DataMMIORead/WriteImplement in next checkpoint

1

3. Fully Functional Implementation on FPGA

  • Working Core
  • MMIO
  • 50 MHz
  • Pass all tests
    • unit tests
    • cpu_tb.v
    • assembly tests
    • isa tests
    • echo_tb.v
    • uart_parse_tb.v
    • bios_tb.v
  • FPGA
  • mmult benchmark

Basically what’s left is MMIO to connect UART and testing on the FPGA board.

Memory Mapped IO (MMIO)

AddressFunctionAccessData Encoding
0x80000000UART controlRead{30’b0, uart_rx_data_out_valid, uart_tx_data_in_ready}
0x80000004UART receiver dataRead{24’b0, uart_rx_data_out}
0x80000008UART transmitter dataWrite{24’b0, uart_tx_data_in}
0x80000010Cycle counterReadClock cycles elapsed
0x80000014Instruction counterReadNumber of instructions executed
0x80000018Reset counters to 0WriteN/A

1

Implementing this was pretty straightforward, though I did find bugs in my uart module that didn’t show up during simulation.

Testing (simulation)

Passing all tests:

Running make sim/cpu_tb.vpd: Passed
Running make sim/asm_tb.vpd: Passed
Running make isa-tests: Passed
Running make c-tests: Passed
Running make sim/echo_tb.vpd: Passed
Running make sim/uart_parse_tb.vpd: Passed
Running make sim/bios_tb.vpd: Passed
All tests passed!

Testing (FPGA)

So I ran synthesize, placement & routing via Vivado, and programmed the bitstream onto the board. Then using the BIOS over serial (screen), I loaded the mmult testbench hexfile into imem:

151> jal 10000000
Result: 0001f800
Cycle Count: 00e501e8
Instruction Count: 00c0b1e8

CPI = 0x0001f800 / 0x00c0b1e8 = 1.18

Checkpoint 4: Optimization

The goal is to minimize benchmark execution time (Iron Law) and maximize FOM (figure of merit):

$$ FOM = \frac{F_{max}}{CPI \cdot \sqrt{cost}} $$

To find $F_{max}$, we have to run Vivado synthesis & PAR repeatedly while increasing cpu clock frequency until it fails to meet timing. A lookup table of desired clock to PLL parameters was provided to us.

It’s very tedious to manually change these, so I wrote a tcl script invoked by make targets to specify desired frequency and pass the parameters to Vivado. However later on, I wanted more granular frequencies (eg. 57 MHz) so then transitioned to a python script to generate parameters adhering to the board clocking specification2.

We use the script fom.py to parse the post-synth/post-place&route utilization/timing reports and run benchmarks to calculate the FOM.

Although the project spec states that we could do anything (add stages, etc…) for this checkpoint, I still wanted to see how much I could get out of a 3-stage pipeline before moving on to more.

Baseline: Current Design

The current design as shown in the block diagram managed to reach an $F_{max}$ of 55 MHz.

FOM:

$ ./scripts/fom.py
...
Cycle Counts: [26558, 19547]
Instruction Counts: [24027, 16923]
CPIs: [1.11, 1.16]
CPI (geomean): 1.13

Fmax: 60.0
CPI: 1.13
Cost: 1472632

FOM (estimate): 43.76

Inspecting the timing reports confirmed that the critical path is the forwarding path from WB, going through control modules, and ending up in MMIO.

Before playing around with Vivado synthesize/PAR directives (retime, replication, etc…) to optimize, I’d like to test different micro-architectual changes first.

Alt-1: Combinational PC_IF

The first thing I wanted to try was elminating the 1-cycle branch delay penalty due to PC being a register. I changed PC_IF to a CL element relying on ALU output, which is passed through to IMEM/BIOS in the same cycle.

Since we now have 0 branch penalty, the pipeline flushing (DXM kill) mechanism can be removed, and I succeeded with a now constant 1.0 CPI, but slightly lower $F_{max}$ of 59 MHz

$ ./scripts/fom.py
...
Cycle Counts: [24026, 16922]
Instruction Counts: [24027, 16923]
CPIs: [1., 1.]
CPI (geomean): 1.00

Fmax: 59.0
CPI: 1.00
Cost: 1423981

FOM (estimate): 49.44

I expected the critial path to be relatively the same, but the timing reports showed that the routing delay increase pushed it to fail 60 MHz. Reports also show a higher fanout (expected), which should be the main contributor of routing delay. But this design is definitely an improvement with the constant 1.0 CPI and lower cost.

Afterthoughts: in hindsight, it would have been better to keep all PCs as registers, and just mux with ALU output before sending to IMEM.

Alt-2: Target Address Adder

This design uses a separate adder to compute target addresses for store/load and branching instructions. The idea was to cut down the critical path going through regfile -> ALU -> memory. With this separate adder, memory access completely bypasses the ALU, wheras the ALU output goes directly to a pipeline register for next stage without passing through any other element. We see substantial improvements allowing us to reach $F_{max}$ of 74.2 MHz

$ ./scripts/fom.py
...
Cycle Counts: [26558, 19547]
Instruction Counts: [24027, 16923]
CPIs: [1.11, 1.16]
CPI (geomean): 1.13

Fmax: 74.219
CPI: 1.13
Cost: 1411490

FOM (estimate): 55.29

The reports show critical path changing from the ALU path to the address adder path feeding into PC mux.

Alt-3: Merge M into WB

After sketching out the design, I realized that this wouldn’t meet the spec requirements. Even with a forwarding path, forwarded data is only available 1 cycle after MWB, and the spec doesn’t allow stalling for data hazards. Hence we abandone this alternative design.

Alt-4: Combining alt-1 & alt-2

Alt-1 and alt-2 designs are independent of each other so we could apply both at the same time.

$ ./scripts/fom.py
...
Cycle Counts: [24026, 16922]
Instruction Counts: [24027, 16923]
CPIs: [1., 1.]
CPI (geomean): 1.00

Fmax: 70.0
CPI: 1.00
Cost: 1398596

FOM (estimate): 59.19

The improvements are not that dramatic as the higher fanout and routing delay results in a lower $F_{max}$ (compared to alt-2). We still benifit from the better CPI (and slightly lower cost).

Directives

By exploring some of the directive options offered by vivado, we managed to achieve 72 MHz

  • synth_design: -directive PerformanceOptimized
  • opt_design: -directive ExploreWithRemap
  • place_design: -directive ExtraTimingOpt
  • phys_opt_design: -directive AlternateFlowWithRetiming
  • route_design: -directive AggressiveExplore
$ ./scripts/fom.py
...
Cycle Counts: [24026, 16922]
Instruction Counts: [24027, 16923]
CPIs: [1., 1.]
CPI (geomean): 1.00

Fmax: 72.266
CPI: 1.00
Cost: 1389178

FOM (estimate): 61.32

I think we might be able to squeeze out some more performance by further simplifying logic (& muxes). We could also remove forwarding paths with acceptable CPI penalty (since only 3-stage) and gain much higher $F_{max}$. At this point however, I want to move onto a 5-stage pipeline.

V5: 5 Stage Pipeline

I implement the standard RISC-V pipeline IF -> ID -> EX -> MEM -> WB and fully bypassed. Branch compare and resolution are in EX while memory control generation are in MEM, but we can tweak that later by seeing how this baseline performs.

block_diagram_v5

Hazards

Structural Hazards

Same as before, correct block design (or adding more hardware) should let us avoid structural hazards completely.

Control Hazards (Branches)

Branch resolution happens in EX with the ALU outputting the correct target address.

Even though we realize unconditional jumps in ID, we only get the target in EX. Hence both conditional and unconditional jumps cost a pipeline flush of 2 cycles.

Data Hazards

WAR, WAW

Not needed for in-order single-issue

RAW, Load-use

For simple Read-After-Write hazards, we have a forwarding path from the pipelined ALU output in MEM back to EX, enabling 0 (cycle) latency.

Regarding Load-Use hazards, with our synchronous memory, the data needed in EX is only available in WB. Hence aside from a fowarding path from WB mux back to EX, we also need to stall (insert 1 NOP) for 1 cycle.

I also caught a new issue that didn’t happen with the 3-stage pipeline: registers needed in the same cycle of it being written to. A modification to the regfile took care of this.

Benchmarks

There were also some logic simplification improvements from the 3-stage, like instead of having 3 copies of memory (imem, dmem, mmio) write enables masks (4 bit), I just keep 1 mask and mux with single bit write enable.

Once I got it fully working (pasing all tests), I achieved a $F_{max}$ of 85 Mhz with this design.

$ ./scripts/fom.py
...
Cycle Counts: [30152, 22173]
Instruction Counts: [22965, 16922]
CPIs: [1.31, 1.31]
CPI (geomean): 1.31

Fmax: 85.0
CPI: 1.31
Cost: 1333270

FOM (estimate): 56.12

Critical path runs from WB along forwarding path to EX branch resolution and hazard control. It was revealed that I had hazard detection combinational logic from EX leaking into MEM MMIO instruction counter. Fixing that and making some other simplifications like the CSR register and MMIO address, I managed to achieve 89 MHz.

$ ./scripts/fom.py
...
Cycle Counts: [29802, 22132]
Instruction Counts: [23709, 16882]
CPIs: [1.26, 1.31]
CPI (geomean): 1.28

Fmax: 89.063
CPI: 1.28
Cost: 1381295

FOM (estimate): 59.03

Further improvements should be possible, but we are getting close to sacrificing readibility & modularity.

Even with a much higher $F_{max}$, we can see that we get a lower FOM than the best 3-stage design due to higher CPI caused by branch and load latencies.

Branch Prediction

I implemented a 2-bit BHT (branch history table) with saturating counter that brought down the CPI to 1.15. Then applied GShare by adding a global history register and XORing with PC for the BHT index, I managed to cut CPI down further to 1.13. Branch prediction accuracy was around 80%. It’s interesting to directly see how the addition of a GHR hurts mmult CPI but improves for bdd, and that would be a good discussion for the tournament predictor.

...
...simulation complete
Cycle Counts: [26812, 19118]
Instruction Counts: [23441, 16882]
CPIs: [1.14, 1.12]
CPI (geomean): 1.13

Fmax: 71.25
CPI: 1.13
Cost: 1392614

FOM (estimate): 53.43

Obviously we take a huge hit on F_max as my branch predictor in IF has inputs wired from ID stage, becoming the new critical path and thus no longer meets original timing. Further optimization for this final version is possible but not done due to “timing” constraints.

What’s Next

This is one of the most complicated projects that I’ve accomplished, but there’s stil a lot that I want to play around with. For example, applying more concepts learned in the advanced Computer Architecture class, such as superscalar out-of-order, caching, multithreading, and advanced branch prediction like Tournament/TAGE.

Another cool direction would be to explore the implementation of interrupts, exception handling, privelege levels, MMU & TLB, or isa extensions like atomic or vector or floating-point instructions.

Also, this project was a pure FPGA experience, and I am currently in progress of revisiting this as an ASIC design (with tapeout)!

Footnotes

  1. https://inst.eecs.berkeley.edu/~eecs151/sp25/static/fpga/project/docs/checkpoint1/index 2 3

  2. https://docs.amd.com/v/u/en-US/ug472_7Series_Clocking