A 95MHz 32-bit Pipelined RISC-V Processor

This is a two-stage pipelined RISC-V processor with a limited instruction set. I created this as part of my Design of Computing Systems (ENGN1640) class, which I took in my junior spring at Brown.

Previously in the class, we had created a single-cycle RISC-V processor with no pipelining. That means the processor fully completes one instruction (reading it from instruction memory, copying the relevant memory out of the stack into a register, doing the calculation, and storing the result) before embarking on the next one. As one can imagine, this is fairly inefficient. The goal was to improve the processor by at least 15% using pipelining.

This document covers the general structure of the processor, the pipelining strategy I selected, and an example program (Sokoban). Sokoban was created in collaboration with my lab partner, Yash Vora.

Since this is still used as an assignment in this class, I will avoid talking about many of the low-level design decisions or actual RISC-V code! These are the broadstrokes.

General Remarks

This is a two-stage pipelined processor which handles the following RISC-V instructions:

InstructionMeaning
addAdds two registers and stores the result to a register.
addiAdds a register and a given constant and stores the result to a register.
subSubtracts one register from another and stores the result to a register.
andReturns bitwise AND of two registers.
orReturns bitwise OR of two registers.
slliShifts a register left by some constant.
mulMultiplies two registers together and stores the result to a register.
swStore a register to an address on the stack.
lwLoads an address from the stack to a register.
beqJump to a given instruction if two registers are equal.
bneJump to a given instruction if two registers are not equal.
jalJump to a different constant location in instruction memory.
jalrJump to a different location in instruction memory, dictated by a register (usually the stack pointer).
halt(Not native RISC-V) Halt execution.

Adding more instructions would have been fairly trivial; the limited instruction set was dictated by the professor.

Implementation of the processor was done through Altera’s Quartus, a programmable logic device design software. Functionality was tested through deployment on an FPGA board.

Timing Statistics

StatValue
Max. Operating Frequency 95MHz
Clock Period10.53ns
Duty Cycle20%
CPI1 (!)

Memory and Stack

The instruction memory can contain up to 232 - 1 instructions (although this can be readily extended). The first entry in instruction memory (address 0) is hardcoded to be 0 and should not be overwritten (or undefined behavior will be observed). The first instruction should be written to the instruction memory location 1.

The stack contains 256 32-bit entries, placed at intervals of 4 words. Stack access is word-aligned (that is, the stack pointer will access at multiples of 4.) If sp is not a multiple of 4, the value will get rounded down to the nearest multiple of 4 before access. The stack pointer (x2) is initialized to 1020, which accesses the last (255th) entry.

Pipelining and Branch Resolution

Branch resolution occurs in the first stage (IF-ID-EX), bypassing the need for any fancy branch-decision logic or stage flushing. The program counter is always supplied with the correct next instruction within the first cycle.

The CPI is so low because the processor is only two-stage; unlike many other pipelined processors, the chip never needs to stall to avoid load-use or compute-use hazards. Writing from data memory into the register occurs before register reads, bypassing load-use hazards, and forwarding from the ALU output bypasses compute-use hazards. By sacrificing some frequency, we get an insanely fast CPI (equivalent to a single-cycle processor).

This was a conscious decision early in the design process and not one I made trivially. By only dividing the CPU into two stages, one fails to compartmentalize some fairly chunky processes. If these had not been implemented efficiently, my overall frequency would not have been reduced enough, leading to an ineffective processor even with the low CPI. It would have been equally valid to, say, implement a four-stage processor, or even split the EX stage (since the ALU contributes the greatest delay here). But that would have added complexity with hazard detection and forwarding, which I ultimately opted to avoid. I ended up making some small optimizations to the ALU in order to boost the overall frequency.

Registers

Several registers are reserved for internal use or FPGA I/O:

Register  Use
x1register address (ra)
x2stack pointer (sp)
x27reserved for I/O (button 0 - “down”)
x28reserved for I/O (button 1 - “up”)
x29reserved for I/O (button 2 - “left”)
x30reserved for I/O (button 3 - “right”)
x31reserved for I/O (switch)

The last five registers are constantly updated by an I/O controller to reflect the status of the input devices, and can be referenced by the program.

Writing to the Screen

Writing to the screen was handled by a VGA buffer module and video controller (which were supplied). A specific RAM module (with larger addresses than the stack) can be written to with an ASCII character and 12-bit RGB information, which would then be rendered by the controller based on a set of stored bitmaps.

Floorplan

Modules

The processor contains eight total modules:

Including the PLL (which generates the 95MHz clock with a 20% duty cycle from the FPGA’s onboard clock), and the VGA buffer/video controller, this makes ten modules. There are also various MUXes to select inputs for different modules based on control bits (set by the control module).

Resource Consumption

ResourceDefinition#
ALMA collection of basic logic elements in the FPGA.1203
LABHigher-order category of a group of logic elements. Roughly equivalent to overall floorplan.73
ALUTA collection of basic logic elements in the FPGA.1482
Logic registerAny kind of register, including those used to handle pipelining.1304
PLLUnit to generate the non-standard clock frequency.1
Memory blocksBlocks of memory used for instruction and stack.4

Example Program: Sokoban

This is the example program written in RISC-V assembly code, in collaboration with a partner, Yash Vora. It’s an implementation of the classic Japanese puzzle game Sokoban (倉庫番, meaning “warehouse keeper”), where the player pushes blocks around a level to get them onto specific tiles (referred here as “pressure plates”). A level is completed when every pressure plate is covered by a box. A player can push a box as long as the space behind it is empty (i.e. not filled with a box or a wall). This leads to surprisingly challenging levels and emergent gameplay! Our Sokoban contains three implemented levels, as an example; the assembly code can be readily extended to include arbitrarily many.

Controls

Four buttons are used as movement; up, down, left, and right.

The switch is used a level reset; setting it to ON and then OFF again will reset the current level to its initial condition.

Movement and Rendering

When a player attempts to move in a direction, the outcome depends on the tiles which are located in that direction:

Only one sprite may be rendered on a tile. The sprites are rendered according to the following priority:

The top of the screen says SOKOBAN - LEVEL N, where N is the current level.

Winning and Levels

A player beats their current level when every pressure plate is covered by a block. As currently implemented, each level has the same number of plates and blocks; however, it is theoretically possible for a level to have more blocks than plates, and indeed the current setup supports this. In that case, all the game requires is for every plate to be covered; not for every block to be on a plate.

When a level is beaten, the game wipes the screen and then renders the next level for them to play. The player advances sequentially through levels (so level 1 first, then level 2, then level 3). A winscreen is displayed after the second level (to demonstrate functionality); but hitting the reset switch on the winscreen will display more levels!

Register Descriptions

Images

An image of the first level. An image of the third level (not in its starting configuration). An image of the winscreen.