Andrew Wang, Tim Gu
A bit-serial CPU processes one bit of a data word at a time using minimal logic - often reusing a small ALU and control unit across clock cycles. This is in contrast to a bit-parallel CPU, which processes entire data words (e.g., 8/16/32 bits) at once.
Processing a single bit at a time instead of in parallel means that the CPU is much slower, but it can be made much smaller. This makes the bit-serial architecture very suitable for a submission to TinyTapeout in which a chip area of 160 x 100 um is one of the primary constraints.
In this design, 16-bit width instructions are fed into the CPU over two clock cycles using the 8 TinyTapeout input signals. These instructions are decoded and the relevant operands (either immediates or stored values from a register file) are processed bit-serially from LSB to MSB and shifted into an accumulator register. The CPU supports parallel load operations to the accumulator and storing results from the accumulator to an addressable register file.
<img width="2021" height="1550" alt="gds_render" src="https://github.com/user-attachments/assets/af303a02-cf40-4cd0-9aa9-7db6f32abe07" />
Opcode | Mnemonic | C Operation | Description |
---|---|---|---|
0000 |
ADDI |
acc = rs1 + imm | Add Immediate |
0001 |
SUBI |
acc = rs1 - imm | Subtract Immediate |
0010 |
SLLI |
acc = rs1 << imm | Shift left Immediate |
0011 |
SRLI |
acc = rs1 >> imm | Shift right Immediate |
0100 |
ORI |
acc = rs1 | imm | Bitwise OR Immediate |
0101 |
ANDI |
acc = rs1 & imm | Bitwise AND Immediate |
0110 |
XORI |
acc = rs1 ^ imm | Bitwise Exclusive OR Immediate |
0111 |
LOADI |
acc = imm | Load immediate into accumulator |
1000 |
ADD |
acc = rs1 + rs2 | Add Registers |
1001 |
SUB |
acc = rs1 - rs2 | Subtract Registers |
1010 |
OR |
acc = rs1 | rs2 | Bitwise OR Registers |
1011 |
AND |
acc = rs1 & rs2 | Bitwise AND Registers |
1100 |
XOR |
acc = rs1 ^ rs2 | Bitwise Exclusive OR Registers |
1101 |
LOAD |
acc = rs1 | Load from register into accumulator |
1110 |
STORE |
rs1 = acc | Store from accumulator into register |
Pin Group | Type | Usage |
---|---|---|
ui_in[7:0] | Input | Instruction bit inputs |
uio_in[0] | Bidirectional | Pushbutton input for instruction loading |
uo_out[7:0] | Output | Parallel output |
clk | Clock | Clock input |
rst_n | Reset | Active low synchronous reset |
[15:8] | [7] | [6:4] | [3:0] |
---|---|---|---|
immediate[7:0] | unused | rs1_addr[2:0] | opcode[3:0] |
[15:11] | [10:7] | [6:4] | [3:0] |
---|---|---|---|
unused | rs2_addr | rs1_addr[2:0] | opcode[3:0] |
Instructions are loaded manually over two clock cycles via button press:
First press:
ui_in[3:0]
) are latched into opcode[3:0]
ui_in[7:4]
) go into instr[3:0]
inst_done
is set highSecond press:
ui_in[7:0]
fills in instr[11:4]
inst_done
is clearedAn edge-detected pushbutton (connected via uio_in[0]
) triggers instruction loading. Once loaded, the FSM executes the instruction, and the final result is output on uo_out[7:0]
.
<img width="2719" height="1014" alt="image" src="https://github.com/user-attachments/assets/c0e4eb80-4685-4eb3-b876-7a634d39eb94" />
The fsm_control
module orchestrates datapath sequencing using a 5-state FSM:
S_IDLE = 0x0
: Waits for button press and valid instruction
S_DECODE = 0x1
: Decodes opcode, issues control signals for load/store/ALU
S_SHIFT_REGS = 0x2
: Performs serial operations; enables register shifting and accumulator writes
S_WRITE_ACC = 0x3
: Special case state for direct writes (not commonly used)
S_OUTPUT = 0x4
: Signals end of execution and enables writing to output LEDs
The FSM generates control signals including reg_shift_en
, acc_write_en
, alu_start
, alu_op
, and out_en
based on instruction type.
The regfile_serial module implements an 8x8 register file, where each register is 8 bits wide. It supports:
(SLLI/SRLI)
, rs1_bit is offset by shift_imm, computed from instr[6:4]
.The register file outputs:
rs1_bit
: used as ALU operand 1rs2_bit
: used as ALU operand 2 (only valid for R-type)regfile_bits
: parallel content of the selected rs1 register, for LOAD/LOADI
The alu_1bit module performs a one-bit computation per cycle based on alu_op:
ADD
, SUB
, XOR
, AND
, OR
, pass-through (for shift ops).carry_in
and carry_out
are managed explicitly to support serial arithmetic.SUB
, rs2
is inverted and an initial carry is injected on the first cycle.The ALU receives rs1
, rs2
, alu_op
, alu_start
, and outputs a single-bit result to the accumulator.
The accumulator is an 8-bit shift register that:
regfile_bits
or instr[11:4]
(based on opcode[3]
)alu_result
bit_index_d
signal to update the correct bit and signals completion when doneThe accumulator provides the final output via acc_bits
.
This project uses a black-box testing strategy to validate the behavior of the bit-serial CPU.
ui_in[7:0]
and a simulated pushbutton uio_in[0]
.uo_out[7:0]
is compared against the expected result to determine if the CPU gives the correct output.clk
and rst_n
.This is a clean abstraction of test logic that reflects the real-world usage model of the CPU with portability to gate-level simulations.
A cocoTB testbench is used to run tests in Python. Each test uses the following structure:
uo_out
against the expected result using assert_result(...)
. A test fails if the result mismatches or the CPU fails to update the output.Example:
LOADI 0x2D
0x2D
<img width="1217" height="869" alt="image" src="https://github.com/user-attachments/assets/685b6ed6-9bf2-432a-b280-d606e5539934" />Example:
R3
contains 0x73
, R4
contains 0x2D
XOR R3, R4
0x5E
<img width="1684" height="856" alt="image" src="https://github.com/user-attachments/assets/41db3e04-fcba-4719-9a9c-b11cd0cde3a1" />Example:
R3
contains 0x73
, R4
contains 0x2D
SUBI R3, 0x2C
0x47
<img width="1684" height="901" alt="image" src="https://github.com/user-attachments/assets/f4138d6a-19f4-46eb-b4e2-71315d5c8499" />Note that in this case, the bits in the I-type instruction that correspond to the rs2
address are a value of 4. However, the mux logic correctly selects the immediate bits for use in the ALU rather than using R4
as the second operand.
Example:
R6
contains 0x12
SLLI R6, 0x02
0x48
<img width="1717" height="888" alt="image" src="https://github.com/user-attachments/assets/9b64fd95-4092-4d5a-bcf6-ff4758816b37" />top.v
and cpu_core.v
fsm_control.v
for instruction decode and control sequencingalu_1bit.v
and regfile_serial.v
and developed the core features required for integration of these modules. This includes R vs I-type operand multiplexing, regfile addressing, and serial arithmetic/logic operations.accumulator.v
# | Input | Output | Bidirectional |
---|---|---|---|
0 | INST_0 | LED_0 | PB_INST |
1 | INST_1 | LED_1 | |
2 | INST_2 | LED_2 | |
3 | INST_3 | LED_3 | |
4 | INST_4 | LED_4 | |
5 | INST_5 | LED_5 | |
6 | INST_6 | LED_6 | |
7 | INST_7 | LED_7 |