
This project implements a 4×4 systolic array matrix multiply-accumulate (MAC) accelerator in SystemVerilog. It computes the product C = A × B for two 4×4 matrices of unsigned 8-bit integers, producing 16 results with 20-bit accumulator precision.
The design consists of five modules:
load_done when complete.The systolic dataflow works as follows: matrix A elements flow left-to-right across rows, matrix B elements flow top-to-bottom down columns. Each PE multiplies its registered A and B values and adds the product to its running accumulator. After 2N−1 = 7 counting cycles plus 2 pipeline flush cycles, all 16 accumulators hold the correct dot product results.
Connect an SPI master (e.g., RP2040, Arduino, or FPGA) to the chip:
| Pin | Signal | Direction |
|---|---|---|
| ui[0] | SCK | Input |
| ui[1] | MOSI | Input |
| ui[2] | CS | Input |
| uo[0] | MISO | Output |
| uo[1] | comp_done | Output (status) |
| uo[2] | load_done | Output (status) |
| uo[3] | spi_done | Output (status) |
comp_done (uo[1]) to know when the result is ready.To multiply:
A = [[1, 2, 3, 4], B = [[5, 6, 7, 8],
[5, 6, 7, 8], [9, 10, 11, 12],
[9, 10, 11, 12], [13, 14, 15, 16],
[13, 14, 15, 16]] [17, 18, 19, 20]]
Send bytes: [0x00, 1,2,3,4, 5,6,7,8, 9,10,11,12, 13,14,15,16, 5,6,7,8, 9,10,11,12, 13,14,15,16, 17,18,19,20]
Expected result matrix:
C = [[130, 140, 150, 160],
[322, 348, 374, 400],
[514, 556, 598, 640],
[706, 764, 822, 880]]
A Python test script (test_systolic.py) is included for testing over UART on a Basys3 FPGA. It sends matrices, reads back results, and compares against NumPy reference values.
An SPI master is required to communicate with the chip. Any microcontroller with SPI support will work. For the Basys3 FPGA test setup, a USB-UART bridge is used instead with a separate UART shim module (not included in the TT submission).
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | SCK | MISO | |
| 1 | MOSI | comp_done | |
| 2 | CS | load_done | |
| 3 | spi_done | ||
| 4 | |||
| 5 | |||
| 6 | |||
| 7 |