495 4x4 Systolic Matrix MAC Accelerator

495 : 4x4 Systolic Matrix MAC Accelerator

Design render

How it works

This project implements a 4×4 systolic array matrix multiply-accumulate (MAC) accelerator in SystemVerilog. It computes the product C = A × B for two 4×4 matrices of unsigned 8-bit integers, producing 16 results with 20-bit accumulator precision.

The design consists of five modules:

  • pe — Processing element. Each PE contains two 8-bit pass-through registers and a 20-bit accumulator. On each clock cycle (when enabled), it latches inputs into internal registers, passes them to neighboring PEs, and accumulates the product of the registered values. 16 PEs are arranged in a 4×4 grid.
  • systolic_array_4x4 — The array top-level. Contains the 16 PEs connected in a grid, boundary feeding logic that skews matrix inputs using a cycle counter (row i of A is delayed by i cycles, column j of B is delayed by j cycles), and a done signal that asserts when the computation is complete (after 10 counter cycles to account for pipeline flush).
  • control_fsm — A 5-state FSM (IDLE → CLEAR → LOAD → COMPUTE → DRAIN) that sequences the operation. It waits for the SPI slave to finish loading matrices, clears accumulators, starts the array, and then enables result transmission.
  • spi_slave — Deserializes 33 incoming SPI bytes (1 command byte + 16 bytes for matrix A + 16 bytes for matrix B) into two 4×4 unpacked arrays and asserts load_done when complete.
  • spi_tx — Serializes the 16 accumulator outputs (20 bits each, transmitted as 16-bit truncated values) back to the host over SPI MISO.

The systolic dataflow works as follows: matrix A elements flow left-to-right across rows, matrix B elements flow top-to-bottom down columns. Each PE multiplies its registered A and B values and adds the product to its running accumulator. After 2N−1 = 7 counting cycles plus 2 pipeline flush cycles, all 16 accumulators hold the correct dot product results.

How to test -- adding lines for gds to work :D

SPI interface

Connect an SPI master (e.g., RP2040, Arduino, or FPGA) to the chip:

Pin Signal Direction
ui[0] SCK Input
ui[1] MOSI Input
ui[2] CS Input
uo[0] MISO Output
uo[1] comp_done Output (status)
uo[2] load_done Output (status)
uo[3] spi_done Output (status)

Protocol

  1. Pull CS low to begin a transaction.
  2. Send 33 bytes over MOSI: 1 command byte (any value), then 16 bytes for matrix A (row-major: A[0][0], A[0][1], ..., A[3][3]), then 16 bytes for matrix B (row-major).
  3. The chip computes C = A × B. Monitor comp_done (uo[1]) to know when the result is ready.
  4. Clock out 32 result bytes over MISO (16 results × 2 bytes each, MSB first).
  5. Pull CS high to end the transaction.

Example

To multiply:

A = [[1, 2, 3, 4],        B = [[5,  6,  7,  8],
     [5, 6, 7, 8],             [9,  10, 11, 12],
     [9, 10, 11, 12],          [13, 14, 15, 16],
     [13, 14, 15, 16]]         [17, 18, 19, 20]]

Send bytes: [0x00, 1,2,3,4, 5,6,7,8, 9,10,11,12, 13,14,15,16, 5,6,7,8, 9,10,11,12, 13,14,15,16, 17,18,19,20]

Expected result matrix:

C = [[130,  140,  150,  160],
     [322,  348,  374,  400],
     [514,  556,  598,  640],
     [706,  764,  822,  880]]

Python test script

A Python test script (test_systolic.py) is included for testing over UART on a Basys3 FPGA. It sends matrices, reads back results, and compares against NumPy reference values.

External hardware

An SPI master is required to communicate with the chip. Any microcontroller with SPI support will work. For the Basys3 FPGA test setup, a USB-UART bridge is used instead with a separate UART shim module (not included in the TT submission).

IO

#InputOutputBidirectional
0SCKMISO
1MOSIcomp_done
2CSload_done
3spi_done
4
5
6
7

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Mux Mux Mux Analog Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_oscillating_bones (Oscillating Bones) tt_um_wokwi_457142813149930497 (TinyTapeOut workshop) tt_um_wokwi_457311688017142785 (tiny tapeout test gates) tt_um_bfcpu (bfCPU) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_rebeccargb_intercal_alu (INTERCAL ALU) tt_um_rebeccargb_vga_pride (VGA Pride) tt_um_wokwi_457215959798165505 (4-bit N frequency divider) tt_um_ppu_aebarthyi (simple_ppu) tt_um_riscyv02 (RISCY-V02) tt_um_pong (Pong) tt_um_LH_TapeoutMultiplier (tt_um_LH_TapeoutMultiplier) tt_um_wokwi_457571222315471873 (7 Seg C) tt_um_wokwi_457571216758012929 (Mikes Second) tt_um_wokwi_457569452934172673 (FirstTinyTapeoutWokwiProject) tt_um_wokwi_457571159626309633 (Tiny Tapeout V1) tt_um_wokwi_457577038845586433 (TinyTapeOut) tt_um_wokwi_457571280506256385 (Tiny tapeouts test gates) tt_um_wokwi_457576742418338817 (calculator) tt_um_wokwi_457571219715001345 (Ami's TT Logic Gates) tt_um_wokwi_457571067547656193 (Mikes First Wokwi design) tt_um_wokwi_457571453314827265 (Tiny tapeout one hot to seven segment display 1-8) tt_um_wokwi_457572218833202177 (4bit adder and hex converter) tt_um_wokwi_463741407580251137 (Lady's First Tapeout) tt_um_wokwi_457571138696714241 (jdisplayer) tt_um_wokwi_457570687900145665 (Tiny Tapeout Test Gates) tt_um_wokwi_457577511431565313 (Tiny Tapeout Test Gates) tt_um_wokwi_457571262875481089 (Tiny Tapeout) tt_um_wokwi_457571417674762241 (TamTries Tiny Tapeout) tt_um_wokwi_457571366985520129 (georgies wokwi design) tt_um_wokwi_457571701752981505 (WilfTT) tt_um_wokwi_457572875733692417 (First WOKWI Design) tt_um_wokwi_457571571887847425 (tiny tapeout gate test) tt_um_wokwi_457571352249873409 (First Wokwi design) tt_um_wokwi_457571405919170561 (Namo's first tapeout) tt_um_wokwi_457571339952163841 (OR Gate with NAND) tt_um_wokwi_457571188658258945 (Abishag's first Wokwi Design) tt_um_wokwi_457571426719781889 (Tiny) tt_um_wokwi_457571268900604929 (tiny tape GDS) tt_um_wokwi_457571949070179329 (Tom Haley Tiny Tape Out Design ) tt_um_alex_ha_192 (alex_ha_192) tt_um_wokwi_457577241913913345 (tiny tapeout test gates ) tt_um_wokwi_457571297367365633 (First Wokwi Attempt) tt_um_wokwi_457571363309211649 (idk yet) tt_um_wokwi_457571305740256257 (Work In progress title) tt_um_wokwi_457579594627462145 (TinyTapeoutProjectDefne) tt_um_wokwi_457571274041781249 (Tiny Tapeout Workshop by Kirsty Tan) tt_um_wokwi_457571233499594753 (Tiny Tapeout Workshop) tt_um_wokwi_457570205537212417 (Tiny Tapeout Test Project) tt_um_ojas_sharma_imperial_ttcpu (ttcpu 4-bit RISC microprocessor) tt_um_wokwi_457571271419289601 (chip one) tt_um_wokwi_457573490746716161 (Name Serial Printer) tt_um_wokwi_457569507958215681 (Tiny tapeout proj) tt_um_wokwi_457577929607958529 (Random 1st Attempt) tt_um_wokwi_457571438667259905 (PD+PFD+FreqDiv) tt_um_wokwi_457571602706552833 (Joe's first Wokwi design) tt_um_wokwi_457571471659666433 (Nicolas' first Wokwi design) tt_um_wokwi_457571148733696001 (Tiny Tapeout Workshop 1) tt_um_wokwi_457572520479222785 (Tiny Tapeout: Buenos días Mundo! ) tt_um_wokwi_457571494688497665 (First Chip) tt_um_wokwi_457571341266031617 (D-Type Flip Flop) tt_um_wokwi_457581344351934465 (WOKWI) tt_um_wokwi_457571462196267009 (Tiny Tapeout) tt_um_wokwi_457571359410603009 (TinyTapeout) tt_um_Terdoo_Osu (Spiking Pattern Recognition Core) tt_um_wokwi_457571319408448513 (Mani TinyTapeout) tt_um_wokwi_457571298662360065 (Tiny Tapeout Test Gates) tt_um_wokwi_457573015156590593 (Lil tapeout) tt_um_wokwi_457576363047649281 (Inverter) tt_um_wokwi_457571216488527873 (Tiny Tapeout Template Copy Paul 1) tt_um_wokwi_457571472208072705 (Tiny Tapeout Test design) tt_um_wokwi_457571381968631809 (Tiny tapeout test) tt_um_wokwi_457571314694049793 (Tiny Tapeout Test) tt_um_wokwi_457571368009979905 (Tiny Tapeout Test Gates) tt_um_wokwi_457571389542502401 (First thing) tt_um_wokwi_457570267471381505 (Tiny Tapeout) tt_um_wokwi_457571563051492353 (CS First Wokwi design) tt_um_wokwi_457577392775721985 ( tiny Tapeout Test Gate) tt_um_wokwi_457570279596067841 (Tiny Tapeout Workshop - AJJ) tt_um_wokwi_457571180646081537 (Alins Password) tt_um_wokwi_457572360568198145 (Tiny Tapeout) tt_um_wokwi_457571270578328577 (Tiny tapeout workshop) tt_um_wokwi_457581625098771457 (Tiny Tapeout First Test Run) tt_um_wokwi_442342513281875969 (First Design) tt_um_wokwi_457581848269362177 (Tiny Tapeout Brainf*ck?) tt_um_sap_alexanderholden (sap1) tt_um_wokwi_457571752214675457 (3bit_ALU) tt_um_wokwi_457571542558115841 (Tiny Tapeout") tt_um_wokwi_457573095390500865 (Tiny Tapeout Workshop Counter) tt_um_wokwi_457571511812802561 (Akash's first Wokwi design) tt_um_wokwi_457577563633889281 (Tiny Tapeouts gate tests) tt_um_wokwi_457576950671858689 (Hymns_GDS) tt_um_wokwi_457571371384299521 (Digital digit display circuit - TINYTAPEOUT) tt_um_rowantylerr_RC_TDC (Resistor Capacitor TDC) tt_um_wokwi_463662181299058689 (2 bit ALU) tt_um_chinghey (Hey FlexCompute-130) tt_um_8b10 (serdes8b10) tt_um_rom_vga_screensaver (VGA Screensaver with embedded bitmap ROM) tt_um_mayamelon_top (Tiny PI Controller) tt_um_JAIMEPRYOR0_VGA_YAY (VGA_YAY) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_mng2_2ncos (A Tale of Two NCOs) tt_um_shimmydee_checkers (One-tile ADC) tt_um_urish_simon (Simon Says memory game) tt_um_dheeeraaj_sine_chirp_beacon (DDS Sine Chirp Beacon) tt_um_nicholas_ls194a (Universal Shift Register (SN74LS194A compatible)) tt_um_BellaB05_Hearts (Pink Hearts) tt_um_scottshuynh_ad_astra (ASIC Ad Astra) tt_um_liamolucko_vga (VGA demo) tt_um_lledoux_s3fdp_seqcomb (S3FDP Seq+Comb Stream Core) tt_um_5482582_cat_vga (Cat VGA) tt_um_vga_example_directional_toggle (Directional toggle of VGA playground example) tt_um_jimbok_ro_puf (Ring Oscillator PUF) tt_um_xxsahanaxx_hwsec_glitch (Hardware Security Glitching Attack) tt_um_NguyenHuuHenry_vga_project (VGA_Project) tt_um_irfantekin_analog (tt_um_irfantekin_analog) tt_um_chicagojones_sky26a_trng (Sky26a Advanced TRNG) tt_um_yen (YEN) tt_um_pedometer (Ultra Low Power Pedometer ASIC) tt_um_analog_atenfyr1 (Configurable Self-biasing Miller-compensated OTA) tt_um_aes_sbox (Formally-Verified Constant-Time AES S-Box) tt_um_tcpu_alienflip (tcpu) tt_um_nebula (Sierpinski Fractal Starfield) tt_um_zenith_tx26 (Zenith TX26) tt_um_odgrip_demoscene_ttsky26a (My first demoscene) tt_um_vighnesh_sawant_plane (Plane with a banner) tt_um_glyph_mode_hd (Glyph Mode HD) tt_um_TSARKA_TinyQV (TinyQV Wishbone SoC) tt_um_SimpleCounter (Simple Counter) tt_um_cfar_nobuzzer (CFAR Detector without Buzzer) tt_um_present (Present) tt_um_top (Approximate Logic Unit) tt_um_goose (OIIA-goose) tt_um_riscv_core (Tiny RISC-V) tt_um_dac_test3v3 (Design and Implementation of R-2R Ladder DAC for GPR Application) tt_um_tadc_its (Time Domain ADC) tt_um_algofoogle_vga_matrix_dac (Analog VGA CSDAC experiments) tt_um_jyblue1001_pll (Analog-PLL) tt_um_axi4lite2x2_top (AXI4-Lite 2M-2S Interconnect) tt_um_systolic_top (4x4 Systolic Matrix MAC Accelerator) tt_um_goose_game (Goose Game) tt_um_rongbin99_happyredmapleleaf_audio_chip (Audio Wave Generator Chip) tt_um_fp_id (FinSec-1: AS-68M Fingerprint Verification ASIC) tt_um_game_of_life (Demoscene: Game of Life) tt_um_ds_missile_command (Missile Command) tt_um_cmos_inverter (Reactive Plasma: CMOS Inverter) tt_um_nightplumeaki_tinypipcore (tinypipcore) tt_um_immrudul_w7khan (Mrudul and Wahhaj Demoscene F2025) tt_um_sohamgovande_transformer (Transformer) tt_um_isa084_uart_servo (UART Positioning PWM Interface) tt_um_wokwi_461265571826974721 (Bias Correction Filter) tt_um_8_bit_cpu (8-bit CPU) tt_um_richad (ADPPLS) tt_um_algofoogle_dottee (DOTTEE VGA demo) tt_um_sar_fms (SAR FSM) tt_um_kolontsov_journey (Journey) tt_um_fft_adityaamehra (64 Sample FFT ASIC) tt_um_lambda_clock (Lambda Clock) tt_um_ece298A_analog (ECE298A analog tile) tt_um_toivoh_demo (Orion Iron Ion [TTSKY26a demo competition]) tt_um_kilian_interference (Wave Lattice) tt_um_fabulous_sky_26a (Tiny FABulous FPGA) tt_um_Rats2012_WobblyBits (WobblyBits - A probabilistic computing chip) tt_um_rebelmike_asic_odyssey (2026: An ASIC Odyssey) tt_um_huyatieo_tinyqv_speck (Speck-V SoC) tt_um_mosbius (mini mosbius) tt_um_remedy_cpu (FFD16 cpu 16-bit) tt_um_vga_ocarina (Ocarina on VGA) tt_um_TinyGPU_v3 (Tiniest GPU V3) tt_um_santhosh_ring_osc (Ring Oscillator PVT Sensor & TRNG) tt_um_santhosh_xbar_ctrl (Memristive Crossbar Peripheral Controller) tt_um_santhosh_stdp_ctrl (Digital STDP Learning Controller) tt_um_santhosh_stoch_neuron (LFSR-Based Stochastic Neuron) tt_um_anweiteck_ldo (1V-LDO) tt_um_sriaxi4lite_top (Axi4_Lite) tt_um_bch_code_15_7_2 (Bose-Chaudhuri-Hocquenghem Code) tt_um_mastensg_ttsky26a_demo (Luz) tt_um_pakesson_vga_rocket (VGA Rocket) tt_um_adpll (ADPLL - All-Digital Phase-Locked Loop) tt_um_Bingyao_FCOTA (Self biased Single Ended Folded Cascoded OTA) tt_um_spacewar_top (Spacewar) tt_um_microlane_demo (microlane demo project) tt_um_NE567Mixer28 (ECG Front End) tt_um_wakita_mux8onehot_cap (Mux8onehot Pulldown Mosfet) tt_um_johshoff_metaballs (Metaballs v2) tt_um_tomvdsch_cyclonerunner (CycloneRunner) tt_um_lowprocess_wildcamping (PicoMIPS CPU) tt_um_canvas (Tiny Canvas) tt_um_snrlxd1068_MACs (Linear and Logarithmic MACs) tt_um_pakesson_simon64_128 (SIMON64/128) tt_um_AmitChen1415 (Tiny Blackjack) tt_um_ole_moller_double_dabble_SV (double_dabble_SV) tt_um_toivoh_demo_1tile (Single tile demo [TTSKY26a demo competition]) tt_um_shiho_space_invaders (Tiny Space Invaders) tt_um_analog_RO (Analog RO) tt_um_electron65_vga (VGA Clock Demo) tt_um_wokwi_457571266840151041 (3-Bit ALU) tt_um_katomata (Katomata - 1D Cellular Automata) tt_um_shimomi_analog (analog circuit) tt_um_toivoh_demo_4tile (Four tile demo [TTSKY26a demo competition]) tt_um_IEEE_open_silicon_FOSSEE (Ring oscillator VCO and Differential Amplifier) tt_um_lm_chip_top (Project Long Man: A Delay-Insensitive Interconnect) tt_um_AlephNaNsea_space_time_waves_and_filaments (Space-Time Waves and Filaments) tt_um_spacelizard_apu (Spacelizard APU) tt_um_wokwi_457569490272926721 (Letter S) tt_um_mau_top_4b (SIMD2 Math Accelerator Unit) tt_um_maze (Maze) tt_um_demoscenettsky (Algorithmic Pattern Generator) tt_um_wokwi_457572141968369665 (Arran's tinytapeout project) tt_um_maxluppe_ttsky26a_analog (Standard Digital Logic Cells Analog Comparator) tt_um_grammartile (GrammarTile) tt_um_bubble_sort (IEEE Bubble Sort Engine) tt_um_ahmed_nematallah_12_bit_adc (12-bit ADC) tt_um_bad_ode_plotter_vga (Bad VGA ODE Plotter) tt_um_wokwi_463706339714973697 (Demo 4-bit ALU 74181 variant) tt_um_wokwi_457569853853115393 (Jasper Tiny Tape Out Workshop) tt_um_wokwi_457560507752701953 (Osian Tiny Tapeout) tt_um_wokwi_457571501325987841 (Rola_Tiny Tapeout Template Workshop4Mar26) tt_um_wokwi_457571903121572865 (TT-wokwi-template) tt_um_wokwi_463380823859050497 (My_Name_on_7_Seg_display) tt_um_wokwi_457569584731832321 (Tiny Tapeout 9 Template Copy) tt_um_wokwi_457571826952995841 (Tiny Tapeout Novomorphic Design 1) tt_um_wokwi_457571349142937601 (Tiny Tapeout Secret First Letter Code) tt_um_wokwi_457571261877235713 (Tiny Tapeout Test) tt_um_wokwi_457582867322921985 (Tiny Tapeout Test GDS) tt_um_wokwi_457571135132600321 (Tiny Tapeout Test Gates) tt_um_wokwi_457571331577181185 (Tinytapeout_IA) tt_um_wokwi_457576779101727745 (tiny tapeout test gates) tt_um_wokwi_457571577702202369 (tj wowki) tt_um_wokwi_457572953060951041 (wokwi) tt_um_pettit_galton (Tiny Galton) tt_um_fountaincoder_top_abc (ABC Temporal Coincidence Detector) tt_um_prime_quine (Prime Quine) tt_um_ghtag_trinity_gf16 (Trinity GF16 Dot Product Accelerator) tt_um_LFSR (Configurable Galois LFSR) tt_um_Acrazt05_titan_proccesing_unit (Titan Proccesing Unit (TPU)) tt_um_essen (Digital) tt_um_alu_bns (6-bit Multi-Functional ALU) tt_um_gerardvt_spade_poc (Interactive XOR Plasma (Spade HDL)) tt_um_gerardvt_clash_poc (Interactive Triangle-Wave Plasma (Clash HDL)) tt_um_jackthoene_frogger (Frogger) tt_um_wokwi_463698873100105729 (IEEE Open Silicon 2026: UTB Logic Trivia Challenge: 8-bit Digital Lock) tt_um_wokwi_463666635153364993 (IEEE - Hex Counter and Logic Gate Validator) tt_um_ChristmasTree_MaligayangPasko (ChristmasTree_MaligayangPasko) tt_um_wokwi_463711763041599489 (IEEE Open Silicon 2026: UTB UART Transmitter basic) tt_um_tinytensorcore (TinyTensorCore) tt_um_uwasic_crypto (UWASIC Crypto) tt_um_topadi (time) tt_um_siliconimist (Siliconimist Demoscene) tt_um_neutern_0 (tt_um_neutern_0) tt_um_htfab_hsxo (HSXO) tt_um_madech_8bit_processor_vga (8-Bit Processor with VGA) tt_um_vga_clock (VGA clock) tt_um_usu_AXIS_MVMul (AXI-Stream Matrix Vector Multiplier) tt_um_weird_numbers (Weird Numbers) tt_um_bovi_cable_tester (Cable Tester) tt_um_libokuohai_asap_cpu_v2 (ASAP CPU v2) tt_um_LinusSkucas_pio (Tiny PIO) tt_um_thomas_ep_sensor (EP Sensor v7 (symmetric in-place thicken, Zhao-compliant)) tt_um_rakhanaufm_truerandom (Current-Starved Ring Oscillator Based True Random Number Generator) tt_um_parakeet (parakeet) tt_um_mcml_vco (MCML experiments) tt_um_tpu ( Tensor Processing Unit) tt_um_strasti (8-Bit ALU) tt_um_zed_analog (Analog design) tt_um_axi4lite_top (Axi4_Lite) tt_um_c4m_spsram_direct (TTSKY-SPSRAM-direct) tt_um_Onchip_Folded_Cascode_N_with_Bias (Folded Cascode N Type with Bias from Onchip Research Group) tt_um_htfab_hybrid (Telephone hybrid) tt_um_ilamparuthi_cfar (CFAR Radar Detector) tt_um_pakesson_glitcher (Glitcher) tt_um_advaittej_stopwatch (V-SPACE Demo: Command & Control Chronograph) tt_um_william_pll (Smartcard PLL Clock Generator) tt_um_Melody_Generator_JLANordhal (Melody Generator based on Markov Chains) tt_um_d_monteiro (Neuromorphic Processor (SNN)) tt_um_jacob_kebaso_4bit_cpu (Nibble - 4-bit CPU) tt_um_signal_detector (Signal_Detection_Processor) tt_um_catalinlazar_tinycore8 (TinyCore8) tt_um_chidam_secengine (Tiny Secure Telemetry Engine) tt_um_urish_usb_cdc (USB CDC (Serial) Device) tt_um_josenbm (9-Channel Frequency Counter with I2C + SPI DAC & ADC) tt_um_shalindra_vga_rings (Variable Speed and Colour Select VGA Rings) tt_um_dinukuk_MYVGA_GLIDER (DKTT01 - VGA Glider) tt_um_fibonacci_JoaoBortolace (Fibonacci Counter) tt_um_wokwi_461639934990157825 (4 bit unlock (IEEE)) tt_um_ctw_ldo (LDO Regulator Skywater 130nm)