5 2x2 CNN Accelerator PE Grid with UART

5 : 2x2 CNN Accelerator PE Grid with UART

Design render
  • Author: Vidhyanandhan Sathishkumar
  • Description: UART-controlled 2x2 systolic processing-element grid for CNN inference. Send 3-byte packets over UART to drive activations/weights; receive 2-byte psum results.
  • GitHub repository
  • Open in 3D viewer
  • Clock: 25000000 Hz

How it works

This project implements a 2×2 systolic Processing Element (PE) grid for CNN inference, controlled entirely over UART from a host PC.

Architecture

The core is peg_1x — a 2×2 array of multiply-accumulate PEs arranged in a systolic fashion. Each PE has a 2-bit signed weight register and a 13-bit accumulator. The array supports three dataflow patterns:

Direction Encoding Description
H-zigzag 2'b10 Activations enter via h_in0, snake left→up→left→down
V-zigzag 2'b01 Activations enter via v_in0, snake down→right→up→left
Circular 2'b11 Internal ring; boundary inputs ignored

The uart_peg_top wrapper sits on top and bridges a host PC to the PE grid over 8N1 UART at 115200 baud.

UART Protocol

PC → FPGA: 3 bytes per DUT clock cycle

Byte Bits Signal
B0 [7] en
B0 [6] move_en
B0 [5] psum_shift_en
B0 [4] psum_clr
B0 [3] w_ld_en
B0 [2:1] direction[1:0]
B1 [7:2] h_in0[5:0]
B1 [1:0] v_in0[5:4]
B2 [7:4] v_in0[3:0]
B2 [3:2] w_in[1][1:0]
B2 [1:0] w_in[0][1:0]

FPGA → PC: 2 bytes after each applied cycle

Byte Bits Signal
B0 [4:0] psum_out0[12:8]
B1 [7:0] psum_out0[7:0]

Each 3-byte packet applies the encoded inputs for exactly one DUT clock cycle. Transient control signals (en, move_en, w_ld_en, etc.) are auto-cleared after each cycle. Sticky signals (direction, h_in0, v_in0, w_in) retain their last value.

Pins

  • ui[0] = UART RX (input from PC)
  • uo[0] = UART TX (output to PC)

How to test

Connect a USB-to-UART adapter: adapter TX → ui[0], adapter RX → uo[0], GND → GND.

Install the Python host script dependencies:

pip install pyserial

Run the full verification suite against the chip:

python uart_peg_host.py /dev/ttyUSB0

This runs 9 test cases matching the simulation testbench (h-zigzag, v-zigzag, circular shift, negative weights, psum_clr, multi-stream accumulation).

Example: load weights and compute one dot product

from uart_peg_host import PegController

ctrl = PegController("/dev/ttyUSB0", baud=115200)
ctrl.clear_psums()
ctrl.load_weights(w0=1, w1=1)          # +1 both columns
ctrl.stream_h([5, 7], extra_drain=6)   # stream activations
result = ctrl.drain_psum()             # read accumulated psum
print(result)                          # → 12
ctrl.close()

External hardware

  • USB-to-UART adapter (3.3V logic level, e.g. FTDI FT232 or CP2102)
  • Host PC running uart_peg_host.py (Python 3, pyserial)

IO

#InputOutputBidirectional
0rxtx
1
2
3
4
5
6
7

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_utoss_riscv (UTOSS RISC-V core) tt_um_memory_game_top (Number Memory Game) tt_um_danielpenas42 (Ball Display) tt_um_machinelearning (7-Segment Neural Predictor) tt_um_microlane_demo (microlane demo project) tt_um_pixel_processor (Tiny Pixel Processor) tt_um_jpigdon_gps_accelerator_top (GPS_Accelerator) tt_um_rgb_mixer (rgb_mixer) tt_um_bgao43 (Tiny TPU Systolic Array) tt_um_main (Pong in Verilog) tt_um_joannec34_teenytpu (teenytpu) tt_um_apa102_ws2812_squidgeefish (APA102 to WS2812 Translator) tt_um_uacj_bouncing_DVD_screensaver (Custom DVD Screensaver for VGA) tt_um_logoUACJ_MOGA (VGA_screensaver_UACJ) tt_um_grace_spi_led_driver (SPI-Controlled 8-Channel LED Driver) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_happyhop_deadcast2 (happyhop) tt_um_dino7 (Dino-7: 7-Segment Runner Game) tt_um_arty3_mac_engine (Simple MAC Engine w/ Postproc) tt_um_uacj (Custom DVD Screensaver for VGA) tt_um_algofoogle_dottee (DOTTEE VGA demo (TTGF26a)) tt_um_mattvenn_signal_generator (Simple Signal Generator) tt_um_urish_simon (Simon Says memory game) tt_um_tpu (Tensor Processing Unit For GF) tt_um_gojimmypi_ttgf_UART_FSM_TRNG_Lab (Hardware Entropy Explorer: UART/SPI TRNG and PUF) tt_um_wokwi_465483277165299713 (First Tinytapeout) tt_um_prem_pipeline_test (Programmable_Pipeline-RISC-V) tt_um_wokwi_467219410242853889 (Tiny Tapeout testtest 111233) tt_um_wokwi_465549494272929793 (Pacos first design) tt_um_wokwi_465731371445677057 (Arturo's first Wokwi design) tt_um_wokwi_465732744934845441 (Tiny Tapeout Template_1234) tt_um_wokwi_465736492859711489 (Tiny Tapeout Workshop JuanF) tt_um_wokwi_465731430225727489 (Rafa’s first Wokwi design) tt_um_wokwi_465731458365332481 (7 segment Display Fli-Flop Try-out) tt_um_wokwi_465732744245929985 (DiseñoCursoTiny) tt_um_wokwi_465731490568160257 (Matt’s first Wokwi design) tt_um_wokwi_465736691688630273 (test1) tt_um_wokwi_465731458628527105 (Mi copia del Tiny Tapeout) tt_um_wokwi_465731520738845697 (El primer diseño) tt_um_wokwi_465731521356457985 (Tiny Tapeout Template Copy) tt_um_gen1_digital_companion_tile (Gen1 Digital Companion Tile) tt_um_wokwi_465732827753495553 (Tiny Tapeout Template Ayman) tt_um_wokwi_465731394728267777 (Julian_Proyecto) tt_um_wokwi_465731458535202817 (Tiny Tapeout Template Copy) tt_um_wokwi_465732847401723905 (Basic Circuit) tt_um_wokwi_465731452481768449 (El primer diseño de Matt para Wokwi) tt_um_wokwi_465731502018614273 (Tiny Tapeout Template flip flop) tt_um_wokwi_465732616714924033 (Tiny Tapeout RJAP) tt_um_wokwi_465731575275296769 (ocxpkeWokwiDesign) tt_um_wokwi_465732880722332673 (Pedro Template) tt_um_wokwi_465731858252480513 (Paula's first Wokwi design) tt_um_wokwi_465731455677830145 (Tiny Tapeout JMCG) tt_um_wokwi_465737601403996161 (Tiny Number Simon) tt_um_ttmul (Balanced Ternary Multiplier) tt_um_wokwi_465731466664816641 (Tiny Tapeout Workshop Malaga 2jun2026) tt_um_8bit_risc_cpu (8-bit RISC CPU) tt_um_wokwi_451184391728659457 (Simple Sprinkler) tt_um_fhw_appel_spiPWMio (spiPWMio) tt_um_divadnauj_GB_serv_soc_wb (serv_soc_wb) tt_um_8bitcustomcomputer (SAP 8 Bit Computer) tt_um_bioimpedance (Very Low Resource Digital Implementation of Bioimpedance Analysis) tt_um_mgj_bist8 (BIST-8: Built-In Self-Test for 8-bit CLA Adder) tt_um_roberto_tiny_radar_tile (BioPulse Tile) tt_um_systolic_mac_2x2 (2x2 Systolic Array Matrix Multiplier) tt_um_peg_top (2x2 CNN Accelerator PE Grid with UART) tt_um_AlvaroRub_ringcounter (Counter16Outputs) tt_um_wokwi_465731440267947009 (Antonio's first Wokwi design) tt_um_wokwi_465732706576877569 (Guille's first Wokwi design.) tt_um_wokwi_465731481873367041 (MIPS-Lite 8-bit Processor) tt_um_wokwi_465736612213902337 (Juan`s first Worki design) tt_um_wokwi_465731439156454401 (Rhyloo’s first Wokwi design) tt_um_wokwi_465732536551273473 (Tiny Tapeout Marcos Fernandez) tt_um_wokwi_465737290543084545 (Tiny Tapeout Template) tt_um_wokwi_465630130495825921 (ram 1 bit Copy) tt_um_wokwi_465731403724006401 (sdft wokwi 1) tt_um_top (RHD2164-MCU-SPI Bridge) tt_um_line_follower_arvaloez (Line Follower Robot controller) tt_um_xoroshiro64plus_v2 (xoroshiro64) tt_um_ohuettenhofer_tiny_qsim (Tiny Quantum Circuit Simulator) tt_um_santhosh_ring_osc_gf (Ring Oscillator PVT Sensor & TRNG (GF180)) tt_um_santhosh_stoch_stdp_pair_gf (Stochastic neuron + STDP controller (merged, GF180)) tt_um_santhosh_rsd_char_gf (RRAM Characterization Platform (DC sweep + endurance + retention + histogram, GF180)) tt_um_santhosh_xbar_ctrl_gf (Memristive Crossbar Peripheral Controller (GF180)) tt_um_joseph_bf (BF) tt_um_hydrocomms (FSK Modem) tt_um_systolic_array (2x2 MAC Systolic array with DFT) tt_um_kluterirv_rv32e_core (Minimal RV32E SoC with UART Loader) tt_um_algofoogle_ttgf26a_vco (VCO driven by DAC) tt_um_fer_logo_music_vga (UNIZG-FER VGA project) tt_um_maqsudbek_dyadic_pwm (Dyadic PWM) tt_um_waferspace_vga_screensaver (Wafer.space Logo VGA Screensaver) tt_um_htfab_vga_tester (Video mode tester)