493 Tiny NPU: 4-Way Parallel INT8 Inference Engine

493 : Tiny NPU: 4-Way Parallel INT8 Inference Engine

Design render
  • Author: Malik
  • Description: 4-way parallel INT8 neural network inference engine with SproutHDL arithmetic units and ReLU activation
  • GitHub repository
  • Open in 3D viewer
  • Clock: 50000000 Hz

Tiny NPU: 4-Way Parallel INT8 Inference Engine

How it works

A minimal neural processing unit with 4 parallel multiply-accumulate datapaths that computes a single fully-connected layer: y = ReLU(W · x + b).

The NPU integrates custom arithmetic units generated by SproutHDL through ML-guided design-space exploration:

  • 4× Han-Carlson multipliers (8-bit unsigned, structurally optimized)
  • 4× Sparse Kogge-Stone adders (24-bit two's complement, area-optimized)

Since the multipliers are unsigned but the NPU handles signed INT8 values, a lightweight sign-management wrapper computes absolute values, multiplies, and conditionally negates the result. A pipeline register between the multiply and accumulate stages ensures clean timing closure.

Specifications:

  • 4 parallel datapaths (one per output neuron)
  • Weight storage: 32 × INT8 register file (4 outputs × 8 inputs)
  • Bias storage: 4 × INT16
  • Input buffer: 8 × INT8 activations
  • 24-bit accumulator precision per output
  • Configurable: 1–8 inputs, 1–4 outputs
  • ReLU activation with INT8 output saturation
  • Pipelined inference: N_IN + 3 cycles (8 inputs → 11 cycles @ 50 MHz = 220 ns)
  • Weights persist across inferences for batch processing

How to test

  1. Reset: rst_n low → high
  2. Config: cmd=0x1, data = {n_out-1}[5:4] | {n_in-1}[2:0]
  3. Load weights row-major: cmd=0x2, data=weight (auto-increments)
  4. Load biases: cmd=0x3, data=bias (auto-increments)
  5. Load activations: cmd=0x4, data=act (auto-increments)
  6. Run: cmd=0x5. Wait for busy→0.
  7. Reset pointers: cmd=0x7
  8. Read outputs: cmd=0x6 (auto-increments)

Design context

The arithmetic units were produced by SproutHDL (github.com/huawei-csl/sprout-hdl) as part of a semester project on ML-guided design-space exploration of AI hardware architectures. The Han-Carlson multiplier and sparse Kogge-Stone adder represent Pareto-optimal points on the area-delay tradeoff frontier identified through automated exploration.

External hardware

None required.

IO

#InputOutputBidirectional
0data_in[0]data_out[0]cmd[0]
1data_in[1]data_out[1]cmd[1]
2data_in[2]data_out[2]cmd[2]
3data_in[3]data_out[3]cmd[3]
4data_in[4]data_out[4]
5data_in[5]data_out[5]
6data_in[6]data_out[6]
7data_in[7]data_out[7]

Chip location

Controller Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Analog Mux Analog Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux Mux tt_um_chip_rom (Chip ROM) tt_um_factory_test (Tiny Tapeout Factory Test) tt_um_oscillating_bones (Oscillating Bones) tt_um_mlyoung_wedgetail (Wedgetail TCDE REV01) tt_um_adex_neuron_ncs (AdEx Neuron NCS) tt_um_opamp_gfcwfzkm (Operational Amplifier Test Circuits) tt_um_analog_genesis (Genesis) tt_um_techhu_analog_trial (SoilZ v1 Lock-In Impedance Analyzer) tt_um_rejunity_vga_logo (VGA Tiny Logo Roto Zoomer) tt_um_rebeccargb_universal_decoder (Universal Binary to Segment Decoder) tt_um_rebeccargb_hardware_utf8 (Hardware UTF Encoder/Decoder) tt_um_rebeccargb_intercal_alu (INTERCAL ALU) tt_um_rebeccargb_vga_pride (VGA Pride) tt_um_urish_simon (Simon Says memory game) tt_um_silicon_art (Silicon Art - Pixel Pig + Canary Token) tt_um_aksp_mbist_mbisr (MBIST + MBISR Built-In Memory Test & Repair) tt_um_wokwi_450492230413445121 (RandomNum) tt_um_wokwi_450492214548484097 (test) tt_um_wokwi_450491696806684673 (Snake) tt_um_wokwi_450491302960427009 (TinyTapeout test) tt_um_wokwi_450492208120711169 (Tadder) tt_um_wokwi_450492222691728385 (test_prj) tt_um_not_a_dinosaur (Not a Dinosaur) tt_um_vga_example (Silly Dog) tt_um_mo_module (vga test project) tt_um_snake_game (SnakeGame) tt_um_cic_filter_demo (Discrete-to-ASIC Delta-Sigma Acquisition System) tt_um_spi_aggregator (Quad SPI Aggregator) tt_um_herald (Herald) tt_um_wokwi_453664332125344769 (Digital Lock with Easter Eggs) tt_um_floAfentaki_top (tinyTapeVerilog_out) tt_um_chrismoos_6502_mcu (m6502 Microcontroller) tt_um_piggy_top (Piggybag) tt_um_wokwi_454491386271657985 (6 Bit Roulette) tt_um_ro_puf_trng (RO-based security primitives) tt_um_wokwi_453674671092670465 (My first Wokwi design) tt_um_ezelioli_blockvga (VGABlock) tt_um_SotaSoC (SotaSoC) tt_um_wokwi_455292199922428929 (Second TT experiment) tt_um_wokwi_455301361268988929 (7-Segment-Wokwi-Design) tt_um_hexcnt_elfuchso ((Hexa)Decimal Counter) tt_um_tobiasgreiser_move_vga_square (move VGA square) tt_um_wokwi_455291603750154241 (Temporary Title) tt_um_wokwi_455297270449581057 (RGB PWM) tt_um_wokwi_455291640579325953 (Primitive clock divider) tt_um_wokwi_455291654653337601 (Tiny Ape Out) tt_um_wokwi_455291618175430657 (adder) tt_um_wokwi_455291689699908609 (test) tt_um_wokwi_455291713774178305 (Just logic) tt_um_wokwi_455291560479595521 (title) tt_um_neb_top (neb tt26a first asic) tt_um_wokwi_455291650915156993 (tiny-tapeout-workshop-result) tt_um_joh1x_prng (8-bit PRNG) tt_um_wokwi_455293127747668993 (Tiny Tapeout) tt_um_wokwi_455291727376311297 (custom_lol) tt_um_wokwi_455291594023558145 (Tiny Tapeout) tt_um_wokwi_455291650157032449 (Tiny Tapeout N) tt_um_tomolt_rasterizer (Tiny Triangle Rasterizer) tt_um_wokwi_455291872587345921 (Test) tt_um_wokwi_455299783033986049 (Example) tt_um_wokwi_455299761916711937 (Try1) tt_um_wokwi_455291701422995457 (Clock Divider Test Project) tt_um_wokwi_455291654560013313 (Test) tt_um_hopfield (Hopfield Associative Memory — Odd Digit Recall on 7-Segment) tt_um_wokwi_455291651646015489 (Workshop Day) tt_um_wokwi_455291845594899457 (83rk: Tiny Tapeout) tt_um_wokwi_455301826476070913 (FirstTapeOut2) tt_um_dranoel06_SAP1 (Programmable 8-BIT CPU) tt_um_wokwi_455300379278483457 (test) tt_um_wokwi_455291837898350593 (Custom_ASIC) tt_um_wokwi_455300931094822913 (Tiny Tapeout Accumulator) tt_um_nampukk_top (Miniproc) tt_um_krimmel_mini_synth (Mini Synth) tt_um_wokwi_455291688143820801 (My first tapeout) tt_um_wokwi_455299760551464961 (Freddys tapeout) tt_um_fir_filter (UART-Programmable 2-Tap FIR Filter) tt_um_arthfink_ddmtd (DDMTD) tt_um_wokwi_455291792579934209 (Test) tt_um_wokwi_455291642603080705 (Test) tt_um_wokwi_455303220350374913 (Simon Says) tt_um_wokwi_455292153909854209 (simple XOR cipher) tt_um_wokwi_455300425088680961 (Hello tinyTapout) tt_um_wokwi_455291692145189889 (^My first design) tt_um_wokwi_455290751669808129 (Tiny Tapeout - Riddle Implementation) tt_um_wokwi_455291724163472385 (Tobias first Wokwi design) tt_um_wokwi_455291807082792961 (My Tiny Tapeout) tt_um_wokwi_455303592417686529 (Count Upwards) tt_um_wokwi_455291645628225537 (Nielss first failure) tt_um_wokwi_455291698669433857 (Tiny_Tapeout_Test) tt_um_wokwi_455293379343017985 (Switch Puzzle) tt_um_yniklas_ma (Multiply-Add) tt_um_yjulian_alu (uCore) tt_um_wokwi_455291641368904705 (WIP Title) tt_um_wokwi_455303526551376897 (just copy 4 not gates) tt_um_wokwi_455303914893650945 (WIP) tt_um_fpga_can_lehmann (FPGA) tt_um_wokwi_455291728585320449 (gatekeeping the gates) tt_um_wokwi_455291738750219265 (RTX 8090) tt_um_ct4111_buzzer (Two Song Buzzer Player) tt_um_wokwi_455291631430488065 (Tudor BCD Test) tt_um_wokwi_455300137923517441 (Tiny Tapeout 1) tt_um_wokwi_455300818767153153 (little frequency divider) tt_um_wokwi_455291748212573185 (Test - clock divider) tt_um_wokwi_455291611094391809 (Counter) tt_um_wokwi_455331155298538497 (WIP) tt_um_mattvenn_vgatest (VGA demo) tt_um_matth_fischer_vgaTT (VGA Squares) tt_um_phsauter_vga_maze (VGA Maze Runner) tt_um_themightyduckofdoom_bitserial_collatz_checker (Bit-Serial Collatz Conjecture Checker) tt_um_tt_tinyQV (Borg - Tiny GPU) tt_um_pgfarley_tophat_top (tophat) tt_um_jamesbuchanan_silly (Silly demo) tt_um_sid (SID Voice Synthesizer) tt_um_tinymoa_ihp26a (TinyMOA: RISC-V CPU with Compute-in-Memory Accelerator) tt_um_embeddedinn_vga (Cyber EMBEDDEDINN) tt_um_urish_rings (VGA Rings) tt_um_swenson_cqs (quad-sieve) tt_um_brmurrell3_m31_accel (M31 Mersenne-31 Arithmetic Accelerator) tt_um_corey (Bernstein-Yang Modular Inverse (secp256k1)) tt_um_wokwi_456019228852926465 (Johnson counter) tt_um_posit_mac_stream (8Bit Posit MAC Unit) tt_um_wokwi_453110263532536833 (Tiny tapeout MAC unit) tt_um_wokwi_456131795093444609 (Test) tt_um_calonso88_spi_i2c_reg_bank (Register bank accessible through SPI and I2C) tt_um_tinyperceptron_karlmose (Tiny Perceptron) tt_um_wokwi_456571536458697729 (Full Adder) tt_um_wokwi_456571856158098433 (MJ Wokwi project) tt_um_pong (TinyPong) tt_um_wokwi_456574262189506561 (Simple counter) tt_um_wokwi_456571610504973313 (4ish bit adder) tt_um_wokwi_456571875721383937 (Hello World) tt_um_tsetlin_machine (Tsetlin Machine for low-power AI) tt_um_wokwi_456574528376856577 (Cremedelcreme) tt_um_wokwi_456571730084640769 (tiny tapeout half adder) tt_um_romultra_top (SPI RAM Driver) tt_um_wokwi_456575247946496001 (4-bit full adder) tt_um_wokwi_456576478636229633 (Alex first circuit) tt_um_wokwi_456571639638628353 (Malthes First Template) tt_um_wokwi_456573048893551617 (And_Or) tt_um_wokwi_456571585305580545 (WIP Bin to Dec) tt_um_wokwi_456571687437989889 (TinyTapeNkTest) tt_um_wokwi_456573098517424129 (2 Digit Display) tt_um_wokwi_456571798679348225 (nand_gate) tt_um_luke_meta (TTIHP26a_Luke_Meta) tt_um_wokwi_456576238411624449 (Tiny Tapeout Amaury Basic test) tt_um_wokwi_456571702278493185 (JayF-HA) tt_um_wokwi_456571626036491265 (Tiny Tapeout First Design) tt_um_wokwi_456571568465442817 (First tinytapeout 234) tt_um_wokwi_456577873995405313 (Switch deBounce for Rotary Encoder) tt_um_wokwi_456571625108498433 (Scott's first Wokwi design) tt_um_wokwi_456578179564131329 (idk) tt_um_wokwi_456573141570913281 (Tiny Tapeout chip) tt_um_wokwi_456578608558661633 (Hidden combination) tt_um_carlhyldborglundstroem_code (8_cool_modes) tt_um_wokwi_456572140961867777 (TestWorkShop) tt_um_wokwi_456575744901356545 (Design_test_workshop) tt_um_wokwi_456571628648495105 (Tiny Tapeout Test Gates) tt_um_wokwi_456578790697395201 (vis_3) tt_um_ISC77x8_HansAdam2077 (ISC77x16) tt_um_wokwi_456571702213480449 (Tiny Tapeout Test Gates) tt_um_wokwi_456575028603245569 (Test) tt_um_wokwi_456571746867102721 (fullAdder) tt_um_wokwi_456576548374933505 (My first design) tt_um_wokwi_456571983499280385 (Tiny Tapeout Test) tt_um_wokwi_456579003210233857 (Tiny Tapeout placeholder) tt_um_wokwi_456572032892464129 (Workshop) tt_um_hoene_smart_led_digital (Smart LED digital) tt_um_wokwi_456572126761001985 (test project) tt_um_schoeberl_undecided (Undecided) tt_um_flummer_ltc (Linear Timecode (LTC) generator with I2C control) tt_um_jalcim (tiny_tester) tt_um_ECM24_serv_soc_top (FH Joanneum TinyTapeout) tt_um_fabulous_ihp_26a (Tiny FABulous FPGA) tt_um_LnL_SoC (Lab and Lectures SoC) tt_um_ygdes_hdsiso8_dlhq (ttihp-HDSISO8) tt_um_algofoogle_fomo (FOMO) tt_um_chrbirks_top (ADPLL) tt_um_MichaelBell_photo_frame (Photo Frame) tt_um_miniMAC (miniMAC) tt_um_crockpotveggies_neuron (Neuromorphic Tile) tt_um_coastalwhite_canright_sbox (Canright SBOX) tt_um_ebeam_pixel_core (E-Beam Inspection Pixel Core) tt_um_DelosReyesJordan_SEM (8-bit SEM Floating-Point Multiplier) tt_um_chisel_template (One One) tt_um_vga_tetris (VGA Tetris) tt_um_wokwi_457062377137305601 (Count To Ten) tt_um_techhu_rv32_trial (LoRa Edge SoC) tt_um_risc_v_wg_swc1 (ttihp-26a-risc-v-wg-swc1) tt_um_andreasp00 (TT6581) tt_um_pakesson_glitcher (Glitcher) tt_um_intv0id_kalman (Kalman Filter for IMU) tt_um_neuromurf_seq_mac_inf (SEQ_MAC_INF_16H3 - Neural Network Inference Accelerator) tt_um_obrhubr (8-bit Prime Number Detector) tt_um_anujic_rng (True(er) Random Number Generator (TRNG)) tt_um_async_test (Chisel Async Test) tt_um_wokwi_458140717611045889 (O2ELHd 7segment display) tt_um_libormiller_SIMON_SPI (SIMON) tt_um_tschai_yim_mill (Tschai's Tic-Tac-Toe) tt_um_microlane_demo (microlane demo project) tt_um_vga_leonllrmc (LLR simple VGA GPU) tt_um_mchiriac (TinyTapeout-Processor2) tt_um_RongGi_tiny_dino (tiny_dino) tt_um_kianv_sv32_soc (KianV SV32 TT Linux SoC) tt_um_chatelao_fp8_multiplier (OCP MXFP8 Streaming MAC Unit) tt_um_filterednoise_infinity_core (Infinity Core) tt_um_alessio8132 (4-bit processor) tt_um_pwm_controller_atudoroi (UART interfaced 8ch PWM controller) tt_um_uart_alu (UART-ALU Processor) tt_um_ALU_t_rick (A fully functional ALU (Arithmetic logic unit)) tt_um_TscherterJunior_top (smolCPU) tt_um_lkhanh_vga_trng (VGA multiplex with TRNG) tt_um_wokwi_456572315745884161 (Tiny tape out test) tt_um_tmr_voter (Triple Modular Redundancy) tt_um_adriantrummer_checker (TinyTapeout VGA Checker) tt_um_jamesbuchanan_silly_mixer (Silly Mixer) tt_um_wokwi_456576419565744129 (tinytapeout_henningp_2bin_to_4bit_decoder) tt_um_ztimer_top (Tiny Tapeout Factory Test for ttihp-timer) tt_um_michaelstambach_vogal (VoGAl) tt_um_teenyspu (TeenySPU) tt_um_wokwi_458752568884674561 (7 segmant ihp resistcode) tt_um_Xelef2000 (RNG) tt_um_gschultz_bouncingcheckers (Bouncing Checkers) tt_um_ygdes_hdsiso8_rs (ttihp-HDSISO8RS) tt_um_malik_tiny_npu (Tiny NPU: 4-Way Parallel INT8 Inference Engine) tt_um_malik_mac_ripple (Gate-Level 8-bit MAC with Ripple-Carry Accumulator) tt_um_faaaa (Demoscreen full of RICH) tt_um_sat_add_blanluc (8 bit saturated adder) tt_um_wokwi_455303279136701441 (Spell. My. Name.) tt_um_thomasherzog_plasma (Plasma) tt_um_YannGuidon_TinyScanChain (TinyScanChain) tt_um_moss_display (moss_display) tt_um_delta (Delta Wing Flight Control Mixer with PWM Output) tt_um_maluei_badstripes (badstripes) tt_um_float_synth_nikleberg (float_synth) tt_um_catalinlazar_ihp_osc_array (IHP Gate Delay Characterizer (3-Flavor)) tt_um_essen (2x2 Systolic array with DFT and bfloat16 - v2) tt_um_ecc_gf2_8 (Tiny_ECC) tt_um_wokwi_459117403524075521 (4-bit ALU) tt_um_gfcwfzkm_scope_bfh_mht1_3 (Basic Oszilloscope and Signal Generator) tt_um_recursivetree_tmmu_top (Tiny MMU) tt_um_prime (8-bit Prime Number Detector) tt_um_gian_alu (tt_gian_alu) tt_um_multitool_soc_mauro_ciccone (Multi-Tool SoC) tt_um_FTEVE_FISH (Flying Fish) tt_um_wscore (8-bit RISC-V Lite CPU) tt_um_wokwi_459234034322375681 (TinyTapeout Signal Box) tt_um_anna_vee (2 digit minute timer) tt_um_zettpe_mini_psg (Mini PSG) tt_um_pmiotti_squares_hypnosis (hypnotic squares) tt_um_mzollin_glitch_detector (Glitch Detector) tt_um_spongent88 (Spongent-88 Hash Accelerator) tt_um_8bit_mac (8bit-mac-unit) tt_um_wokwi_459285910800527361 (4-Bit Counter and Registers Demo) tt_um_tobisma_random_snake (Random Snake) tt_um_maze_game (Maze Explorer Game) tt_um_ihp26a_ring_osc (Verilog ring oscillator) tt_um_vga_ca (vga_ca) tt_um_wokwi_459299619699169281 (TinySRAM) tt_um_jmkr_ece_git_code_lock (Code Lock) tt_um_wokwi_459303685175910401 (Bday Candle Chip) tt_um_wokwi_455293203542942721 (1-4 Counter) tt_um_wokwi_454935456504261633 (2 Bit Adder) tt_um_wokwi_455291660779120641 (74LS138) tt_um_wokwi_455291682546516993 (Mein Hund Gniesbert) tt_um_wokwi_455291649462874113 (Tiny Tapeout Full Adder) tt_um_wokwi_455293410637770753 (Yturkeri_Mytinytapeout) tt_um_wokwi_455291642978471937 (2-Bit Adder) tt_um_wokwi_456118923713667073 (4-Bit Adder) tt_um_wokwi_456571724337390593 (7 Segment Binary Viewer) tt_um_wokwi_456578784059908097 (7 segment number viewer) tt_um_wokwi_456571605697249281 (Hello) tt_um_wokwi_456576571487651841 (7 Segment BCD) tt_um_wokwi_456571686260436993 (Tiny Tapeout Workshop Test) tt_um_wokwi_455291787137823745 (TinyTapeout logic gate test) tt_um_wokwi_455291649222749185 (lriglooCs-first-Wokwi-design) tt_um_wokwi_456578694921494529 (sree) tt_um_wokwi_456571638794523649 (GDS Test) tt_um_2048_vga_game (2048 sliding tile puzzle game (VGA)) tt_um_urish_usb_cdc (USB CDC (Serial) Device) tt_um_tippfehlr_nyan_cat (NYAN CAT) tt_um_wokwi_459210187582694401 (Simple Counter) tt_um_zouzias (Yet another VGA tinytapeout) tt_um_Jan_three_body_solution (Three Body Solution) Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available Available