557 Nano-Bnn-Accelerator :: Quicker, easier and cheaper to make your own chip!

557 : Nano-Bnn-Accelerator

Author: Giorgos Fragkoulis

Description: A regular binary network accelerator for 1x1 tile Z2A

Clock: 50000000 Hz

How it works

This is a 1-bit Vector-Matrix Multiplier (VMM) coprocessor designed to accelerate Binary Neural Networks (BNNs). Because a 1x1 Sky130 tile only gives us about 1,000 gates, fitting a full Neural Network is impossible.

Instead, this design uses a Bottom-Up Time-Multiplexing architecture. The silicon contains 8 physical neurons. Each neuron has an 8-bit weight shift register and a 10-bit accumulator. The math is extremely cheap: multiplication is done using a single XNOR gate, and accumulation is done via a Popcount adder tree.

There is no internal State Machine. The ASIC acts as a raw, ultra-fast math engine, while the looping, routing, and layer management are handled by a host processor in software.

How to test

The chip expects to be driven by a host microcontroller running a C/C++ driver. The testing flow goes like this:

Pulse rst_n LOW to clear the 10-bit accumulators.
Load 8 bytes of binarized weights into the shift registers via ui_in.
Stream your binarized image pixels (e.g., 784 pixels for a 28x28 image) into ui_in.
Pulse the compute pin (uio_in[4]). The internal edge detector ensures exactly one addition per byte.
Read the bottom 8 bits of the accumulator from uo_out and the top 2 bits from uio_out[7:6].
Repeat the batch. The host software applies the folded Batch Norm thresholds.

External hardware

To run this, you need a host microcontroller (Raspberry Pi Pico / RP2040, ESP32, etc.) or an FPGA SoC to hold the trained model weights, stream the 1-bit data, and compute the final Float32 Softmax output layer.

#	Input	Output	Bidirectional
0	data_in_0	acc_out_0	load_weights
1	data_in_1	acc_out_1	addr_0
2	data_in_2	acc_out_2	addr_1
3	data_in_3	acc_out_3	addr_2
4	data_in_4	acc_out_4	compute_en
5	data_in_5	acc_out_5	unused
6	data_in_6	acc_out_6	acc_out_8
7	data_in_7	acc_out_7	acc_out_9

Input

Output

Bidirectional

data_in_0

acc_out_0

load_weights

data_in_1

acc_out_1

addr_0

data_in_2

acc_out_2

addr_1

data_in_3

acc_out_3

addr_2

data_in_4

acc_out_4

compute_en

data_in_5

acc_out_5

unused

data_in_6

acc_out_6

acc_out_8

data_in_7

acc_out_7

acc_out_9

Chip location

557 Nano-Bnn-Accelerator