
This design is a 3×3 unsigned 4-bit matrix multiplier. It computes
C = A · B where A and B are 3×3 matrices of 4-bit unsigned
integers (values 0–15). Each element of C is the sum of three 8-bit
products, so the result is 10 bits wide and cannot overflow:
3 · 15 · 15 = 675 < 2^10.
Because the chip only has 24 user I/O bits, the matrices cannot be loaded in parallel. Instead the design exposes a simple nibble-stream protocol driven by a small finite-state machine:
S_LOAD_A → S_LOAD_B → S_WAIT → ┌── S_COMPUTE ── S_READ ──┐ ... → S_DONE
└───────────── 9× ────────┘
LOAD_EN latches the
nibble on DATA_IN[3:0] into the next slot of the on-chip A or B
memory (row-major, 9 nibbles each).START rises after both matrices are loaded, the
multiplier accumulates one 4×4 product per clock for three cycles,
producing a 10-bit element of C.OUT_VALID is asserted; the user reads two bytes of the
result by pulsing READ_EN. Byte 0 contains bits [9:8], byte 1
contains bits [7:0]. After the second pulse the FSM automatically
advances to the next (i, j) and computes again.DONE is asserted
and stays high until reset.Status pins (BUSY, DONE, OUT_VALID) are exposed on the
bidirectional bus so a host MCU can poll progress without bit-banging.
Reset (rst_n low) clears all state and returns to S_LOAD_A.
The host (MCU, FPGA, USB-GPIO, etc.) drives the chip in this order:
rst_n low for a few clocks, then release.ui_in[3:0], pulse LOAD_EN high for one clock, low for one clock.START high for one clock. BUSY stays high.OUT_VALID = 1,
then read two bytes on uo_out[7:0], pulsing READ_EN between
bytes. The FSM transparently computes the next element while reading.DONE goes high.Sanity-check vectors:
A = I_3, any B → C == B.A = B = [[15]*3]*3 → every C[i][j] = 675 (0x02A3, byte0=0x02, byte1=0xA3).The cocotb testbench in test/test.py covers identity, a small
hand-computed case, the maximum-value case, and a randomized case.
None. A microcontroller, FPGA, or USB-GPIO interface is enough to drive the nibble-stream protocol. No buffers, level shifters, or analog parts are required.
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | DATA_IN[0] | DATA_OUT[0] | BUSY (out) |
| 1 | DATA_IN[1] | DATA_OUT[1] | DONE (out) |
| 2 | DATA_IN[2] | DATA_OUT[2] | OUT_VALID (out) |
| 3 | DATA_IN[3] | DATA_OUT[3] | LOAD_EN (in) |
| 4 | DATA_IN[4] | DATA_OUT[4] | START (in) |
| 5 | DATA_IN[5] | DATA_OUT[5] | READ_EN (in) |
| 6 | DATA_IN[6] | DATA_OUT[6] | |
| 7 | DATA_IN[7] | DATA_OUT[7] |