
This project implements a 2×2 output-stationary systolic array that computes signed 4-bit matrix multiplication, producing 8-bit results.
The design has 5 modules:
acc += a_in × weight_in every cycle when en=1.a_row1 and weight_col1 are delayed by 1 cycle to align data correctly across the array.en, clear, and valid signals automatically after start is pulsed.acc_00, acc_01, acc_10, acc_11) onto data_out one per cycle when valid=1.ui_in and uio_in, then auto-streams them to the array after start.Load all 4 register pairs in 4 cycles using load=1 and sel:
| sel | ui_in[7:4] (data_c1) | uio_in[7:4] (data_c2) |
|---|---|---|
| 00 | A[0][0] | A[0][1] |
| 01 | A[1][0] | A[1][1] |
| 10 | B[0][0] | B[1][0] |
| 11 | B[0][1] | B[1][1] |
After loading, pulse start=1 for 1 cycle. The wrapper streams cycle 1 values, then cycle 2 values, then zeros automatically. Results appear on uo_out one per cycle when uio_out[0] (valid) goes high.
Computes C = A × B where:
rst_n=0 for 5 cycles, then rst_n=1)load=1, sel, ui_in[7:4], uio_in[7:4]start=1 for 1 cycleuio_out[0] (valid) to go highuo_out| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | start | data_out[0] | valid |
| 1 | load | data_out[1] | |
| 2 | sel[0] | data_out[2] | |
| 3 | sel[1] | data_out[3] | |
| 4 | data_c1[0] | data_out[4] | data_c2[0] |
| 5 | data_c1[1] | data_out[5] | data_c2[1] |
| 6 | data_c1[2] | data_out[6] | data_c2[2] |
| 7 | data_c1[3] | data_out[7] | data_c2[3] |