
The OCP MXFP8 Streaming MAC Unit is a high-performance, area-optimized arithmetic core designed for AI inference acceleration. It implements the OpenCompute (OCP) Microscaling Formats (MX) Specification v1.0, supporting a wide range of sub-8-bit floating-point and integer formats with hardware-accelerated shared scaling.
The unit is configured in its "Full" edition (2x2 tiles), featuring:
To maintain a minimal IO footprint (8-bit ports), the unit uses a 41-cycle streaming protocol to process a block of 32 elements ($k=32$).
| Cycle | Input ui_in[7:0] |
Input uio_in[7:0] |
Output uo_out[7:0] |
Description |
|---|---|---|---|---|
| 0 | Metadata 0 | Metadata 1 | 0x00 | IDLE: Load MX+ / Debug or Start Fast Protocol. |
| 1 | Scale A | Format A / BM A | 0x00 | Load Scale A, Format A, and BM Index A. |
| 2 | Scale B | Format B / BM B | 0x00 | Load Scale B, Format B, and BM Index B. |
| 3-34 | Element $A_i$ | Element $B_i$ | 0x00 | Stream 32 pairs of elements.* |
| 35-36 | - | - | 0x00 | Pipeline flush & final scaling. |
| 37-40 | - | - | Result [31:0] | Serialized 32-bit result (MSB first). |
*Note: In Packed Mode (uio_in[6]=1 in Cycle 0), the STREAM phase is reduced to 16 cycles (Cycles 3-18).
The unit captures configuration and scaling data during the first three cycles of the protocol.
ui_in)[7]): 1: Reuse previous scales/formats; immediately jump to Cycle 3.[6]): 1: Enable internal probing and metadata echo at the end of the block.[5]): 1: Direct input-to-output mapping for physical connectivity testing.[4:3]):
0: Normal (Exact IEEE-like multiplication).1: LNS (Logarithmic Number System using Mitchell's Approximation).2: Hybrid (Standard for Block Max elements, LNS for all others).[2:0]): (Standard Start only) Exponent offset for non-Block Max elements in Operand A (MX++).uio_in)[7]): 1: Enable OCP MX+ extensions (Repurposed exponents and Block Max tracking).[6]): 1: Enable Vector Packing for 4-bit formats (2 elements per byte, Cycles 3-18).[5]): 0: SAT (Saturate to Max/Min), 1: WRAP (Modulo arithmetic).[4:3]):
0: TRN (Truncate/Towards Zero).1: CEL (Ceil/Towards $+\infty$).2: FLR (Floor/Towards $-\infty$).3: RNE (Round-to-Nearest-Ties-to-Even).[2:0]):
ui_in) & Config A (uio_in)Scale A (ui_in[7:0]):
Config A (uio_in[7:0]):
[7:3]): The index (0-31) of the "Block Max" element in Operand A (used in MX+ mode).[2:0]):
0: E4M3, 1: E5M2, 2: E3M2, 3: E2M3, 4: E2M1, 5: INT8, 6: INT8_SYM.ui_in) & Config B (uio_in)Scale B (ui_in[7:0]):
Config B (uio_in[7:0]):
[7:3]): The index (0-31) of the "Block Max" element in Operand B.[2:0]): Independent format for Operand B (Enabled if SUPPORT_MIXED_PRECISION=1).rst_n low, then set ena high.0x00 on both ui_in and uio_in for standard E4M3 mode.0x7F (1.0 scale) on ui_in and 0x00 (E4M3) on uio_in.0x7F (1.0 scale) on ui_in and 0x00 (E4M3) on uio_in.0x38 (1.0 in E4M3) on both ports.uo_out.0x00002000 (representing 32.0 in the system's 8-bit fractional format).ui_in[7]=1 in Cycle 0 to bypass scale loading. Useful for weight-stationary kernels where scales and formats remain constant across blocks.uio_in[6]=1 in Cycle 0. Stream two 4-bit elements per byte (High nibble = Element $i+1$, Low nibble = Element $i$).test/TT_MAC_RUN.PY) for quick prototyping.| Port | Name | Description |
|---|---|---|
ui_in[7:0] |
Operand A / Scale A | Elements $A_i$ or Scale $X_A$. |
uio_in[7:0] |
Operand B / Scale B | Elements $B_i$ or Scale $X_B$. |
uo_out[7:0] |
Result Out | Serialized 32-bit dot product result. |
clk |
Clock | System clock (Target: 20MHz). |
rst_n |
Reset | Active-low asynchronous reset. |
ena |
Enable | Clock enable. |
The OCP MX+ extension optimizes quantization by preserving high-precision "outliers" (Block Max elements) while maintaining a low bit-width for the rest of the block.
For a block of $k$ elements, the value of an element $A_i$ is given by: $V(A_i) = S \cdot M_i \cdot 2^{X_A - 127}$ Where:
When MX+ Enable is set, the Block Max (BM) element—identified by BM Index—repurposes its exponent bits as additional mantissa.
Normal Element ($i \neq BM$): Decoded as standard MXFP (e.g., E4M3).
Block Max Element ($i = BM$):
MX++ allows "Non-Block Max" (NBM) elements to use a finer quantization grid than the BM element by applying a secondary exponent offset.
$V(A_{i \neq BM}) = S \cdot M_i \cdot 2^{(X_A - 127) - NBM_Offset_A}$
This effectively "zooms in" on the smaller values in the block, reducing the floor noise caused by a single large outlier.
In LNS Mode, multiplication $P = A \times B$ is performed in the logarithmic domain: $\log_2(P) = \log_2(A) + \log_2(B)$
To avoid expensive Power/Log circuits, the unit uses Mitchell’s Approximation: $\log_2(1+m) \approx m, \quad m \in [0, 1)$
The product of two significands $(1+m_a)$ and $(1+m_b)$ is approximated as:
(1+m_a)(1+m_b) \approx \begin{cases} 1 + m_a + m_b & \text{if } m_a + m_b < 1 \\ 2(m_a + m_b) & \text{if } m_a + m_b \ge 1 \end{cases}
This allows the multiplier to be replaced by a simple adder and a shift, reducing hardware area by over 50%.
A massive thank you to Matt Venn, Uri Shaked, Sophie, and the entire Tiny Tapeout / IHP community for making open-source silicon a reality. This project was built on the foundation of your incredible tools and dedication.
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | data_in_a[0] | data_out[0] | data_in_b[0] |
| 1 | data_in_a[1] | data_out[1] | data_in_b[1] |
| 2 | data_in_a[2] | data_out[2] | data_in_b[2] |
| 3 | data_in_a[3] | data_out[3] | data_in_b[3] |
| 4 | data_in_a[4] | data_out[4] | data_in_b[4] |
| 5 | data_in_a[5] | data_out[5] | data_in_b[5] |
| 6 | data_in_a[6] | data_out[6] | data_in_b[6] |
| 7 | data_in_a[7] | data_out[7] | data_in_b[7] |