
This tile delays a bit's value by 502 cycles at speeds above 100MHz (according to the synthesiser, to be tested). It is a baseline for storage packing density, as well as a test architecture for asynchronous shift registers, not made out of large DFF cells. This version packs 672 standard latches and a controller, filling 87% of the tile's surface.
As the name implies, it's a high density shift register for deep digital delays. According to the PDK for CMOS IHP at https://github.com/IHP-GmbH/IHP-Open-PDK/blob/main/ihp-sg13g2/libs.ref/sg13g2_stdcell/doc/sg13g2_stdcell_typ_1p20V_25C.pdf
MUX2 is almost 3× smaller than the DFF gate and could be used as a latch by feeding its output back to an input (just like with the old antifuse Actel FPGAs such as A1xxx). This trick is rejected by the tools but in the same area, I could also implement a SR latch with enable, using combined and compact OR/AND gates. This is done in a different tile (see https://github.com/ygdes/ttihp-HDSISO8RS ) but first I need a reliable reference point.
This project "tt_um_ygdes_hdsiso8_dlhq" uses the conventional transparent latch DLHQ, whose size is in-between. The shift register uses 4 latches to store 3 bits at a given time and 4 non-overlapping "clock" pulses perform the shifting. Slowly. Just like below, but with 8 parallel chains.
"Tranches" are provided with 16, 64 or 256 latches and must be used in "odd-even" pairs so you get 24, 96 or 384 cycles of delay. You can chain them as long as surface allows (to a degree). The controller adds another 20 cycles. The 16x tranches need 4 extra inverters if used alone.

The apparent complexity comes from the 8-phase clock, which is brought to the "asynchronous" domain. Each of the 8 lanes is 8× slower (which relaxes timing constraints) but the overall throughput is preserved by an intricate demultiplexer and multiplexer. So it "should" work at "full speed", we'll see.
Compared to a shift register with normal DFF cells, it could store up to twice the same amount of bits per unit of surface, without the need of full-custom cells, as the controller's (sequencer, mux and demux) size becomes insignificant when the chain gets longer. Depths of several kilobits are possible without too much hassles (if the synth agrees), without a mad clock network, reducing simultaneous switching noise... Not only are the pulses slower, their traces are also shorter: each pulse affects only 1/8th of the cells at any time.
Ideally, the 8 chains should be manually placed (or with a script), not thrown at random. For implementation, I use a "tuned" Verilog workflow and instantiate cells directly from https://github.com/IHP-GmbH/IHP-Open-PDK/blob/main/ihp-sg13g2/libs.ref/sg13g2_stdcell/verilog/sg13g2_stdcell.v . For simulation, parts of this file are copy-pasted to gate-specific files to remove some warnings (find them in /test, thank you Jeremy!).
You will get a "Synthesis warnings : Warning: There are XXX unclocked register/latch pins." This is normal.
Good to know:
Startup sequence:
Extra insight and observability:

Note in the diagram above that RESET forces all the outputs to 1, thus flushing the whole delay line in less than a microsecond.
An 8-bit LFSR is integrated to ease testing. Thus an oscilloscope and a variable frequency oscillator are enough to characterise the achieveable speed. To use it,
Note 1: 8 bits gives a period of 255, almost half of the SISO's depth of 502, a small shift is expected (the SISO output precedes the LFSR by 8 cycles) and the SISO should store twice the whole LFSR period, but the output should align anyway.
Note 2: The LFSR_PERIOD pulse should appear 193 clock cycles after the release of the RESET pin.
Note 3: The RESET signal clears the contents of the SISO. Give it a few cycles for the 0 to propagate through all the latches while it flushes after releasing the RESET.

A basic custom test board will be put together, to hook the variable frequency generator and the oscilloscope probes.
Optionally, if you only want to make a "light chaser", hook 8 LED to the IO port, select the external clock and add a 555. Or you can have a more funky pattern by displaying the LFSR's state by setting SHOW_LFSR to 1.
| # | Input | Output | Bidirectional |
|---|---|---|---|
| 0 | CLK_SEL | D_OUT | PULSE #0 |
| 1 | EXT_CLK | CLK_OUT | PULSE #1 |
| 2 | EXT_RST | Johnson #0 | PULSE #2 |
| 3 | D_IN | Johnson #1 | PULSE #3 |
| 4 | Johnson #2 | PULSE #4 | |
| 5 | SHOW_LFSR | Johnson #3 | PULSE #5 |
| 6 | LFSR_EN | LFSR_PERIOD | PULSE #6 |
| 7 | DIN_SEL | LFSR_BIT | PULSE #7 |