Commit f5c3ef4
authored
Rebase/cache refactoring onto main (#20)
* [RTL] Add folded data bank plumbing
- wire DataPartSplit/folded params through cluster/group/tile
- implement skewed folded data SRAM mapping in cachepool_tile
- adjust cluster wrapper tb for the new configuration
* [SW] Add scalar and vector cache tests with 128-bit aligned accesses
- add scalar cache tests that run basic and stress patterns without crossing 128-bit parts
- add vector cache tests that use RVV loads/stores on 128-bit chunks and verify data integrity
- integrate both tests into the test CMake and keep patterns aligned to folded-cache part size
* [RTL] prioritize writes in skewed folded bank selection
Select a single part per column/bank per cycle and prioritize write parts over read parts to avoid clobbering bank signals.
* cachepool folded-path integration and sim-time protocol hardening
- cachepool_tile: use EffectiveCoalFactor=1 in folded mode; pass to cache ctrl.
- cachepool_cc: size Spatz response FIFO with NumSpatzOutstandingLoads; add overflow assert.
- tcdm_cache_interco: add non-synthesis outstanding scoreboard/asserts for req/rsp matching.
* [SW] Add cache mix smoke and pressure tests for scalar and vector cache access interleaving.
* [RTL] plumb hash-way folded cache integration
Bump insitu-cache to the folded/hash-way revision, thread
UseHashWaySelect through cluster/tile, and queue Spatz memory
responses through the local response FIFO instead of bypassing
write acks.
* [RTL] Bump insitu cache dep.
* [RTL] Add Spatz<->TCDM id-indexed req/rsp scoreboard for debug
cachepool_cc: per-port sb_q[user.req_id] slot table for out-of-order
rsp matching; watchdog dumps stuck ids. Gated by parameter
(default off, +define+ENABLE_SPATZ_REQ_SCOREBOARD to enable).
* [RTL] tile: propagate skew-bank grant to prevent silent read drops
The skew-bank arbiter at (col, bank_sel) picks writes over reads
without exposing the loser; a hardwired l1_data_bank_gnt=1 caused the
upstream to consume stale rdata when another way wrote the same
column. Compute any_other_write_in_col (loop-free, depends only on
part_we) and gate gnt by it: writes always granted, reads granted iff
no OTHER way writes the same (col, bank_sel). Excludes own way's
writes so own idle words aren't spuriously stalled. Fixes multi-core
coherence in rlc-mimic and unlocks AllowReadDuringWrite=1 on data
banks.
* [SW] fix runtime/test bugs
- l1cache: flush+wait before xbar commit so the reconfig doesn't leave
dirty lines bound to the old hash layout.
- mcs-lock: move cluster barrier before the non-zero-core spin loop
(otherwise cores 1+ never barrier and core 0 deadlocks).
- load-store: print the correct buffer name (B/C, not A) in the B/C
error messages; add c_ptr to the pointer dump.
- idotp-32b: include got/expected in Check Failed! print.
* [SW] add cache-{coverage,coverage-min,line-rw-smoke,rlc-mimic,vector-rw} tests
Register five new cache-focused tests in CMakeLists.txt:
- cache-line-rw-smoke single-core line-granular RW smoke
- cache-rlc-mimic RLC traffic mimic (vector load/store)
- cache-vector-rw multi-iteration vector load-store kernel
- cache-coverage 12-phase multi-core cache stress / coverage
- cache-coverage-min minimal phase-06 writeback-loss repro
* [VERIF] enable Spatz-SB + add tile-level memory-model VIP
- Bender.lock: bump insitu-cache to the rev with the wrapper/coalescer
SBs and the SYNC_CTRL_CHECK_PEND fix.
- Makefile: define ENABLE_SPATZ_REQ_SCOREBOARD so the in-RTL Spatz
req/rsp watchdog is on by default.
- cachepool_tile.sv: per-port pre-strip TCDM req tracer
(+sb_pretrace_addr_lo/hi) and byte-granular shadow-memory model
(+mm_enable) that $errors on DATA / TYPE / ORPHAN_RSP mismatches.
Both passive, off by default, sim-only.
* [Fix] axi_user_width sizing and group xbar req/rsp slot routing
- config.mk: derive axi_user_width as base + 2*(idx_width(num_tiles)-1).
Previous widths truncated bank_id MSB on the AXI loopback, routing
cache_ctrl refill responses to the icache bypass slot.
- cachepool_group.sv: use the source tile id `t` (not target_tile) for
the request destination slot, so the response (routed by user.tile_id
mod NumRemotePortCore) lands on the same xbar mst port as the request.
* [SW] cache-mix-pressure: keep per-core offset 4-byte aligned
The `win` offset combined `it * 64u + cid * 7u`. The `cid * 7u` term
is odd for cid > 0, so `wp = (base + win + j * 4U)` ended up
unaligned for any non-zero core. Snitch raises a misaligned
load/store exception for unaligned uint32_t accesses, and this
runtime has no exception handler installed, so cores 1+ entered a
trap loop at PC 0x800005fc while cores 0/2/3 stalled at the next
sync_all. Result: the test always timed out without printing UART.
Change the per-core stride to `cid * 28u` (= 7 * 4) so the offset
stays varied per core but is always 4-aligned, restoring the
original "varied window" intent. Test now passes with retval=0.
* [SW] mcs-lock: drop deadlocking exit pattern
After the per-core stats printf, non-zero cores entered `while(1){}`
and were never able to reach the second `snrt_cluster_hw_barrier()`
below. Core 0 then waited forever at that barrier for cores 1+.
Result: the kernel never reached `return 0`, _snrt_exit was never
called, and EOC was never asserted -- the sim always timed out.
Removing the if/while-loop (and the now-pointless second barrier)
lets every core return cleanly; _snrt_exit only fires set_eoc on
core 0 anyway, and the other cores halt naturally.
mcs-lock now reaches EOC retval=0 cleanly.
* [SW] fft-32b: add 1024-point / 4-core variant
The existing fft-32b_M1024_N16 test is parameterized for 16 cores --
data_1024_16.h has active_cores=16 baked in and the kernel slices
the work by active_cores. On a 4-core config only 4/16 of the FFT
actually executes, so the output is uniformly wrong (r:1024,i:1024)
and the test self-fails with retval=1.
Add a 1024_4 variant alongside, generated via gen_data.py from a
new fft_1024_4.json config. Both variants now coexist; the N16
variant is appropriate for 4t/16c and the N4 variant for 1t/4c.
The new 1024_4 test passes cleanly (r:0, i:0, retval=0).
* [CFG] add cachepool_2t_fpu_512 (2 tile / 8 core) bisect config
Midpoint between cachepool_fpu_512 (1t/4c — passes) and
cachepool_4t_fpu_512 (4t/16c — broken). Used to isolate whether the
multi-tile cache failures are specific to 4 tiles or to any
configuration with NumTiles > 1. cache-line-rw-smoke fails at 2t/8c
with the same DATA-MISMATCH signature seen at 4t/16c, confirming
the bug is in the inter-tile / group-xbar path itself, not a
4-tile-only artefact.
* [SW] add minimal-tile0-repro for multi-tile bug isolation
Reduces cache-line-rw-smoke to the smallest pattern that still
triggers the multi-tile cache bug:
* only core 0 does work (1 store + 1 load to one cache line)
* all other cores immediately return 0
* no printf, no library calls
* 16 words written + read
On cachepool_fpu_512 (1 tile) this passes cleanly.
On cachepool_2t_fpu_512 and cachepool_4t_fpu_512 the SB still flags
RESP DATA MISMATCH on cache lines touched by the startup/exit
runtime path (not by the test data). The test's own data check
PASSES because the corrupted line is not the test's buf line, but
the underlying cache-state bug is reproduced.
Conclusion from this repro: the bug fires the moment NumTiles > 1
even on purely single-core local activity -- it is NOT a coherence
problem (no cross-tile sharing happens here) and NOT a remote-port
routing problem (no real remote traffic from cores 1+). The
suspect surface narrows to the multi-tile-conditional rotation
math in tcdm_cache_interco/cachepool_tile (bits_to_rotate widens
from CacheBankBits to CacheBankBits+TileBits at NumTiles>1) or the
remote-port muxing inside the local cache_ctrl when those ports
are wired in even though they carry no traffic.
Use for future waveform-level debug:
make vsim config=cachepool_2t_fpu_512 -B
./sim/bin/cachepool_cluster.vsim \
software/build/CachePoolTests/test-cachepool-minimal-tile0-repro
* [RTL] cluster/group: propagate UseHashWaySelect through multi-tile path
The multi-tile (NumTiles>1) cluster instantiates cachepool_group, which then
instantiates cachepool_tile -- but cachepool_group did not forward
UseHashWaySelect, so the tile fell back to its own 1'b0 default. This silently
disabled hash-way select on every multi-tile build and triggered the
forwarding-buffer / skewed-fold data-corruption path. Add the missing
parameter wiring; default to 1'b1 to match cachepool_cluster.sv.
* [Bender] bump insitu-cache to zexin/sync-flush-fixes
* [VERIF] cc/interco: add plusarg-gated write-ack + addr-watch probes
* [VERIF] cc: demote benign EOC write-ack FIFO tail to info
* [VERIF] cc/tile: guard debug probes with ifndef TARGET_SYNTHESIS
* [Bender] bump insitu-cache lock to tcdm_wrapper comb-loop fix (2710920)
* [VERIF] cc/tile/interco: wrap long lines + verible waivers for debug probes
* [Lint] cluster: fix W110 user_i width mismatch; waive W123 false-positives
- cachepool_cluster.sv: zero-extend refill_user_t to the AxiUserWidth user_i
port of reqrsp_to_axi + ASSERT_INIT(AxiUserWidth >= $bits(refill_user_t)).
Behavior-preserving (was an implicit zero-extend); closes the refill-misroute
hazard at elaboration.
- config.mk: correct stale axi_user_width comment (cache_info_t has no tile_id;
the tile term is over-provisioned headroom, not an exact fit).
- lint.tcl: DU-scoped W123 waivers for cachepool_cache_ctrl (coalescer_resp/
bypass_resp driven via i_bypass_xbar slv_rsp_o aggregate) + spatz_decoder.
* [TEST] multi_producer: atomic rlc_ctx updates for multi-consumer
Mark vtNext/pduWithoutPoll/byteWithoutPoll _Atomic and use atomic_fetch_add
(unique per-PDU SN + shared stats) instead of plain +=; add a release fence
before each lock unlock. Fixes the multi-consumer data race; K100 passes
eoc_clean with 0 scoreboard mismatches on 2t/4t x rp1/rp2.
* [Lint] address review: relocate tile verif, fix W123 root-cause, always_ff/rst_ni
- Move per-tile TCDM tracer + memory-model VIP out of cachepool_tile.sv into
hardware/src/verif/cachepool_tile_tcdm_checker.sv (bind-attached); keeps the
RTL body synthesis-clean.
- tcdm_cache_interco: put rst_ni in the Probe-D always_ff sensitivity list.
- Drop the cachepool_cache_ctrl W123 waiver: root-caused to '{}-as-lvalue on
the bypass_xbar output ports (fixed in insitu-cache to concatenation {});
lint confirms the cache_ctrl W123 is gone.
- Bump insitu-cache lock to 65940a3 (the concatenation fix).
* [CFG] make L1 folded/hash-way/fwd-buffer config-selectable
- Add l1d_use_folded / l1d_fold_way_group / l1d_use_hash_way / l1d_use_fwd_buf
knobs to cachepool_512.mk + cachepool_fpu_512.mk (default = production:
folded+hash+fwd); emit them as VLOG_DEFS macros.
- Thread UseForwardingBuffer through cluster->group->tile->cache_ctrl, and
macro-default UseFoldedDataBanks/FoldWayGroup/UseHashWaySelect/UseForwardingBuffer
at the wrapper from those macros.
- Fix dropped param paths: forward UseFoldedDataBanks/FoldWayGroup through the
cluster->group instantiation, and UseHashWaySelect/UseForwardingBuffer through
wrapper->cluster (previously stuck at module defaults).
- Bump insitu-cache lock to fbabd6a (fwd-buffer param + PartSplit=1 fixes).
Verified: folded default bit-identical pass; unfolded conventional passes.1 parent 51deb23 commit f5c3ef4
36 files changed
Lines changed: 3289 additions & 109 deletions
File tree
- config
- hardware
- src
- verif
- tb
- software
- snRuntime/src
- tests
- cache-coverage-min
- cache-coverage
- cache-line-rw-smoke
- cache-mix-pressure
- cache-mix-smoke
- cache-rlc-mimic
- cache-test-scalar
- cache-test-vector
- cache-vector-rw
- fft-32b
- data
- script
- idotp-32b
- mcs-lock
- minimal-tile0-repro
- multi_producer_single_consumer_double_linked_list/kernel
- util/lint/script
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
74 | | - | |
| 74 | + | |
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
52 | 56 | | |
53 | 57 | | |
54 | 58 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
251 | 251 | | |
252 | 252 | | |
253 | 253 | | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
254 | 260 | | |
255 | 261 | | |
256 | 262 | | |
| |||
277 | 283 | | |
278 | 284 | | |
279 | 285 | | |
| 286 | + | |
| 287 | + | |
280 | 288 | | |
281 | 289 | | |
282 | 290 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
| 15 | + | |
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
57 | 69 | | |
58 | 70 | | |
59 | 71 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
59 | 71 | | |
60 | 72 | | |
61 | 73 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
105 | 105 | | |
106 | 106 | | |
107 | 107 | | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
108 | 149 | | |
109 | | - | |
| 150 | + | |
110 | 151 | | |
111 | | - | |
| 152 | + | |
112 | 153 | | |
113 | | - | |
| 154 | + | |
114 | 155 | | |
115 | 156 | | |
| 157 | + | |
| 158 | + | |
116 | 159 | | |
117 | 160 | | |
118 | 161 | | |
| |||
0 commit comments