Skip to content

Commit f5c3ef4

Browse files
Rebase/cache refactoring onto main (#20)
* [RTL] Add folded data bank plumbing - wire DataPartSplit/folded params through cluster/group/tile - implement skewed folded data SRAM mapping in cachepool_tile - adjust cluster wrapper tb for the new configuration * [SW] Add scalar and vector cache tests with 128-bit aligned accesses - add scalar cache tests that run basic and stress patterns without crossing 128-bit parts - add vector cache tests that use RVV loads/stores on 128-bit chunks and verify data integrity - integrate both tests into the test CMake and keep patterns aligned to folded-cache part size * [RTL] prioritize writes in skewed folded bank selection Select a single part per column/bank per cycle and prioritize write parts over read parts to avoid clobbering bank signals. * cachepool folded-path integration and sim-time protocol hardening - cachepool_tile: use EffectiveCoalFactor=1 in folded mode; pass to cache ctrl. - cachepool_cc: size Spatz response FIFO with NumSpatzOutstandingLoads; add overflow assert. - tcdm_cache_interco: add non-synthesis outstanding scoreboard/asserts for req/rsp matching. * [SW] Add cache mix smoke and pressure tests for scalar and vector cache access interleaving. * [RTL] plumb hash-way folded cache integration Bump insitu-cache to the folded/hash-way revision, thread UseHashWaySelect through cluster/tile, and queue Spatz memory responses through the local response FIFO instead of bypassing write acks. * [RTL] Bump insitu cache dep. * [RTL] Add Spatz<->TCDM id-indexed req/rsp scoreboard for debug cachepool_cc: per-port sb_q[user.req_id] slot table for out-of-order rsp matching; watchdog dumps stuck ids. Gated by parameter (default off, +define+ENABLE_SPATZ_REQ_SCOREBOARD to enable). * [RTL] tile: propagate skew-bank grant to prevent silent read drops The skew-bank arbiter at (col, bank_sel) picks writes over reads without exposing the loser; a hardwired l1_data_bank_gnt=1 caused the upstream to consume stale rdata when another way wrote the same column. Compute any_other_write_in_col (loop-free, depends only on part_we) and gate gnt by it: writes always granted, reads granted iff no OTHER way writes the same (col, bank_sel). Excludes own way's writes so own idle words aren't spuriously stalled. Fixes multi-core coherence in rlc-mimic and unlocks AllowReadDuringWrite=1 on data banks. * [SW] fix runtime/test bugs - l1cache: flush+wait before xbar commit so the reconfig doesn't leave dirty lines bound to the old hash layout. - mcs-lock: move cluster barrier before the non-zero-core spin loop (otherwise cores 1+ never barrier and core 0 deadlocks). - load-store: print the correct buffer name (B/C, not A) in the B/C error messages; add c_ptr to the pointer dump. - idotp-32b: include got/expected in Check Failed! print. * [SW] add cache-{coverage,coverage-min,line-rw-smoke,rlc-mimic,vector-rw} tests Register five new cache-focused tests in CMakeLists.txt: - cache-line-rw-smoke single-core line-granular RW smoke - cache-rlc-mimic RLC traffic mimic (vector load/store) - cache-vector-rw multi-iteration vector load-store kernel - cache-coverage 12-phase multi-core cache stress / coverage - cache-coverage-min minimal phase-06 writeback-loss repro * [VERIF] enable Spatz-SB + add tile-level memory-model VIP - Bender.lock: bump insitu-cache to the rev with the wrapper/coalescer SBs and the SYNC_CTRL_CHECK_PEND fix. - Makefile: define ENABLE_SPATZ_REQ_SCOREBOARD so the in-RTL Spatz req/rsp watchdog is on by default. - cachepool_tile.sv: per-port pre-strip TCDM req tracer (+sb_pretrace_addr_lo/hi) and byte-granular shadow-memory model (+mm_enable) that $errors on DATA / TYPE / ORPHAN_RSP mismatches. Both passive, off by default, sim-only. * [Fix] axi_user_width sizing and group xbar req/rsp slot routing - config.mk: derive axi_user_width as base + 2*(idx_width(num_tiles)-1). Previous widths truncated bank_id MSB on the AXI loopback, routing cache_ctrl refill responses to the icache bypass slot. - cachepool_group.sv: use the source tile id `t` (not target_tile) for the request destination slot, so the response (routed by user.tile_id mod NumRemotePortCore) lands on the same xbar mst port as the request. * [SW] cache-mix-pressure: keep per-core offset 4-byte aligned The `win` offset combined `it * 64u + cid * 7u`. The `cid * 7u` term is odd for cid > 0, so `wp = (base + win + j * 4U)` ended up unaligned for any non-zero core. Snitch raises a misaligned load/store exception for unaligned uint32_t accesses, and this runtime has no exception handler installed, so cores 1+ entered a trap loop at PC 0x800005fc while cores 0/2/3 stalled at the next sync_all. Result: the test always timed out without printing UART. Change the per-core stride to `cid * 28u` (= 7 * 4) so the offset stays varied per core but is always 4-aligned, restoring the original "varied window" intent. Test now passes with retval=0. * [SW] mcs-lock: drop deadlocking exit pattern After the per-core stats printf, non-zero cores entered `while(1){}` and were never able to reach the second `snrt_cluster_hw_barrier()` below. Core 0 then waited forever at that barrier for cores 1+. Result: the kernel never reached `return 0`, _snrt_exit was never called, and EOC was never asserted -- the sim always timed out. Removing the if/while-loop (and the now-pointless second barrier) lets every core return cleanly; _snrt_exit only fires set_eoc on core 0 anyway, and the other cores halt naturally. mcs-lock now reaches EOC retval=0 cleanly. * [SW] fft-32b: add 1024-point / 4-core variant The existing fft-32b_M1024_N16 test is parameterized for 16 cores -- data_1024_16.h has active_cores=16 baked in and the kernel slices the work by active_cores. On a 4-core config only 4/16 of the FFT actually executes, so the output is uniformly wrong (r:1024,i:1024) and the test self-fails with retval=1. Add a 1024_4 variant alongside, generated via gen_data.py from a new fft_1024_4.json config. Both variants now coexist; the N16 variant is appropriate for 4t/16c and the N4 variant for 1t/4c. The new 1024_4 test passes cleanly (r:0, i:0, retval=0). * [CFG] add cachepool_2t_fpu_512 (2 tile / 8 core) bisect config Midpoint between cachepool_fpu_512 (1t/4c — passes) and cachepool_4t_fpu_512 (4t/16c — broken). Used to isolate whether the multi-tile cache failures are specific to 4 tiles or to any configuration with NumTiles > 1. cache-line-rw-smoke fails at 2t/8c with the same DATA-MISMATCH signature seen at 4t/16c, confirming the bug is in the inter-tile / group-xbar path itself, not a 4-tile-only artefact. * [SW] add minimal-tile0-repro for multi-tile bug isolation Reduces cache-line-rw-smoke to the smallest pattern that still triggers the multi-tile cache bug: * only core 0 does work (1 store + 1 load to one cache line) * all other cores immediately return 0 * no printf, no library calls * 16 words written + read On cachepool_fpu_512 (1 tile) this passes cleanly. On cachepool_2t_fpu_512 and cachepool_4t_fpu_512 the SB still flags RESP DATA MISMATCH on cache lines touched by the startup/exit runtime path (not by the test data). The test's own data check PASSES because the corrupted line is not the test's buf line, but the underlying cache-state bug is reproduced. Conclusion from this repro: the bug fires the moment NumTiles > 1 even on purely single-core local activity -- it is NOT a coherence problem (no cross-tile sharing happens here) and NOT a remote-port routing problem (no real remote traffic from cores 1+). The suspect surface narrows to the multi-tile-conditional rotation math in tcdm_cache_interco/cachepool_tile (bits_to_rotate widens from CacheBankBits to CacheBankBits+TileBits at NumTiles>1) or the remote-port muxing inside the local cache_ctrl when those ports are wired in even though they carry no traffic. Use for future waveform-level debug: make vsim config=cachepool_2t_fpu_512 -B ./sim/bin/cachepool_cluster.vsim \ software/build/CachePoolTests/test-cachepool-minimal-tile0-repro * [RTL] cluster/group: propagate UseHashWaySelect through multi-tile path The multi-tile (NumTiles>1) cluster instantiates cachepool_group, which then instantiates cachepool_tile -- but cachepool_group did not forward UseHashWaySelect, so the tile fell back to its own 1'b0 default. This silently disabled hash-way select on every multi-tile build and triggered the forwarding-buffer / skewed-fold data-corruption path. Add the missing parameter wiring; default to 1'b1 to match cachepool_cluster.sv. * [Bender] bump insitu-cache to zexin/sync-flush-fixes * [VERIF] cc/interco: add plusarg-gated write-ack + addr-watch probes * [VERIF] cc: demote benign EOC write-ack FIFO tail to info * [VERIF] cc/tile: guard debug probes with ifndef TARGET_SYNTHESIS * [Bender] bump insitu-cache lock to tcdm_wrapper comb-loop fix (2710920) * [VERIF] cc/tile/interco: wrap long lines + verible waivers for debug probes * [Lint] cluster: fix W110 user_i width mismatch; waive W123 false-positives - cachepool_cluster.sv: zero-extend refill_user_t to the AxiUserWidth user_i port of reqrsp_to_axi + ASSERT_INIT(AxiUserWidth >= $bits(refill_user_t)). Behavior-preserving (was an implicit zero-extend); closes the refill-misroute hazard at elaboration. - config.mk: correct stale axi_user_width comment (cache_info_t has no tile_id; the tile term is over-provisioned headroom, not an exact fit). - lint.tcl: DU-scoped W123 waivers for cachepool_cache_ctrl (coalescer_resp/ bypass_resp driven via i_bypass_xbar slv_rsp_o aggregate) + spatz_decoder. * [TEST] multi_producer: atomic rlc_ctx updates for multi-consumer Mark vtNext/pduWithoutPoll/byteWithoutPoll _Atomic and use atomic_fetch_add (unique per-PDU SN + shared stats) instead of plain +=; add a release fence before each lock unlock. Fixes the multi-consumer data race; K100 passes eoc_clean with 0 scoreboard mismatches on 2t/4t x rp1/rp2. * [Lint] address review: relocate tile verif, fix W123 root-cause, always_ff/rst_ni - Move per-tile TCDM tracer + memory-model VIP out of cachepool_tile.sv into hardware/src/verif/cachepool_tile_tcdm_checker.sv (bind-attached); keeps the RTL body synthesis-clean. - tcdm_cache_interco: put rst_ni in the Probe-D always_ff sensitivity list. - Drop the cachepool_cache_ctrl W123 waiver: root-caused to '{}-as-lvalue on the bypass_xbar output ports (fixed in insitu-cache to concatenation {}); lint confirms the cache_ctrl W123 is gone. - Bump insitu-cache lock to 65940a3 (the concatenation fix). * [CFG] make L1 folded/hash-way/fwd-buffer config-selectable - Add l1d_use_folded / l1d_fold_way_group / l1d_use_hash_way / l1d_use_fwd_buf knobs to cachepool_512.mk + cachepool_fpu_512.mk (default = production: folded+hash+fwd); emit them as VLOG_DEFS macros. - Thread UseForwardingBuffer through cluster->group->tile->cache_ctrl, and macro-default UseFoldedDataBanks/FoldWayGroup/UseHashWaySelect/UseForwardingBuffer at the wrapper from those macros. - Fix dropped param paths: forward UseFoldedDataBanks/FoldWayGroup through the cluster->group instantiation, and UseHashWaySelect/UseForwardingBuffer through wrapper->cluster (previously stuck at module defaults). - Bump insitu-cache lock to fbabd6a (fwd-buffer param + PartSplit=1 fixes). Verified: folded default bit-identical pass; unfolded conventional passes.
1 parent 51deb23 commit f5c3ef4

36 files changed

Lines changed: 3289 additions & 109 deletions

File tree

Bender.lock

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ packages:
7171
- common_verification
7272
- register_interface
7373
insitu-cache:
74-
revision: fa761ddebc946f9b46509d84945bf41ee1a9ec49
74+
revision: fbabd6a06fd801c960078517ad47f7994130b944
7575
version: null
7676
source:
7777
Git: https://github.com/pulp-platform/Insitu-Cache.git

Bender.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ dependencies:
1414
register_interface: { git: "https://github.com/pulp-platform/register_interface.git", version: 0.3.8 }
1515
riscv-dbg: { git: "https://github.com/pulp-platform/riscv-dbg.git", version: 0.7.0 }
1616
tech_cells_generic: { git: "https://github.com/pulp-platform/tech_cells_generic.git", version: 0.2.11 }
17-
Insitu-Cache: { git: "https://github.com/pulp-platform/Insitu-Cache.git", rev: zexin/cachepool_dev }
17+
Insitu-Cache: { git: "https://github.com/pulp-platform/Insitu-Cache.git", rev: zexin/sync-flush-fixes }
1818
spatz: { git: "https://github.com/pulp-platform/spatz.git", rev: cachepool-32b }
1919
dram_rtl_sim: { git: "https://github.com/pulp-platform/dram_rtl_sim.git", rev: cachepool }
2020

@@ -49,6 +49,10 @@ sources:
4949
- hardware/src/cachepool_cluster.sv
5050
# Level 4
5151
- hardware/tb/cachepool_cluster_wrapper.sv
52+
# sim-only verification IP (bind-attached; excluded from synth/lint)
53+
- target: simulation
54+
files:
55+
- hardware/src/verif/cachepool_tile_tcdm_checker.sv
5256
# testbench
5357
- target: cachepool_test
5458
files:

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -251,6 +251,12 @@ VLOG_DEFS += -DL1D_TILE_SIZE=$(l1d_tile_size)
251251
VLOG_DEFS += -DL1D_TAG_DATA_WIDTH=$(l1d_tag_data_width)
252252
VLOG_DEFS += -DL1D_NUM_BANKS=$(l1d_num_banks)
253253
VLOG_DEFS += -DL1D_DEPTH=$(l1d_depth)
254+
# L1 data-bank micro-architecture knobs (1=on/0=off). Production = folded(1) +
255+
# hash-way(1) + fwd-buffer(1). Unfolded conventional = folded(0)+hash(0)+fwd(0).
256+
VLOG_DEFS += -DL1D_USE_FOLDED=$(l1d_use_folded)
257+
VLOG_DEFS += -DL1D_FOLD_WAY_GROUP=$(l1d_fold_way_group)
258+
VLOG_DEFS += -DL1D_USE_HASH_WAY=$(l1d_use_hash_way)
259+
VLOG_DEFS += -DL1D_USE_FWD_BUF=$(l1d_use_fwd_buf)
254260

255261
# CachePool CC / core cluster
256262
VLOG_DEFS += -DSPATZ_FPU_EN=$(spatz_fpu_en)
@@ -277,6 +283,8 @@ VLOG_DEFS += -DPERIPH_START_ADDR=$(periph_start_addr)
277283
VLOG_DEFS += -DBOOT_ADDR=$(boot_addr)
278284
VLOG_DEFS += -DUART_ADDR=$(uart_addr)
279285

286+
VLOG_DEFS += -DENABLE_SPATZ_REQ_SCOREBOARD
287+
280288
ENABLE_CACHEPOOL_TESTS ?= 1
281289

282290
# Bender targets

config/cachepool_2t_fpu_512.mk

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# Copyright 2026 ETH Zurich and University of Bologna.
2+
# Licensed under the Apache License, Version 2.0, see LICENSE for details.
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
# 2-tile / 8-core FPU variant of cachepool_fpu_512.
6+
# Bisect point between 1t/4c (works) and 4t/16c (broken) to isolate
7+
# whether the bug is in the inter-tile (group-level) xbar at all, or
8+
# only at 4t.
9+
10+
#########################
11+
## CachePool Cluster ##
12+
#########################
13+
14+
# Number of tiles
15+
num_tiles ?= 2
16+
17+
# Number of cores
18+
num_cores ?= 8
19+
20+
# Core datawidth
21+
data_width ?= 32
22+
23+
# Core addrwidth
24+
addr_width ?= 32
25+
26+
num_remote_ports_per_tile ?= 1
27+
28+
29+
######################
30+
## CachePool Tile ##
31+
######################
32+
33+
# Number of cores per CachePool tile
34+
num_cores_per_tile ?= 4
35+
36+
# Refill interconnection data width
37+
refill_data_width ?= 128
38+
39+
##### L1 Data Cache #####
40+
41+
# L1 data cacheline width (in Bit)
42+
l1d_cacheline_width ?= 512
43+
44+
# L1 data cache size (in KiB)
45+
l1d_size ?= 256
46+
47+
# L1 data cache banking factor (how many banks per core?)
48+
l1d_bank_factor ?= 1
49+
50+
# L1 coalecsing window
51+
l1d_coal_window ?= 2
52+
53+
# L1 data cache number of ways per
54+
l1d_num_way ?= 4
55+
56+
# L1 data cache size per tile (KiB)
57+
l1d_tile_size ?= 256
58+
59+
# L1 data cache tag width (TODO: should be calcualted)
60+
l1d_tag_data_width ?= 92
61+
62+
####################
63+
## CachePool CC ##
64+
####################
65+
# Spatz fpu support?
66+
spatz_fpu_en ?= 1
67+
68+
# Spatz number of FPU
69+
spatz_num_fpu ?= 4
70+
71+
# Spatz number of IPU
72+
spatz_num_ipu ?= 4
73+
74+
# Spatz max outstanding transactions
75+
spatz_max_trans ?= 32
76+
77+
# Snitch/FPU max outstanding transactions
78+
snitch_max_trans ?= 16
79+
80+
81+
#####################
82+
## L2 Main Memory ##
83+
#####################
84+
# L2 number of channels
85+
l2_channel ?= 4
86+
87+
# L2 bank width (DRAM width, change with care)
88+
l2_bank_width ?= 512
89+
90+
# L2 interleaving factor (in order of bank_width)
91+
l2_interleave ?= 16
92+
93+
94+
##################
95+
## Peripherals ##
96+
##################
97+
# Hardware stack size (in Byte)
98+
stack_hw_size ?= 1024
99+
100+
# Stack size (total, including share and private, 32'h800)
101+
stack_tot_size ?= 2048

config/cachepool_512.mk

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,10 @@
99
#########################
1010

1111
# Number of tiles
12-
num_tiles ?= 4
12+
num_tiles ?= 1
1313

1414
# Number of cores
15-
num_cores ?= 16
15+
num_cores ?= 4
1616

1717
# Core datawidth
1818
data_width ?= 32
@@ -54,6 +54,18 @@ l1d_tile_size ?= 256
5454
# L1 data cache tag width (TODO: should be calcualted)
5555
l1d_tag_data_width ?= 92
5656

57+
# L1 data-bank micro-architecture (1=on, 0=off).
58+
# Production cache = folded(1) + hash-way(1) + fwd-buffer(1).
59+
# Unfolded "conventional" cache = folded(0) + hash-way(0) + fwd-buffer(0).
60+
# Constraints (enforced by RTL elaboration asserts):
61+
# - folded (l1d_use_folded=1) REQUIRES l1d_use_hash_way=1
62+
# - fwd-buffer (l1d_use_fwd_buf=1) REQUIRES l1d_use_hash_way=1
63+
# l1d_fold_way_group=0 => auto (min(4, ways)).
64+
l1d_use_folded ?= 1
65+
l1d_fold_way_group ?= 0
66+
l1d_use_hash_way ?= 1
67+
l1d_use_fwd_buf ?= 1
68+
5769
####################
5870
## CachePool CC ##
5971
####################

config/cachepool_fpu_512.mk

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,18 @@ l1d_tile_size ?= 256
5656
# L1 data cache tag width (TODO: should be calcualted)
5757
l1d_tag_data_width ?= 92
5858

59+
# L1 data-bank micro-architecture (1=on, 0=off).
60+
# Production cache = folded(1) + hash-way(1) + fwd-buffer(1).
61+
# Unfolded "conventional" cache = folded(0) + hash-way(0) + fwd-buffer(0).
62+
# Constraints (enforced by RTL elaboration asserts):
63+
# - folded (l1d_use_folded=1) REQUIRES l1d_use_hash_way=1
64+
# - fwd-buffer (l1d_use_fwd_buf=1) REQUIRES l1d_use_hash_way=1
65+
# l1d_fold_way_group=0 => auto (min(4, ways)).
66+
l1d_use_folded ?= 1
67+
l1d_fold_way_group ?= 0
68+
l1d_use_hash_way ?= 1
69+
l1d_use_fwd_buf ?= 1
70+
5971
####################
6072
## CachePool CC ##
6173
####################

config/config.mk

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -105,14 +105,57 @@ snitch_max_trans ?= 16
105105
## AXI configuration ##
106106
#########################
107107

108+
# axi_user_width sets the AXI 'user' field that carries refill_user_t over the
109+
# cache->L2 AXI. Hard requirement: axi_user_width >= $bits(refill_user_t).
110+
# This floor is now enforced in RTL by ASSERT_INIT(CheckAxiUserFitsRefillUser)
111+
# in cachepool_cluster.sv, and the cluster zero-extends refill_user_t onto the
112+
# (wider) AXI user port at the reqrsp_to_axi call site.
113+
#
114+
# refill_user_t = bank_id(BankIDWidth) + tile_id(TileIDWidth)
115+
# + cache_info_t + burst_req_t
116+
# cache_info_t = {for_write_pend, depth, way} -- NOTE: NO tile_id inside.
117+
#
118+
# So refill_user_t carries TileIDWidth exactly ONCE. The values below are
119+
# deliberately conservative headroom, not an exact fit: the per-tile adjustment
120+
# adds 2*(idx_width(NumTiles)-1) bits, and cachepool_pkg.sv:163 then adds a
121+
# further clog2(NumTiles) (SpatzAxiUserWidth = AXI_USER_WIDTH + clog2(NumTiles)),
122+
# so the tile term ends up over-provisioned. This is harmless: no RTL reads any
123+
# AXI user bit above $bits(refill_user_t)-1.
124+
# (An earlier version of this comment claimed cache_info_t held a second
125+
# TileIDWidth copy and therefore doubled idx_width(NumTiles) -- it does not;
126+
# the 2x factor is just slack, kept as-is for headroom.)
127+
#
128+
# Do NOT let axi_user_width drop below $bits(refill_user_t): the MSB of bank_id
129+
# (or higher tile_id) would be truncated on the AXI loopback and refill
130+
# responses would route back to the wrong slv port (e.g. bank_id=4 aliases to
131+
# bank_id=0, sending cb=3's refill response to the icache bypass slot, making
132+
# cb=3 hang). The ASSERT_INIT above catches this at elaboration.
133+
134+
ifeq ($(num_tiles),1)
135+
axi_user_tile_adj := 0
136+
else ifeq ($(num_tiles),2)
137+
axi_user_tile_adj := 0
138+
else ifeq ($(num_tiles),4)
139+
axi_user_tile_adj := 2
140+
else ifeq ($(num_tiles),8)
141+
axi_user_tile_adj := 4
142+
else ifeq ($(num_tiles),16)
143+
axi_user_tile_adj := 6
144+
else
145+
$(error num_tiles=$(num_tiles) not handled by axi_user_width formula; add a case in config.mk)
146+
endif
147+
148+
# Base widths for NumTiles=1 (= reference values, verified working).
108149
ifeq ($(l1d_cacheline_width),512)
109-
axi_user_width := 17
150+
axi_user_base := 18
110151
else ifeq ($(l1d_cacheline_width),256)
111-
axi_user_width := 18
152+
axi_user_base := 19
112153
else ifeq ($(l1d_cacheline_width),128)
113-
axi_user_width := 21
154+
axi_user_base := 22
114155
endif
115156

157+
axi_user_width := $(shell echo $$(( $(axi_user_base) + $(axi_user_tile_adj) )))
158+
116159
#####################
117160
## L2 Main Memory ##
118161
#####################

0 commit comments

Comments
 (0)