-
Notifications
You must be signed in to change notification settings - Fork 22
Support Fully Asynchronous DMAs #114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
2025-09-19 (@Xeratec)Added
Changed
Known IssuesThe current code has several issues that need to be addressed first.
ExampleThe following traces are generated with GVSoC to visualize the new behavior and illustrate the changes introduced. The first transaction is the transfer of the output back to L2, while this is followed by 3 transfers from L2 to L1 for the inputs. Previous Double-Buffering
python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 --doublebufferI’m not fully confident that the trace is visualized correctly. The time spent transferring the input data appears unusually short, while the idle period before launching the kernel for the next tile is significantly longer. This suggests that the trace likely only shows the transfer of the first input blob, but not the subsequent inputs. This interpretation also aligns with the figure below: The computation of the next tile in the figure above only starts after the time required to transfer all three input blobs in the figure below. New Double-Buffering
python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 --doublebufferNew Single-Buffering
python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 |
d7cee2e to
737db19
Compare
2025-09-20 (@Xeratec)As discussed with @lukamac, I implemented a proper Additionally, I implemented a Added
Changed
Fixed
Known IssuesThe current code has several issues that need to be addressed first.
ExampleThe following traces are generated with GVSoC to visualize the new behavior and illustrate the changes introduced. All images have the same timescale. The trace shows input transfer of the next tile, followed by the kernel call and output transfer. The GEMM has three inputs and one output. All traces were generated by modifying the python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 --doublebufferNew Double-Buffering (
|
2025-09-24 (@Xeratec)I optimized the implementation further and performed profiling. The results confirm that double-buffering (DB) provides clear benefits for memory-bound workloads, while its impact on compute-bound workloads remains limited. Key Findings
ResultsDouble-Buffer (DirectionWaitingStrategy)
Single-Buffer (DirectionWaitingStrategy)
Double-Buffer (PerTensorWaitingStrategy)
Single-Buffer (PerTensorWaitingStrategy)
Example
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughRefactors tiling and DMA codegen: separates profiling mixins from core SB/DB classes, adds direction-aware AsyncDma with alloc/wait semantics, returns alloc/dma/deinit triplets from transfers, updates Snitch pipeline/timers, and adds PULP mchan channel-id bounds checks. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant TG as TilingCodeGeneration
participant TA as AnydimAsyncDmaTransferAdapter
participant DMA as AsyncDma
participant WS as WaitingStrategy
participant F as Future
TG->>TA: transfer(tensor,..., direction, comment)
TA->>DMA: getFuture(tensorName, direction)
DMA->>WS: getFuture(tensorName, direction)
WS-->>DMA: Future
DMA-->>TA: Future
TA->>DMA: transfer(..., direction, future=F, comment)
DMA-->>TA: (alloc snippets, dma snippets, deinit snippets)
TA-->>TG: (alloc, dma, deinit) snippets
Note over TG: TG collects open/ingress/egress/close lists and delegates to profiling or core mixin
sequenceDiagram
autonumber
participant CT as ClusterTilingPass
participant SB as SB Codegen
participant DB as DB Codegen
participant PSB as Profiling SB MixIn
participant PDB as Profiling DB MixIn
CT->>CT: check tilingProfiling flag
alt profiling enabled
CT->>PSB: generate setup/loop with profiling hooks
CT->>PDB: generate setup/loop with profiling hooks
else profiling disabled
CT->>SB: generate normal SB loop
CT->>DB: generate normal DB loop
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (1)
23-28: Ensure profiling mixins precede the code-gen base class.Same MRO problem here: the profiling overrides never run because the concrete code-generation class is listed first. Swap the inheritance order so profiling still injects its measurement code.
-class ProfilingPULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn): +class ProfilingPULPL3TilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration): pass
🧹 Nitpick comments (1)
Deeploy/TilingExtension/AsyncDma.py (1)
190-194: Preserve DMA comment on blocking wait.
Future.waitnow accepts the comment payload, but the blocking adapter drops it by callingfuture.wait()without forwarding the string. As a result the generated blocking path loses the annotation while alloc/transfer keep it. Please pass the comment through so the wait snippet stays documented.- callStack.append(future.wait()) + callStack.append(future.wait(comment))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py(1 hunks)Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py(1 hunks)Deeploy/Targets/PULPOpen/DMA/L3Dma.py(2 hunks)Deeploy/Targets/PULPOpen/DMA/MchanDma.py(3 hunks)Deeploy/Targets/Snitch/Bindings.py(1 hunks)Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py(1 hunks)Deeploy/Targets/Snitch/DMA/SnitchDma.py(2 hunks)Deeploy/TilingExtension/AsyncDma.py(8 hunks)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py(5 hunks)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py(4 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py(3 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py(3 hunks)DeeployTest/Platforms/Snitch/main.c(1 hunks)TargetLibraries/PULPOpen/inc/mchan_v6.h(2 hunks)TargetLibraries/PULPOpen/inc/mchan_v7.h(2 hunks)TargetLibraries/Snitch/src/CycleCounter.c(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py
🧬 Code graph analysis (15)
TargetLibraries/PULPOpen/inc/mchan_v7.h (1)
TargetLibraries/PULPOpen/inc/mchan_v6.h (2)
mchan_channel_is_busy(77-81)mchan_channel_wait(83-93)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (6)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)
_tilingLoop(91-145)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
_tilingLoop(76-304)Deeploy/DeeployTypes.py (6)
NetworkContext(545-1054)executionBlock(1562-1565)ExecutionBlock(1412-1538)copy(1050-1054)VariableBuffer(227-375)CodeSnippet(37-40)Deeploy/TilingExtension/TilingCodegen.py (3)
TilingSchedule(53-122)VariableReplacementScheme(126-158)HyperRectangle(24-35)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
_hoistTileNumAndIdxPtr(77-101)Deeploy/TilingExtension/AsyncDma.py (4)
Future(16-39)transfer(119-133)transfer(170-195)transfer(247-325)
DeeployTest/Platforms/Snitch/main.c (1)
TargetLibraries/Snitch/src/CycleCounter.c (4)
ResetTimer(16-26)StartTimer(28-34)getCycles(46-55)StopTimer(36-44)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (4)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(32-304)ProfilingDoubleBufferingTilingMixIn(307-379)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(31-145)ProfilingSingleBufferingTilingMixIn(148-206)Deeploy/DeeployTypes.py (8)
CodeTransformationPass(2249-2280)apply(2155-2169)apply(2204-2220)apply(2256-2280)NetworkContext(545-1054)executionBlock(1562-1565)ExecutionBlock(1412-1538)CodeGenVerbosity(44-50)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (1)
apply(252-307)
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (3)
Deeploy/TilingExtension/AsyncDma.py (5)
AsyncDma(85-133)DirectionWaitingStrategy(59-71)Future(16-39)transferOpRepr(108-117)transferOpRepr(158-168)Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)
transferOpRepr(62-79)Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)
transferOpRepr(70-89)
Deeploy/Targets/Snitch/Bindings.py (1)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterSynch.py (1)
SnitchSynchCoresPass(15-24)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (5)
Deeploy/DeeployTypes.py (9)
CodeSnippet(37-40)ExecutionBlock(1412-1538)NetworkContext(545-1054)NodeTemplate(82-224)VariableBuffer(227-375)_ReferenceBuffer(512-542)add(715-752)init(260-269)executionBlock(1562-1565)Deeploy/TilingExtension/AsyncDma.py (13)
AsyncDma(85-133)EmptyFuture(136-141)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(155-156)getFuture(233-234)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)
TilingCodeGeneration(40-307)_generateDmaTransferCalls(113-170)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
TilingHoistingMixIn(31-153)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(97-198)PrototypeTilingMixIn(53-94)TilingMetaInfo(13-20)generateAllTilingCode(83-94)generateSetupAndTeardownCode(56-66)measurementArrayDeclaration(100-131)injectPrintCycleDiff(134-181)generateLoopCode(69-80)kernelProfilingWrap(184-198)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)
Deeploy/TilingExtension/AsyncDma.py (6)
AsyncDma(85-133)Future(16-39)PerTensorWaitingStrategy(52-56)_transferTemplates(152-153)transferOpRepr(108-117)transferOpRepr(158-168)Deeploy/DeeployTypes.py (2)
NodeTemplate(82-224)VariableBuffer(227-375)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)
Deeploy/DeeployTypes.py (2)
NodeTemplate(82-224)VariableBuffer(227-375)Deeploy/TilingExtension/AsyncDma.py (5)
AsyncDma(85-133)_transferTemplates(152-153)transferOpRepr(108-117)transferOpRepr(158-168)Future(16-39)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)
Deeploy/DeeployTypes.py (5)
CodeSnippet(37-40)ExecutionBlock(1412-1538)executionBlock(1562-1565)addLeft(1435-1448)addRight(1450-1463)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
generateLoopCode(165-206)generateSetupAndTeardownCode(151-162)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
generateLoopCode(336-379)generateSetupAndTeardownCode(310-333)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (4)
Deeploy/TilingExtension/AsyncDma.py (17)
AsyncDma(85-133)AnydimAsyncDmaTransferAdapter(198-325)EmptyFuture(136-141)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(155-156)getFuture(233-234)transfer(119-133)transfer(170-195)transfer(247-325)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)
TilingCodeGeneration(40-307)_generateDmaTransferCalls(113-170)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
TilingHoistingMixIn(31-153)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(97-198)PrototypeTilingMixIn(53-94)TilingMetaInfo(13-20)generateAllTilingCode(83-94)generateSetupAndTeardownCode(56-66)measurementArrayDeclaration(100-131)injectPrintCycleDiff(134-181)generateLoopCode(69-80)kernelProfilingWrap(184-198)
TargetLibraries/PULPOpen/inc/mchan_v6.h (1)
TargetLibraries/PULPOpen/inc/mchan_v7.h (2)
mchan_channel_is_busy(103-107)mchan_channel_wait(109-119)
Deeploy/TilingExtension/AsyncDma.py (5)
Deeploy/DeeployTypes.py (7)
NodeTemplate(82-224)init(260-269)CodeSnippet(37-40)alloc(271-282)alloc(502-505)VariableBuffer(227-375)NetworkContext(545-1054)Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)
transferOpRepr(62-79)checkTransfer(53-60)Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)
transferOpRepr(63-95)checkTransfer(49-61)Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)
transferOpRepr(70-89)checkTransfer(64-68)Deeploy/TilingExtension/TilingCodegen.py (2)
padShape(202-207)padStride(218-223)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(32-304)ProfilingDoubleBufferingTilingMixIn(307-379)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(31-145)ProfilingSingleBufferingTilingMixIn(148-206)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(32-304)ProfilingDoubleBufferingTilingMixIn(307-379)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(31-145)ProfilingSingleBufferingTilingMixIn(148-206)
🪛 Ruff (0.13.1)
Deeploy/Targets/Snitch/DMA/SnitchDma.py
46-58: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
78-78: Unused method argument: variableReplacement
(ARG002)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py
Outdated
Show resolved
Hide resolved
|
The full review will arrive, this is just a huge NIT from me that I saw even before and tried to give a different approach in my attempt at the PR. I really don't like the comments in the DMA templates. I would much rather we take my approach of adding a comment as it's own CodeSnippet on top of a group of statements. It keeps the templates much smaller and function calls cleaner while still preserving the desired behavior of having comments. It's obviously not a blocker, but I just dislike it a lot that I wanted to immediately comment it. |
I agree. I can adjust the comment thing. I also just realized that I have to take a look at the Timer function in the Snitch platform. I will work on this early next week. You still can give it a pass or wait after these changes, whatever you prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)
10-60: Don't replace reused DMA futures withEmptyFuture.When a waiting strategy hands back the same
Futureobject for multiple tensors (e.g., direction/tensor-group strategies still present in the tree), the second invocation goes through theEmptyFuturebranch. That causes the DMA templates to render${future}as an empty string. In practice, the generated C ends up as= snrt_dma_start_2d(...), which does not compile, and we also lose the handle needed for the later wait. The single-buffer path worked before because it always threaded the realFuturethrough the transfer calls. Please keep using the realFuturefor code generation and let the set only deduplicate init/deinit/wait bookkeeping.[suggested fix:]
-from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, EmptyFuture, Future +from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future @@ - future = self.dma.getFuture(tensorName, direction) - _future = set([future]) - futures - _future = _future.pop() if len(_future) > 0 else EmptyFuture("") - futures.add(future) - - callStack.extend( - self._generateDmaTransferCalls(ctxt, tensorName, rectangles, tileIdxVar, localBuffer, externalBufferRef, - direction, _future)) + future = self.dma.getFuture(tensorName, direction) + futures.add(future) + + callStack.extend( + self._generateDmaTransferCalls(ctxt, tensorName, rectangles, tileIdxVar, localBuffer, externalBufferRef, + direction, future))Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
11-180: Keep emitting the real future when reusing DMA handles.Just like in the SB generator, swapping reused futures for
EmptyFuturemeans the DMA templates no longer receive a variable name. With Snitch/MCHAN templates that expand${future}into the left-hand side, the generated code turns into= snrt_dma_start_2d(...), which is invalid and drops the txid needed for later waits. Please always pass the realFutureto the transfer helpers; the set already ensures we only init/deinit each handle once.[suggested fix:]
-from Deeploy.TilingExtension.AsyncDma import AnydimAsyncDmaTransferAdapter, AsyncDma, EmptyFuture, Future +from Deeploy.TilingExtension.AsyncDma import AnydimAsyncDmaTransferAdapter, AsyncDma, Future @@ - future = self.dma.getFuture(tensorName, "ExternalToLocal") - # Extract the future that is not already in the set of ingress futures - _future = set([future]) - ingressFutures - _future = _future.pop() if len(_future) > 0 else EmptyFuture("") - ingressFutures.add(future) + future = self.dma.getFuture(tensorName, "ExternalToLocal") + ingressFutures.add(future) @@ - initialDmaTransferCalls = anydimAdapter.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims, - stridesFromShape(externalBufferShape), - stridesFromShape(rectangles[0].dims), "ExternalToLocal", - _future, math.prod(externalBufferShape)) + initialDmaTransferCalls = anydimAdapter.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims, + stridesFromShape(externalBufferShape), + stridesFromShape(rectangles[0].dims), "ExternalToLocal", + future, math.prod(externalBufferShape)) @@ - ingressDMAStatements.append(_future.wait()) + ingressDMAStatements.append(future.wait()) @@ - self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I+1", nextLocalBufferReference, - externalBufferRef, "ExternalToLocal", _future)) + self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I+1", nextLocalBufferReference, + externalBufferRef, "ExternalToLocal", future))
🧹 Nitpick comments (1)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
68-108: Eliminate duplicated_measureCyclestemplate definition.
_measureCyclesis defined twice back-to-back with identical bodies. The second definition silently overwrites the first, which is confusing to readers and will trigger Ruff’s B024 warning. Please drop the redundant copy to keep the mixin lean.Apply this diff:
- _measureCycles = NodeTemplate(""" - ${measurements}[${tileIdxVar}] = getCycles(); - """) -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py(1 hunks)Deeploy/Targets/PULPOpen/DMA/MchanDma.py(2 hunks)Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py(1 hunks)Deeploy/Targets/Snitch/DMA/SnitchDma.py(2 hunks)Deeploy/TilingExtension/AsyncDma.py(9 hunks)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py(5 hunks)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py(4 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py(4 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py(9 hunks)DeeployTest/Platforms/Snitch/main.c(1 hunks)TargetLibraries/Snitch/src/CycleCounter.c(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py
- Deeploy/Targets/PULPOpen/DMA/MchanDma.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.
🧬 Code graph analysis (8)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(82-224)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (5)
Deeploy/TilingExtension/AsyncDma.py (13)
AsyncDma(85-120)EmptyFuture(123-128)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(142-143)getFuture(200-201)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)
TilingCodeGeneration(40-289)_generateDmaTransferCalls(115-152)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (2)
TilingHoistingMixIn(31-153)dictOfArrays(19-28)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(67-208)PrototypeTilingMixIn(23-64)TilingMetaInfo(13-20)generateAllTilingCode(53-64)generateSetupAndTeardownCode(26-36)measurementArrayDeclaration(110-141)injectPrintCycleDiff(144-191)generateLoopCode(39-50)kernelProfilingWrap(194-208)Deeploy/DeeployTypes.py (5)
add(715-752)CodeSnippet(37-40)init(260-269)executionBlock(1562-1565)ExecutionBlock(1412-1538)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (6)
Deeploy/TilingExtension/AsyncDma.py (17)
AsyncDma(85-120)AnydimAsyncDmaTransferAdapter(165-271)EmptyFuture(123-128)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(142-143)getFuture(200-201)transfer(113-120)transfer(150-162)transfer(214-271)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4)
TilingCodeGeneration(40-289)_tilingLoop(67-71)_generateDmaTransferCalls(115-152)_generateExternalReferenceUpdate(154-175)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (2)
TilingHoistingMixIn(31-153)dictOfArrays(19-28)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(67-208)PrototypeTilingMixIn(23-64)TilingMetaInfo(13-20)generateAllTilingCode(53-64)generateSetupAndTeardownCode(26-36)measurementArrayDeclaration(110-141)injectPrintCycleDiff(144-191)generateLoopCode(39-50)kernelProfilingWrap(194-208)Deeploy/DeeployTypes.py (12)
NodeTemplate(82-224)CodeSnippet(37-40)VariableBuffer(227-375)_ReferenceBuffer(512-542)NetworkContext(545-1054)executionBlock(1562-1565)ExecutionBlock(1412-1538)add(715-752)init(260-269)nodeName(1568-1571)addLeft(1435-1448)addRight(1450-1463)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (3)
_tilingLoop(68-130)generateSetupAndTeardownCode(136-147)generateLoopCode(150-191)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)
Deeploy/TilingExtension/AsyncDma.py (4)
AsyncDma(85-120)Future(16-39)PerTensorWaitingStrategy(52-56)_transferTemplates(139-140)
Deeploy/TilingExtension/AsyncDma.py (4)
Deeploy/DeeployTypes.py (6)
NodeTemplate(82-224)alloc(271-282)alloc(502-505)CodeSnippet(37-40)VariableBuffer(227-375)NetworkContext(545-1054)Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)
checkTransfer(40-47)transferOpRepr(49-60)Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)
checkTransfer(34-46)transferOpRepr(48-74)Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)
checkTransfer(47-51)transferOpRepr(53-65)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)
Deeploy/DeeployTypes.py (7)
CodeSnippet(37-40)ExecutionBlock(1412-1538)executionBlock(1562-1565)addLeft(1435-1448)addRight(1450-1463)NodeTemplate(82-224)nodeName(1568-1571)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
generateLoopCode(314-357)generateSetupAndTeardownCode(288-311)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
generateLoopCode(150-191)generateSetupAndTeardownCode(136-147)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (5)
Deeploy/DeeployTypes.py (4)
NodeTemplate(82-224)NetworkContext(545-1054)executionBlock(1562-1565)copy(1050-1054)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
_tilingLoop(64-282)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)
_tilingLoop(68-130)Deeploy/TilingExtension/TilingCodegen.py (1)
TilingSchedule(53-122)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
_hoistTileNumAndIdxPtr(77-101)
DeeployTest/Platforms/Snitch/main.c (1)
TargetLibraries/Snitch/src/CycleCounter.c (3)
StartTimer(28-36)getCycles(48-55)StopTimer(38-46)
🪛 Ruff (0.13.1)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
90-91: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
99-100: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
66-66: Unused method argument: variableReplacement
(ARG002)
255-255: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
Deeploy/Targets/Snitch/DMA/SnitchDma.py
32-41: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties
(B024)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-kernels / test-runner-siracusa
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-kernels / test-runner-siracusa
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-kernels / test-runner-siracusa
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
- GitHub Check: siracusa-kernels / test-runner-siracusa
- GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
🔇 Additional comments (2)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)
15-24: Zero-initialize L3 request and guard waits to avoid deadlocksNice catch: zeroing the
pi_cl_ram_req_tupfront and skippingpi_cl_ram_copy_waitwhensize == 0protects the PerTensor strategy from stalling on futures that never issued a transfer. This keeps first-use futures safe without regressing real transfers whosesizestays non-zero.DeeployTest/Platforms/Snitch/main.c (1)
80-94: Revisit DM-core guard for timer teardownReiterating the earlier feedback:
StopTimer()still tears down the shared perf counter viasnrt_stop_perf_counter. With the call now executed on every core, any compute core that returns fromRunNetworkahead of the DM controller can reachStopTimer()first and halt the counter before the DM core samplesgetCycles(), truncating the runtime the very next time this happens. Please keep the reset/start/stop lifecycle wrapped in the DM-core guard (or otherwise prove the counter cannot be stopped by non-DM cores) before this lands.
If you don't have the energy, I can also change the comments (in this PR or a followup). I'll wait for the rest then. Thanks for the effort! |
Already implemented all the suggested changes in afca596 |
6180e28 to
0dae064
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
TargetLibraries/Snitch/src/CycleCounter.c (1)
16-26: Guard perf-counter reset to DM core for symmetry and safety.Start/Stop are gated by
snrt_is_dm_core(), but Reset isn’t. Resetting perf counters from compute cores can clobber the DM core’s counter state. Gate Reset as well.void ResetTimer() { - snrt_reset_perf_counter(SNRT_PERF_CNT0); + if (snrt_is_dm_core()) { + snrt_reset_perf_counter(SNRT_PERF_CNT0); + } uint32_t const core_id = snrt_global_core_idx(); ... }
🧹 Nitpick comments (1)
TargetLibraries/Snitch/src/CycleCounter.c (1)
9-15: Optional: mitigate 32-bit counter wrap and shrinkrunning[].
- Switch to 64-bit cycle/instruction counters (
mcycleh/minstreth) when__riscv_xlen == 32, or document the 32-bit wrap (~8.6 s @500 MHz).- Change
running[NUM_CORES]fromuint32_ttouint8_t(orbool) to save L1 memory.NUM_CORESalready includes the DM core (compute cores =NUM_CORES - 1), so no build-system changes needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
TargetLibraries/Snitch/src/CycleCounter.c(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.
Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
🔇 Additional comments (4)
TargetLibraries/Snitch/src/CycleCounter.c (4)
29-31: LGTM: DM-only start of perf counter.Gating
snrt_start_perf_counter(...)withsnrt_is_dm_core()avoids compute-core interference.
39-41: LGTM: DM-only stop of perf counter.Symmetric to Start; avoids cross-core counter side effects.
42-45: Correct buffer for minstret end snapshot — resolved.
instr_end[core_id] = read_csr(minstret);fixes the earlier overwrite bug.
48-65: Elapsed-counter semantics restored for getCycles/getInstr.Delta logic when running vs. stopped is back; profiling users get elapsed values again.
2025-10-01 (@Xeratec)After discussion with @lukamac, we identified that the current DMA transfer scheduling is overly strict with synchronization. For example, we always wait for input tile t to finish before starting the transfer of the next tile t+1, even though the hardware could already handle this in parallel. This conservative scheduling guarantees correctness, but it prevents us from taking full advantage of double-buffering. With proper double-buffering, we should instead keep two futures for inputs and two for outputs. That way, the DMA engine can already start moving data for tile t+1 while the compute kernel is still working on tile t, and simultaneously flush out results from tile t-1. This enables true overlap of compute, prefetch, and drain. See #116, which previously attempted to implement this scheduling scheme. This PR shows the correct approach:
In theory, this is the optimal scheduling scheme. However, for most current workloads, the DMA transfers are typically not longer than the control overhead plus kernel execution, so runtime performance is not affected. Additionally, MCHAN only supports one in-flight transaction per direction. The benefit is primarily architectural cleanliness and readiness for cases where transfer time becomes a bottleneck. Manual Improved Codestatic void _closure(void *_closure_args) {
// CLOSURE ARG CAST
_closure_args_t *args = (_closure_args_t *)_closure_args;
uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;
// CLOSURE FUNCTION CALL
uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;
// Initialize DMA future
uint32_t channel_input_0 = (uint32_t)-1;
uint32_t channel_output_0 = (uint32_t)-1;
uint32_t channel_input_1 = (uint32_t)-1;
uint32_t channel_output_1 = (uint32_t)-1;
// Transfer initial input tile
channel_input_0 = mchan_channel_alloc();
mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];
// Transfer initial input tile
mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];
// Transfer initial input tile
mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];
// TILING LOOP
for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {
switch ((TILING_I) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
break;
}
// Wait for current input tile
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
channel_input_1 = mchan_channel_alloc();
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
channel_input_0 = mchan_channel_alloc();
break;
}
// Transfer next input tile
mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];
}
// Wait for current input tile
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
break;
}
// Transfer next input tile
mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];
}
// Wait for current input tile
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
break;
}
// Transfer next input tile
mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
}
if (TILING_I > 1) {
// Wait for current output tile
if (TILING_I % 2) {
mchan_channel_wait(channel_output_0);
mchan_channel_free(channel_output_0);
} else {
mchan_channel_wait(channel_output_1);
mchan_channel_free(channel_output_1);
}
}
if (TILING_I % 2) {
mchan_channel_wait(channel_input_0);
mchan_channel_free(channel_input_0);
} else {
mchan_channel_wait(channel_input_1);
mchan_channel_free(channel_input_1);
}
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
*DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];
_cluster_fork_args_t DeeployNetwork__cluster_fork_args =
(_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
.DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
.DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
.DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
.DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};
pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);
// Transfer current output tile
if (TILING_I % 2) {
channel_output_0 = mchan_channel_alloc();
} else {
channel_output_1 = mchan_channel_alloc();
}
mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];
// CLOSE TILING LOOP
}
*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;
// Wait for final output tile
mchan_channel_wait(channel_output_0);
mchan_channel_free(channel_output_0);
mchan_channel_wait(channel_output_1);
mchan_channel_free(channel_output_1);
// Deinitialize DMA future
// CLOSURE ARG WRITEBACK
}Current Codestatic void _closure(void *_closure_args) {
// CLOSURE ARG CAST
_closure_args_t *args = (_closure_args_t *)_closure_args;
uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;
// CLOSURE FUNCTION CALL
uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;
// Initialize DMA future
uint32_t channel_output = (uint32_t)-1;
uint32_t channel_input = (uint32_t)-1;
// Transfer initial input tile
channel_input = mchan_channel_alloc();
mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];
// Transfer initial input tile
mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];
// Transfer initial input tile
mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];
// TILING LOOP
for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {
switch ((TILING_I) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
break;
}
// Wait for current input tile
mchan_channel_wait(channel_input);
mchan_channel_free(channel_input);
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
break;
}
// Transfer next input tile
channel_input = mchan_channel_alloc();
mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];
}
// Wait for current input tile
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
break;
}
// Transfer next input tile
mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];
}
// Wait for current input tile
// DOUBLE BUFFERING CHECK TILE LOAD
if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {
switch ((TILING_I + 1) % 2) {
case 0:
DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
break;
case 1:
DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
break;
}
// Transfer next input tile
mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
}
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
*DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];
_cluster_fork_args_t DeeployNetwork__cluster_fork_args =
(_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
.DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
.DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
.DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
.DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};
pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);
// Wait for previous output tile
mchan_channel_wait(channel_output);
mchan_channel_free(channel_output);
// Transfer current output tile
channel_output = mchan_channel_alloc();
mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);
// UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];
// CLOSE TILING LOOP
}
*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;
// Wait for final output tile
mchan_channel_wait(channel_output);
mchan_channel_free(channel_output);
// Deinitialize DMA future
// CLOSURE ARG WRITEBACK
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do the changes, I just need some clarifications on the intended behavior.
| def transfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, | ||
| shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], | ||
| direction: DmaDirection, future: Future) -> List[CodeSnippet]: | ||
| shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, | ||
| ...], direction: DmaDirection, | ||
| future: Future) -> Tuple[List[CodeSnippet], List[CodeSnippet], List[CodeSnippet]]: | ||
| self.checkTransfer(ctxt, externalBuffer, localBuffer, shape, strideExt, strideLoc, direction) | ||
| opRepr = self.transferOpRepr(externalBuffer, localBuffer, shape, strideExt, strideLoc, direction, future) | ||
| template = self._transferTemplates[len(shape)] | ||
| return [CodeSnippet(template, opRepr)] | ||
|
|
||
| def setup(self) -> List[CodeSnippet]: | ||
| return [] | ||
|
|
||
| def teardown(self) -> List[CodeSnippet]: | ||
| return [] | ||
| return [future.alloc()], [CodeSnippet(template, opRepr)], [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with the API change so I will need some convincing to accept it.
I view the transfer as a method that focuses only on generating code to execute a single transfer.
The overarching CodeGenerator can take care of future allocation and deinit. I prefer this approach of keeping the final leaf abstractions (things wrapping the kernel calls) to stay quite simple and concentrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand your concern, and I am open to different solutions. Unfortunately, I do not have the bandwidth to implement this right now. I suggest we either postpone this PR by 1-2 weeks, implement the requested changes in another one, or you adapt the code to your liking.
| # LMACAN: The brackets around ${tileIdxVar} are important to ensure correct order | ||
| # of the modulo operation. Breaking case without the brackets is when we | ||
| # put "TILING_I + 1" for tileIdxVar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda rude to remove this comment. it took me effort to hunt down the cause. I'd like to prevent someone else of doing the same mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups, sorry, I agree, and I must have overlooked this change. I can revert it.
| # Extract the future that is not already in the set of ingress futures | ||
| _future = set([future]) - ingressFutures | ||
| _future = _future.pop() if len(_future) > 0 else EmptyFuture("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain to me what's the intended behavior with this. Why do we need to get an EmptyFuture? is it because of the allocs? what was getting replicated too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It enables the use of stateless futures and prevents generating multiple allocation calls if the same future is reused in multiple transfer calls. This is the case for a PerDirectionWaitingStrategy. The _future variable is passed to the transfer() function, which inserts the future.alloc() code into the generated code. This will of course change if we adapt the transfer() API interface as discussed in #114 (comment)
| if (channel_id > MCHAN_TRANSFER_ID_MAX) | ||
| return; | ||
| *status_ptr = 1 << channel_id; | ||
| } | ||
|
|
||
| static uint32_t mchan_channel_is_busy(uint32_t channel_id) { | ||
| // TODO: assert channel_id is smaller then 32 | ||
| if (channel_id > MCHAN_TRANSFER_ID_MAX) | ||
| return 0; | ||
| return *status_ptr & (1 << channel_id); | ||
| } | ||
|
|
||
| static void mchan_channel_wait(uint32_t channel_id) { | ||
| // TODO: assert channel_id is smaller then 32 | ||
| if (channel_id > MCHAN_TRANSFER_ID_MAX) | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am against silently failing and using that as a feature in the implementation. The design goal of this was to be as bare bones as possible so I wouldn't do this check here but in the code generation where needed.
If we want the check I would prefer it crashes execution or returns an exit status so that the user knows something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we’ve already discussed, this represents a different implementation philosophy. My version is already a functional improvement compared to the previous state. Additionally, I wouldn’t consider this behavior as “failing silently,” since if a channel does not exist, I don’t want to wait for it, and that outcome is intentional. If an invalid channel ID was expected to exist, then the failure effectively occurred earlier during allocation, and I’d be open to adopting your approach at that point.
That said, we could also consider wrapping these function calls with checks to validate that the futures are valid (which would be required in a double-buffering scenario) and handle failures at that level.
Co-authored-by: Luka Macan <[email protected]>
0dae064 to
0e8d74e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (1)
24-36: Fix MRO so profiling mixins actually run.Reorder mixins ahead of code-generation base classes.
-class ProfilingSnitchClusterTilingSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn): +class ProfilingSnitchClusterTilingSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration): _printCycleDifference = NodeTemplate(r""" @@ -class ProfilingSnitchClusterTilingDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn): +class ProfilingSnitchClusterTilingDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration): _printCycleDifference = NodeTemplate(r"""
🧹 Nitpick comments (3)
TargetLibraries/PULPOpen/inc/mchan_v7.h (1)
98-99: Consider adding debug assertions for invalid channel IDs.While the silent early returns are safe, they could mask programming errors during development. Consider adding compile-time debug assertions to catch invalid channel IDs in development builds:
static void mchan_channel_free(uint32_t channel_id) { +#ifdef DEBUG + assert(channel_id <= MCHAN_TRANSFER_ID_MAX && "Invalid channel ID"); +#endif if (channel_id > MCHAN_TRANSFER_ID_MAX) return; *status_ptr = 1 << channel_id; }Apply similar changes to
mchan_channel_is_busyandmchan_channel_wait.Also applies to: 104-105, 110-111
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
52-59: Consider simplifying the future deduplication logic.The set-difference approach to determine whether a future is new is correct but not immediately readable. The logic creates a set with the future, subtracts already-tracked futures, then pops or uses
EmptyFutureif empty.A more straightforward alternative:
future = self.dma.getFuture(tensorName, direction) -_future = set([future]) - futures -_future = _future.pop() if len(_future) > 0 else EmptyFuture("") +_future = future if future not in futures else EmptyFuture("") futures.add(future)This achieves the same result (avoiding duplicate init/deinit for already-seen futures) with clearer intent.
90-91: Optional: Use iterable unpacking instead of list concatenation.Static analysis correctly suggests that list concatenation can be replaced with iterable unpacking for slightly better performance and modern Python style.
Apply this diff:
-ingressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer input tiles"}) - ] + ingressDMAStatements +ingressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer input tiles"}), + *ingressDMAStatements]-egressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer output tiles"}) - ] + egressDMAStatements +egressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer output tiles"}), + *egressDMAStatements]Based on static analysis
Also applies to: 99-100
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py(1 hunks)Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py(1 hunks)Deeploy/Targets/PULPOpen/DMA/L3Dma.py(1 hunks)Deeploy/Targets/PULPOpen/DMA/MchanDma.py(2 hunks)Deeploy/Targets/Snitch/Bindings.py(1 hunks)Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py(1 hunks)Deeploy/Targets/Snitch/DMA/SnitchDma.py(2 hunks)Deeploy/TilingExtension/AsyncDma.py(9 hunks)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py(5 hunks)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py(4 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py(4 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py(9 hunks)DeeployTest/Platforms/Snitch/main.c(1 hunks)TargetLibraries/PULPOpen/inc/mchan_v6.h(2 hunks)TargetLibraries/PULPOpen/inc/mchan_v7.h(2 hunks)TargetLibraries/Snitch/src/CycleCounter.c(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- TargetLibraries/Snitch/src/CycleCounter.c
🚧 Files skipped from review as they are similar to previous changes (3)
- DeeployTest/Platforms/Snitch/main.c
- Deeploy/Targets/Snitch/Bindings.py
- TargetLibraries/PULPOpen/inc/mchan_v6.h
🧰 Additional context used
🧬 Code graph analysis (12)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)
Deeploy/TilingExtension/AsyncDma.py (4)
AsyncDma(85-120)Future(16-39)PerTensorWaitingStrategy(52-56)_transferTemplates(139-140)
TargetLibraries/PULPOpen/inc/mchan_v7.h (1)
TargetLibraries/PULPOpen/inc/mchan_v6.h (2)
mchan_channel_is_busy(77-81)mchan_channel_wait(83-93)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (4)
Deeploy/DeeployTypes.py (9)
CodeGenVerbosity(49-55)CodeTransformationPass(2312-2343)ExecutionBlock(1439-1565)NetworkContext(564-1076)NodeTemplate(87-229)apply(2214-2228)apply(2267-2283)apply(2319-2343)executionBlock(1592-1595)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(20-282)ProfilingDoubleBufferingTilingMixIn(285-357)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(19-130)ProfilingSingleBufferingTilingMixIn(133-191)Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (1)
apply(39-52)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)
Deeploy/DeeployTypes.py (7)
CodeSnippet(42-45)ExecutionBlock(1439-1565)executionBlock(1592-1595)addLeft(1462-1475)addRight(1477-1490)NodeTemplate(87-229)nodeName(1598-1601)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
generateLoopCode(314-357)generateSetupAndTeardownCode(288-311)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
generateLoopCode(150-191)generateSetupAndTeardownCode(136-147)
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (1)
Deeploy/TilingExtension/AsyncDma.py (3)
AsyncDma(85-120)DirectionWaitingStrategy(59-71)Future(16-39)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(20-282)ProfilingDoubleBufferingTilingMixIn(285-357)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(19-130)ProfilingSingleBufferingTilingMixIn(133-191)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (4)
Deeploy/TilingExtension/AsyncDma.py (17)
AsyncDma(85-120)AnydimAsyncDmaTransferAdapter(165-271)EmptyFuture(123-128)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(142-143)getFuture(200-201)transfer(113-120)transfer(150-162)transfer(214-271)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)
TilingCodeGeneration(40-289)_generateDmaTransferCalls(115-152)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
TilingHoistingMixIn(31-153)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(67-208)PrototypeTilingMixIn(23-64)TilingMetaInfo(13-20)generateAllTilingCode(53-64)generateSetupAndTeardownCode(26-36)measurementArrayDeclaration(110-141)injectPrintCycleDiff(144-191)generateLoopCode(39-50)kernelProfilingWrap(194-208)
Deeploy/TilingExtension/AsyncDma.py (4)
Deeploy/DeeployTypes.py (6)
NodeTemplate(87-229)alloc(276-287)alloc(521-524)CodeSnippet(42-45)VariableBuffer(232-391)NetworkContext(564-1076)Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)
checkTransfer(40-47)transferOpRepr(49-60)Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)
checkTransfer(34-46)transferOpRepr(48-74)Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)
checkTransfer(47-51)transferOpRepr(53-65)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (5)
Deeploy/DeeployTypes.py (4)
NodeTemplate(87-229)NetworkContext(564-1076)executionBlock(1592-1595)ExecutionBlock(1439-1565)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
_tilingLoop(64-282)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)
_tilingLoop(68-130)Deeploy/TilingExtension/TilingCodegen.py (2)
TilingSchedule(53-122)VariableReplacementScheme(126-158)Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)
_hoistTileNumAndIdxPtr(77-101)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
Deeploy/TilingExtension/AsyncDma.py (13)
AsyncDma(85-120)EmptyFuture(123-128)Future(16-39)getFuture(48-49)getFuture(54-56)getFuture(69-71)getFuture(80-82)getFuture(92-93)getFuture(142-143)getFuture(200-201)wait(38-39)init(29-30)deinit(35-36)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)
ProfilingPrototypeMixIn(67-208)PrototypeTilingMixIn(23-64)TilingMetaInfo(13-20)generateAllTilingCode(53-64)generateSetupAndTeardownCode(26-36)measurementArrayDeclaration(110-141)injectPrintCycleDiff(144-191)generateLoopCode(39-50)kernelProfilingWrap(194-208)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
DoubleBufferingTilingCodeGeneration(20-282)ProfilingDoubleBufferingTilingMixIn(285-357)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
SingleBufferingTilingCodeGeneration(19-130)ProfilingSingleBufferingTilingMixIn(133-191)
🪛 Ruff (0.13.3)
Deeploy/Targets/Snitch/DMA/SnitchDma.py
32-41: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties
(B024)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
66-66: Unused method argument: variableReplacement
(ARG002)
255-255: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
90-91: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
99-100: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🔇 Additional comments (23)
TargetLibraries/PULPOpen/inc/mchan_v7.h (2)
37-37: LGTM! Consistent boundary checks prevent undefined behavior.The addition of
MCHAN_TRANSFER_ID_MAXand corresponding boundary checks in the channel management functions correctly prevent bit-shift overflow and undefined behavior when channel IDs exceed the valid range. The implementation is consistent with the mchan_v6.h pattern.Also applies to: 98-99, 104-105, 110-111
104-105: Early-return in mchan_channel_wait ensures mchan_channel_is_busy is never called with invalid channel IDs. No further changes required.Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)
15-24: LGTM! Proper initialization and guarded wait pattern.The changes improve safety and correctness:
- Explicit zero initialization prevents use of uninitialized memory.
- Conditional wait (
${name}.size != 0) prevents waiting on requests that were never issued.- Empty
_allocTemplateappropriately implements the Future interface without requiring explicit allocation for this DMA backend.Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4)
43-43: LGTM! Useful template for generated code comments.The
_lineCommenttemplate provides a standard mechanism for injecting explanatory comments into generated code, improving readability of tiling codegen output.
67-71: LGTM! Appropriate abstraction for tiling strategies.The abstract
_tilingLoopmethod enables different implementations (single-buffering, double-buffering, profiling variants) while maintaining a consistent interface. The boolean return indicates tiling applicability, aligning with the existing pattern.
73-96: LGTM! Clean refactoring that supports new tiling strategies.The refactored
generateTilingLoop:
- Flattens multiple tiling schedules for single-pass processing.
- Validates that all base offset lists match the expected buffer count.
- Delegates implementation to the abstract
_tilingLoopmethod, enabling strategy pattern for different buffering approaches (SB/DB/profiling).
128-152: LGTM! Flattening aligns with updated DMA transfer API.The changes flatten the DMA transfer call results from nested tuples into flat lists. This aligns with the PR's update to the transfer API, which now returns
(alloc, dma, deinit)triplets. The flattening enables downstream processing of individual code snippets across allocation, transfer, and deinitialization phases.Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)
29-29: LGTM! Simplified waiting strategy.Switching to
DirectionWaitingStrategysimplifies synchronization by using one future per direction (input/output) rather than tracking multiple tensor groups. This aligns with MCHAN's limitation of one in-flight transaction per direction mentioned in the PR objectives.
14-20: Audit MCHAN channel lifecycles manually
Automated check returned no missing waits, but confirm every path using_allocTemplateinvokes_waitTemplate(i.e.,mchan_channel_alloc()is always followed bymchan_channel_wait()/free()).Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (7)
31-39: LGTM! Well-structured templates for buffer switching.The switch/case templates provide a clean mechanism for selecting buffers based on tile index modulo buffer count, which is essential for double-buffering. The
_referenceUpdatetemplate handles pointer reassignment for buffer switching.
44-52: LGTM! Robust switch-case assembly.The
_switchmethod properly:
- Validates that the number of case blocks matches the expected buffer count.
- Assembles a complete switch statement with proper opening, cases with breaks, and closure.
- Returns a flat list of CodeSnippets for integration into the tiling loop.
54-62: LGTM! Refactored to support per-buffer code blocks.The updated
_generateBufferChoicereturns per-buffer code blocks (one for each buffer in the multi-buffer scheme) rather than a single snippet. This enables the_switchmethod to select the appropriate buffer update based on tile index, which is essential for double-buffering.
69-96: LGTM! Clear documentation of double-buffering strategy.The inline comments provide an excellent high-level overview of the double-buffering implementation, making the complex tiling loop logic much easier to understand. The staged approach (setup, loop with ingress/egress, teardown) clearly delineates the different phases of execution.
139-163: EmptyFuture deduplication pattern enables stateless futures.The pattern of extracting
_futurewithEmptyFuturefallback prevents duplicate allocation calls when the same future is reused across multiple tensors (e.g., withDirectionWaitingStrategy). The_futureis passed totransfer()andwait(), which insert the allocation code. If the future is already in the set,EmptyFuture(with empty templates) is used instead, avoiding duplicate allocations while still generating valid code.
253-264: LGTM! Proper future lifecycle management.The future lifecycle is correctly managed:
- All futures (ingress ∪ egress) are initialized in setup.
- Futures are waited on at the appropriate points (ingress during loop, egress in teardown).
- All futures are deinitialized in teardown.
This ensures proper resource management across the double-buffering execution.
285-357: LGTM! Clean profiling integration via mixin pattern.The
ProfilingDoubleBufferingTilingMixIncleanly separates profiling concerns from core double-buffering logic:
generateSetupAndTeardownCode: Declares measurement arrays and wraps with initial/final cycle measurements.generateLoopCode: Injects cycle measurements around ingress DMA, kernel, and egress DMA.- Delegates to
super()for base behavior, ensuring proper composition withPrototypeTilingMixIn.This design enables profiling to be optionally applied to double-buffering tiling without modifying core logic.
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)
5-5: LGTM! Appropriate import cleanup.Removing
abstractmethodfrom imports is correct since this file no longer declares abstract methods. The abstract_tilingLoopis now declared inTilingCodeGeneration.py.
26-64: LGTM! Explicit parameter-driven flow improves clarity.The updated signatures make code snippet passing explicit rather than relying on internal generation:
generateSetupAndTeardownCodeandgenerateLoopCodetake explicit list parameters.generateAllTilingCodeorchestrates both methods, maintaining a single entry point.This design improves clarity and testability by making data flow explicit and enables profiling mixins to intercept and augment snippet lists.
Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)
34-41: Guard dummy DMA transfer behind a compile-time flagWrap the fallback
snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0)to avoid unintended memory traffic:if (snrt_is_dm_core()) { ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat}); - // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved. - snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0); + // Workaround: ensure txid can be resolved without side-effects. + #if defined(DEEPLOY_SNRT_DMA_TXID_WORKAROUND) + snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0); + #endif }Confirm this 1-byte self-copy (repeat = 0) is safe across all targeted SNRT versions; if not, expose a dedicated “commit” primitive or use a true zero-length transfer.
Deeploy/TilingExtension/AsyncDma.py (1)
113-121: All.transfercall sites correctly handle the new triple return value.Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (3)
10-14: LGTM: Import additions support profiling and future handling.The new imports (
EmptyFuture,ProfilingPrototypeMixIn,PrototypeTilingMixIn) are correctly added and used throughout the file to enable profiling-aware code generation and proper future deduplication.
72-130: LGTM: Clear structure and improved maintainability.The restructured
_tilingLoopmethod provides:
- Explicit comments documenting the single-buffering strategy
- Well-organized statement lists (setup, ingress, egress, teardown)
- Proper delegation to
generateAllTilingCodemixin method- Descriptive inline comments via
self._lineCommentsnippetsThis significantly improves readability and aligns with the PR goal of adding generated-code comments.
133-191: LGTM: Profiling mixin follows established patterns.
ProfilingSingleBufferingTilingMixIncorrectly:
- Extends both
PrototypeTilingMixInandProfilingPrototypeMixIn- Overrides
generateSetupAndTeardownCodeto inject measurement arrays and print logic- Overrides
generateLoopCodeto wrap ingress/egress DMA statements with cycle measurements- Delegates to parent implementations via
super()- Uses
kernelProfilingWrapfor kernel-level measurementsThe implementation is consistent with the profiling infrastructure described in
TilingPrototypes.pyand aligns with the PR goal of adding profiling support to Snitch tiling.
| class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix MRO so profiling mixins override base methods.
Place Profiling*MixIn before the CodeGeneration base.
-class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
+class ProfilingPULPClusterTilingGenerationSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration):
pass
@@
-class ProfilingPULPClusterTilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn):
+class ProfilingPULPClusterTilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration):
passAlso applies to: 27-28
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py around
lines 19-20 (and similarly lines 27-28), the profiling mixin is declared after
the code-generation base which prevents the mixin from overriding base methods
due to MRO; swap the inheritance order so the Profiling*MixIn comes before the
CodeGeneration base (e.g., class
ProfilingPULPClusterTilingGenerationSB(ProfilingSingleBufferingTilingMixIn,
SingleBufferingTilingCodeGeneration): pass) and apply the same swap to the other
affected class definition so profiling methods take precedence.
| class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix MRO so profiling mixins override base methods.
Reorder bases: mixin first.
-class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
+class ProfilingPULPL3TilingGenerationSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration):
pass
@@
-class ProfilingPULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn):
+class ProfilingPULPL3TilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration):
passAlso applies to: 27-28
🤖 Prompt for AI Agents
In Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py around
lines 19-20 and 27-28 the profiling mixin(s) are listed after the concrete
SingleBufferingTilingCodeGeneration base, causing the MRO to pick base methods
over the mixin overrides; reorder the base classes so the
ProfilingSingleBufferingTilingMixIn (and any other profiling mixins) appear
first in each class definition (mixin before
SingleBufferingTilingCodeGeneration) so the mixin methods take precedence.
| _transferTemplates = { | ||
| 2: | ||
| NodeTemplate( | ||
| "if (snrt_is_dm_core()) snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});" | ||
| ), | ||
| NodeTemplate(""" | ||
| if (snrt_is_dm_core()) { | ||
| ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat}); | ||
| // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved. | ||
| snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0); | ||
| } | ||
| """), | ||
| } | ||
| _waitingStrategy = TensorGroupWaitingStrategy(SnitchBarrierFuture, "") | ||
| _waitingStrategy = PerTensorWaitingStrategy(SnitchFuture) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate class attributes with ClassVar to satisfy RUF012 and signal intent.
mark mutable class attrs as ClassVar.
Apply:
-from typing import Dict, Tuple
+from typing import Dict, Tuple, ClassVar
+from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future, PerTensorWaitingStrategy, AsyncDmaWaitingStrategy
@@
-class SnitchDma(AsyncDma):
+class SnitchDma(AsyncDma):
- _transferTemplates = {
+ _transferTemplates: ClassVar[Dict[int, NodeTemplate]] = {
@@
- _waitingStrategy = PerTensorWaitingStrategy(SnitchFuture)
+ _waitingStrategy: ClassVar[AsyncDmaWaitingStrategy] = PerTensorWaitingStrategy(SnitchFuture)Based on static analysis hints
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _transferTemplates = { | |
| 2: | |
| NodeTemplate( | |
| "if (snrt_is_dm_core()) snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});" | |
| ), | |
| NodeTemplate(""" | |
| if (snrt_is_dm_core()) { | |
| ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat}); | |
| // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved. | |
| snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0); | |
| } | |
| """), | |
| } | |
| _waitingStrategy = TensorGroupWaitingStrategy(SnitchBarrierFuture, "") | |
| _waitingStrategy = PerTensorWaitingStrategy(SnitchFuture) | |
| from typing import Dict, Tuple, ClassVar | |
| from Deeploy.TilingExtension.AsyncDma import ( | |
| AsyncDma, | |
| DmaDirection, | |
| Future, | |
| PerTensorWaitingStrategy, | |
| AsyncDmaWaitingStrategy, | |
| ) | |
| class SnitchDma(AsyncDma): | |
| _transferTemplates: ClassVar[Dict[int, NodeTemplate]] = { | |
| 2: | |
| NodeTemplate(""" | |
| if (snrt_is_dm_core()) { | |
| ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat}); | |
| // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved. | |
| snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0); | |
| } | |
| """), | |
| } | |
| _waitingStrategy: ClassVar[AsyncDmaWaitingStrategy] = PerTensorWaitingStrategy(SnitchFuture) |
🧰 Tools
🪛 Ruff (0.13.3)
32-41: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
🤖 Prompt for AI Agents
In Deeploy/Targets/Snitch/DMA/SnitchDma.py around lines 32 to 43, the mutable
class attributes _transferTemplates and _waitingStrategy need explicit ClassVar
typing to satisfy RUF012 and signal they are class-level, not instance-level;
import ClassVar from typing (or typing_extensions if needed) and change their
annotations to e.g. _transferTemplates: ClassVar[Dict[int, NodeTemplate]] =
{...} and _waitingStrategy: ClassVar[PerTensorWaitingStrategy] =
PerTensorWaitingStrategy(SnitchFuture), ensuring imports for any referenced
types are available.
| # 4.4.1) Wait for previous output tile | ||
| future = self.dma.getFuture(tensorName, "LocalToExternal") | ||
| # Extract the future that is not already in the set of ingress futures | ||
| _future = set([future]) - egressFutures | ||
| _future = _future.pop() if len(_future) > 0 else EmptyFuture("") | ||
| egressFutures.add(future) | ||
|
|
||
| egressDMAStatements.append(CodeSnippet(self._lineComment, {"comment": "Wait for previous output tile"})) | ||
| egressDMAStatements.append(_future.wait()) | ||
|
|
||
| # 4.4.2) Start transfer for current output tile | ||
| dmaTransferCalls = self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I", localBuffer, | ||
| externalBufferRef, "LocalToExternal", future) | ||
| egressDmaTransferCalls.extend(dmaTransferCalls) | ||
| externalBufferRef, "LocalToExternal", _future) | ||
|
|
||
| egressDMAStatements.append(CodeSnippet(self._lineComment, {"comment": "Transfer current output tile"})) | ||
| egressDMAStatements.extend(dmaTransferCalls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify that the past review issue has been addressed.
A past review comment (still visible in the annotations) flagged that using EmptyFuture on the egress path prevents correct transaction ID generation for Snitch DMA. The suggested fix was to keep the actual Future for codegen and only use the set for deduping teardown.
The current code still uses the _future pattern with EmptyFuture fallback. Please confirm whether:
- The issue was resolved through changes in the
Futureimplementation or transfer API. - The pattern is now correct for the updated DMA backends (Mchan, Snitch).
- Or if the suggested fix from the past review still needs to be applied.
Run the following script to check if EmptyFuture is causing issues with Snitch DMA transfer generation:
🏁 Script executed:
#!/bin/bash
# Description: Check for Snitch DMA transfer patterns and EmptyFuture usage
# Look for Snitch DMA transfer generation
rg -nP 'snrt_dma_start|SnitchDma' -C 5
# Check EmptyFuture template definitions
rg -nP 'class EmptyFuture' -A 10
# Look for getFuture usage patterns
rg -nP 'getFuture.*LocalToExternal' -C 3Length of output: 9839
🏁 Script executed:
#!/bin/bash
# Inspect AsyncDma getFuture and waiting strategy
rg -nP 'class AsyncDma' -A100 Deeploy/TilingExtension/AsyncDma.py
rg -nP 'def getFuture' -C5 Deeploy/TilingExtension/AsyncDma.py
rg -nP 'class PerTensorWaitingStrategy' -A50 Deeploy/TilingExtension/AsyncDma.pyLength of output: 11508
Preserve real Future for egress DMA instead of using EmptyFuture fallback
The current fallback to EmptyFuture when deduping egress futures (DoubleBufferingTilingCodeGeneration.py:227-231) suppresses both wait() and transfer calls, breaking Snitch DMA transaction-ID generation. Retain the original future for code emission and use egressFutures only for deduplication.
🤖 Prompt for AI Agents
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
lines 227-242: the code replaces the real future with EmptyFuture when deduping
egress futures which suppresses wait() and DMA transfer uses and breaks DMA
transaction-ID generation; change the logic to always keep and pass the original
`future` into code emission (use it for _future.wait() and for
generateDmaTransferCalls) but use `egressFutures` only to decide whether to emit
the prior-wait() (i.e., if `future` not in `egressFutures` emit the wait and
then add it to `egressFutures`, otherwise skip emitting an extra wait while
still passing the original `future` to transfer call generation).
| return executionBlock | ||
|
|
||
|
|
||
| class ProfilingPrototypeMixIn(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary ABC inheritance from ProfilingPrototypeMixIn.
ProfilingPrototypeMixIn does not declare any abstract methods and serves as a concrete mixin providing profiling functionality. It should not inherit from ABC.
Apply this diff to remove the ABC inheritance:
-class ProfilingPrototypeMixIn(ABC):
+class ProfilingPrototypeMixIn:If the intent is to ensure subclasses implement certain methods, consider adding abstract methods. Otherwise, remove the ABC inheritance.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class ProfilingPrototypeMixIn(ABC): | |
| class ProfilingPrototypeMixIn: |
🧰 Tools
🪛 Ruff (0.13.3)
67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties
(B024)
🤖 Prompt for AI Agents
In Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py around
line 67, remove the unnecessary inheritance from ABC on the
ProfilingPrototypeMixIn class (change its declaration to be a plain mixin) and
also remove the now-unused ABC import at the top of the file; if the original
intent was to force subclasses to implement methods, instead declare explicit
@abstractmethod(s) and keep ABC, otherwise simply drop the ABC inheritance and
unused import.
| _measureCycles = NodeTemplate(""" | ||
| ${measurements}[${tileIdxVar}] = getCycles(); | ||
| """) | ||
|
|
||
| _measurementArrayDeclaration = NodeTemplate(""" | ||
| uint32_t ${measurements}[${totalNumTiles}]; | ||
| """) | ||
|
|
||
| _stringDeclaration = NodeTemplate(""" | ||
| const static char ${name}[] = "${string}"; | ||
| """) | ||
|
|
||
| _printLoopSetup = NodeTemplate(""" | ||
| StopTimer(); | ||
| for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0); | ||
| ${profileIdxVar} < ${numTiles}[*${tileIdxPtr}]; | ||
| ${profileIdxVar}++){ | ||
| """) | ||
|
|
||
| _printCycleDifference = NodeTemplate(r""" | ||
| printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \ | ||
| ${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr}); | ||
| """) | ||
|
|
||
| _printLoopTeardown = NodeTemplate(""" | ||
| } | ||
| StartTimer(); | ||
| """) | ||
|
|
||
| _measureConditionSetup = NodeTemplate(""" | ||
| if(${cond}){ | ||
| """) | ||
|
|
||
| _measureConditionEnd = NodeTemplate(""" | ||
| } | ||
| """) | ||
|
|
||
| _measureCycles = NodeTemplate(""" | ||
| ${measurements}[${tileIdxVar}] = getCycles(); | ||
| """) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate _measureCycles definition.
The _measureCycles template is defined twice (lines 68-70 and lines 105-107) with identical content. This appears to be a copy-paste error.
Apply this diff to remove the duplicate:
- _measureCycles = NodeTemplate("""
- ${measurements}[${tileIdxVar}] = getCycles();
- """)
-
_measurementArrayDeclaration = NodeTemplate("""
uint32_t ${measurements}[${totalNumTiles}];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _measureCycles = NodeTemplate(""" | |
| ${measurements}[${tileIdxVar}] = getCycles(); | |
| """) | |
| _measurementArrayDeclaration = NodeTemplate(""" | |
| uint32_t ${measurements}[${totalNumTiles}]; | |
| """) | |
| _stringDeclaration = NodeTemplate(""" | |
| const static char ${name}[] = "${string}"; | |
| """) | |
| _printLoopSetup = NodeTemplate(""" | |
| StopTimer(); | |
| for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0); | |
| ${profileIdxVar} < ${numTiles}[*${tileIdxPtr}]; | |
| ${profileIdxVar}++){ | |
| """) | |
| _printCycleDifference = NodeTemplate(r""" | |
| printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \ | |
| ${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr}); | |
| """) | |
| _printLoopTeardown = NodeTemplate(""" | |
| } | |
| StartTimer(); | |
| """) | |
| _measureConditionSetup = NodeTemplate(""" | |
| if(${cond}){ | |
| """) | |
| _measureConditionEnd = NodeTemplate(""" | |
| } | |
| """) | |
| _measureCycles = NodeTemplate(""" | |
| ${measurements}[${tileIdxVar}] = getCycles(); | |
| """) | |
| _measurementArrayDeclaration = NodeTemplate(""" | |
| uint32_t ${measurements}[${totalNumTiles}]; | |
| """) | |
| _stringDeclaration = NodeTemplate(""" | |
| const static char ${name}[] = "${string}"; | |
| """) | |
| _printLoopSetup = NodeTemplate(""" | |
| StopTimer(); | |
| for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0); | |
| ${profileIdxVar} < ${numTiles}[*${tileIdxPtr}]; | |
| ${profileIdxVar}++){ | |
| """) | |
| _printCycleDifference = NodeTemplate(r""" | |
| printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \ | |
| ${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr}); | |
| """) | |
| _printLoopTeardown = NodeTemplate(""" | |
| } | |
| StartTimer(); | |
| """) | |
| _measureConditionSetup = NodeTemplate(""" | |
| if(${cond}){ | |
| """) | |
| _measureConditionEnd = NodeTemplate(""" | |
| } | |
| """) | |
| _measureCycles = NodeTemplate(""" | |
| ${measurements}[${tileIdxVar}] = getCycles(); | |
| """) |
🤖 Prompt for AI Agents
In Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py around
lines 68 to 107 there is a duplicate definition of the _measureCycles
NodeTemplate (one at ~68-70 and another at ~105-107); remove the second (later)
duplicate definition so _measureCycles is defined only once, ensuring no other
references are changed.







This pull request introduces improvements to the DMA code generation for several backends (
SnitchDmaandMchan), to enable proper double-buffering by overlapping DMA transfers with kernel calls. Additionally, it refactors the profiling infrastructure for Snitch tiling and improves the readability of the generated code by adding some helpful comments.Key Changes
EmptyGEMM): DB reduces runtime by up to 21% withDirectionWaitingStrategy. When comparing SB (L=5000) against DB (L=10000) (same number of tiles), DB is ~26–29% faster in the new implementation. Ondevel, SB was slightly faster (-1.7%), due to the wrong DMA synchronization.Added
Changed
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.