Skip to content

Commit 397341f

Browse files
authored
Support Fully Asynchronous DMAs (#114)
This pull request introduces improvements to the DMA code generation for several backends (`SnitchDma` and `Mchan`), to enable proper double-buffering by overlapping DMA transfers with kernel calls. Additionally, it refactors the profiling infrastructure for Snitch tiling and improves the readability of the generated code by adding some helpful comments. ### Added - Profiling-aware tiling mixins: `ProfilingDoubleBufferingTilingMixIn` and `ProfilingSingleBufferingTilingMixIn` integrated into the Snitch and PULP tiling generators. - Optional comments injected into generated code (DMA templates `_initTemplate`, `_allocTemplate`, `_waitTemplate`) for improved readability and traceability. - Profiling instrumentation for tile-level DMA and kernel execution integrated into the tiling passes for Snitch backends. ### Changed - Refactored DMA code-generation in the backends (`SnitchDma`, `Mchan`) to enable full overlap of DMA and compute for double-buffering, replacing the earlier (incorrect) synchronization scheme. - Simplified tiling generator logic by leveraging the profiling mix-ins and consolidating redundant template assignments, improving maintainability and code generation clarity. - Improved the waiting-strategy architecture: introduced `PerTensorWaitingStrategy` alongside existing `TensorGroupWaitingStrategy`, enabling finer-grained control of DMA futures in DB mode. ### Fixed - Corrected DMA synchronization bug that previously prevented effective overlapping of transfer and compute in DB mode, especially noticeable for memory-bound kernels.
1 parent 23e9f02 commit 397341f

File tree

17 files changed

+584
-510
lines changed

17 files changed

+584
-510
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ This file contains the changelog for the Deeploy project. The changelog is divid
44
## Unreleased (Planned Release Target: v0.2.1)
55

66
### List of Pull Requests
7+
- Support Fully Asynchronous DMAs [#114](https://github.com/pulp-platform/Deeploy/pull/114)
78
- Disallow shape inference [#128](https://github.com/pulp-platform/Deeploy/pull/128)
89
- Remove memory-aware node bindings [#123](https://github.com/pulp-platform/Deeploy/pull/123)
910
- Fix missing const's layout transformation and refactor NCHWtoNHWC passes [#122](https://github.com/pulp-platform/Deeploy/pull/122)
@@ -55,6 +56,8 @@ This file contains the changelog for the Deeploy project. The changelog is divid
5556
- RequantHelpers.py for Neureka's TileConstraints
5657
- Added assertion that all the graph tensors after lowering have a shape annotated
5758
- Added testFloatGEMMnobias
59+
- Profiling support and optional comments in generated DMA code for better traceability
60+
- Added new waiting-strategy logic with fine-grained `PerTensorWaitingStrategy`
5861

5962
### Changed
6063
- Replaced platform-specific tags (`*-amd64`, `*-arm64`) with direct digest references in `Noelware/docker-manifest-action`.
@@ -91,6 +94,7 @@ This file contains the changelog for the Deeploy project. The changelog is divid
9194
- Removed Wmem variants of bindings and tile constraints from Neureka
9295
- Disabled ICCT_ITA_8 MemPool test because it was using a lowering that created shapeless tensors
9396
- Added missing shape annotation to the testTypeInferenceDifferentTypes
97+
- Refactored DMA code generation (`SnitchDma`, `Mchan`) to correctly overlap transfers and compute in double-buffering mode
9498

9599
### Fixed
96100
- Prevent node duplication for graphs generated via GraphSurgeon
@@ -105,6 +109,7 @@ This file contains the changelog for the Deeploy project. The changelog is divid
105109
- Missing layout transformation of the const's (bias, mul, add, shift in Conv/RequantizedConv)
106110
- Keep mul/add rank of requantized Neureka tile constraints
107111
- Fix bias hoisting in generic GEMM with no bias
112+
- DMA synchronization bug causing reduced DB performance on memory-bound kernels.
108113

109114
### Removed
110115
- Delete outdated and unused `.gitlab-ci.yml` file

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,20 @@
77
from Deeploy.DeeployTypes import CodeGenVerbosity, CodeTransformationPass, ExecutionBlock, NetworkContext, _NoVerbosity
88
from Deeploy.TilingExtension.AsyncDma import AsyncDma
99
from Deeploy.TilingExtension.CodeTransformationPasses.DoubleBufferingTilingCodeGeneration import \
10-
DoubleBufferingTilingCodeGeneration
10+
DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn
1111
from Deeploy.TilingExtension.CodeTransformationPasses.SingleBufferingTilingCodeGeneration import \
12-
SingleBufferingTilingCodeGeneration
13-
from Deeploy.TilingExtension.CodeTransformationPasses.TilingPrototypes import DoubleBufferingTilingMixIn, \
14-
ProfilingDoubleBufferingTilingMixIn, ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingMixIn
12+
ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration
1513

1614

17-
class PULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, SingleBufferingTilingMixIn):
15+
class PULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration):
1816
pass
1917

2018

2119
class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
2220
pass
2321

2422

25-
class PULPClusterTilingGenerationDB(DoubleBufferingTilingCodeGeneration, DoubleBufferingTilingMixIn):
23+
class PULPClusterTilingGenerationDB(DoubleBufferingTilingCodeGeneration):
2624
pass
2725

2826

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,22 +7,20 @@
77
from Deeploy.DeeployTypes import CodeGenVerbosity, CodeTransformationPass, ExecutionBlock, NetworkContext, _NoVerbosity
88
from Deeploy.TilingExtension.AsyncDma import AsyncDma
99
from Deeploy.TilingExtension.CodeTransformationPasses.DoubleBufferingTilingCodeGeneration import \
10-
DoubleBufferingTilingCodeGeneration
10+
DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn
1111
from Deeploy.TilingExtension.CodeTransformationPasses.SingleBufferingTilingCodeGeneration import \
12-
SingleBufferingTilingCodeGeneration
13-
from Deeploy.TilingExtension.CodeTransformationPasses.TilingPrototypes import DoubleBufferingTilingMixIn, \
14-
ProfilingDoubleBufferingTilingMixIn, ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingMixIn
12+
ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration
1513

1614

17-
class PULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, SingleBufferingTilingMixIn):
15+
class PULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration):
1816
pass
1917

2018

2119
class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
2220
pass
2321

2422

25-
class PULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration, DoubleBufferingTilingMixIn):
23+
class PULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration):
2624
pass
2725

2826

Deeploy/Targets/PULPOpen/DMA/L3Dma.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,16 @@
1212

1313
class L3DmaFuture(Future):
1414

15-
_initTemplate = NodeTemplate("pi_cl_ram_req_t ${name};")
15+
_initTemplate = NodeTemplate("pi_cl_ram_req_t ${name} = {0};")
16+
1617
_deinitTemplate = NodeTemplate("")
17-
_waitTemplate = NodeTemplate("pi_cl_ram_copy_wait(&${name});")
18+
19+
_allocTemplate = NodeTemplate("")
20+
21+
_waitTemplate = NodeTemplate("""
22+
if (${name}.size != 0) {
23+
pi_cl_ram_copy_wait(&${name});
24+
}""")
1825

1926

2027
class L3Dma(AsyncDma):

Deeploy/Targets/PULPOpen/DMA/MchanDma.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,23 @@
66
from typing import Dict, Tuple
77

88
from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
9-
from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future, TensorGroupWaitingStrategy
9+
from Deeploy.TilingExtension.AsyncDma import AsyncDma, DirectionWaitingStrategy, DmaDirection, Future
1010

1111

1212
class MchanChannelFuture(Future):
1313

14-
_initTemplate = NodeTemplate("uint32_t ${name} = mchan_channel_alloc();")
15-
_deinitTemplate = NodeTemplate("mchan_channel_free(${name});")
16-
_waitTemplate = NodeTemplate("mchan_channel_wait(${name});")
14+
_initTemplate = NodeTemplate("uint32_t ${name} = (uint32_t) -1;")
15+
16+
_deinitTemplate = NodeTemplate("")
17+
18+
_allocTemplate = NodeTemplate("${name} = mchan_channel_alloc();")
19+
20+
_waitTemplate = NodeTemplate("""
21+
if (${name} <= MCHAN_CHANNEL_ID_MAX) {
22+
mchan_channel_wait(${name});
23+
mchan_channel_free(${name});
24+
}
25+
""")
1726

1827

1928
class MchanDma(AsyncDma):
@@ -22,7 +31,7 @@ class MchanDma(AsyncDma):
2231
1: NodeTemplate("mchan_transfer_1d(${cmd}, ${loc}, ${ext});"),
2332
2: NodeTemplate("mchan_transfer_2d_ext_strided(${cmd}, ${loc}, ${ext}, ${size_1d}, ${stride_2d});"),
2433
}
25-
_waitingStrategy = TensorGroupWaitingStrategy(MchanChannelFuture, "channel_id")
34+
_waitingStrategy = DirectionWaitingStrategy(MchanChannelFuture, "channel")
2635

2736
def __init__(self, transferTemplates: Dict[int, NodeTemplate] = _transferTemplates) -> None:
2837
super().__init__(transferTemplates)

Deeploy/Targets/Snitch/Bindings.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from Deeploy.Targets.Generic.Templates import iNoNormTemplate
1515
from Deeploy.Targets.Generic.TypeCheckers import AddChecker, GEMMChecker, RQAddChecker, SoftmaxChecker, iNoNormChecker
1616
from Deeploy.Targets.Snitch.CodeTransformationPasses import SnitchClusterTiling, SnitchCoreFilterPass, \
17-
SnitchProfileExecutionBlockPass, SnitchSynchCoresPass
17+
SnitchSynchCoresPass
1818
from Deeploy.Targets.Snitch.DMA.SnitchDma import SnitchDma
1919
from Deeploy.Targets.Snitch.Templates import AddTemplate, FloatGemmTemplate, RQAddTemplate, iSoftmaxTemplate
2020
from Deeploy.Targets.Snitch.Templates.FloatSoftmaxTemplate import FloatSoftmax_Template
@@ -37,7 +37,6 @@
3737

3838
TiledTransformer = CodeTransformation([
3939
SnitchCoreFilterPass("compute"),
40-
SnitchProfileExecutionBlockPass(),
4140
TilingVariableReplacement("L1"),
4241
TilingCallClosure(writeback = False),
4342
SnitchSynchCoresPass(),

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,38 +4,55 @@
44

55
from typing import Tuple
66

7-
from Deeploy.DeeployTypes import CodeGenVerbosity, CodeTransformationPass, ExecutionBlock, NetworkContext, _NoVerbosity
7+
from Deeploy.DeeployTypes import CodeGenVerbosity, CodeTransformationPass, ExecutionBlock, NetworkContext, \
8+
NodeTemplate, _NoVerbosity
89
from Deeploy.TilingExtension.AsyncDma import AsyncDma
910
from Deeploy.TilingExtension.CodeTransformationPasses.DoubleBufferingTilingCodeGeneration import \
10-
DoubleBufferingTilingCodeGeneration
11+
DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn
1112
from Deeploy.TilingExtension.CodeTransformationPasses.SingleBufferingTilingCodeGeneration import \
12-
SingleBufferingTilingCodeGeneration
13-
from Deeploy.TilingExtension.CodeTransformationPasses.TilingPrototypes import DoubleBufferingTilingMixIn, \
14-
SingleBufferingTilingMixIn
13+
ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration
1514

1615

17-
class SnitchClusterTilingSB(SingleBufferingTilingCodeGeneration, SingleBufferingTilingMixIn):
16+
class SnitchClusterTilingSB(SingleBufferingTilingCodeGeneration):
1817
pass
1918

2019

21-
class SnitchClusterTilingDB(DoubleBufferingTilingCodeGeneration, DoubleBufferingTilingMixIn):
20+
class SnitchClusterTilingDB(DoubleBufferingTilingCodeGeneration):
2221
pass
2322

2423

24+
class ProfilingSnitchClusterTilingSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
25+
_printCycleDifference = NodeTemplate(r"""
26+
printf("%s%u][Core %d] %s%u%s", ${prefixStr}, ${profileIdxVar}, snrt_global_core_idx(), "${flavorStr}", \
27+
${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr});
28+
""")
29+
30+
31+
class ProfilingSnitchClusterTilingDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn):
32+
_printCycleDifference = NodeTemplate(r"""
33+
printf("%s%u][Core %d] %s%u%s", ${prefixStr}, ${profileIdxVar}, snrt_global_core_idx(), "${flavorStr}", \
34+
${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr});
35+
""")
36+
37+
2538
class SnitchClusterTiling(CodeTransformationPass):
2639

2740
def __init__(self, externalMemory: str, localMemory: str, dma: AsyncDma):
2841
self.SB = SnitchClusterTilingSB(externalMemory, localMemory, dma)
42+
self.profilingSB = ProfilingSnitchClusterTilingSB(externalMemory, localMemory, dma)
43+
2944
self.DB = SnitchClusterTilingDB(externalMemory, localMemory, dma)
45+
self.profilingDB = ProfilingSnitchClusterTilingDB(externalMemory, localMemory, dma)
3046

3147
def apply(self,
3248
ctxt: NetworkContext,
3349
executionBlock: ExecutionBlock,
3450
name: str,
3551
verbose: CodeGenVerbosity = _NoVerbosity) -> Tuple[NetworkContext, ExecutionBlock]:
3652
if verbose.tilingProfiling:
37-
raise NotImplementedError("Profiling not implemented for L2")
38-
39-
ctxt, executionBlock = self.SB.apply(ctxt, executionBlock, name)
40-
ctxt, executionBlock = self.DB.apply(ctxt, executionBlock, name)
53+
ctxt, executionBlock = self.profilingSB.apply(ctxt, executionBlock, name)
54+
ctxt, executionBlock = self.profilingDB.apply(ctxt, executionBlock, name)
55+
else:
56+
ctxt, executionBlock = self.SB.apply(ctxt, executionBlock, name)
57+
ctxt, executionBlock = self.DB.apply(ctxt, executionBlock, name)
4158
return ctxt, executionBlock

Deeploy/Targets/Snitch/DMA/SnitchDma.py

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,41 @@
55
from typing import Dict, Tuple
66

77
from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
8-
from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future, TensorGroupWaitingStrategy
8+
from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future, PerTensorWaitingStrategy
99

1010

1111
class SnitchBarrierFuture(Future):
1212
_initTemplate = NodeTemplate("")
1313
_deinitTemplate = NodeTemplate("")
14+
_allocTemplate = NodeTemplate("")
1415
_waitTemplate = NodeTemplate("if (snrt_is_dm_core()) snrt_dma_wait_all();")
1516

1617

1718
# LMACAN: TODO: Add single transfer waiting
1819
class SnitchFuture(Future):
19-
_initTemplate = NodeTemplate("uint16_t ${name};")
20+
_initTemplate = NodeTemplate("snrt_dma_txid_t ${name} = (snrt_dma_txid_t) -1;")
21+
2022
_deinitTemplate = NodeTemplate("")
21-
_waitTemplate = NodeTemplate("if (snrt_is_dm_core()) snrt_dma_wait(${name});")
23+
24+
_allocTemplate = NodeTemplate("")
25+
26+
_waitTemplate = NodeTemplate(
27+
"if ( (${name} != ( (snrt_dma_txid_t) -1) ) && snrt_is_dm_core() ) snrt_dma_wait(${name});")
2228

2329

2430
class SnitchDma(AsyncDma):
2531

2632
_transferTemplates = {
2733
2:
28-
NodeTemplate(
29-
"if (snrt_is_dm_core()) snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});"
30-
),
34+
NodeTemplate("""
35+
if (snrt_is_dm_core()) {
36+
${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});
37+
// WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved.
38+
snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);
39+
}
40+
"""),
3141
}
32-
_waitingStrategy = TensorGroupWaitingStrategy(SnitchBarrierFuture, "")
42+
_waitingStrategy = PerTensorWaitingStrategy(SnitchFuture)
3343

3444
def __init__(self, transferTemplates: Dict[int, NodeTemplate] = _transferTemplates) -> None:
3545
super().__init__(transferTemplates)
@@ -43,13 +53,13 @@ def checkTransfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, lo
4353
def transferOpRepr(self, externalBuffer: VariableBuffer, localBuffer: VariableBuffer, shape: Tuple[int, ...],
4454
strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...], direction: DmaDirection,
4555
future: Future) -> OperatorRepresentation:
46-
_ = future
4756
operatorRepresentation: OperatorRepresentation = {
4857
"dest": localBuffer.name if direction == "ExternalToLocal" else externalBuffer.name,
4958
"src": externalBuffer.name if direction == "ExternalToLocal" else localBuffer.name,
5059
"repeat": shape[0],
5160
"size": shape[1],
5261
"stride_dest": strideLoc[0] if direction == "ExternalToLocal" else strideExt[0],
5362
"stride_src": strideExt[0] if direction == "ExternalToLocal" else strideLoc[0],
63+
"future": future.name
5464
}
5565
return operatorRepresentation

0 commit comments

Comments
 (0)