Support Fully Asynchronous DMAs #114

Xeratec · 2025-09-19T16:11:10Z

This pull request introduces improvements to the DMA code generation for several backends (SnitchDma and Mchan), to enable proper double-buffering by overlapping DMA transfers with kernel calls. Additionally, it refactors the profiling infrastructure for Snitch tiling and improves the readability of the generated code by adding some helpful comments.

Key Changes

Memory-bound kernels (EmptyGEMM): DB reduces runtime by up to 21% with DirectionWaitingStrategy. When comparing SB (L=5000) against DB (L=10000) (same number of tiles), DB is ~26–29% faster in the new implementation. On devel, SB was slightly faster (-1.7%), due to the wrong DMA synchronization.
Compute-bound kernels (FloatGEMM): DB yields only ~2% improvement, and in some cases slightly higher latency than SB with the same number of tiles. This is consistent with expectations, as the additional tiling and control overhead dominates. I will investigate this aspect in a separate PR.
Tile equivalence (L=5000 SB vs. L=10000 DB): These experiments allow fair comparison between DB and SB, showing that DB’s advantage is specific to memory-bound scenarios.

Added

Changed

Fixed

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

Xeratec · 2025-09-19T16:11:44Z

2025-09-19 (@Xeratec)

Added

Enhanced DMA Code Generation and Template Flexibility: All DMA-related NodeTemplates in L3Dma.py, MchanDma.py, and SnitchDma.py now support an optional comment field, allowing insertion of custom comments into generated code snippets for initialization, waiting, and transfer operations. This improves code readability and debugging.
Profiling Infrastructure for Snitch Tiling: Introduced new profiling mixins (ProfilingSingleBufferingTilingMixIn, ProfilingDoubleBufferingTilingMixIn) and corresponding tiling classes (ProfilingSnitchClusterTilingSB, ProfilingSnitchClusterTilingDB) to support profiling-aware tiling code generation for Snitch clusters.

Changed

Refactoring and Consistency in DMA Waiting Strategies: The DMA waiting strategy for Snitch and Mchan backends has been unified to use PerTensorWaitingStrategy instead of TensorGroupWaitingStrategy, simplifying synchronization and making the behavior consistent across all DMA types.
The SnitchClusterTiling class now conditionally applies profiling tiling passes when profiling is enabled, instead of raising a NotImplementedError.
Remove the SnitchProfileExecutionBlockPass pass from the TiledTransfer code generations. It should not be used by default.

Known Issues

The current code has several issues that need to be addressed first.

The current code does allocate each DMA transfer for Mchan and SnitchDMA into a distinct future by using a TensorGroupWaitingStrategy. However, this can break when the AnydimAsyncDmaTransferAdapter is used because it might issue multiple DMA transfers without keeping track of the multiple futures (transfer IDs). (c185fea)
To not get stuck during execution, each DMA transfer is followed by a dummy transfer (same parameters, except the length set to one). This is a hack and generates configuration overhead. (d5823a8)
The profiling support for tiled execution on Sntich is very hacky and needs to be cleaned up. (e188356)

Example

The following traces are generated with GVSoC to visualize the new behavior and illustrate the changes introduced. The first transaction is the transfer of the output back to L2, while this is followed by 3 transfers from L2 to L1 for the inputs.
All images have the same time scale.

Previous Double-Buffering

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer

I’m not fully confident that the trace is visualized correctly. The time spent transferring the input data appears unusually short, while the idle period before launching the kernel for the next tile is significantly longer. This suggests that the trace likely only shows the transfer of the first input blob, but not the subsequent inputs. This interpretation also aligns with the figure below: The computation of the next tile in the figure above only starts after the time required to transfer all three input blobs in the figure below.

New Double-Buffering

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer

New Single-Buffering

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000

Xeratec · 2025-09-20T14:36:11Z

2025-09-20 (@Xeratec)

As discussed with @lukamac, I implemented a proper PerTensorWaitingStrategy, which also works when using the AnydimAsyncDmaTransferAdapter, which generates multiple transfers. The DMA transfer function now returns three lists of code snippets (pre-transfer, transfer, and post-transfer code). This allows us to apply loop wrapping in AnydimAsyncDmaTransferAdapter to only the transfer code. As shown below, this allows for proper double-buffering and overlapping DMA transfers with kernel calls.

Additionally, I implemented a TensorGroupWaitingStrategy which uses two groups (ExternalToLocal and LocalToExternal). To only generate the future allocation once per group, the futures have a state (allocated). Because in double buffering mode, the future for the ingress transfers has to be allocated once before the tiling loop and once in the tiling loop, the TensorGroupWaitingStrategy contains four groups (two times ExternalToLocal, LocalToExternal). The two ExternalToLocal futures map to the same future name but have individual allocation states.

Added

Add a dedicated allocation function for DMA future. (737db19).

Changed

Change DMA transfer functions to return (pre-transfer, transfer, and post-transfer code). (737db19).
Extend thegetFuture functions to also take a transfer direction. (737db19)
Use four groups (two times ExternalToLocal, LocalToExternal future) in TensorGroupWaitingStrategy. (7baa0e3)

Fixed

Check if transfer ID is valid in Mchan and SnitchDma driver. (737db19)

Known Issues

The current code has several issues that need to be addressed first.

To not get stuck during execution, each DMA transfer is followed by a dummy transfer (same parameters, except the length is set to one). This is a hack and generates configuration overhead. (d5823a8)
The profiling support for tiled execution on Snitch is very hacky and needs to be cleaned up. (e188356)

Example

The following traces are generated with GVSoC to visualize the new behavior and illustrate the changes introduced. All images have the same timescale. The trace shows input transfer of the next tile, followed by the kernel call and output transfer. The GEMM has three inputs and one output.

All traces were generated by modifying the PULP_Gemm_fp32_fp32_fp32_fp32 kernel to not compute but directly return and running:

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer

New Double-Buffering (`PerTensorWaitingStrategy`)

Siracusa_testEmptyKernel_doublebuffer_perTensor

Revelant Code

static void _closure(void *_closure_args) {
  // CLOSURE ARG CAST
  _closure_args_t *args = (_closure_args_t *)_closure_args;
  uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;

  // CLOSURE FUNCTION CALL
  uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
  void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
  void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
  void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;

  // Initialize DMA future
  uint32_t B_future = (uint32_t)-1;

  // Initialize DMA future
  uint32_t data_out_future = (uint32_t)-1;

  // Initialize DMA future
  uint32_t A_future = (uint32_t)-1;

  // Initialize DMA future
  uint32_t C_future = (uint32_t)-1;

  // Transfer initial input tile
  A_future = mchan_channel_alloc();
  mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
  DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];

  // Transfer initial input tile
  B_future = mchan_channel_alloc();
  mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];

  // Transfer initial input tile
  C_future = mchan_channel_alloc();
  mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];

  // Wait for initial input tile
  mchan_channel_wait(C_future);
  mchan_channel_free(C_future);

  // Wait for initial input tile
  mchan_channel_wait(A_future);
  mchan_channel_free(A_future);

  // Wait for initial input tile
  mchan_channel_wait(B_future);
  mchan_channel_free(B_future);

  // TILING LOOP
  for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
       TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
      break;
    }

    // Wait for input tile
    mchan_channel_wait(B_future);
    mchan_channel_free(B_future);

    // Wait for input tile
    mchan_channel_wait(A_future);
    mchan_channel_free(A_future);

    // Wait for input tile
    mchan_channel_wait(C_future);
    mchan_channel_free(C_future);

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
        break;
      }

      // Transfer input tile
      A_future = mchan_channel_alloc();
      mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
                        DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
      DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
        break;
      }

      // Transfer input tile
      B_future = mchan_channel_alloc();
      mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
        break;
      }

      // Transfer input tile
      C_future = mchan_channel_alloc();
      mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
                                    DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
    }

    // Wait for output tile
    mchan_channel_wait(data_out_future);
    mchan_channel_free(data_out_future);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
    *DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];

    _cluster_fork_args_t DeeployNetwork__cluster_fork_args =
        (_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};

    pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);

    // Transfer output tile
    data_out_future = mchan_channel_alloc();
    mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
                                  DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
    DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];

    // CLOSE TILING LOOP
  }
  *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;

  // Wait for final output tile
  mchan_channel_wait(data_out_future);
  mchan_channel_free(data_out_future);

  // CLOSURE ARG WRITEBACK
}

New Double-Buffering (`TensorGroupWaitingStrategy`)

Siracusa_testEmptyKernel_doublebuffer_perGroup

Revelant Code

static void _closure(void *_closure_args) {
  // CLOSURE ARG CAST
  _closure_args_t *args = (_closure_args_t *)_closure_args;
  uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;

  // CLOSURE FUNCTION CALL
  uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
  void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
  void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
  void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;

  // Initialize DMA future
  uint32_t mchan_ExternalToLocal_future = (uint32_t)-1;

  // Initialize DMA future
  uint32_t mchan_LocalToExternal_future = (uint32_t)-1;

  // Transfer initial input tile
  mchan_ExternalToLocal_future = mchan_channel_alloc();
  mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
  DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];

  mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];

  mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];

  // Wait for initial input tile
  mchan_channel_wait(mchan_ExternalToLocal_future);
  mchan_channel_free(mchan_ExternalToLocal_future);

  // TILING LOOP
  for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
       TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
      break;
    }

    // Wait for input tile
    mchan_channel_wait(mchan_ExternalToLocal_future);
    mchan_channel_free(mchan_ExternalToLocal_future);

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
        break;
      }

      // Transfer input tile
      mchan_ExternalToLocal_future = mchan_channel_alloc();
      mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
                        DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
      DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
        break;
      }

      mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
        break;
      }

      mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
                                    DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
    }

    // Wait for output tile
    mchan_channel_wait(mchan_LocalToExternal_future);
    mchan_channel_free(mchan_LocalToExternal_future);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
    *DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];

    _cluster_fork_args_t DeeployNetwork__cluster_fork_args =
        (_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};

    pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);

    // Transfer output tile
    mchan_LocalToExternal_future = mchan_channel_alloc();
    mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
                                  DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
    DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];

    // CLOSE TILING LOOP
  }
  *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;

  // Wait for final output tile
  mchan_channel_wait(mchan_LocalToExternal_future);
  mchan_channel_free(mchan_LocalToExternal_future);

  // CLOSURE ARG WRITEBACK
}

Old Double-Buffering (GlobalWaiting)

Siracusa_testEmptyKernel_doublebuffer_old

Revelant Code

static void _closure(void *_closure_args) {
  // CLOSURE ARG CAST
  _closure_args_t *args = (_closure_args_t *)_closure_args;
  uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;

  // CLOSURE FUNCTION CALL
  uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
  void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
  void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
  void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;

  // INIT Setup
  uint32_t channel_id_future = mchan_channel_alloc();

  // DMA INITIAL
  mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

  // WAIT INITIAL
  mchan_channel_wait(channel_id_future);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
  DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];

  // DMA INITIAL
  mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

  // WAIT INITIAL
  mchan_channel_wait(channel_id_future);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];

  // DMA INITIAL
  mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

  // WAIT INITIAL
  mchan_channel_wait(channel_id_future);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];

  // TILING LOOP
  for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
       TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
      break;
    }

    // DOUBLE BUFFERING CHOOSE BUFFER
    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
      break;
    }

    // WAIT INGRESS
    mchan_channel_wait(channel_id_future);

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
        break;
      }

      // DMA INGRESS
      mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
                        DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
      DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
        break;
      }

      // DMA INGRESS
      mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];

      // DOUBLE BUFFERING CHOOSE BUFFER
      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
        break;
      }

      // DMA INGRESS
      mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
                                    DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
    }

    // WAIT EGRESS
    mchan_channel_wait(channel_id_future);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
    *DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];

    _cluster_fork_args_t DeeployNetwork__cluster_fork_args =
        (_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};

    pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);

    // DMA EGRESS
    mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
                                  DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
    DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];

    // CLOSE TILING LOOP
  }
  *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;

  // WAIT EGRESS
  mchan_channel_wait(channel_id_future);

  // DEINIT Setup
  mchan_channel_free(channel_id_future);

  // CLOSURE ARG WRITEBACK
}

Xeratec · 2025-09-27T16:42:54Z

2025-09-24 (@Xeratec)

I optimized the implementation further and performed profiling. The results confirm that double-buffering (DB) provides clear benefits for memory-bound workloads, while its impact on compute-bound workloads remains limited.

Key Findings

Memory-bound kernels (EmptyGEMM): DB reduces runtime by up to 21% with DirectionWaitingStrategy. When comparing SB (L=5000) against DB (L=10000) (same number of tiles), DB is ~26–29% faster in the new implementation. On devel, SB was slightly faster (-1.7%), due to the wrong DMA synchronization.
Compute-bound kernels (FloatGEMM): DB yields only ~2% improvement, and in some cases slightly higher latency than SB with the same number of tiles. This is consistent with expectations, as the additional tiling and control overhead dominates. I will investigate this aspect in a separate PR.
Tile equivalence (L=5000 SB vs. L=10000 DB): These experiments allow fair comparison between DB and SB, showing that DB’s advantage is specific to memory-bound scenarios.

Results

Double-Buffer (DirectionWaitingStrategy)

Test	Devel	New	Change
testFloatGEMM (L1=10000)	140789	137984	-2.0%
testEmptyGEMM	7391	5811	-21.4%

Single-Buffer (DirectionWaitingStrategy)

Test	Devel	New	Change
testFloatGEMM (L1=10000)	117685	117806	+0.1%
testEmptyGEMM (L1=10000)	6408	6424	+0.3%
testFloatGEMM (L1=5000)	140585	140564	-0.0%
testEmptyGEMM (L1=5000)	7263	7309	+0.6%

Double-Buffer (PerTensorWaitingStrategy)

Test	Devel	New	Change
testFloatGEMM (L1=10000)	-	138456	-
testEmptyGEMM (L1=10000)	-	5802	-

Single-Buffer (PerTensorWaitingStrategy)

Test	Devel	New	Change
testFloatGEMM (L1=10000)	-	117908	-
testEmptyGEMM (L1=10000)	-	6550	-
testFloatGEMM (L1=5000)	-	140846	-
testEmptyGEMM (L1=5000)	-	7457	-

Example

EmptyGEMM
L1=10000
Double-Buffer
PerTensorWaitingStrategy

Siracusa_testEmptyKernel_doublebuffer_perTensor_optimized

coderabbitai · 2025-09-27T16:50:33Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added optional tiling profiling for single/double buffering across PULP and Snitch targets, with cycle-difference reporting.
Improvements
- More robust double-buffering codegen with per-buffer switching and clearer DMA sequencing.
- Async DMA gains direction-aware waiting strategies and explicit future allocation, improving transfer orchestration.
Bug Fixes
- Guarded DMA waits to avoid blocking on invalid/no-op transfers.
- Safer MCHAN handling with bounds checks to prevent out-of-range channel operations.
Performance
- Refined timing: unconditional network run timing and corrected per-core performance counters on Snitch.

Walkthrough

Refactors tiling and DMA codegen: separates profiling mixins from core SB/DB classes, adds direction-aware AsyncDma with alloc/wait semantics, returns alloc/dma/deinit triplets from transfers, updates Snitch pipeline/timers, and adds PULP mchan channel-id bounds checks.

Changes

Cohort / File(s)	Summary of Changes
Tiling prototypes & core `Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py`, `Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py`	Reworked codegen to use explicit lists of CodeSnippet groups (open/ingress/egress/close/setup/teardown), removed generateInnerCode, added abstract _tilingLoop, flattened transfer snippet returns and updated related APIs.
Single-buffering codegen + profiling `Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py`	Added ProfilingSingleBufferingTilingMixIn, profiling templates and hooks; _generateTransferScheduleCalls adds optional comment and Future handling; tiling loop emits structured snippet lists with profiling measurement wrappers.
Double-buffering codegen + profiling `Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py`	_generateBufferChoice now returns per-buffer blocks; added switch/case buffer-selection helpers and _switch; introduced ProfilingDoubleBufferingTilingMixIn and profiling injection points.
Async DMA, futures & adapters `Deeploy/TilingExtension/AsyncDma.py`	Future gains _allocTemplate and alloc(); waiting strategies accept DmaDirection; introduced DirectionWaitingStrategy and BarrierWaitingStrategy; transfer APIs accept direction/future/comment and return (alloc, dma, deinit) triplets; adapters updated to propagate direction/future.
PULP DMA implementations `Deeploy/Targets/PULPOpen/DMA/L3Dma.py`, `Deeploy/Targets/PULPOpen/DMA/MchanDma.py`	L3DmaFuture: init/wait templates adjusted and _allocTemplate added with guarded wait; MchanDma switched to DirectionWaitingStrategy and updated channel lifecycle templates (alloc/init/deinit/wait) including new alloc template and guarded behavior.
Snitch DMA implementation `Deeploy/Targets/Snitch/DMA/SnitchDma.py`	Switched waiting strategy to PerTensorWaitingStrategy; SnitchFuture/BarrierFuture add _allocTemplate and guarded init/wait; transfer templates updated (including 2D start hack); transferOpRepr includes future reference.
Target tiling passes (PULP & Snitch) `Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py`, `Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py`, `Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py`	Removed mixin multi-inheritance from SB/DB classes (now inherit only core SB/DB codegen); added profiling SB/DB subclasses that include Profiling mixins; cluster tiling dispatches to profiling variants when profiling enabled.
Snitch bindings & pipeline `Deeploy/Targets/Snitch/Bindings.py`	Removed `SnitchProfileExecutionBlockPass` import and instantiation from the transformation pipeline.
Runtime / tests `DeeployTest/Platforms/Snitch/main.c`, `TargetLibraries/Snitch/src/CycleCounter.c`	Timer start/stop now wrap RunNetwork unconditionally with added barriers; CycleCounter adapted to per-core init/end/running state and CSR-based deltas.
PULP mchan headers `TargetLibraries/PULPOpen/inc/mchan_v6.h`, `TargetLibraries/PULPOpen/inc/mchan_v7.h`	Added `MCHAN_TRANSFER_ID_MAX (15)` and bounds checks with early returns in channel_free, channel_is_busy, and channel_wait.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant TG as TilingCodeGeneration
  participant TA as AnydimAsyncDmaTransferAdapter
  participant DMA as AsyncDma
  participant WS as WaitingStrategy
  participant F as Future

  TG->>TA: transfer(tensor,..., direction, comment)
  TA->>DMA: getFuture(tensorName, direction)
  DMA->>WS: getFuture(tensorName, direction)
  WS-->>DMA: Future
  DMA-->>TA: Future
  TA->>DMA: transfer(..., direction, future=F, comment)
  DMA-->>TA: (alloc snippets, dma snippets, deinit snippets)
  TA-->>TG: (alloc, dma, deinit) snippets
  Note over TG: TG collects open/ingress/egress/close lists and delegates to profiling or core mixin

sequenceDiagram
  autonumber
  participant CT as ClusterTilingPass
  participant SB as SB Codegen
  participant DB as DB Codegen
  participant PSB as Profiling SB MixIn
  participant PDB as Profiling DB MixIn

  CT->>CT: check tilingProfiling flag
  alt profiling enabled
    CT->>PSB: generate setup/loop with profiling hooks
    CT->>PDB: generate setup/loop with profiling hooks
  else profiling disabled
    CT->>SB: generate normal SB loop
    CT->>DB: generate normal DB loop
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Improve Double-Buffering DMA #120 — Changes implement direction-aware waiting strategies, per-future alloc/wait semantics and double-buffering scheduling consistent with the issue's objectives.

Possibly related PRs

Refactor tiling code generation #105 — Prior refactor that overlaps with mixin separation and AsyncDma direction/future changes and touches the same classes.

Suggested reviewers

Victor-Jung
lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 1.64% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly captures the primary enhancement of this changeset, which is to enable fully asynchronous DMA operations, and aligns with the main DMA code generation and waiting‐strategy improvements detailed in the PR. Although the PR also refactors tiling profiling, the core feature is correctly summarized by “Support Fully Asynchronous DMAs.”
Description Check	✅ Passed	The description clearly outlines the DMA code generation improvements for overlapping transfers, the introduction of new waiting strategies, and the refactoring of the Snitch tiling profiling infrastructure, directly reflecting the scope of changes in the pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (1)
23-28: Ensure profiling mixins precede the code-gen base class.

Same MRO problem here: the profiling overrides never run because the concrete code-generation class is listed first. Swap the inheritance order so profiling still injects its measurement code.
-class ProfilingPULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn):
+class ProfilingPULPL3TilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration):
     pass

🧹 Nitpick comments (1)

Deeploy/TilingExtension/AsyncDma.py (1)
190-194: Preserve DMA comment on blocking wait.

Future.wait now accepts the comment payload, but the blocking adapter drops it by calling future.wait() without forwarding the string. As a result the generated blocking path loses the annotation while alloc/transfer keep it. Please pass the comment through so the wait snippet stays documented.
-        callStack.append(future.wait())
+        callStack.append(future.wait(comment))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fa3c31 and 7262da8.

📒 Files selected for processing (16)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (1 hunks)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (1 hunks)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2 hunks)
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (3 hunks)
Deeploy/Targets/Snitch/Bindings.py (1 hunks)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (1 hunks)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (2 hunks)
Deeploy/TilingExtension/AsyncDma.py (8 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (5 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (3 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3 hunks)
DeeployTest/Platforms/Snitch/main.c (1 hunks)
TargetLibraries/PULPOpen/inc/mchan_v6.h (2 hunks)
TargetLibraries/PULPOpen/inc/mchan_v7.h (2 hunks)
TargetLibraries/Snitch/src/CycleCounter.c (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py

🧬 Code graph analysis (15)

TargetLibraries/PULPOpen/inc/mchan_v7.h (1)

TargetLibraries/PULPOpen/inc/mchan_v6.h (2)

mchan_channel_is_busy (77-81)

mchan_channel_wait (83-93)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (6)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

_tilingLoop (91-145)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

_tilingLoop (76-304)

Deeploy/DeeployTypes.py (6)

NetworkContext (545-1054)

executionBlock (1562-1565)

ExecutionBlock (1412-1538)

copy (1050-1054)

VariableBuffer (227-375)

CodeSnippet (37-40)

Deeploy/TilingExtension/TilingCodegen.py (3)

TilingSchedule (53-122)

VariableReplacementScheme (126-158)

HyperRectangle (24-35)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

_hoistTileNumAndIdxPtr (77-101)

Deeploy/TilingExtension/AsyncDma.py (4)

Future (16-39)

transfer (119-133)

transfer (170-195)

transfer (247-325)

DeeployTest/Platforms/Snitch/main.c (1)

TargetLibraries/Snitch/src/CycleCounter.c (4)

ResetTimer (16-26)

StartTimer (28-34)

getCycles (46-55)

StopTimer (36-44)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (4)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (32-304)

ProfilingDoubleBufferingTilingMixIn (307-379)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (31-145)

ProfilingSingleBufferingTilingMixIn (148-206)

Deeploy/DeeployTypes.py (8)

CodeTransformationPass (2249-2280)

apply (2155-2169)

apply (2204-2220)

apply (2256-2280)

NetworkContext (545-1054)

executionBlock (1562-1565)

ExecutionBlock (1412-1538)

CodeGenVerbosity (44-50)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (1)

apply (252-307)

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (3)

Deeploy/TilingExtension/AsyncDma.py (5)

AsyncDma (85-133)

DirectionWaitingStrategy (59-71)

Future (16-39)

transferOpRepr (108-117)

transferOpRepr (158-168)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)

transferOpRepr (62-79)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)

transferOpRepr (70-89)

Deeploy/Targets/Snitch/Bindings.py (1)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterSynch.py (1)

SnitchSynchCoresPass (15-24)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (5)

Deeploy/DeeployTypes.py (9)

CodeSnippet (37-40)

ExecutionBlock (1412-1538)

NetworkContext (545-1054)

NodeTemplate (82-224)

VariableBuffer (227-375)

_ReferenceBuffer (512-542)

add (715-752)

init (260-269)

executionBlock (1562-1565)

Deeploy/TilingExtension/AsyncDma.py (13)

AsyncDma (85-133)

EmptyFuture (136-141)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (155-156)

getFuture (233-234)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)

TilingCodeGeneration (40-307)

_generateDmaTransferCalls (113-170)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

TilingHoistingMixIn (31-153)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (97-198)

PrototypeTilingMixIn (53-94)

TilingMetaInfo (13-20)

generateAllTilingCode (83-94)

generateSetupAndTeardownCode (56-66)

measurementArrayDeclaration (100-131)

injectPrintCycleDiff (134-181)

generateLoopCode (69-80)

kernelProfilingWrap (184-198)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

Deeploy/TilingExtension/AsyncDma.py (6)

AsyncDma (85-133)

Future (16-39)

PerTensorWaitingStrategy (52-56)

_transferTemplates (152-153)

transferOpRepr (108-117)

transferOpRepr (158-168)

Deeploy/DeeployTypes.py (2)

NodeTemplate (82-224)

VariableBuffer (227-375)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

Deeploy/DeeployTypes.py (2)

NodeTemplate (82-224)

VariableBuffer (227-375)

Deeploy/TilingExtension/AsyncDma.py (5)

AsyncDma (85-133)

_transferTemplates (152-153)

transferOpRepr (108-117)

transferOpRepr (158-168)

Future (16-39)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)

Deeploy/DeeployTypes.py (5)

CodeSnippet (37-40)

ExecutionBlock (1412-1538)

executionBlock (1562-1565)

addLeft (1435-1448)

addRight (1450-1463)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

generateLoopCode (165-206)

generateSetupAndTeardownCode (151-162)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

generateLoopCode (336-379)

generateSetupAndTeardownCode (310-333)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (4)

Deeploy/TilingExtension/AsyncDma.py (17)

AsyncDma (85-133)

AnydimAsyncDmaTransferAdapter (198-325)

EmptyFuture (136-141)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (155-156)

getFuture (233-234)

transfer (119-133)

transfer (170-195)

transfer (247-325)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)

TilingCodeGeneration (40-307)

_generateDmaTransferCalls (113-170)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

TilingHoistingMixIn (31-153)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (97-198)

PrototypeTilingMixIn (53-94)

TilingMetaInfo (13-20)

generateAllTilingCode (83-94)

generateSetupAndTeardownCode (56-66)

measurementArrayDeclaration (100-131)

injectPrintCycleDiff (134-181)

generateLoopCode (69-80)

kernelProfilingWrap (184-198)

TargetLibraries/PULPOpen/inc/mchan_v6.h (1)

TargetLibraries/PULPOpen/inc/mchan_v7.h (2)

mchan_channel_is_busy (103-107)

mchan_channel_wait (109-119)

Deeploy/TilingExtension/AsyncDma.py (5)

Deeploy/DeeployTypes.py (7)

NodeTemplate (82-224)

init (260-269)

CodeSnippet (37-40)

alloc (271-282)

alloc (502-505)

VariableBuffer (227-375)

NetworkContext (545-1054)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

transferOpRepr (62-79)

checkTransfer (53-60)

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

transferOpRepr (63-95)

checkTransfer (49-61)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

transferOpRepr (70-89)

checkTransfer (64-68)

Deeploy/TilingExtension/TilingCodegen.py (2)

padShape (202-207)

padStride (218-223)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (32-304)

ProfilingDoubleBufferingTilingMixIn (307-379)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (31-145)

ProfilingSingleBufferingTilingMixIn (148-206)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (32-304)

ProfilingDoubleBufferingTilingMixIn (307-379)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (31-145)

ProfilingSingleBufferingTilingMixIn (148-206)

🪛 Ruff (0.13.1)

Deeploy/Targets/Snitch/DMA/SnitchDma.py

46-58: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

78-78: Unused method argument: variableReplacement

(ARG002)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py

DeeployTest/Platforms/Snitch/main.c

TargetLibraries/Snitch/src/CycleCounter.c

lukamac · 2025-09-27T17:07:04Z

The full review will arrive, this is just a huge NIT from me that I saw even before and tried to give a different approach in my attempt at the PR.

I really don't like the comments in the DMA templates.

I would much rather we take my approach of adding a comment as it's own CodeSnippet on top of a group of statements. It keeps the templates much smaller and function calls cleaner while still preserving the desired behavior of having comments.
I want the users (a person developing a DMA driver) to have to implement the smallest possible thing and reduce boilerplate there, when we can do it in the backend for all of them.

It's obviously not a blocker, but I just dislike it a lot that I wanted to immediately comment it.

Xeratec · 2025-09-27T17:39:14Z

The full review will arrive, this is just a huge NIT from me that I saw even before and tried to give a different approach in my attempt at the PR.

I really don't like the comments in the DMA templates.

I would much rather we take my approach of adding a comment as it's own CodeSnippet on top of a group of statements. It keeps the templates much smaller and function calls cleaner while still preserving the desired behavior of having comments. I want the users (a person developing a DMA driver) to have to implement the smallest possible thing and reduce boilerplate there, when we can do it in the backend for all of them.

It's obviously not a blocker, but I just dislike it a lot that I wanted to immediately comment it.

I agree. I can adjust the comment thing. I also just realized that I have to take a look at the Timer function in the Snitch platform. I will work on this early next week. You still can give it a pass or wait after these changes, whatever you prefer.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

10-60: Don't replace reused DMA futures with EmptyFuture.

When a waiting strategy hands back the same Future object for multiple tensors (e.g., direction/tensor-group strategies still present in the tree), the second invocation goes through the EmptyFuture branch. That causes the DMA templates to render ${future} as an empty string. In practice, the generated C ends up as = snrt_dma_start_2d(...), which does not compile, and we also lose the handle needed for the later wait. The single-buffer path worked before because it always threaded the real Future through the transfer calls. Please keep using the real Future for code generation and let the set only deduplicate init/deinit/wait bookkeeping.

[suggested fix:]
-from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, EmptyFuture, Future
+from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future
@@
-            future = self.dma.getFuture(tensorName, direction)
-            _future = set([future]) - futures
-            _future = _future.pop() if len(_future) > 0 else EmptyFuture("")
-            futures.add(future)
-
-            callStack.extend(
-                self._generateDmaTransferCalls(ctxt, tensorName, rectangles, tileIdxVar, localBuffer, externalBufferRef,
-                                               direction, _future))
+            future = self.dma.getFuture(tensorName, direction)
+            futures.add(future)
+
+            callStack.extend(
+                self._generateDmaTransferCalls(ctxt, tensorName, rectangles, tileIdxVar, localBuffer, externalBufferRef,
+                                               direction, future))

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

11-180: Keep emitting the real future when reusing DMA handles.

Just like in the SB generator, swapping reused futures for EmptyFuture means the DMA templates no longer receive a variable name. With Snitch/MCHAN templates that expand ${future} into the left-hand side, the generated code turns into = snrt_dma_start_2d(...), which is invalid and drops the txid needed for later waits. Please always pass the real Future to the transfer helpers; the set already ensures we only init/deinit each handle once.

[suggested fix:]

-from Deeploy.TilingExtension.AsyncDma import AnydimAsyncDmaTransferAdapter, AsyncDma, EmptyFuture, Future
+from Deeploy.TilingExtension.AsyncDma import AnydimAsyncDmaTransferAdapter, AsyncDma, Future
@@
-            future = self.dma.getFuture(tensorName, "ExternalToLocal")
-            # Extract the future that is not already in the set of ingress futures
-            _future = set([future]) - ingressFutures
-            _future = _future.pop() if len(_future) > 0 else EmptyFuture("")
-            ingressFutures.add(future)
+            future = self.dma.getFuture(tensorName, "ExternalToLocal")
+            ingressFutures.add(future)
@@
-            initialDmaTransferCalls = anydimAdapter.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
-                                                             stridesFromShape(externalBufferShape),
-                                                             stridesFromShape(rectangles[0].dims), "ExternalToLocal",
-                                                             _future, math.prod(externalBufferShape))
+            initialDmaTransferCalls = anydimAdapter.transfer(ctxt, externalBufferRef, localBuffer, rectangles[0].dims,
+                                                             stridesFromShape(externalBufferShape),
+                                                             stridesFromShape(rectangles[0].dims), "ExternalToLocal",
+                                                             future, math.prod(externalBufferShape))
@@
-            ingressDMAStatements.append(_future.wait())
+            ingressDMAStatements.append(future.wait())
@@
-                self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I+1", nextLocalBufferReference,
-                                               externalBufferRef, "ExternalToLocal", _future))
+                self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I+1", nextLocalBufferReference,
+                                               externalBufferRef, "ExternalToLocal", future))

🧹 Nitpick comments (1)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
68-108: Eliminate duplicated _measureCycles template definition.

_measureCycles is defined twice back-to-back with identical bodies. The second definition silently overwrites the first, which is confusing to readers and will trigger Ruff’s B024 warning. Please drop the redundant copy to keep the mixin lean.

Apply this diff:
-    _measureCycles = NodeTemplate("""
-    ${measurements}[${tileIdxVar}] = getCycles();
-    """)
-

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7262da8 and afca596.

📒 Files selected for processing (11)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1 hunks)
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2 hunks)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (1 hunks)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (2 hunks)
Deeploy/TilingExtension/AsyncDma.py (9 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (5 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9 hunks)
DeeployTest/Platforms/Snitch/main.c (1 hunks)
TargetLibraries/Snitch/src/CycleCounter.c (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py
Deeploy/Targets/PULPOpen/DMA/MchanDma.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

🧬 Code graph analysis (8)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (82-224)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (5)

Deeploy/TilingExtension/AsyncDma.py (13)

AsyncDma (85-120)

EmptyFuture (123-128)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (142-143)

getFuture (200-201)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)

TilingCodeGeneration (40-289)

_generateDmaTransferCalls (115-152)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (2)

TilingHoistingMixIn (31-153)

dictOfArrays (19-28)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (67-208)

PrototypeTilingMixIn (23-64)

TilingMetaInfo (13-20)

generateAllTilingCode (53-64)

generateSetupAndTeardownCode (26-36)

measurementArrayDeclaration (110-141)

injectPrintCycleDiff (144-191)

generateLoopCode (39-50)

kernelProfilingWrap (194-208)

Deeploy/DeeployTypes.py (5)

add (715-752)

CodeSnippet (37-40)

init (260-269)

executionBlock (1562-1565)

ExecutionBlock (1412-1538)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (6)

Deeploy/TilingExtension/AsyncDma.py (17)

AsyncDma (85-120)

AnydimAsyncDmaTransferAdapter (165-271)

EmptyFuture (123-128)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (142-143)

getFuture (200-201)

transfer (113-120)

transfer (150-162)

transfer (214-271)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4)

TilingCodeGeneration (40-289)

_tilingLoop (67-71)

_generateDmaTransferCalls (115-152)

_generateExternalReferenceUpdate (154-175)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (2)

TilingHoistingMixIn (31-153)

dictOfArrays (19-28)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (67-208)

PrototypeTilingMixIn (23-64)

TilingMetaInfo (13-20)

generateAllTilingCode (53-64)

generateSetupAndTeardownCode (26-36)

measurementArrayDeclaration (110-141)

injectPrintCycleDiff (144-191)

generateLoopCode (39-50)

kernelProfilingWrap (194-208)

Deeploy/DeeployTypes.py (12)

NodeTemplate (82-224)

CodeSnippet (37-40)

VariableBuffer (227-375)

_ReferenceBuffer (512-542)

NetworkContext (545-1054)

executionBlock (1562-1565)

ExecutionBlock (1412-1538)

add (715-752)

init (260-269)

nodeName (1568-1571)

addLeft (1435-1448)

addRight (1450-1463)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (3)

_tilingLoop (68-130)

generateSetupAndTeardownCode (136-147)

generateLoopCode (150-191)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)

Deeploy/TilingExtension/AsyncDma.py (4)

AsyncDma (85-120)

Future (16-39)

PerTensorWaitingStrategy (52-56)

_transferTemplates (139-140)

Deeploy/TilingExtension/AsyncDma.py (4)

Deeploy/DeeployTypes.py (6)

NodeTemplate (82-224)

alloc (271-282)

alloc (502-505)

CodeSnippet (37-40)

VariableBuffer (227-375)

NetworkContext (545-1054)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

checkTransfer (40-47)

transferOpRepr (49-60)

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

checkTransfer (34-46)

transferOpRepr (48-74)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

checkTransfer (47-51)

transferOpRepr (53-65)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)

Deeploy/DeeployTypes.py (7)

CodeSnippet (37-40)

ExecutionBlock (1412-1538)

executionBlock (1562-1565)

addLeft (1435-1448)

addRight (1450-1463)

NodeTemplate (82-224)

nodeName (1568-1571)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

generateLoopCode (314-357)

generateSetupAndTeardownCode (288-311)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

generateLoopCode (150-191)

generateSetupAndTeardownCode (136-147)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (5)

Deeploy/DeeployTypes.py (4)

NodeTemplate (82-224)

NetworkContext (545-1054)

executionBlock (1562-1565)

copy (1050-1054)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

_tilingLoop (64-282)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

_tilingLoop (68-130)

Deeploy/TilingExtension/TilingCodegen.py (1)

TilingSchedule (53-122)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

_hoistTileNumAndIdxPtr (77-101)

DeeployTest/Platforms/Snitch/main.c (1)

TargetLibraries/Snitch/src/CycleCounter.c (3)

StartTimer (28-36)

getCycles (48-55)

StopTimer (38-46)

🪛 Ruff (0.13.1)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py

90-91: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

99-100: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

66-66: Unused method argument: variableReplacement

(ARG002)

255-255: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

Deeploy/Targets/Snitch/DMA/SnitchDma.py

32-41: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties

(B024)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-kernels / test-runner-siracusa
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-kernels / test-runner-siracusa
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-kernels / test-runner-siracusa
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled
GitHub Check: siracusa-kernels / test-runner-siracusa
GitHub Check: siracusa-neureka-kernels-tiled-singlebuffer-L2 / test-runner-siracusa-neureka-tiled

🔇 Additional comments (2)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)

15-24: Zero-initialize L3 request and guard waits to avoid deadlocks

Nice catch: zeroing the pi_cl_ram_req_t upfront and skipping pi_cl_ram_copy_wait when size == 0 protects the PerTensor strategy from stalling on futures that never issued a transfer. This keeps first-use futures safe without regressing real transfers whose size stays non-zero.

DeeployTest/Platforms/Snitch/main.c (1)

80-94: Revisit DM-core guard for timer teardown

Reiterating the earlier feedback: StopTimer() still tears down the shared perf counter via snrt_stop_perf_counter. With the call now executed on every core, any compute core that returns from RunNetwork ahead of the DM controller can reach StopTimer() first and halt the counter before the DM core samples getCycles(), truncating the runtime the very next time this happens. Please keep the reset/start/stop lifecycle wrapped in the DM-core guard (or otherwise prove the counter cannot be stopped by non-DM cores) before this lands.

TargetLibraries/Snitch/src/CycleCounter.c

lukamac · 2025-09-27T20:44:55Z

I agree. I can adjust the comment thing. I also just realized that I have to take a look at the Timer function in the Snitch platform. I will work on this early next week. You still can give it a pass or wait after these changes, whatever you prefer.

If you don't have the energy, I can also change the comments (in this PR or a followup). I'll wait for the rest then. Thanks for the effort!

Xeratec · 2025-09-29T15:04:30Z

If you don't have the energy, I can also change the comments (in this PR or a followup). I'll wait for the rest then. Thanks for the effort!

Already implemented all the suggested changes in afca596

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

TargetLibraries/Snitch/src/CycleCounter.c (1)
16-26: Guard perf-counter reset to DM core for symmetry and safety.

Start/Stop are gated by snrt_is_dm_core(), but Reset isn’t. Resetting perf counters from compute cores can clobber the DM core’s counter state. Gate Reset as well.
 void ResetTimer() {
-  snrt_reset_perf_counter(SNRT_PERF_CNT0);
+  if (snrt_is_dm_core()) {
+    snrt_reset_perf_counter(SNRT_PERF_CNT0);
+  }
   uint32_t const core_id = snrt_global_core_idx();
   ...
 }

🧹 Nitpick comments (1)

TargetLibraries/Snitch/src/CycleCounter.c (1)

9-15: Optional: mitigate 32-bit counter wrap and shrink running[].

Switch to 64-bit cycle/instruction counters (mcycleh/minstreth) when __riscv_xlen == 32, or document the 32-bit wrap (~8.6 s @500 MHz).

Change running[NUM_CORES] from uint32_t to uint8_t (or bool) to save L1 memory.

NUM_CORES already includes the DM core (compute cores = NUM_CORES - 1), so no build-system changes needed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6180e28 and 0dae064.

📒 Files selected for processing (1)

TargetLibraries/Snitch/src/CycleCounter.c (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Learnt from: Xeratec
PR: pulp-platform/Deeploy#105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

🔇 Additional comments (4)

TargetLibraries/Snitch/src/CycleCounter.c (4)

29-31: LGTM: DM-only start of perf counter.

Gating snrt_start_perf_counter(...) with snrt_is_dm_core() avoids compute-core interference.

39-41: LGTM: DM-only stop of perf counter.

Symmetric to Start; avoids cross-core counter side effects.

42-45: Correct buffer for minstret end snapshot — resolved.

instr_end[core_id] = read_csr(minstret); fixes the earlier overwrite bug.

48-65: Elapsed-counter semantics restored for getCycles/getInstr.

Delta logic when running vs. stopped is back; profiling users get elapsed values again.

Xeratec · 2025-10-01T14:02:25Z

2025-10-01 (@Xeratec)

After discussion with @lukamac, we identified that the current DMA transfer scheduling is overly strict with synchronization. For example, we always wait for input tile t to finish before starting the transfer of the next tile t+1, even though the hardware could already handle this in parallel. This conservative scheduling guarantees correctness, but it prevents us from taking full advantage of double-buffering.

With proper double-buffering, we should instead keep two futures for inputs and two for outputs. That way, the DMA engine can already start moving data for tile t+1 while the compute kernel is still working on tile t, and simultaneously flush out results from tile t-1. This enables true overlap of compute, prefetch, and drain.

See #116, which previously attempted to implement this scheduling scheme. This PR shows the correct approach:

Introduces ping-pong buffers for A/B/C and outputs.
Uses distinct channels per parity (0/1) to track DMA ownership.
Moves synchronization points from “wait before start” to “wait before reuse,” ensuring correctness while maximizing overlap.

In theory, this is the optimal scheduling scheme. However, for most current workloads, the DMA transfers are typically not longer than the control overhead plus kernel execution, so runtime performance is not affected. Additionally, MCHAN only supports one in-flight transaction per direction. The benefit is primarily architectural cleanliness and readiness for cases where transfer time becomes a bottleneck.

Manual Improved Code

static void _closure(void *_closure_args) {
  // CLOSURE ARG CAST
  _closure_args_t *args = (_closure_args_t *)_closure_args;
  uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;

  // CLOSURE FUNCTION CALL
  uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
  void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
  void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
  void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;

  // Initialize DMA future
  uint32_t channel_input_0 = (uint32_t)-1;
  uint32_t channel_output_0 = (uint32_t)-1;
  uint32_t channel_input_1 = (uint32_t)-1;
  uint32_t channel_output_1 = (uint32_t)-1;

  // Transfer initial input tile
  channel_input_0 = mchan_channel_alloc();
  mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
  DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];

  // Transfer initial input tile
  mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];

  // Transfer initial input tile

  mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];

  // TILING LOOP
  for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
       TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {

    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
      break;
    }

    // Wait for current input tile

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
        case 0:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
        channel_input_1 = mchan_channel_alloc();
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
        channel_input_0 = mchan_channel_alloc();
        break;
      }

      // Transfer next input tile
      mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
                        DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
      DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];
    }

    // Wait for current input tile

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
        break;
      }

      // Transfer next input tile

      mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];
    }

    // Wait for current input tile

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
        break;
      }

      // Transfer next input tile

      mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
                                    DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
    }

    if (TILING_I > 1) {
        // Wait for current output tile
        if (TILING_I % 2) {
            mchan_channel_wait(channel_output_0);
            mchan_channel_free(channel_output_0);
        } else {
            mchan_channel_wait(channel_output_1);
            mchan_channel_free(channel_output_1);
        }
    }
    if (TILING_I % 2) {
        mchan_channel_wait(channel_input_0);
        mchan_channel_free(channel_input_0);
    } else {
        mchan_channel_wait(channel_input_1);
        mchan_channel_free(channel_input_1);
    }

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
    *DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];

    _cluster_fork_args_t DeeployNetwork__cluster_fork_args =
        (_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};

    pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);

    // Transfer current output tile
    if (TILING_I % 2) {
        channel_output_0 = mchan_channel_alloc();
    } else {
        channel_output_1 = mchan_channel_alloc();
    }

    mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
                                  DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
    DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];

    // CLOSE TILING LOOP
  }
  *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;

  // Wait for final output tile
  mchan_channel_wait(channel_output_0);
  mchan_channel_free(channel_output_0);
  mchan_channel_wait(channel_output_1);
  mchan_channel_free(channel_output_1);

  // Deinitialize DMA future

  // CLOSURE ARG WRITEBACK
}

Current Code

static void _closure(void *_closure_args) {
  // CLOSURE ARG CAST
  _closure_args_t *args = (_closure_args_t *)_closure_args;
  uint8_t *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr = args->DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr;

  // CLOSURE FUNCTION CALL
  uint16_t *DeeployNetwork_TILING_CODEGEN_L1__M_ref = (uint16_t *)DeeployNetwork_TILING_CODEGEN_L1__M + 0;
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 5504);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 1408);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 8320);
  float32_t *DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)((char *)DeeployNetwork_MEMORYARENA_L1 + 0);
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_ref = (void *)DeeployNetwork_input_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__A_ref + 1408;
  void *DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref = (void *)DeeployNetwork_weight_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__B_ref + 2048;
  void *DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref = (void *)DeeployNetwork_bias_tensor + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__C_ref + 704;
  void *DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_ref = (void *)DeeployNetwork_output_0 + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 0;
  void *DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1 = (void *)DeeployNetwork_TILING_CODEGEN_L1__data_out_ref + 704;

  // Initialize DMA future
  uint32_t channel_output = (uint32_t)-1;
  uint32_t channel_input = (uint32_t)-1;

  // Transfer initial input tile
  channel_input = mchan_channel_alloc();
  mchan_transfer_1d(1443200, DeeployNetwork_TILING_CODEGEN_L1__A_ref, DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
  DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[0];

  // Transfer initial input tile

  mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_ref, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[0];

  // Transfer initial input tile

  mchan_transfer_2d_ext_strided(1966784, DeeployNetwork_TILING_CODEGEN_L1__C_ref, DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

  // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
  DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[0];

  // TILING LOOP
  for (int TILING_I = DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr];
       TILING_I < DeeployNetwork_TILING_CODEGEN_L1__numTiles[(*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr) + 1]; TILING_I++) {

    switch ((TILING_I) % 2) {
    case 0:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_0;
      break;
    case 1:
      DeeployNetwork_TILING_CODEGEN_L1__A_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__B_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__C_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
      DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = (float32_t *)DeeployNetwork_TILING_CODEGEN_L1__output_0_buffer_1;
      break;
    }

    // Wait for current input tile
    mchan_channel_wait(channel_input);
    mchan_channel_free(channel_input);

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__A_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__input_0_buffer_1;
        break;
      }

      // Transfer next input tile
      channel_input = mchan_channel_alloc();
      mchan_transfer_1d(DeeployNetwork_TILING_CODEGEN_L1__A_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__A_next,
                        DeeployNetwork_TILING_CODEGEN_L1__input_0_ref);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__input_0_ref
      DeeployNetwork_TILING_CODEGEN_L1__input_0_ref += DeeployNetwork_TILING_CODEGEN_L1__A_relativeOffset[TILING_I + 1];
    }

    // Wait for current input tile

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__B_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_buffer_1;
        break;
      }

      // Transfer next input tile

      mchan_transfer_2d_ext_strided(1968128, DeeployNetwork_TILING_CODEGEN_L1__B_next, DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__weight_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__B_relativeOffset[TILING_I + 1];
    }

    // Wait for current input tile

    // DOUBLE BUFFERING CHECK TILE LOAD
    if ((TILING_I + 1) < DeeployNetwork_TILING_CODEGEN_L1__numTiles[*DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr + 1]) {

      switch ((TILING_I + 1) % 2) {
      case 0:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_0;
        break;
      case 1:
        DeeployNetwork_TILING_CODEGEN_L1__C_next = (void *)DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_buffer_1;
        break;
      }

      // Transfer next input tile

      mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__C_cmd[TILING_I + 1], DeeployNetwork_TILING_CODEGEN_L1__C_next,
                                    DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref, 64, 128);

      // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref
      DeeployNetwork_TILING_CODEGEN_L1__bias_tensor_ref += DeeployNetwork_TILING_CODEGEN_L1__C_relativeOffset[TILING_I + 1];
    }

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__M_ref
    *DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M[TILING_I];

    _cluster_fork_args_t DeeployNetwork__cluster_fork_args =
        (_cluster_fork_args_t){.DeeployNetwork_TILING_CODEGEN_L1__M_ref = DeeployNetwork_TILING_CODEGEN_L1__M_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__A_ref = DeeployNetwork_TILING_CODEGEN_L1__A_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__B_ref = DeeployNetwork_TILING_CODEGEN_L1__B_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__C_ref = DeeployNetwork_TILING_CODEGEN_L1__C_ref,
                               .DeeployNetwork_TILING_CODEGEN_L1__data_out_ref = DeeployNetwork_TILING_CODEGEN_L1__data_out_ref};

    pi_cl_team_fork(NUM_CORES, (void *)_cluster_fork, &DeeployNetwork__cluster_fork_args);

    // Wait for previous output tile
    mchan_channel_wait(channel_output);
    mchan_channel_free(channel_output);

    // Transfer current output tile
    channel_output = mchan_channel_alloc();
    mchan_transfer_2d_ext_strided(DeeployNetwork_TILING_CODEGEN_L1__data_out_cmd[TILING_I], DeeployNetwork_TILING_CODEGEN_L1__data_out_ref,
                                  DeeployNetwork_TILING_CODEGEN_L1__output_0_ref, 64, 128);

    // UPDATE VARIABLE DeeployNetwork_TILING_CODEGEN_L1__output_0_ref
    DeeployNetwork_TILING_CODEGEN_L1__output_0_ref += DeeployNetwork_TILING_CODEGEN_L1__data_out_relativeOffset[TILING_I];

    // CLOSE TILING LOOP
  }
  *DeeployNetwork_TILING_CODEGEN_L1__tileIdxPtr += 1;

  // Wait for final output tile
  mchan_channel_wait(channel_output);
  mchan_channel_free(channel_output);

  // Deinitialize DMA future

  // CLOSURE ARG WRITEBACK
}

lukamac

I'll do the changes, I just need some clarifications on the intended behavior.

lukamac · 2025-10-02T13:28:10Z

Deeploy/TilingExtension/AsyncDma.py

    def transfer(self, ctxt: NetworkContext, externalBuffer: VariableBuffer, localBuffer: VariableBuffer,
-                 shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int, ...],
-                 direction: DmaDirection, future: Future) -> List[CodeSnippet]:
+                 shape: Tuple[int, ...], strideExt: Tuple[int, ...], strideLoc: Tuple[int,
+                                                                                      ...], direction: DmaDirection,
+                 future: Future) -> Tuple[List[CodeSnippet], List[CodeSnippet], List[CodeSnippet]]:
        self.checkTransfer(ctxt, externalBuffer, localBuffer, shape, strideExt, strideLoc, direction)
        opRepr = self.transferOpRepr(externalBuffer, localBuffer, shape, strideExt, strideLoc, direction, future)
        template = self._transferTemplates[len(shape)]
-        return [CodeSnippet(template, opRepr)]
-
-    def setup(self) -> List[CodeSnippet]:
-        return []
-
-    def teardown(self) -> List[CodeSnippet]:
-        return []
+        return [future.alloc()], [CodeSnippet(template, opRepr)], []


I disagree with the API change so I will need some convincing to accept it.
I view the transfer as a method that focuses only on generating code to execute a single transfer.
The overarching CodeGenerator can take care of future allocation and deinit. I prefer this approach of keeping the final leaf abstractions (things wrapping the kernel calls) to stay quite simple and concentrated.

I do understand your concern, and I am open to different solutions. Unfortunately, I do not have the bandwidth to implement this right now. I suggest we either postpone this PR by 1-2 weeks, implement the requested changes in another one, or you adapt the code to your liking.

lukamac · 2025-10-02T13:30:33Z

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

-    # LMACAN: The brackets around ${tileIdxVar} are important to ensure correct order
-    #         of the modulo operation. Breaking case without the brackets is when we
-    #         put "TILING_I + 1" for tileIdxVar.


kinda rude to remove this comment. it took me effort to hunt down the cause. I'd like to prevent someone else of doing the same mistake.

Ups, sorry, I agree, and I must have overlooked this change. I can revert it.

lukamac · 2025-10-02T13:57:59Z

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

+            # Extract the future that is not already in the set of ingress futures
+            _future = set([future]) - ingressFutures
+            _future = _future.pop() if len(_future) > 0 else EmptyFuture("")


Can you please explain to me what's the intended behavior with this. Why do we need to get an EmptyFuture? is it because of the allocs? what was getting replicated too much?

It enables the use of stateless futures and prevents generating multiple allocation calls if the same future is reused in multiple transfer calls. This is the case for a PerDirectionWaitingStrategy. The _future variable is passed to the transfer() function, which inserts the future.alloc() code into the generated code. This will of course change if we adapt the transfer() API interface as discussed in #114 (comment)

lukamac · 2025-10-02T14:20:30Z

TargetLibraries/PULPOpen/inc/mchan_v6.h

+  if (channel_id > MCHAN_TRANSFER_ID_MAX)
+    return;
  *status_ptr = 1 << channel_id;
 }

 static uint32_t mchan_channel_is_busy(uint32_t channel_id) {
-  // TODO: assert channel_id is smaller then 32
+  if (channel_id > MCHAN_TRANSFER_ID_MAX)
+    return 0;
  return *status_ptr & (1 << channel_id);
 }

 static void mchan_channel_wait(uint32_t channel_id) {
-  // TODO: assert channel_id is smaller then 32
+  if (channel_id > MCHAN_TRANSFER_ID_MAX)
+    return;


I am against silently failing and using that as a feature in the implementation. The design goal of this was to be as bare bones as possible so I wouldn't do this check here but in the code generation where needed.

If we want the check I would prefer it crashes execution or returns an exit status so that the user knows something is wrong.

As we’ve already discussed, this represents a different implementation philosophy. My version is already a functional improvement compared to the previous state. Additionally, I wouldn’t consider this behavior as “failing silently,” since if a channel does not exist, I don’t want to wait for it, and that outcome is intentional. If an invalid channel ID was expected to exist, then the failure effectively occurred earlier during allocation, and I’d be open to adopting your approach at that point.

That said, we could also consider wrapping these function calls with checks to validate that the futures are valid (which would be required in a double-buffering scenario) and handle failures at that level.

…tegy

Co-authored-by: Luka Macan <[email protected]>

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (1)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (1)

24-36: Fix MRO so profiling mixins actually run.

Reorder mixins ahead of code-generation base classes.

-class ProfilingSnitchClusterTilingSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
+class ProfilingSnitchClusterTilingSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration):
     _printCycleDifference = NodeTemplate(r"""
@@
-class ProfilingSnitchClusterTilingDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn):
+class ProfilingSnitchClusterTilingDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration):
     _printCycleDifference = NodeTemplate(r"""

🧹 Nitpick comments (3)

TargetLibraries/PULPOpen/inc/mchan_v7.h (1)
98-99: Consider adding debug assertions for invalid channel IDs.

While the silent early returns are safe, they could mask programming errors during development. Consider adding compile-time debug assertions to catch invalid channel IDs in development builds:
 static void mchan_channel_free(uint32_t channel_id) {
+#ifdef DEBUG
+  assert(channel_id <= MCHAN_TRANSFER_ID_MAX && "Invalid channel ID");
+#endif
   if (channel_id > MCHAN_TRANSFER_ID_MAX)
     return;
   *status_ptr = 1 << channel_id;
 }
Apply similar changes to mchan_channel_is_busy and mchan_channel_wait.

Also applies to: 104-105, 110-111
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
52-59: Consider simplifying the future deduplication logic.

The set-difference approach to determine whether a future is new is correct but not immediately readable. The logic creates a set with the future, subtracts already-tracked futures, then pops or uses EmptyFuture if empty.

A more straightforward alternative:
 future = self.dma.getFuture(tensorName, direction)
-_future = set([future]) - futures
-_future = _future.pop() if len(_future) > 0 else EmptyFuture("")
+_future = future if future not in futures else EmptyFuture("")
 futures.add(future)
This achieves the same result (avoiding duplicate init/deinit for already-seen futures) with clearer intent.

90-91: Optional: Use iterable unpacking instead of list concatenation.

Static analysis correctly suggests that list concatenation can be replaced with iterable unpacking for slightly better performance and modern Python style.

Apply this diff:
-ingressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer input tiles"})
-                       ] + ingressDMAStatements
+ingressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer input tiles"}),
+                       *ingressDMAStatements]
-egressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer output tiles"})
-                      ] + egressDMAStatements
+egressDMAStatements = [CodeSnippet(self._lineComment, {"comment": "Transfer output tiles"}),
+                      *egressDMAStatements]
Based on static analysis

Also applies to: 99-100

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0dae064 and 0e8d74e.

📒 Files selected for processing (16)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (1 hunks)
Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (1 hunks)
Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1 hunks)
Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2 hunks)
Deeploy/Targets/Snitch/Bindings.py (1 hunks)
Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (1 hunks)
Deeploy/Targets/Snitch/DMA/SnitchDma.py (2 hunks)
Deeploy/TilingExtension/AsyncDma.py (9 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (5 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (4 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9 hunks)
DeeployTest/Platforms/Snitch/main.c (1 hunks)
TargetLibraries/PULPOpen/inc/mchan_v6.h (2 hunks)
TargetLibraries/PULPOpen/inc/mchan_v7.h (2 hunks)
TargetLibraries/Snitch/src/CycleCounter.c (2 hunks)

✅ Files skipped from review due to trivial changes (1)

TargetLibraries/Snitch/src/CycleCounter.c

🚧 Files skipped from review as they are similar to previous changes (3)

DeeployTest/Platforms/Snitch/main.c
Deeploy/Targets/Snitch/Bindings.py
TargetLibraries/PULPOpen/inc/mchan_v6.h

🧰 Additional context used

🧬 Code graph analysis (12)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)

Deeploy/TilingExtension/AsyncDma.py (4)

AsyncDma (85-120)

Future (16-39)

PerTensorWaitingStrategy (52-56)

_transferTemplates (139-140)

TargetLibraries/PULPOpen/inc/mchan_v7.h (1)

TargetLibraries/PULPOpen/inc/mchan_v6.h (2)

mchan_channel_is_busy (77-81)

mchan_channel_wait (83-93)

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py (4)

Deeploy/DeeployTypes.py (9)

CodeGenVerbosity (49-55)

CodeTransformationPass (2312-2343)

ExecutionBlock (1439-1565)

NetworkContext (564-1076)

NodeTemplate (87-229)

apply (2214-2228)

apply (2267-2283)

apply (2319-2343)

executionBlock (1592-1595)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (20-282)

ProfilingDoubleBufferingTilingMixIn (285-357)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (19-130)

ProfilingSingleBufferingTilingMixIn (133-191)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (1)

apply (39-52)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3)

Deeploy/DeeployTypes.py (7)

CodeSnippet (42-45)

ExecutionBlock (1439-1565)

executionBlock (1592-1595)

addLeft (1462-1475)

addRight (1477-1490)

NodeTemplate (87-229)

nodeName (1598-1601)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

generateLoopCode (314-357)

generateSetupAndTeardownCode (288-311)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

generateLoopCode (150-191)

generateSetupAndTeardownCode (136-147)

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (1)

Deeploy/TilingExtension/AsyncDma.py (3)

AsyncDma (85-120)

DirectionWaitingStrategy (59-71)

Future (16-39)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (20-282)

ProfilingDoubleBufferingTilingMixIn (285-357)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (19-130)

ProfilingSingleBufferingTilingMixIn (133-191)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (4)

Deeploy/TilingExtension/AsyncDma.py (17)

AsyncDma (85-120)

AnydimAsyncDmaTransferAdapter (165-271)

EmptyFuture (123-128)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (142-143)

getFuture (200-201)

transfer (113-120)

transfer (150-162)

transfer (214-271)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (2)

TilingCodeGeneration (40-289)

_generateDmaTransferCalls (115-152)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

TilingHoistingMixIn (31-153)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (67-208)

PrototypeTilingMixIn (23-64)

TilingMetaInfo (13-20)

generateAllTilingCode (53-64)

generateSetupAndTeardownCode (26-36)

measurementArrayDeclaration (110-141)

injectPrintCycleDiff (144-191)

generateLoopCode (39-50)

kernelProfilingWrap (194-208)

Deeploy/TilingExtension/AsyncDma.py (4)

Deeploy/DeeployTypes.py (6)

NodeTemplate (87-229)

alloc (276-287)

alloc (521-524)

CodeSnippet (42-45)

VariableBuffer (232-391)

NetworkContext (564-1076)

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (2)

checkTransfer (40-47)

transferOpRepr (49-60)

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

checkTransfer (34-46)

transferOpRepr (48-74)

Deeploy/Targets/Snitch/DMA/SnitchDma.py (2)

checkTransfer (47-51)

transferOpRepr (53-65)

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (5)

Deeploy/DeeployTypes.py (4)

NodeTemplate (87-229)

NetworkContext (564-1076)

executionBlock (1592-1595)

ExecutionBlock (1439-1565)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

_tilingLoop (64-282)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

_tilingLoop (68-130)

Deeploy/TilingExtension/TilingCodegen.py (2)

TilingSchedule (53-122)

VariableReplacementScheme (126-158)

Deeploy/TilingExtension/CodeTransformationPasses/TilingHoistingMixIn.py (1)

_hoistTileNumAndIdxPtr (77-101)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

Deeploy/TilingExtension/AsyncDma.py (13)

AsyncDma (85-120)

EmptyFuture (123-128)

Future (16-39)

getFuture (48-49)

getFuture (54-56)

getFuture (69-71)

getFuture (80-82)

getFuture (92-93)

getFuture (142-143)

getFuture (200-201)

wait (38-39)

init (29-30)

deinit (35-36)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (9)

ProfilingPrototypeMixIn (67-208)

PrototypeTilingMixIn (23-64)

TilingMetaInfo (13-20)

generateAllTilingCode (53-64)

generateSetupAndTeardownCode (26-36)

measurementArrayDeclaration (110-141)

injectPrintCycleDiff (144-191)

generateLoopCode (39-50)

kernelProfilingWrap (194-208)

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

DoubleBufferingTilingCodeGeneration (20-282)

ProfilingDoubleBufferingTilingMixIn (285-357)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

SingleBufferingTilingCodeGeneration (19-130)

ProfilingSingleBufferingTilingMixIn (133-191)

🪛 Ruff (0.13.3)

Deeploy/Targets/Snitch/DMA/SnitchDma.py

32-41: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties

(B024)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

66-66: Unused method argument: variableReplacement

(ARG002)

255-255: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py

90-91: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

99-100: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🔇 Additional comments (23)

TargetLibraries/PULPOpen/inc/mchan_v7.h (2)

37-37: LGTM! Consistent boundary checks prevent undefined behavior.

The addition of MCHAN_TRANSFER_ID_MAX and corresponding boundary checks in the channel management functions correctly prevent bit-shift overflow and undefined behavior when channel IDs exceed the valid range. The implementation is consistent with the mchan_v6.h pattern.

Also applies to: 98-99, 104-105, 110-111

104-105: Early-return in mchan_channel_wait ensures mchan_channel_is_busy is never called with invalid channel IDs. No further changes required.

Deeploy/Targets/PULPOpen/DMA/L3Dma.py (1)

15-24: LGTM! Proper initialization and guarded wait pattern.

The changes improve safety and correctness:

Explicit zero initialization prevents use of uninitialized memory.

Conditional wait (${name}.size != 0) prevents waiting on requests that were never issued.

Empty _allocTemplate appropriately implements the Future interface without requiring explicit allocation for this DMA backend.

Deeploy/TilingExtension/CodeTransformationPasses/TilingCodeGeneration.py (4)

43-43: LGTM! Useful template for generated code comments.

The _lineComment template provides a standard mechanism for injecting explanatory comments into generated code, improving readability of tiling codegen output.

67-71: LGTM! Appropriate abstraction for tiling strategies.

The abstract _tilingLoop method enables different implementations (single-buffering, double-buffering, profiling variants) while maintaining a consistent interface. The boolean return indicates tiling applicability, aligning with the existing pattern.

73-96: LGTM! Clean refactoring that supports new tiling strategies.

The refactored generateTilingLoop:

Flattens multiple tiling schedules for single-pass processing.

Validates that all base offset lists match the expected buffer count.

Delegates implementation to the abstract _tilingLoop method, enabling strategy pattern for different buffering approaches (SB/DB/profiling).

128-152: LGTM! Flattening aligns with updated DMA transfer API.

The changes flatten the DMA transfer call results from nested tuples into flat lists. This aligns with the PR's update to the transfer API, which now returns (alloc, dma, deinit) triplets. The flattening enables downstream processing of individual code snippets across allocation, transfer, and deinitialization phases.

Deeploy/Targets/PULPOpen/DMA/MchanDma.py (2)

29-29: LGTM! Simplified waiting strategy.

Switching to DirectionWaitingStrategy simplifies synchronization by using one future per direction (input/output) rather than tracking multiple tensor groups. This aligns with MCHAN's limitation of one in-flight transaction per direction mentioned in the PR objectives.

14-20: Audit MCHAN channel lifecycles manually
Automated check returned no missing waits, but confirm every path using _allocTemplate invokes _waitTemplate (i.e., mchan_channel_alloc() is always followed by mchan_channel_wait()/free()).

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (7)

31-39: LGTM! Well-structured templates for buffer switching.

The switch/case templates provide a clean mechanism for selecting buffers based on tile index modulo buffer count, which is essential for double-buffering. The _referenceUpdate template handles pointer reassignment for buffer switching.

44-52: LGTM! Robust switch-case assembly.

The _switch method properly:

Validates that the number of case blocks matches the expected buffer count.

Assembles a complete switch statement with proper opening, cases with breaks, and closure.

Returns a flat list of CodeSnippets for integration into the tiling loop.

54-62: LGTM! Refactored to support per-buffer code blocks.

The updated _generateBufferChoice returns per-buffer code blocks (one for each buffer in the multi-buffer scheme) rather than a single snippet. This enables the _switch method to select the appropriate buffer update based on tile index, which is essential for double-buffering.

69-96: LGTM! Clear documentation of double-buffering strategy.

The inline comments provide an excellent high-level overview of the double-buffering implementation, making the complex tiling loop logic much easier to understand. The staged approach (setup, loop with ingress/egress, teardown) clearly delineates the different phases of execution.

139-163: EmptyFuture deduplication pattern enables stateless futures.

The pattern of extracting _future with EmptyFuture fallback prevents duplicate allocation calls when the same future is reused across multiple tensors (e.g., with DirectionWaitingStrategy). The _future is passed to transfer() and wait(), which insert the allocation code. If the future is already in the set, EmptyFuture (with empty templates) is used instead, avoiding duplicate allocations while still generating valid code.

253-264: LGTM! Proper future lifecycle management.

The future lifecycle is correctly managed:

All futures (ingress ∪ egress) are initialized in setup.

Futures are waited on at the appropriate points (ingress during loop, egress in teardown).

All futures are deinitialized in teardown.

This ensures proper resource management across the double-buffering execution.

285-357: LGTM! Clean profiling integration via mixin pattern.

The ProfilingDoubleBufferingTilingMixIn cleanly separates profiling concerns from core double-buffering logic:

generateSetupAndTeardownCode: Declares measurement arrays and wraps with initial/final cycle measurements.

generateLoopCode: Injects cycle measurements around ingress DMA, kernel, and egress DMA.

Delegates to super() for base behavior, ensuring proper composition with PrototypeTilingMixIn.

This design enables profiling to be optionally applied to double-buffering tiling without modifying core logic.

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)

5-5: LGTM! Appropriate import cleanup.

Removing abstractmethod from imports is correct since this file no longer declares abstract methods. The abstract _tilingLoop is now declared in TilingCodeGeneration.py.

26-64: LGTM! Explicit parameter-driven flow improves clarity.

The updated signatures make code snippet passing explicit rather than relying on internal generation:

generateSetupAndTeardownCode and generateLoopCode take explicit list parameters.

generateAllTilingCode orchestrates both methods, maintaining a single entry point.

This design improves clarity and testability by making data flow explicit and enables profiling mixins to intercept and augment snippet lists.
Deeploy/Targets/Snitch/DMA/SnitchDma.py (1)
34-41: Guard dummy DMA transfer behind a compile-time flag

Wrap the fallback snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0) to avoid unintended memory traffic:
 if (snrt_is_dm_core()) {
     ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});
-    // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved.
-    snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);
+    // Workaround: ensure txid can be resolved without side-effects.
+    #if defined(DEEPLOY_SNRT_DMA_TXID_WORKAROUND)
+    snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);
+    #endif
 }
Confirm this 1-byte self-copy (repeat = 0) is safe across all targeted SNRT versions; if not, expose a dedicated “commit” primitive or use a true zero-length transfer.
Deeploy/TilingExtension/AsyncDma.py (1)

113-121: All .transfer call sites correctly handle the new triple return value.

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (3)

10-14: LGTM: Import additions support profiling and future handling.

The new imports (EmptyFuture, ProfilingPrototypeMixIn, PrototypeTilingMixIn) are correctly added and used throughout the file to enable profiling-aware code generation and proper future deduplication.

72-130: LGTM: Clear structure and improved maintainability.

The restructured _tilingLoop method provides:

Explicit comments documenting the single-buffering strategy

Well-organized statement lists (setup, ingress, egress, teardown)

Proper delegation to generateAllTilingCode mixin method

Descriptive inline comments via self._lineComment snippets

This significantly improves readability and aligns with the PR goal of adding generated-code comments.

133-191: LGTM: Profiling mixin follows established patterns.

ProfilingSingleBufferingTilingMixIn correctly:

Extends both PrototypeTilingMixIn and ProfilingPrototypeMixIn

Overrides generateSetupAndTeardownCode to inject measurement arrays and print logic

Overrides generateLoopCode to wrap ingress/egress DMA statements with cycle measurements

Delegates to parent implementations via super()

Uses kernelProfilingWrap for kernel-level measurements

The implementation is consistent with the profiling infrastructure described in TilingPrototypes.py and aligns with the PR goal of adding profiling support to Snitch tiling.

coderabbitai · 2025-10-06T16:59:45Z

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py

 class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
    pass


⚠️ Potential issue | 🟠 Major

Fix MRO so profiling mixins override base methods.

Place Profiling*MixIn before the CodeGeneration base.

-class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn): +class ProfilingPULPClusterTilingGenerationSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration): pass @@ -class ProfilingPULPClusterTilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn): +class ProfilingPULPClusterTilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration): pass

Also applies to: 27-28

🤖 Prompt for AI Agents

In Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPClusterTiling.py around lines 19-20 (and similarly lines 27-28), the profiling mixin is declared after the code-generation base which prevents the mixin from overriding base methods due to MRO; swap the inheritance order so the Profiling*MixIn comes before the CodeGeneration base (e.g., class ProfilingPULPClusterTilingGenerationSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration): pass) and apply the same swap to the other affected class definition so profiling methods take precedence.

coderabbitai · 2025-10-06T16:59:45Z

Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py

 class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
    pass


⚠️ Potential issue | 🟠 Major

Fix MRO so profiling mixins override base methods.

Reorder bases: mixin first.

-class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn): +class ProfilingPULPL3TilingGenerationSB(ProfilingSingleBufferingTilingMixIn, SingleBufferingTilingCodeGeneration): pass @@ -class ProfilingPULPL3TilingGenerationDB(DoubleBufferingTilingCodeGeneration, ProfilingDoubleBufferingTilingMixIn): +class ProfilingPULPL3TilingGenerationDB(ProfilingDoubleBufferingTilingMixIn, DoubleBufferingTilingCodeGeneration): pass

Also applies to: 27-28

🤖 Prompt for AI Agents

In Deeploy/Targets/PULPOpen/CodeTransformationPasses/PULPL3Tiling.py around lines 19-20 and 27-28 the profiling mixin(s) are listed after the concrete SingleBufferingTilingCodeGeneration base, causing the MRO to pick base methods over the mixin overrides; reorder the base classes so the ProfilingSingleBufferingTilingMixIn (and any other profiling mixins) appear first in each class definition (mixin before SingleBufferingTilingCodeGeneration) so the mixin methods take precedence.

coderabbitai · 2025-10-06T16:59:45Z

Deeploy/Targets/Snitch/DMA/SnitchDma.py

    _transferTemplates = {
        2:
-            NodeTemplate(
-                "if (snrt_is_dm_core()) snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});"
-            ),
+            NodeTemplate("""
+            if (snrt_is_dm_core()) {
+                ${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});
+                // WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved.
+                snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);
+            }
+            """),
    }
-    _waitingStrategy = TensorGroupWaitingStrategy(SnitchBarrierFuture, "")
+    _waitingStrategy = PerTensorWaitingStrategy(SnitchFuture)



⚠️ Potential issue | 🟡 Minor

Annotate class attributes with ClassVar to satisfy RUF012 and signal intent.

mark mutable class attrs as ClassVar.

Apply:

-from typing import Dict, Tuple +from typing import Dict, Tuple, ClassVar +from Deeploy.TilingExtension.AsyncDma import AsyncDma, DmaDirection, Future, PerTensorWaitingStrategy, AsyncDmaWaitingStrategy @@ -class SnitchDma(AsyncDma): +class SnitchDma(AsyncDma): - _transferTemplates = { + _transferTemplates: ClassVar[Dict[int, NodeTemplate]] = { @@ - _waitingStrategy = PerTensorWaitingStrategy(SnitchFuture) + _waitingStrategy: ClassVar[AsyncDmaWaitingStrategy] = PerTensorWaitingStrategy(SnitchFuture)

Based on static analysis hints

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_transferTemplates = {

2:

NodeTemplate(

"if (snrt_is_dm_core()) snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});"

),

NodeTemplate("""

if (snrt_is_dm_core()) {

${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});

// WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved.

snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);

}

"""),

}

_waitingStrategy = TensorGroupWaitingStrategy(SnitchBarrierFuture, "")

_waitingStrategy = PerTensorWaitingStrategy(SnitchFuture)

from typing import Dict, Tuple, ClassVar

from Deeploy.TilingExtension.AsyncDma import (

AsyncDma,

DmaDirection,

Future,

PerTensorWaitingStrategy,

AsyncDmaWaitingStrategy,

)

class SnitchDma(AsyncDma):

_transferTemplates: ClassVar[Dict[int, NodeTemplate]] = {

2:

NodeTemplate("""

if (snrt_is_dm_core()) {

${future} = snrt_dma_start_2d(${dest}, ${src}, ${size}, ${stride_dest}, ${stride_src}, ${repeat});

// WIESEP: Hack as otherwise the last commited DMA transaction ID can never be resolved.

snrt_dma_start_2d(${dest}, ${dest}, 1, 0, 0, 0);

}

"""),

}

_waitingStrategy: ClassVar[AsyncDmaWaitingStrategy] = PerTensorWaitingStrategy(SnitchFuture)

🧰 Tools

🪛 Ruff (0.13.3)

32-41: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

🤖 Prompt for AI Agents

In Deeploy/Targets/Snitch/DMA/SnitchDma.py around lines 32 to 43, the mutable class attributes _transferTemplates and _waitingStrategy need explicit ClassVar typing to satisfy RUF012 and signal they are class-level, not instance-level; import ClassVar from typing (or typing_extensions if needed) and change their annotations to e.g. _transferTemplates: ClassVar[Dict[int, NodeTemplate]] = {...} and _waitingStrategy: ClassVar[PerTensorWaitingStrategy] = PerTensorWaitingStrategy(SnitchFuture), ensuring imports for any referenced types are available.

coderabbitai · 2025-10-06T16:59:46Z

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

+            # 4.4.1) Wait for previous output tile
+            future = self.dma.getFuture(tensorName, "LocalToExternal")
+            # Extract the future that is not already in the set of ingress futures
+            _future = set([future]) - egressFutures
+            _future = _future.pop() if len(_future) > 0 else EmptyFuture("")
            egressFutures.add(future)

+            egressDMAStatements.append(CodeSnippet(self._lineComment, {"comment": "Wait for previous output tile"}))
+            egressDMAStatements.append(_future.wait())
+
+            # 4.4.2) Start transfer for current output tile
            dmaTransferCalls = self._generateDmaTransferCalls(ctxt, tensorName, rectangles, "TILING_I", localBuffer,
-                                                              externalBufferRef, "LocalToExternal", future)
-            egressDmaTransferCalls.extend(dmaTransferCalls)
+                                                              externalBufferRef, "LocalToExternal", _future)
+
+            egressDMAStatements.append(CodeSnippet(self._lineComment, {"comment": "Transfer current output tile"}))
+            egressDMAStatements.extend(dmaTransferCalls)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that the past review issue has been addressed.

A past review comment (still visible in the annotations) flagged that using EmptyFuture on the egress path prevents correct transaction ID generation for Snitch DMA. The suggested fix was to keep the actual Future for codegen and only use the set for deduping teardown.

The current code still uses the _future pattern with EmptyFuture fallback. Please confirm whether:

The issue was resolved through changes in the Future implementation or transfer API.

The pattern is now correct for the updated DMA backends (Mchan, Snitch).

Or if the suggested fix from the past review still needs to be applied.

Run the following script to check if EmptyFuture is causing issues with Snitch DMA transfer generation:

🏁 Script executed:

#!/bin/bash # Description: Check for Snitch DMA transfer patterns and EmptyFuture usage # Look for Snitch DMA transfer generation rg -nP 'snrt_dma_start|SnitchDma' -C 5 # Check EmptyFuture template definitions rg -nP 'class EmptyFuture' -A 10 # Look for getFuture usage patterns rg -nP 'getFuture.*LocalToExternal' -C 3

Length of output: 9839

🏁 Script executed:

#!/bin/bash # Inspect AsyncDma getFuture and waiting strategy rg -nP 'class AsyncDma' -A100 Deeploy/TilingExtension/AsyncDma.py rg -nP 'def getFuture' -C5 Deeploy/TilingExtension/AsyncDma.py rg -nP 'class PerTensorWaitingStrategy' -A50 Deeploy/TilingExtension/AsyncDma.py

Length of output: 11508

Preserve real Future for egress DMA instead of using EmptyFuture fallback
The current fallback to EmptyFuture when deduping egress futures (DoubleBufferingTilingCodeGeneration.py:227-231) suppresses both wait() and transfer calls, breaking Snitch DMA transaction-ID generation. Retain the original future for code emission and use egressFutures only for deduplication.

🤖 Prompt for AI Agents

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py lines 227-242: the code replaces the real future with EmptyFuture when deduping egress futures which suppresses wait() and DMA transfer uses and breaks DMA transaction-ID generation; change the logic to always keep and pass the original `future` into code emission (use it for _future.wait() and for generateDmaTransferCalls) but use `egressFutures` only to decide whether to emit the prior-wait() (i.e., if `future` not in `egressFutures` emit the wait and then add it to `egressFutures`, otherwise skip emitting an extra wait while still passing the original `future` to transfer call generation).

coderabbitai · 2025-10-06T16:59:46Z

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

        return executionBlock


 class ProfilingPrototypeMixIn(ABC):


⚠️ Potential issue | 🟡 Minor

Remove unnecessary ABC inheritance from ProfilingPrototypeMixIn.

ProfilingPrototypeMixIn does not declare any abstract methods and serves as a concrete mixin providing profiling functionality. It should not inherit from ABC.

Apply this diff to remove the ABC inheritance:

-class ProfilingPrototypeMixIn(ABC): +class ProfilingPrototypeMixIn:

If the intent is to ensure subclasses implement certain methods, consider adding abstract methods. Otherwise, remove the ABC inheritance.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class ProfilingPrototypeMixIn(ABC):

class ProfilingPrototypeMixIn:

🧰 Tools

🪛 Ruff (0.13.3)

67-67: ProfilingPrototypeMixIn is an abstract base class, but it has no abstract methods or properties

(B024)

🤖 Prompt for AI Agents

In Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py around line 67, remove the unnecessary inheritance from ABC on the ProfilingPrototypeMixIn class (change its declaration to be a plain mixin) and also remove the now-unused ABC import at the top of the file; if the original intent was to force subclasses to implement methods, instead declare explicit @abstractmethod(s) and keep ABC, otherwise simply drop the ABC inheritance and unused import.

coderabbitai · 2025-10-06T16:59:46Z

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

+    _measureCycles = NodeTemplate("""
+    ${measurements}[${tileIdxVar}] = getCycles();
+    """)
+
+    _measurementArrayDeclaration = NodeTemplate("""
+    uint32_t ${measurements}[${totalNumTiles}];
+    """)
+
+    _stringDeclaration = NodeTemplate("""
+    const static char ${name}[] = "${string}";
+    """)
+
+    _printLoopSetup = NodeTemplate("""
+    StopTimer();
+    for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0);
+        ${profileIdxVar} < ${numTiles}[*${tileIdxPtr}];
+        ${profileIdxVar}++){
+    """)
+
+    _printCycleDifference = NodeTemplate(r"""
+    printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \
+    ${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr});
+    """)
+
+    _printLoopTeardown = NodeTemplate("""
+    }
+    StartTimer();
+    """)
+
+    _measureConditionSetup = NodeTemplate("""
+    if(${cond}){
+    """)
+
+    _measureConditionEnd = NodeTemplate("""
+    }
+    """)
+
+    _measureCycles = NodeTemplate("""
+    ${measurements}[${tileIdxVar}] = getCycles();
+    """)


⚠️ Potential issue | 🟡 Minor

Remove duplicate _measureCycles definition.

The _measureCycles template is defined twice (lines 68-70 and lines 105-107) with identical content. This appears to be a copy-paste error.

Apply this diff to remove the duplicate:

- _measureCycles = NodeTemplate(""" - ${measurements}[${tileIdxVar}] = getCycles(); - """) - _measurementArrayDeclaration = NodeTemplate(""" uint32_t ${measurements}[${totalNumTiles}];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_measureCycles = NodeTemplate("""

${measurements}[${tileIdxVar}] = getCycles();

""")

_measurementArrayDeclaration = NodeTemplate("""

uint32_t ${measurements}[${totalNumTiles}];

""")

_stringDeclaration = NodeTemplate("""

const static char ${name}[] = "${string}";

""")

_printLoopSetup = NodeTemplate("""

StopTimer();

for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0);

${profileIdxVar} < ${numTiles}[*${tileIdxPtr}];

${profileIdxVar}++){

""")

_printCycleDifference = NodeTemplate(r"""

printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \

${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr});

""")

_printLoopTeardown = NodeTemplate("""

}

StartTimer();

""")

_measureConditionSetup = NodeTemplate("""

if(${cond}){

""")

_measureConditionEnd = NodeTemplate("""

}

""")

_measureCycles = NodeTemplate("""

${measurements}[${tileIdxVar}] = getCycles();

""")

_measurementArrayDeclaration = NodeTemplate("""

uint32_t ${measurements}[${totalNumTiles}];

""")

_stringDeclaration = NodeTemplate("""

const static char ${name}[] = "${string}";

""")

_printLoopSetup = NodeTemplate("""

StopTimer();

for (int ${profileIdxVar} = ((*${tileIdxPtr} > 0) ? ${numTiles}[(*${tileIdxPtr} - 1)] : 0);

${profileIdxVar} < ${numTiles}[*${tileIdxPtr}];

${profileIdxVar}++){

""")

_printCycleDifference = NodeTemplate(r"""

printf("%s%u] %s%u%s", ${prefixStr}, ${profileIdxVar}, "${flavorStr}", \

${measurementsEnd}[${profileIdxVar}] - ${measurementsStart}[${profileIdxVar}], ${suffixStr});

""")

_printLoopTeardown = NodeTemplate("""

}

StartTimer();

""")

_measureConditionSetup = NodeTemplate("""

if(${cond}){

""")

_measureConditionEnd = NodeTemplate("""

}

""")

_measureCycles = NodeTemplate("""

${measurements}[${tileIdxVar}] = getCycles();

""")

🤖 Prompt for AI Agents

In Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py around lines 68 to 107 there is a duplicate definition of the _measureCycles NodeTemplate (one at ~68-70 and another at ~105-107); remove the second (later) duplicate definition so _measureCycles is defined only once, ensuring no other references are changed.

Xeratec added this to the Release 0.2.1 milestone Sep 19, 2025

Xeratec assigned Xeratec and lukamac Sep 19, 2025

Xeratec added the Feature Addition of new features label Sep 19, 2025

Xeratec added this to Deeploy Sep 19, 2025

Xeratec force-pushed the pr/fix_doublebuffer branch 2 times, most recently from d7cee2e to 737db19 Compare September 20, 2025 08:07

lukamac mentioned this pull request Sep 22, 2025

DRAFT: Fix db perf #116

Closed

5 tasks

Xeratec marked this pull request as ready for review September 27, 2025 16:43

Xeratec requested review from Victor-Jung and lukamac as code owners September 27, 2025 16:43

Xeratec moved this to Need Reviewer in Deeploy Sep 27, 2025

Xeratec changed the title ~~DRAFT: Support Fully Asynchronous DMAs~~ Support Fully Asynchronous DMAs Sep 27, 2025

Xeratec added the Bug Something isn't working label Sep 27, 2025

coderabbitai bot reviewed Sep 27, 2025

View reviewed changes

Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py Outdated Show resolved Hide resolved

DeeployTest/Platforms/Snitch/main.c Show resolved Hide resolved

TargetLibraries/Snitch/src/CycleCounter.c Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 27, 2025

View reviewed changes

TargetLibraries/Snitch/src/CycleCounter.c Show resolved Hide resolved

Xeratec force-pushed the pr/fix_doublebuffer branch from 6180e28 to 0dae064 Compare September 29, 2025 15:59

Xeratec moved this from Need Reviewer to In review in Deeploy Sep 29, 2025

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

Xeratec mentioned this pull request Oct 1, 2025

Improve Double-Buffering DMA #120

Open

lukamac reviewed Oct 2, 2025

View reviewed changes

Xeratec and others added 20 commits October 6, 2025 18:51

Improve DMA code readability

91fd369

Use per Tensor synchronization with MCHAN

c3d298c

WIP Snitch Hacky Flly Asyn DMA

417313e

WIP Snitch Hacky Profiling

f28ff1e

Add allocation step for DMA future

0a8bef3

Fix PerTensorWaitingStrategy and switch MChan to PerTensorWaitingStra…

ea6a842

…tegy

Cleanup

174af2b

Check if transfer ID is valid in Mchan and SnitchDma driver

ab912c1

Correctly schedule singe and double buffering

a3244ec

Fix missing future state reset

d7fd038

Fix missing wait state

95a11d1

Remove duplicated wait and improve code comments

9169b3f

Properly interleave DMA wait and transfer statements

79a7226

Simplify code

dbf46c2

Fix final output wait

700af21

Partially adapt Lukas code

7886d56

Co-authored-by: Luka Macan <[email protected]>

Easier solution with only using sets

ea4dd12

Fix L3 Transfers

b17f791

Implement feedback from Luka

c1e242b

Implement CodeRabbit Feedback

0e8d74e

Xeratec force-pushed the pr/fix_doublebuffer branch from 0dae064 to 0e8d74e Compare October 6, 2025 16:51

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

		class ProfilingPULPClusterTilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
		pass

		class ProfilingPULPL3TilingGenerationSB(SingleBufferingTilingCodeGeneration, ProfilingSingleBufferingTilingMixIn):
		pass

	class ProfilingPrototypeMixIn(ABC):
	class ProfilingPrototypeMixIn:

Support Fully Asynchronous DMAs #114

Are you sure you want to change the base?

Support Fully Asynchronous DMAs #114

Uh oh!

Conversation

Xeratec commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

Xeratec commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

2025-09-19 (@Xeratec)

Added

Changed

Known Issues

Example

Previous Double-Buffering

New Double-Buffering

New Single-Buffering

Uh oh!

Xeratec commented Sep 20, 2025

2025-09-20 (@Xeratec)

Added

Changed

Fixed

Known Issues

Example

New Double-Buffering (PerTensorWaitingStrategy)

New Double-Buffering (TensorGroupWaitingStrategy)

Old Double-Buffering (GlobalWaiting)

Uh oh!

Xeratec commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

2025-09-24 (@Xeratec)

Results

Double-Buffer (DirectionWaitingStrategy)

Single-Buffer (DirectionWaitingStrategy)

Double-Buffer (PerTensorWaitingStrategy)

Single-Buffer (PerTensorWaitingStrategy)

Example

Uh oh!

coderabbitai bot commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lukamac commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xeratec commented Sep 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukamac commented Sep 27, 2025

Uh oh!

Xeratec commented Sep 29, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Xeratec commented Oct 1, 2025 • edited by lukamac Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Xeratec commented Sep 19, 2025 •

edited

Loading

Xeratec commented Sep 19, 2025 •

edited

Loading

New Double-Buffering (`PerTensorWaitingStrategy`)

New Double-Buffering (`TensorGroupWaitingStrategy`)

Xeratec commented Sep 27, 2025 •

edited

Loading

coderabbitai bot commented Sep 27, 2025 •

edited

Loading

lukamac commented Sep 27, 2025 •

edited

Loading

Xeratec commented Oct 1, 2025 •

edited by lukamac

Loading