NVIDIA
diff --git a/‎docs/design/rapids_shuffle_manager_v2_phase1_design.md‎
Lines changed: 649 additions & 0 deletions b/‎docs/design/rapids_shuffle_manager_v2_phase1_design.md‎
Lines changed: 649 additions & 0 deletions
diff --git a/‎docs/design/rapids_shuffle_manager_v2_phase2_design.md‎
Lines changed: 529 additions & 0 deletions b/‎docs/design/rapids_shuffle_manager_v2_phase2_design.md‎
Lines changed: 529 additions & 0 deletions
diff --git a/‎docs/dev/shuffle-metrics.md‎
Lines changed: 65 additions & 0 deletions b/‎docs/dev/shuffle-metrics.md‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/RapidsShuffleHeartbeatHandler.scala‎
Lines changed: 45 additions & 1 deletion b/‎sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/RapidsShuffleHeartbeatHandler.scala‎
Lines changed: 45 additions & 1 deletion
@@ -0,0 +1,65 @@
+# Shuffle Metrics: SparkRapidsShuffleDiskSavingsEvent
+
+When using MULTITHREADED shuffle mode with `spark.rapids.shuffle.multithreaded.skipMerge=true`,
+the RAPIDS Accelerator emits `SparkRapidsShuffleDiskSavingsEvent` to the Spark event log.
+This document explains how to interpret and aggregate these events.
+
+## Event Format
+
+Each executor posts its own event when it cleans up shuffle data. A single shuffle may have
+multiple events in the eventlog (one per executor that participated in the shuffle write).
+
+Event format in eventlog (JSON):
+```json
+{"Event":"com.nvidia.spark.rapids.SparkRapidsShuffleDiskSavingsEvent",
+ "shuffleId":0,"bytesFromMemory":7868,"bytesFromDisk":0}
+```
+
+## Why Custom Events Instead of Task Metrics
+
+Spark task metrics are committed when a task completes. However, shuffle data lifecycle
+extends beyond task completion - buffers may be spilled to disk after a task finishes but
+before the shuffle data is read. The final `bytesFromMemory` vs `bytesFromDisk` statistics
+can only be determined when shuffle cleanup occurs (after the SQL query completes), at
+which point task metrics are no longer updatable.
+
+## Field Descriptions
+
+| Field | Description |
+|-------|-------------|
+| `shuffleId` | The Spark shuffle ID |
+| `bytesFromMemory` | Bytes kept in memory and never written to disk (actual disk I/O savings) |
+| `bytesFromDisk` | Bytes spilled to disk due to memory pressure |
+
+The sum of `bytesFromMemory` across all events should approximately match the total
+"Shuffle Bytes Written" reported in task metrics.
+
+## Aggregating Events
+
+To get application-wide totals from an eventlog:
+
+```bash
+grep "SparkRapidsShuffleDiskSavingsEvent" eventlog | \
+  jq -s '{
+    totalBytesFromMemory: (map(.bytesFromMemory) | add),
+    totalBytesFromDisk: (map(.bytesFromDisk) | add),
+    diskSavingsBytes: (map(.bytesFromMemory) | add)
+  }'
+```
+
+## Timing Considerations
+
+The cleanup mechanism uses a polling model where executors poll the driver every 1 second.
+For short-running applications or scripts, the session may exit before executors have a
+chance to poll and report their statistics.
+
+To ensure all events are captured, add a short delay before exiting:
+
+```scala
+// After your last query completes
+Thread.sleep(2000)  // Wait for executor cleanup polling
+spark.stop()
+```
+
+For long-running applications or interactive sessions (spark-shell, notebooks), this is
+typically not an issue as there is enough time between queries for cleanup to complete.
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2023, NVIDIA CORPORATION.
+ * Copyright (c) 2023-2026, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -41,3 +41,47 @@ trait RapidsShuffleHeartbeatHandler {
   /** Called when a new peer is seen via heartbeats */
   def addPeer(peer: BlockManagerId): Unit
 }
+
+// ============================================================================
+// Shuffle Cleanup RPC Messages
+// ============================================================================
+
+/**
+ * Statistics for a single shuffle cleanup operation.
+ *
+ * @param shuffleId the shuffle ID that was cleaned up
+ * @param bytesFromMemory bytes that were read from memory (never spilled to disk)
+ * @param bytesFromDisk bytes that were read from disk (spilled at some point)
+ * @param numExpansions number of buffer expansions that occurred
+ * @param numSpills number of buffers that were spilled to disk
+ * @param numForcedFileOnly number of buffers that used forced file-only mode
+ */
+case class ShuffleCleanupStats(
+    shuffleId: Int,
+    bytesFromMemory: Long,
+    bytesFromDisk: Long,
+    numExpansions: Int = 0,
+    numSpills: Int = 0,
+    numForcedFileOnly: Int = 0) extends Serializable
+
+/**
+ * Executor polls driver for shuffles that need to be cleaned up.
+ *
+ * @param executorId identifier for the executor
+ */
+case class RapidsShuffleCleanupPollMsg(executorId: String)
+
+/**
+ * Driver response with shuffle IDs that need cleanup.
+ *
+ * @param shuffleIds list of shuffle IDs to clean up
+ */
+case class RapidsShuffleCleanupResponseMsg(shuffleIds: Array[Int])
+
+/**
+ * Executor reports cleanup statistics to driver.
+ *
+ * @param executorId identifier for the executor
+ * @param stats cleanup statistics for each shuffle
+ */
+case class RapidsShuffleCleanupStatsMsg(executorId: String, stats: Array[ShuffleCleanupStats])