Conversation
Replace the old update_constants_from_blob + cross-method sharing with a unified per-weight caching approach. The first method to initialize loads its constants from the blob and caches them by FQN. Subsequent methods with matching FQNs reuse cached GPU tensors via update_user_managed_constant_buffer_pairs, skipping blob loading entirely. This eliminates duplicate GPU weight allocations for multi-method models (e.g., prefill/decode), reducing peak GPU memory from ~35 GB to ~17.6 GB for Qwen 3.5 MoE. Also adds GPU peak memory reporting to the Qwen3.5 MoE runner and a CI check (< 20 GB) in test_model_e2e.sh.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18901
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 15 New Failures, 3 Unrelated FailuresAs of commit 9aeea5f with merge base 87e65ac ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude review this PR. |
|
Claude finished @digantdesai's task in 1m 47s —— View job Review of PR #18901 — Per-weight constant cache
Overall: Good improvement. Replacing the Issues1. Dangling pointers in The Actually, on closer inspection this is fine since 2. The 3. First method still loads the full blob even when some constants are already cached — When This is likely acceptable for now since the primary use case is full overlap, but worth a comment noting this limitation. 4. GPU memory measurement is not peak memory —
5. Missing blank line between function and member variable — return Error::Ok;
}
mutable std::mutex shared_constants_mutex_;There should be a blank line between 6. Thread safety: The calls to 7. CI test uses PEAK_MEM=$(echo "$OUTPUT" | grep -oP 'GPU peak memory usage: \K[0-9.]+' || true)
Minor / Nits
Looks Good
|
Problem: Multi-method AOTI models (e.g., Qwen3.5 MoE with separate prefill/decode methods) load the full weight blob independently for each method, even when they share identical weights. This causes duplicate GPU allocations -- Qwen3.5 MoE peaked at ~35 GB, making it impossible to run on a single 24 GB GPU (e.g., 4090).
Solution: Introduce a per-weight FQN-keyed constant cache in
CudaBackend. The first method loads its constants from the blob and caches them. Subsequent methods with matching FQNs skip blob loading entirely and reuse cached GPU tensors viaupdate_user_managed_constant_buffer_pairs. A legacy fallback path is preserved for older AOTI models without constant management APIs.Results
Peak GPU memory: 35.4 GB → 17.6 GB (-50%)