Per weight constant cache by Gasoonjia · Pull Request #18901 · pytorch/executorch

Gasoonjia · 2026-04-15T02:28:32Z

Problem: Multi-method AOTI models (e.g., Qwen3.5 MoE with separate prefill/decode methods) load the full weight blob independently for each method, even when they share identical weights. This causes duplicate GPU allocations -- Qwen3.5 MoE peaked at ~35 GB, making it impossible to run on a single 24 GB GPU (e.g., 4090).

Solution: Introduce a per-weight FQN-keyed constant cache in CudaBackend. The first method loads its constants from the blob and caches them. Subsequent methods with matching FQNs skip blob loading entirely and reuse cached GPU tensors via update_user_managed_constant_buffer_pairs. A legacy fallback path is preserved for older AOTI models without constant management APIs.

Results
Peak GPU memory: 35.4 GB → 17.6 GB (-50%)

Replace the old update_constants_from_blob + cross-method sharing with a unified per-weight caching approach. The first method to initialize loads its constants from the blob and caches them by FQN. Subsequent methods with matching FQNs reuse cached GPU tensors via update_user_managed_constant_buffer_pairs, skipping blob loading entirely. This eliminates duplicate GPU weight allocations for multi-method models (e.g., prefill/decode), reducing peak GPU memory from ~35 GB to ~17.6 GB for Qwen 3.5 MoE. Also adds GPU peak memory reporting to the Qwen3.5 MoE runner and a CI check (< 20 GB) in test_model_e2e.sh.

pytorch-bot · 2026-04-15T02:28:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18901

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

❌ 15 New Failures, 3 Unrelated Failures

As of commit 9aeea5f with merge base 87e65ac ():

NEW FAILURES - The following jobs have failed:

Cadence Build & Test / cpu-test / test-aot / test-aot (gh)
backends/cadence/aot/tests/test_replace_ops_passes.py::TestReplaceOpsPasses::test_replace_conv2d_with_linear
Lint / lintrunner (gh)
>>> Lint for backends/cuda/runtime/cuda_backend.cpp:
pull / test-mcu-cortex-m-backend / linux-job (gh)
RuntimeError: Command docker exec -t 2716297285c22eb2f05ae1c5548badb4351b55e05666d62b55d336e2ecac9bbd /exec failed with exit code 1
Test CoreML Backend / test-coreml / test-backend-macos (coreml, models) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t e701f7c5639269c738a74af1ae27f0c0a622353e28a5dfa0f249fb0b59371413 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (nvidia, parakeet-tdt, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 1d2206d0478507543a5e07316a555a02cdaf9dce346246633d30cea78ba03381 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u55) / linux-job (gh)
RuntimeError: Command docker exec -t b6ef966df3d7690e4a3e6f43a44f1cfc403bcf5272d75a6ae2c8e5cb79572771 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_pytest_ops_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t 21c052f7681138367b9840744c4b71b0073e3dcf8383263603bd81157f078537 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_run_ethos_u55) / linux-job (gh)
RuntimeError: Command docker exec -t c0b984cc8a1529fe33ebf5d0e1368bb4867b0993c4cb46fdd5803f61088b6548 /exec failed with exit code 1
trunk / test-arm-backend-ethos-u (test_run_ethos_u85) / linux-job (gh)
RuntimeError: Command docker exec -t 03d8b8af0b6610a68f34255a7d46ce936e9adc2869f8085ca718b0d57377d8fb /exec failed with exit code 1
trunk / test-arm-ootb-linux (run_deit_e2e_ethos_u) / linux-job (gh)
RuntimeError: Command docker exec -t 83060cc91e206900a05592d692d3278dbdec0084624b0e56f7f59444243e749d /exec failed with exit code 1
trunk / test-cortex-m-e2e (mv3) / linux-job (gh)
RuntimeError: Command docker exec -t c0073659d0a2180f01d07e45626f5f041fe959952be9a1161221d98804367748 /exec failed with exit code 1
trunk / test-torchao-huggingface-checkpoints (lfm2_5_1_2b, linux.arm64.2xlarge, executorch-ubuntu-22.04-g... / linux-job (gh)
RuntimeError: Command docker exec -t d739f9563c0aec1b64440504210a2c87cffac184a740159e58d91941d1ca2c98 /exec failed with exit code 1
trunk / test-torchao-huggingface-checkpoints (phi_4_mini, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
RuntimeError: Command docker exec -t fb0274751b2e669ef80e4242a36b7a4276b1a5145ed8ee0bdeea2519e72dbea3 /exec failed with exit code 1
trunk / test-torchao-huggingface-checkpoints (qwen3_4b, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc1... / linux-job (gh)
RuntimeError: Command docker exec -t 51b2373b054b874a365280d5ea9fbb0c07f001bfe4f3ff3f6ae70f7f31722e0a /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-15T02:29:17Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

digantdesai · 2026-04-15T02:53:38Z

@claude review this PR.

claude · 2026-04-15T02:54:16Z

Claude finished @digantdesai's task in 1m 47s —— View job

Review of PR #18901 — Per-weight constant cache

Read changed files and PR diff
Review backends/cuda/runtime/cuda_backend.cpp
Review backends/aoti/aoti_delegate_handle.h
Review examples/models/qwen3_5_moe/main.cpp
Review .ci/scripts/test_model_e2e.sh
Post review summary

Overall: Good improvement. Replacing the share_kv_cache_across_methods compile-spec opt-in with automatic per-weight caching is the right direction — it's simpler for callers and handles the general case (shared weights, not just KV cache). The legacy fallback is a reasonable safety net. A few issues to address:

Issues

1. Dangling pointers in update_user_managed_constant_buffer_pairs call — cuda_backend.cpp:835-843

The pairs vector stores AOTInductorConstantMapEntry with const char* pointing into fqn_to_name's key/value std::string objects. But fqn_to_name is a local variable on the stack. If the structured binding internal_name is a copy (it is — structured bindings from unordered_map give references, but the const auto& should be fine here), the pointers remain valid. However, there's a subtler issue: the lock on shared_constants_mutex_ is released at line 843 before pairs is used at line 847. The pairs entries reference internal_name.c_str() from fqn_to_name (still alive), so that's fine. But the it->second (AtenTensorHandle) is read from shared_constant_tensors_ inside the lock, so the handle value is captured by copy into pairs — this is safe.

Actually, on closer inspection this is fine since fqn_to_name outlives pairs, and the AtenTensorHandle is an opaque pointer copied by value. No issue here on lifetime. Disregard.

2. allocate_constant_on_gpu is dead code — cuda_backend.cpp:669-698

The allocate_constant_on_gpu static method is defined but never called anywhere in this PR. The three new function pointer types (get_constant_data_size, get_constant_from_folded, get_constants_blob_size) are loaded but also never used. If these are intended for a future per-weight granular loading optimization (loading individual constants instead of the full blob), that's understandable, but as-is they are unused code.

Remove dead code →

3. First method still loads the full blob even when some constants are already cached — cuda_backend.cpp:765-793

When uncached_fqns is non-empty, the code calls update_constants_from_blob which loads all constants from the blob into GPU memory, not just the uncached ones. It then extracts and caches the new ones. This means the first method still does a full blob load, and if a second method has partially overlapping constants, it will also do a full blob load for its uncached subset. The "full blob load + extract" approach works for the common case (100% overlap between prefill/decode), but for partial overlap it wastes GPU memory transiently.

This is likely acceptable for now since the primary use case is full overlap, but worth a comment noting this limitation.

4. GPU memory measurement is not peak memory — examples/models/qwen3_5_moe/main.cpp:253-260

cudaMemGetInfo at the end of execution reports current GPU memory usage, not peak. If memory was allocated and freed during execution (e.g., the old blob approach), the peak would have been higher but would not be captured. The printf says "peak memory usage" which is misleading. Consider using cudaMemPoolGetAttribute with cudaMemPoolAttrUsedMemHigh or tracking allocations over time. For this PR's purpose (verifying the cache avoids duplicates) it works as a sanity check since the weights stay allocated, but the label is inaccurate.

5. Missing blank line between function and member variable — cuda_backend.cpp:900-901

    return Error::Ok;
  }
  mutable std::mutex shared_constants_mutex_;

There should be a blank line between load_constants_legacy and the member variable declaration for readability (matching the style used elsewhere in this class).

6. Thread safety: get_constant_name / get_constant_original_fqn called under lock — cuda_backend.cpp:739-752

The calls to handle->get_constant_name() and handle->get_constant_original_fqn() happen while holding shared_constants_mutex_. These are AOTI runtime calls that shouldn't be slow, but holding a mutex during external library calls is a pattern to be cautious about. Since the lock is needed to check shared_constant_tensors_, consider splitting: enumerate constants first (no lock needed), then take the lock only for the cache lookup.

7. CI test uses -oP (PCRE) in grep — .ci/scripts/test_model_e2e.sh:406

PEAK_MEM=$(echo "$OUTPUT" | grep -oP 'GPU peak memory usage: \K[0-9.]+' || true)

grep -P requires PCRE support which may not be available in all CI environments (e.g., macOS grep doesn't support -P). Since this CI likely runs on Linux this is probably fine, but sed or grep -oE with a capture group would be more portable.

Minor / Nits

The comment at line 660 mentions allocate_constant_on_gpu() as "the allocation primitive" — if the method is removed per issue Re-sync with internal repository #2, update this comment block.
The comment block at lines 651-664 is thorough and well-written.
constants_extracted_ member removal is clean — the cache naturally handles first-vs-subsequent without a separate flag.

Looks Good

The legacy fallback (load_constants_legacy) correctly preserves the old behavior for models compiled with older torch versions.
Removing the share_kv_cache_across_methods compile spec simplifies the API surface — users no longer need to opt in.
The AOTIDelegateHandle additions follow the existing pattern.
Error handling in load_constants_with_cache properly returns errors up the call chain via ET_CHECK_OK_OR_RETURN_ERROR.

gasoonjia and others added 3 commits April 14, 2026 18:21

lint

0c05620

remove unnessary benchmark

865f118

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 15, 2026

Merge branch 'main' into per-weight-constant-cache

9aeea5f

Gasoonjia temporarily deployed to upload-benchmark-results April 15, 2026 03:47 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per weight constant cache#18901

Per weight constant cache#18901
Gasoonjia wants to merge 4 commits intomainfrom
per-weight-constant-cache

Gasoonjia commented Apr 15, 2026

Uh oh!

pytorch-bot bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

digantdesai commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented Apr 15, 2026

Uh oh!

pytorch-bot bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18901

❗ 1 Active SEVs

❌ 15 New Failures, 3 Unrelated Failures

Uh oh!

github-actions bot commented Apr 15, 2026

This PR needs a release notes: label

Uh oh!

digantdesai commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #18901 — Per-weight constant cache

Issues

Minor / Nits

Looks Good

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Apr 15, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude bot commented Apr 15, 2026 •

edited

Loading