Weight offloading API surface (CUDA backend) by mergennachin · Pull Request #19711 · pytorch/executorch

mergennachin · 2026-05-20T22:21:56Z

Weight offloading design surface (CUDA backend)

Design-only PR for CUDA-backend weight offloading: weights live in
CPU memory, the runtime streams only the currently-needed ones to GPU
through a capped cudaMemPool. Headers and docstrings only -- no
implementation bodies, no caller, no wiring on the partitioner or
runtime side. All four design files are marked
EXPERIMENTAL -- NOT YET WIRED.

Public knobs (CudaPartitioner(weight_offload=True, ...) and the
weight_offload_budget_mb runtime spec) are intentionally NOT
exposed in this PR; they ship with the implementation. Two open items
block wiring and are documented inline:

Probe op preservation -- an identity custom op with
mutates_args=() is a DCE target through inductor; the
implementation PR must give probe non-elidable semantics that
don't trip torch.export's parameter-output validation, plus a
test that asserts the lowered AOTI wrapper actually emits the
probe calls.
AOTI blob layout -- WeightCatalog::build needs per-constant
offsets and dtype/shape. AOTI doesn't expose either today;
implementation PR must either land upstream shims or serialize
the metadata into the partition payload at export time.

Read order:

backends/cuda/passes/weight_offload_pass.py -- export half
backends/cuda/runtime/weight_offload/weight_offload.h -- runtime
backends/cuda/runtime/weight_offload/probe_op.h -- c-shim
backends/cuda/runtime/weight_offload/prefetcher.h -- copy stream

See: #19709

pytorch-bot · 2026-05-20T22:22:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19711

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 53 Pending, 3 Unrelated Failures, 3 Unclassified Failures

As of commit 1a0ca10 with merge base 54f1f28 ():

NEW FAILURE - The following job has failed:

Lint / link-check / lint-urls (gh)

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Check Labels / Check labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: GraphQL query
pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t d59b4168cd2e3fb4adb484b30c196744f9586c2e6579d3892e7b49cfc8e3e1f3 /exec failed with exit code 127
pull / test-arm-backend-no-driver (test_run_tosa) / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: Command docker exec -t 84b698e8f30143c74de243912f641a4487d0c1e30643dbdafc7c75b604dd3d53 /exec failed with exit code 127

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / android / run-emulator (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-arm-backend-no-driver (test_pytest_models_tosa) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-arm-backend-no-driver (test_pytest_ops_no_target) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-05-20T22:22:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Design-only PR for CUDA-backend weight offloading: weights live in CPU memory, the runtime streams only the currently-needed ones to GPU through a capped cudaMemPool. Headers and docstrings only -- no implementation bodies, no caller, no wiring on the partitioner or runtime side. All four design files are marked ``EXPERIMENTAL -- NOT YET WIRED``. Public knobs (``CudaPartitioner(weight_offload=True, ...)`` and the ``weight_offload_budget_mb`` runtime spec) are intentionally NOT exposed in this PR; they ship with the implementation. Four open items block wiring and are documented inline: * Probe op preservation -- an identity custom op with ``mutates_args=()`` is a DCE target through inductor; the implementation PR must give probe non-elidable semantics that don't trip torch.export's parameter-output validation, plus a test that asserts the lowered AOTI wrapper actually emits the probe calls. * AOTI blob layout -- ``WeightCatalog::build`` needs per-constant offsets and dtype/shape. AOTI doesn't expose either today; implementation PR must either land upstream shims or serialize the metadata into the offload payload at export time. * Payload transport channel -- the pass has to run from ``AotiBackend.preprocess`` (the partitioner contract forbids mutating the ExportedProgram from ``partition()``); the implementation PR picks between ``processed_bytes`` and a per-method ``NamedDataStore`` entry. * Schedule / cursor order -- the runtime cursor hard-fails on a mismatch against the recorded schedule, but the pass observes parameter order before inductor lowering reorders / duplicates reads. Implementation PR either regenerates the schedule from the post-lowering wrapper or extends probe with a self-identifying ``probe_id`` / FQN arg so no cursor is needed. Read order: backends/cuda/passes/weight_offload_pass.py -- export half backends/cuda/runtime/weight_offload/weight_offload.h -- runtime backends/cuda/runtime/weight_offload/probe_op.h -- c-shim backends/cuda/runtime/weight_offload/prefetcher.h -- copy stream See: #19709

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2026

mergennachin force-pushed the cuda_weight_offload_v4 branch 5 times, most recently from bb0b1ca to f7a8396 Compare May 21, 2026 18:31

mergennachin force-pushed the cuda_weight_offload_v4 branch from f7a8396 to 1a0ca10 Compare May 21, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight offloading API surface (CUDA backend)#19711

Weight offloading API surface (CUDA backend)#19711
mergennachin wants to merge 1 commit into
mainfrom
cuda_weight_offload_v4

mergennachin commented May 20, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergennachin commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19711

❌ 1 New Failure, 53 Pending, 3 Unrelated Failures, 3 Unclassified Failures

Uh oh!

github-actions Bot commented May 20, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mergennachin commented May 20, 2026 •

edited

Loading

pytorch-bot Bot commented May 20, 2026 •

edited

Loading

This PR needs a `release notes:` label