Skip to content

Weight offloading API surface (CUDA backend)#19711

Draft
mergennachin wants to merge 1 commit into
mainfrom
cuda_weight_offload_v4
Draft

Weight offloading API surface (CUDA backend)#19711
mergennachin wants to merge 1 commit into
mainfrom
cuda_weight_offload_v4

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

@mergennachin mergennachin commented May 20, 2026

Weight offloading design surface (CUDA backend)

Design-only PR for CUDA-backend weight offloading: weights live in
CPU memory, the runtime streams only the currently-needed ones to GPU
through a capped cudaMemPool. Headers and docstrings only -- no
implementation bodies, no caller, no wiring on the partitioner or
runtime side. All four design files are marked
EXPERIMENTAL -- NOT YET WIRED.

Public knobs (CudaPartitioner(weight_offload=True, ...) and the
weight_offload_budget_mb runtime spec) are intentionally NOT
exposed in this PR; they ship with the implementation. Two open items
block wiring and are documented inline:

  • Probe op preservation -- an identity custom op with
    mutates_args=() is a DCE target through inductor; the
    implementation PR must give probe non-elidable semantics that
    don't trip torch.export's parameter-output validation, plus a
    test that asserts the lowered AOTI wrapper actually emits the
    probe calls.

  • AOTI blob layout -- WeightCatalog::build needs per-constant
    offsets and dtype/shape. AOTI doesn't expose either today;
    implementation PR must either land upstream shims or serialize
    the metadata into the partition payload at export time.

Read order:

backends/cuda/passes/weight_offload_pass.py -- export half
backends/cuda/runtime/weight_offload/weight_offload.h -- runtime
backends/cuda/runtime/weight_offload/probe_op.h -- c-shim
backends/cuda/runtime/weight_offload/prefetcher.h -- copy stream

See: #19709

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 20, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19711

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 53 Pending, 3 Unrelated Failures, 3 Unclassified Failures

As of commit 1a0ca10 with merge base 54f1f28 (image):

NEW FAILURE - The following job has failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 20, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin mergennachin force-pushed the cuda_weight_offload_v4 branch 5 times, most recently from bb0b1ca to f7a8396 Compare May 21, 2026 18:31
Design-only PR for CUDA-backend weight offloading: weights live in
CPU memory, the runtime streams only the currently-needed ones to GPU
through a capped cudaMemPool. Headers and docstrings only -- no
implementation bodies, no caller, no wiring on the partitioner or
runtime side. All four design files are marked
``EXPERIMENTAL -- NOT YET WIRED``.

Public knobs (``CudaPartitioner(weight_offload=True, ...)`` and the
``weight_offload_budget_mb`` runtime spec) are intentionally NOT
exposed in this PR; they ship with the implementation. Four open
items block wiring and are documented inline:

  * Probe op preservation -- an identity custom op with
    ``mutates_args=()`` is a DCE target through inductor; the
    implementation PR must give probe non-elidable semantics that
    don't trip torch.export's parameter-output validation, plus a
    test that asserts the lowered AOTI wrapper actually emits the
    probe calls.

  * AOTI blob layout -- ``WeightCatalog::build`` needs per-constant
    offsets and dtype/shape. AOTI doesn't expose either today;
    implementation PR must either land upstream shims or serialize
    the metadata into the offload payload at export time.

  * Payload transport channel -- the pass has to run from
    ``AotiBackend.preprocess`` (the partitioner contract forbids
    mutating the ExportedProgram from ``partition()``); the
    implementation PR picks between ``processed_bytes`` and a
    per-method ``NamedDataStore`` entry.

  * Schedule / cursor order -- the runtime cursor hard-fails on a
    mismatch against the recorded schedule, but the pass observes
    parameter order before inductor lowering reorders / duplicates
    reads. Implementation PR either regenerates the schedule from
    the post-lowering wrapper or extends probe with a self-identifying
    ``probe_id`` / FQN arg so no cursor is needed.

Read order:

  backends/cuda/passes/weight_offload_pass.py            -- export half
  backends/cuda/runtime/weight_offload/weight_offload.h  -- runtime
  backends/cuda/runtime/weight_offload/probe_op.h        -- c-shim
  backends/cuda/runtime/weight_offload/prefetcher.h      -- copy stream

See: #19709
@mergennachin mergennachin force-pushed the cuda_weight_offload_v4 branch from f7a8396 to 1a0ca10 Compare May 21, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant