[ET Device Support] Module: allocate device memory for planned buffers by Gasoonjia · Pull Request #19746 · pytorch/executorch

Gasoonjia · 2026-05-23T03:40:40Z

#18476 clone version due to bot crash

…level device array This diff adds device placement information to the ExecuTorch schema to support representing tensor-level device type information, which will be the basic requirement for the following tensor_parser updates. This is part of the Phase 1 implementation to make ET device type work E2E without user-specified device placement. Design doc: https://docs.google.com/document/d/1lwd9BlohmwkN5EEvRulO_b-XnZBwv1nMb5l2K3jfuwA/edit?tab=t.0#heading=h.o6anuvkix4bu Differential Revision: [D93635657](https://our.internmc.facebook.com/intern/diff/D93635657/) [ghstack-poisoned]

This diff extends `TensorImpl` to carry device information, enabling the runtime tensor to track which device its data resides on (CPU, CUDA, etc.). This is a prerequisite for parsing device info from the schema and allocating device memory. Differential Revision: [D93635655](https://our.internmc.facebook.com/intern/diff/D93635655/) [ghstack-poisoned]

…stry This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…ace and DeviceAllocatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…locatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…ace and DeviceAllocatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…locatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…ace and DeviceAllocatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…locatorRegistry" This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.). **DeviceAllocator interface provides:** - `init_buffer()` - Initialize memory buffer pools for memory-planned tensors - `get_offset_address()` - Get pointer to offset within pre-allocated buffer - `allocate()` / `deallocate()` - Dynamic device memory allocation - `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device - `device_type()` - Returns the device type this allocator handles **DeviceAllocatorRegistry provides:** - Singleton registry mapping DeviceType → DeviceAllocator - `register_allocator()` / `get_allocator()` methods - Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly) **Design notes:** - Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime - Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry) - Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/) [ghstack-poisoned]

…vice mapping Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the corresponding Python dataclass to schema.py. This enables mapping each non-constant planned memory buffer to a specific device type (CPU, CUDA, etc.). The field is optional and absent for CPU-only programs, ensuring zero binary size regression. Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/) [ghstack-poisoned]

…ce schema for per-buffer device mapping" Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the corresponding Python dataclass to schema.py. This enables mapping each non-constant planned memory buffer to a specific device type (CPU, CUDA, etc.). The field is optional and absent for CPU-only programs, ensuring zero binary size regression. Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/) [ghstack-poisoned]

…r-buffer device mapping" Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the corresponding Python dataclass to schema.py. This enables mapping each non-constant planned memory buffer to a specific device type (CPU, CUDA, etc.). The field is optional and absent for CPU-only programs, ensuring zero binary size regression. Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/) [ghstack-poisoned]

…ce schema for per-buffer device mapping" Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the corresponding Python dataclass to schema.py. This enables mapping each non-constant planned memory buffer to a specific device type (CPU, CUDA, etc.). The field is optional and absent for CPU-only programs, ensuring zero binary size regression. Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/) [ghstack-poisoned]

…r-buffer device mapping" Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the corresponding Python dataclass to schema.py. This enables mapping each non-constant planned memory buffer to a specific device type (CPU, CUDA, etc.). The field is optional and absent for CPU-only programs, ensuring zero binary size regression. Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/) [ghstack-poisoned]

…r device type Extends memory planning to separate device tensors from CPU tensors into distinct memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific mem_ids before the greedy/naive algorithm runs, ensuring they get planned into independent memory buffers that never share space with CPU tensors. Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/) [ghstack-poisoned]

…anning: separate buffers per device type" Extends memory planning to separate device tensors from CPU tensors into distinct memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific mem_ids before the greedy/naive algorithm runs, ensuring they get planned into independent memory buffers that never share space with CPU tensors. Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/) [ghstack-poisoned]

… buffers per device type" Extends memory planning to separate device tensors from CPU tensors into distinct memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific mem_ids before the greedy/naive algorithm runs, ensuring they get planned into independent memory buffers that never share space with CPU tensors. Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/) [ghstack-poisoned]

…meta Enable serialzing non_const_buffer_device into into PTE file. Differential Revision: [D97850707](https://our.internmc.facebook.com/intern/diff/D97850707/) [ghstack-poisoned]

…ifetime management Introduces DeviceMemoryBuffer, an RAII wrapper that owns a single device memory allocation. On destruction, it automatically calls DeviceAllocator::deallocate() to free the memory. This mirrors the role of std::vector<uint8_t> for CPU planned buffers, but for non-cpu device memory (CUDA, etc.). Key features: - Static factory create(size, type, index) looks up DeviceAllocator from registry - Move-only semantics (no copy) to enforce single ownership - as_span() accessor wraps device pointer for use with HierarchicalAllocator - Destructor is no-op for default-constructed or moved-from instances Differential Revision: [D97850709](https://our.internmc.facebook.com/intern/diff/D97850709/) [ghstack-poisoned]

Add memory_planned_buffer_device(index) to MethodMeta, returning the Device (type + index) for each planned memory buffer. This reads from the non_const_buffer_device field in the serialized ExecutionPlan. For CPU-only programs (or legacy PTE files without non_const_buffer_device), all buffers default to Device{CPU, 0}. The sparse list only stores entries for non-CPU buffers, so the lookup scans for a matching buffer_idx. This API enables Module::load_method() to query each buffer's target device and allocate accordingly (malloc for CPU, DeviceAllocator for CUDA, etc.). Differential Revision: [D97850708](https://our.internmc.facebook.com/intern/diff/D97850708/) [ghstack-poisoned]

This diff extend MemoryManager with optional per-buffer device type metadata so the runtime explicitly knows which planned memory buffers are on which device. This enables future device-aware dispatch and debugging. Changes: - New constructor taking planned_buffer_devices as extra input for device info - New accessors: planned_buffer_devices(), has_device_memory() - No existing functionalities have been updated. Differential Revision: [D97850706](https://our.internmc.facebook.com/intern/diff/D97850706/) [ghstack-poisoned]

…e memory for planned buffers" This diff enables module API loading program memory-planed on non-cpu device. It update Module::load_method() to detect device buffers via MethodMeta and allocate device memory using the registered DeviceAllocator. Device memory is managed via DeviceMemoryBuffer RAII objects stored in PlannedMemory, ensuring proper cleanup when the Method is destroyed. Differential Revision: [D97850705](https://our.internmc.facebook.com/intern/diff/D97850705/) [ghstack-poisoned]

[ghstack-poisoned]

Differential Revision: D97850705 Pull Request resolved: #18476

pytorch-bot · 2026-05-23T03:40:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19746

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Gasoonjia added 30 commits February 18, 2026 11:25

[ET Device Support] Emitter reads non_const_buffer_device from graph …

0829c5d

…meta Enable serialzing non_const_buffer_device into into PTE file. Differential Revision: [D97850707](https://our.internmc.facebook.com/intern/diff/D97850707/) [ghstack-poisoned]

Update (base update)

280c1d2

[ghstack-poisoned]

Update (base update)

8bd18ab

[ghstack-poisoned]

Update (base update)

0c8ce9e

[ghstack-poisoned]

Update (base update)

764a2c8

[ghstack-poisoned]

Update (base update)

cbab086

[ghstack-poisoned]

[ET Device Support] Module: allocate device memory for planned buffers

b68f982

Differential Revision: D97850705 Pull Request resolved: #18476

Gasoonjia requested review from kirklandsign, larryliu0820 and shoumikhin as code owners May 23, 2026 03:40

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 23, 2026

kirklandsign approved these changes May 23, 2026

View reviewed changes

Gasoonjia merged commit 12f62f2 into main May 23, 2026
169 of 179 checks passed

Gasoonjia deleted the gh/gasoonjia/156/base branch May 23, 2026 03:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET Device Support] Module: allocate device memory for planned buffers#19746

[ET Device Support] Module: allocate device memory for planned buffers#19746
Gasoonjia merged 30 commits into
mainfrom
gh/gasoonjia/156/base

Gasoonjia commented May 23, 2026

Uh oh!

pytorch-bot Bot commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gasoonjia commented May 23, 2026

Uh oh!

pytorch-bot Bot commented May 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19746

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants