Skip to content

[ET Device Support] MemoryManager: add per-buffer device metadata#19737

Merged
Gasoonjia merged 26 commits into
mainfrom
gh/gasoonjia/155/base
May 22, 2026
Merged

[ET Device Support] MemoryManager: add per-buffer device metadata#19737
Gasoonjia merged 26 commits into
mainfrom
gh/gasoonjia/155/base

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

Create #18475 manually due to bot crash

…level device array

This diff adds device placement information to the ExecuTorch schema to support representing tensor-level device type information, which will be the basic requirement for the following tensor_parser updates.

This is part of the Phase 1 implementation to make ET device type work E2E without user-specified device placement.

Design doc: https://docs.google.com/document/d/1lwd9BlohmwkN5EEvRulO_b-XnZBwv1nMb5l2K3jfuwA/edit?tab=t.0#heading=h.o6anuvkix4bu

Differential Revision: [D93635657](https://our.internmc.facebook.com/intern/diff/D93635657/)

[ghstack-poisoned]
This diff extends `TensorImpl` to carry device information, enabling the runtime tensor to track which device its data resides on (CPU, CUDA, etc.). This is a prerequisite for parsing device info from the schema and allocating device memory.

Differential Revision: [D93635655](https://our.internmc.facebook.com/intern/diff/D93635655/)

[ghstack-poisoned]
…stry

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…ace and DeviceAllocatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…locatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…ace and DeviceAllocatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…locatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…ace and DeviceAllocatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…locatorRegistry"

This diff introduces the `DeviceAllocator` abstract interface and `DeviceAllocatorRegistry` for device-specific memory allocation. This is a foundational abstraction that enables the runtime to dispatch memory operations to the appropriate device backend other than CPU (CUDA, etc.).

**DeviceAllocator interface provides:**
- `init_buffer()` - Initialize memory buffer pools for memory-planned tensors
- `get_offset_address()` - Get pointer to offset within pre-allocated buffer
- `allocate()` / `deallocate()` - Dynamic device memory allocation
- `copy_host_to_device()` / `copy_device_to_host()` - Data transfer between host and device
- `device_type()` - Returns the device type this allocator handles

**DeviceAllocatorRegistry provides:**
- Singleton registry mapping DeviceType → DeviceAllocator
- `register_allocator()` / `get_allocator()` methods
- Fixed-size array indexed by device type (no dynamic allocation, embedded-friendly)

**Design notes:**
- Registry stores raw pointers (non-owning) - allocators are expected to be singletons with static lifetime
- Follows ExecuTorch's embedded-first philosophy (no std::unique_ptr, no heap allocation in registry)
- Convenience free functions `register_device_allocator()` and `get_device_allocator()` for ease of use

Differential Revision: [D93635656](https://our.internmc.facebook.com/intern/diff/D93635656/)

[ghstack-poisoned]
…vice mapping

Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the
corresponding Python dataclass to schema.py. This enables mapping each non-constant
planned memory buffer to a specific device type (CPU, CUDA, etc.).

The field is optional and absent for CPU-only programs, ensuring zero binary size regression.

Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/)

[ghstack-poisoned]
…ce schema for per-buffer device mapping"

Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the
corresponding Python dataclass to schema.py. This enables mapping each non-constant
planned memory buffer to a specific device type (CPU, CUDA, etc.).

The field is optional and absent for CPU-only programs, ensuring zero binary size regression.

Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/)

[ghstack-poisoned]
…r-buffer device mapping"

Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the
corresponding Python dataclass to schema.py. This enables mapping each non-constant
planned memory buffer to a specific device type (CPU, CUDA, etc.).

The field is optional and absent for CPU-only programs, ensuring zero binary size regression.

Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/)

[ghstack-poisoned]
…ce schema for per-buffer device mapping"

Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the
corresponding Python dataclass to schema.py. This enables mapping each non-constant
planned memory buffer to a specific device type (CPU, CUDA, etc.).

The field is optional and absent for CPU-only programs, ensuring zero binary size regression.

Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/)

[ghstack-poisoned]
…r-buffer device mapping"

Adds the NonConstBufferDevice table to the FlatBuffer schema (program.fbs) and the
corresponding Python dataclass to schema.py. This enables mapping each non-constant
planned memory buffer to a specific device type (CPU, CUDA, etc.).

The field is optional and absent for CPU-only programs, ensuring zero binary size regression.

Differential Revision: [D97335597](https://our.internmc.facebook.com/intern/diff/D97335597/)

[ghstack-poisoned]
…r device type

Extends memory planning to separate device tensors from CPU tensors into distinct
memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific
mem_ids before the greedy/naive algorithm runs, ensuring they get planned into
independent memory buffers that never share space with CPU tensors.

Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/)

[ghstack-poisoned]
…anning: separate buffers per device type"

Extends memory planning to separate device tensors from CPU tensors into distinct
memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific
mem_ids before the greedy/naive algorithm runs, ensuring they get planned into
independent memory buffers that never share space with CPU tensors.

Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/)

[ghstack-poisoned]
… buffers per device type"

Extends memory planning to separate device tensors from CPU tensors into distinct
memory buffers. Non-CPU TensorSpecs (e.g., CUDA) are pre-assigned device-specific
mem_ids before the greedy/naive algorithm runs, ensuring they get planned into
independent memory buffers that never share space with CPU tensors.

Differential Revision: [D97447105](https://our.internmc.facebook.com/intern/diff/D97447105/)

[ghstack-poisoned]
…meta

Enable serialzing non_const_buffer_device into into PTE file.

Differential Revision: [D97850707](https://our.internmc.facebook.com/intern/diff/D97850707/)

[ghstack-poisoned]
…ifetime management

Introduces DeviceMemoryBuffer, an RAII wrapper that owns a single device
memory allocation. On destruction, it automatically calls
DeviceAllocator::deallocate() to free the memory. This mirrors the role of
std::vector<uint8_t> for CPU planned buffers, but for non-cpu device memory (CUDA, etc.).

Key features:
- Static factory create(size, type, index) looks up DeviceAllocator from registry
- Move-only semantics (no copy) to enforce single ownership
- as_span() accessor wraps device pointer for use with HierarchicalAllocator
- Destructor is no-op for default-constructed or moved-from instances

Differential Revision: [D97850709](https://our.internmc.facebook.com/intern/diff/D97850709/)

[ghstack-poisoned]
Add memory_planned_buffer_device(index) to MethodMeta, returning the
Device (type + index) for each planned memory buffer. This reads from
the non_const_buffer_device field in the serialized ExecutionPlan.

For CPU-only programs (or legacy PTE files without non_const_buffer_device),
all buffers default to Device{CPU, 0}. The sparse list only stores entries
for non-CPU buffers, so the lookup scans for a matching buffer_idx.

This API enables Module::load_method() to query each buffer's target device
and allocate accordingly (malloc for CPU, DeviceAllocator for CUDA, etc.).

Differential Revision: [D97850708](https://our.internmc.facebook.com/intern/diff/D97850708/)

[ghstack-poisoned]
…-buffer device metadata"

This diff extend MemoryManager with optional per-buffer device type metadata so the runtime explicitly knows which planned memory buffers are on which device. This enables future device-aware dispatch and debugging.

Changes:
- New constructor taking planned_buffer_devices as extra input for device info
- New accessors: planned_buffer_devices(), has_device_memory()
- No existing functionalities have been updated.

Differential Revision: [D97850706](https://our.internmc.facebook.com/intern/diff/D97850706/)

[ghstack-poisoned]
…-buffer device metadata"

This diff extend MemoryManager with optional per-buffer device type metadata so the runtime explicitly knows which planned memory buffers are on which device. This enables future device-aware dispatch and debugging.

Changes:
- New constructor taking planned_buffer_devices as extra input for device info
- New accessors: planned_buffer_devices(), has_device_memory()
- No existing functionalities have been updated.

Differential Revision: [D97850706](https://our.internmc.facebook.com/intern/diff/D97850706/)

[ghstack-poisoned]
…-buffer device metadata"

This diff extend MemoryManager with optional per-buffer device type metadata so the runtime explicitly knows which planned memory buffers are on which device. This enables future device-aware dispatch and debugging.

Changes:
- New constructor taking planned_buffer_devices as extra input for device info
- New accessors: planned_buffer_devices(), has_device_memory()
- No existing functionalities have been updated.

Differential Revision: [D97850706](https://our.internmc.facebook.com/intern/diff/D97850706/)

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Differential Revision: D97850706

Pull Request resolved: #18475
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 22, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19737

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2026
@Gasoonjia Gasoonjia merged commit c5e0a03 into main May 22, 2026
166 of 171 checks passed
@Gasoonjia Gasoonjia deleted the gh/gasoonjia/155/base branch May 22, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants