Skip to content

fix: disable USM compression for any multi-device context#930

Open
F3zz1k wants to merge 1 commit into
intel:masterfrom
F3zz1k:fix/svm-disable-compression-multidevice
Open

fix: disable USM compression for any multi-device context#930
F3zz1k wants to merge 1 commit into
intel:masterfrom
F3zz1k:fix/svm-disable-compression-multidevice

Conversation

@F3zz1k
Copy link
Copy Markdown

@F3zz1k F3zz1k commented May 24, 2026

Summary

Closes #921

The peer-USM compression guard in SVMAllocsManager::createUnifiedMemoryAllocation is gated on device->hasAnyPeerAccess().value_or(false). For the first USM device allocation in a multi-device L0 context that std::optional can be observed as Some(false) (peer-access cache not yet populated by initializePeerAccessForDevices). The disable arm is skipped, the BO is allocated compressed, and any later peer-import returns garbage through dma-buf because the importing device has no matching surface metadata.

This patch adds memoryProperties.rootDeviceIndices.size() > 1 as an additional, synchronously-known gate. Multi-device contexts conservatively disable compression, correct because a BO in a multi-device context can become peer-imported at any subsequent point in the application's lifetime.

Verification

  • 2× and 4x Intel Arc Pro B70 (BMG-G31), Linux 7.1-rc4 + xe (with Leon Romanovsky's "dma-buf: Always build with DMABUF_MOVE_NOTIFY")
  • Built locally with the patch loaded via LD_LIBRARY_PATH against compute-runtime 26.05/26.18, cross-device sycl::queue::memcpy(host, peer_usm_ptr) returned garbage before; returns source data correctly after
  • Cause-and-effect confirmed by reverting the patch on the same build, returns to garbage
  • The OpenCL UNRECOVERABLE_IF(!allocation) at enqueue_svm.h:308 symptom on the same workload is resolved by this fix
  • llama.cpp tensor-split (--device SYCL0,SYCL1 --split-mode layer) regression-tested on both patched and stock, no behavioural change there (that workload's allocation timing happens to dodge the race)

Note: This patch remains necessary even on kernels with DRM_XE_VM_BIND_FLAG_DECOMPRESS (7.1-rc1+).
The kernel-side xe_bo_decompress() at drivers/gpu/drm/xe/xe_bo.c:3537 silently skips non-VRAM buffers, and
peer-imported BOs are ttm_bo_type_sg, so cross-device decompression is a no-op even when the flag is
accepted. Until that's addressed kernel-side, disabling compression at allocation time for multi-device
contexts is the correct behaviour.

Unit tests

Not added in this commit. The targeted code path requires a platform where releaseHelper->isUsmCompressionSupportedOnPeerAccess() returns false (release2001 / release2002 = BMG / LNL) AND usmCompressionSupported(hwInfo) returns true. The default MockReleaseHelper returns true for the former, so a test under shared/test/unit_test/memory_manager/ would skip on most configurations.

A natural home would be shared/test/unit_test/xe2_hpg_core/ which already uses the BMG release helper, or a new test using MockReleaseHelper overrides. If really needed, I could add it, would appreciate guidance on the preferred location and pattern (the existing givenLocalMemoryEnabledAndCompressionEnabledThenDeviceSideSharedUsmIsCompressed test in unified_memory_manager_tests.cpp:721 is the natural sibling but uses single-root setup).

Related

cc @ivysochyn — this directly extends the guard you added in dc85f33 and refactored in a832f01. Race observed where hasAnyPeerAccess() is Some(false) for the very first multi-device alloc.

The existing peer-USM compression guard in createUnifiedMemoryAllocation
is gated on device->hasAnyPeerAccess().value_or(false). That
std::optional<bool> is populated lazily by initializePeerAccessForDevices
at driver init and can be observed as Some(false) for the first USM
device allocation in a multi-device L0 context. The disable arm is then
skipped, the BO is allocated with the compression flag enabled, and any
later peer-import returns the compressed bytes through dma-buf without
the matching surface metadata.

Use rootDeviceIndices.size() > 1 as an additional gate. That property is
known synchronously at allocation time and conservatively disables
compression for any BO in a multi-device context, where peer-import can
happen at any subsequent point regardless of the current peer-access
cache state.

Verified on 2x Intel Arc Pro B70 (BMG-G31), Linux 7.1-rc4 + xe (with
Leon Romanovsky's "dma-buf: Always build with DMABUF_MOVE_NOTIFY"):
multi-device sycl::queue::memcpy(host, dev0_usm_ptr) executed on dev1
previously returned garbage; now returns the source data correctly.
The OpenCL backend symptom (UNRECOVERABLE_IF(!allocation) at
enqueue_svm.h:308) is resolved by the same fix.

Related: vllm-project/vllm#41663, intel#916, intel#921, intel#922.
Signed-off-by: Gabriel Bouffard <gabriel.a.bouffard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-BMG Level Zero device enumeration broken by deferred internal-engine init (regression NEO 25.40→25.44, Arc Pro B50/B60/B70)

1 participant