Skip to content

[XPU] Fix illegal instruction error when running Intel P800-compiled RDMA libs on Hygon P800#7935

Open
hong19860320 wants to merge 1 commit into
PaddlePaddle:developfrom
hong19860320:hongming/fix_rdma_for_hygon
Open

[XPU] Fix illegal instruction error when running Intel P800-compiled RDMA libs on Hygon P800#7935
hong19860320 wants to merge 1 commit into
PaddlePaddle:developfrom
hong19860320:hongming/fix_rdma_for_hygon

Conversation

@hong19860320
Copy link
Copy Markdown
Collaborator

@hong19860320 hong19860320 commented May 27, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Cross-platform binary incompatibility — a shared library compiled with -march=native on Intel generates instructions that are illegal on Hygon (AMD-based) CPUs.

Modifications

The RDMA library compiled on an Intel P800 machine, when run on a Hygon (海光) P800 machine, causes an "illegal instruction" error upon import rdma_comm, preventing the cache messager from starting. The root cause is that the Hygon P800 uses a different CPU with a different instruction set than the Intel P800, so -march=native cannot be used as a compilation option.

Solution: Replace -march=native with a more portable option such as -march=x86-64-v3 (AVX2 baseline), depending on the Hygon CPU feature set required.

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 27, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-27 10:25:05

📋 Review 摘要

PR 概述:修复 Intel P800 编译的 RDMA 库在海光 P800 上运行时因 -march=native 生成非法指令导致的 import 失败问题。
变更范围cache_manager/transfer_factory/kvcache_transfer/CMakeLists.txtsetup.py
影响面 Tag[XPU] [KVCache] [PD Disaggregation]

问题

级别 文件 概述
❓ 疑问 CMakeLists.txt:18 条件仅覆盖 DEVICE_TYPE=xpu,其他设备类型是否也有跨平台部署需求?
❓ 疑问 CMakeLists.txt:19 -march=x86-64-v3 需要 GCC 11+,请确认构建环境最低版本

历史 Findings 修复情况

Finding 问题 状态
F1 PR 描述 Usage or CommandAccuracy Tests 为空,Checklist 全部未勾选 ⚠️ 仍存在

📝 PR 规范检查

标题格式合规(含 [XPU] 官方 Tag)。描述中 MotivationModifications 已填写,但 Usage or CommandAccuracy Tests 段落为空,Checklist 全部未勾选。

标题建议(可直接复制):

  • [XPU] Fix illegal instruction error when running Intel P800-compiled RDMA libs on Hygon P800
PR 描述建议(点击展开,可直接复制)
## Motivation
Cross-platform binary incompatibility — a shared library compiled with -march=native on Intel generates instructions that are illegal on Hygon (AMD-based) CPUs. The RDMA library compiled on an Intel P800 machine, when run on a Hygon (海光) P800 machine, causes an "illegal instruction" error upon `import rdma_comm`, preventing the cache messenger from starting.

## Modifications
- `fastdeploy/cache_manager/transfer_factory/kvcache_transfer/CMakeLists.txt`:当 `DEVICE_TYPE=xpu` 且架构为 `x86_64/AMD64` 时,将 `-march=native` 替换为 `-march=x86-64-v3`(AVX2 基线),其余场景保持 `-march=native` 不变。
- `setup.py`:在 cmake 参数中新增 `-DDEVICE_TYPE={get_device_type()}`,将设备类型传递给 CMake 以启用条件编译逻辑。

## Usage or Command
N/A

## Accuracy Tests
N/A(编译选项变更,不影响模型输出)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests. (编译选项变更,无需单测)
- [x] Provide accuracy results. (N/A,不影响模型输出)
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复思路正确,通过条件编译将 XPU+x86_64 场景的 -march=native 替换为可移植的 -march=x86-64-v3,解决了跨 CPU 厂商的二进制兼容性问题。有两处疑问需作者确认后可合入。

set(CMAKE_CXX_COMPILER g++)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=native -std=c++11")
if(DEVICE_TYPE STREQUAL "xpu" AND CMAKE_SYSTEM_PROCESSOR MATCHES "x86_64|AMD64")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=x86-64-v3 -std=c++11")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 条件仅覆盖 DEVICE_TYPE=xpu,其他设备类型(如 dcu)若在 Intel 机器上编译、在海光机器上运行,同样会触发 illegal instruction 错误。

是否有意只修复 XPU 场景?如果是,建议在注释中说明原因;如果其他设备类型也有跨平台部署需求,可考虑将条件扩展为:

if(CMAKE_SYSTEM_PROCESSOR MATCHES "x86_64|AMD64")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=x86-64-v3 -std=c++11")
else()
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=native -std=c++11")
endif()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前仅 XPU 的硬件环境比较复杂,其它硬件如果需要可自行增加

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=native -std=c++11")
if(DEVICE_TYPE STREQUAL "xpu" AND CMAKE_SYSTEM_PROCESSOR MATCHES "x86_64|AMD64")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2 -Ofast -ffast-math -funroll-loops -march=x86-64-v3 -std=c++11")
else()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 -march=x86-64-v3 需要 GCC 11+(2021 年发布)才支持该 march level 名称。请确认构建环境的最低 GCC 版本满足此要求,否则会出现 error: bad value (x86-64-v3) for -march= switch 编译错误。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P800 基于 ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleqa:xpu-ubuntu2204-x86_64-gcc123-py310 镜像,默认 gcc 版本 12.3。

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 27, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-27 17:06:00

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

主测试任务 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 已通过;当前 Required 无失败、无运行中、无等待中任务。Optional 有 5 个失败,仅供参考;其中 Check PR Template 可由作者补齐 PR 模板勾选项后处理。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
39(0) 39 31 5 0 0 3

2 任务状态汇总

日志列说明:失败任务直接使用工具预生成的日志链接;运行中任务使用 Job 链接。

2.1 Required任务 : 8/10 通过

必选任务阻塞合并,失败需优先处理。当前 required_failed=0、required_running=0、required_pending=0;另有 2 个 required context 未计入通过(可能为 skipped/other),如 Checks 页仍阻塞请人工确认。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 8 个必选任务通过 - - - - -
⏭️ 2 个必选任务未计入通过 - 非失败状态(可能 skipped/other) 如被分支保护阻塞,请检查 Checks 页 CI 详情 -

2.2 可选任务 — 23/29 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
xpu_build_test / xpu-build-test 23s Job -
Run iluvatar Tests / run_iluvatar_cases 16m3s Job -
Check PR Template 21s Job -
CI_HPU 1h55m Job -
Trigger Jenkins for PR 16s Job -
⏭️ 其余 1 个可选任务跳过/未执行 - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务,本轮无需调用 ci_failure_analyzer

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@42dbd1f). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7935   +/-   ##
==========================================
  Coverage           ?   73.19%           
==========================================
  Files              ?      404           
  Lines              ?    56841           
  Branches           ?     8890           
==========================================
  Hits               ?    41604           
  Misses             ?    12415           
  Partials           ?     2822           
Flag Coverage Δ
GPU 73.19% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants