Skip to content

ci: add ci_test workflows and fix NVIDIA probing#590

Draft
zhangyue207 wants to merge 14 commits intomasterfrom
codex/fix-nvidia-platform-probing
Draft

ci: add ci_test workflows and fix NVIDIA probing#590
zhangyue207 wants to merge 14 commits intomasterfrom
codex/fix-nvidia-platform-probing

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 6, 2026

Summary

  • Move the ci_test CI work into master.
  • Replace the in-tree .ci implementation with the external InfiniTensor/ci submodule.
  • Add .github/workflows/ci_test.yml and .github/ci_config.yaml as the parent GitHub Actions entrypoint/config.
  • Run the NVIDIA job with --devices nvidia.
  • Make tests/conftest.py respect explicit --devices platform selections before falling back to the historical broad torch-device mapping.

Motivation

This PR brings the ci_test workflow work into master and fixes the NVIDIA validation failure found while testing that workflow locally.

The NVIDIA job crashed because device=cuda was treated as all CUDA-like InfiniOps platforms: nvidia, metax, and iluvatar. On an NVIDIA runner, skip_op_without_platform_impl could still query MetaX / Iluvatar implementation indices, which may abort in native code instead of producing a safe pytest skip.

Closes: N/A.

Type of Change

  • fix — bug fix
  • build / ci — build system or CI configuration

Platforms Affected

  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • Build system / CMake / CI

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA Yes 3758 passed, 1363 skipped Local self-hosted NVIDIA flow via python3 .ci/run.py --config .github/ci_config.yaml --job nvidia_gpu --gpu-id 0 --local.
Iluvatar No Not run No Iluvatar hardware available on this machine.
MetaX No Not run No MetaX hardware available on this machine.
Cambricon No Not run No Cambricon hardware available on this machine.
Moore No Not run No Moore hardware available on this machine.
Ascend No Not run No Ascend hardware available on this machine.

Benchmark / Performance Impact

N/A — this PR only changes CI/test platform selection behavior.

Notes for Reviewers

This PR is now targeted at master, so it includes the original ci_test branch changes plus the follow-up NVIDIA probing fix.

Review focus:

  • The .ci submodule migration and .github workflow/config boundary.
  • With explicit --devices nvidia, device=cuda only queries nvidia implementation indices.
  • Without explicit --devices, the historical broad fallback remains unchanged.

Checklist

  • PR title and commit messages follow Conventional Commits.
  • Branch name follows <type>/xxx-yyyy-zzzz. Existing branch is codex/fix-nvidia-platform-probing.
  • Commits are meaningful and no fixup! / squash! / wip commits remain.
  • Changes are scoped to moving ci_test CI into master and fixing the NVIDIA CI failure.
  • ruff check tests/conftest.py passes.
  • ruff format --check tests/conftest.py passes.
  • python3 -m py_compile tests/conftest.py passes.
  • Matrix generation was checked with python3 .ci/config_to_matrix.py --config .github/ci_config.yaml --dump-by-type.
  • pytest was run locally on every supported platform this PR can affect. NVIDIA was run locally; other hardware was not available.
  • Standalone regression test added. No standalone regression test was added; validation is through the NVIDIA CI job path that previously crashed.
  • No secrets, third-party code, or unsafe native-code changes were added.

@zhangyue207 zhangyue207 force-pushed the codex/fix-nvidia-platform-probing branch from f6909f0 to eacc238 Compare May 6, 2026 09:32
@zhangyue207 zhangyue207 force-pushed the codex/fix-nvidia-platform-probing branch from eacc238 to 656626d Compare May 6, 2026 12:02
@zhangyue207 zhangyue207 changed the base branch from ci_test to master May 6, 2026 15:57
@zhangyue207 zhangyue207 requested a review from a team May 6, 2026 15:57
@zhangyue207 zhangyue207 changed the title fix: avoid cross-platform CUDA probing in tests ci: add ci_test workflows and fix NVIDIA probing May 6, 2026
@zhangyue207 zhangyue207 marked this pull request as draft May 7, 2026 06:58
@zhangyue207 zhangyue207 marked this pull request as draft May 7, 2026 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant