[CI] Cross-platform — Part 2: ARM/Spark workflow#5698
Conversation
There was a problem hiding this comment.
Update (df786d2): Reviewed incremental diff since 5cbba72 (1 file changed, 1 commit).
New commits reviewed
| Commit | PR | Description |
|---|---|---|
| feba370 | #5826 | Defer pxr loading in isaaclab.sim, .assets, .scene, .sim/utils |
| 207f319 | #5844 | Log Kit version info on AppLauncher startup |
| c54e512 | — | Scope test tags to arm_ci only; misc review fixes |
| ea8ecd6 | #5891 | Split tasks into core/contrib |
| acb09f2 | — | Merge remote-tracking branch origin/develop |
| 5cbba72 | — | Move cartpole smoke test into test/core/ after tasks split |
| df786d2 | — | Raise cartpole perception smoke-test timeout for cold shader cache |
Latest change (df786d2)
Raises test_train_cartpole_perception timeout from default (presumably matching state test) to 1800 s to accommodate first-run shader compilation on cold GPU cache. Adds a comment explaining the reasoning. Minimal, sensible change — shader compilation on first camera-enabled run can easily exceed 600 s on some hardware.
Full analysis (unchanged from prior update)
1. Deferred pxr imports (#5826) ✅ — All 9 modules now defer from pxr import ... into function bodies. Type hints stay under TYPE_CHECKING. Fixes Kit startup crash when env-cfg parse pulls pxr before AppLauncher.
2. Kit version logging (#5844) ✅ — Simple _log_kit_version_info() added to AppLauncher. Prints Kit version, kernel version, and git hash to sys.__stderr__.
3. Tasks core/contrib restructuring (#5891) ✅ — Massive but mechanical: direct.* / manager_based.* → flat core.* / contrib.*. Gym IDs unchanged. Docs updated. Changelog fragment present.
4. CI/test adjustments ✅ — Test markers narrowed, cartpole smoke test moved, perception timeout raised for cold cache.
Previous concern status
| Issue | Status |
|---|---|
P1 (conftest.py OSError handling) |
✅ Resolved |
| P2 (TAB delimiter) | ✅ Resolved |
No new issues found. LGTM from bot. 🤖
Foundation for cross-platform CI. Registers four pytest markers (windows, windows_ci, arm, arm_ci), teaches AppLauncher to recognize them in argv so they do not leak into Isaac Sim's argparse, and moves the AssetConverterBase USD scratch directory from a hardcoded /tmp/IsaacLab to tempfile.gettempdir() for cross-platform compatibility. Tags source/isaaclab/test/deps/test_torch.py and test_scipy.py with the new markers so they are selectable by future cross-platform jobs. Workflow files (arm-ci.yaml, windows-ci.yaml) ship in follow-up PRs.
00966f4 to
ca8fa07
Compare
4b8ecb8 to
3e904c3
Compare
Mirrors build.yaml's spirit but stays minimal for the aarch64 path: Tier 1 (gates none — continue-on-error): general-arm, install-arm, kit-launch-arm Tier 2 (meaningful, marker-filtered): kitless-arm, determinism-arm Every job sets continue-on-error: true while the aarch64 runner setup stabilizes. Every pytest invocation passes --timeout=N --timeout-method=signal so a single hung test cannot consume the whole job slot. Inline scripts use set -e to fail on the first nonzero return. Tags three test_rendering_*_kitless.py files plus test_differential_ik.py and test_operational_space.py with the arm_ci marker so the Tier 2 jobs can select them via pytest -m arm_ci.
3e904c3 to
98a5b68
Compare
b7862d0 to
48819bf
Compare
- source/isaaclab/setup.py: gate pytetwild==0.2.3 with `platform_machine != 'aarch64'`. No aarch64 wheel on PyPI; source build fails because the transitive `geogram` dep hardcodes `-m64` in its CMakeLists. The single call site at sim/schemas/schemas.py already lazy-imports it with a clear "install manually" message, so aarch64 users keep everything except automatic volume-deformable tetrahedralization. - docker/Dockerfile.arm-ci: new lightweight Dockerfile that layers cmake / build-essential / git + EULA env vars + a bash entrypoint onto the multi-arch Isaac Sim base image. Replaces the previous inline `docker run + apt-get install + docker commit` chain in arm-ci.yaml. - docker/Dockerfile.base: reverted the libgmp / libmpfr / libeigen3 / libcgal / libboost additions from the arm64-conditional apt block — those were only needed by pytetwild's fTetWild build, which we no longer install on aarch64. - .github/workflows/arm-ci.yaml: build via docker/Dockerfile.arm-ci instead of the inline apt-install-and-commit pattern. Test steps no longer need to re-specify --entrypoint or EULA env vars on every docker run.
48819bf to
b3d7f7c
Compare
Forces run_docker_tests=false in build.yaml's changes job so all heavy test jobs skip via their existing if-gate. Saves CI runner time + cost during ARM CI iteration on PR isaac-sim#5698. Must be reverted before final review.
| new_test_files = [] | ||
| for test_file in test_files: | ||
| with open(test_file) as f: | ||
| if marker_token in f.read(): | ||
| new_test_files.append(test_file) | ||
| test_files = new_test_files |
There was a problem hiding this comment.
Missing error handling in ci_marker file scan
The second-pass scan opens test_file without a try/except OSError, so a race-condition delete, permission error, or unreadable symlink on CI would raise an unhandled exception and abort the entire test orchestrator session. The first scan (pre-scan at line ~703) correctly wraps the same open() in except OSError: continue, but this second scan omits that guard. The inconsistency means a transient filesystem issue during the second pass would produce a hard failure rather than a graceful skip.
| new_test_files = [] | |
| for test_file in test_files: | |
| with open(test_file) as f: | |
| if marker_token in f.read(): | |
| new_test_files.append(test_file) | |
| test_files = new_test_files | |
| if ci_marker: | |
| # Match both `@pytest.mark.<marker>` (per-function) and | |
| # `pytestmark = pytest.mark.<marker>` / `pytestmark = [..., pytest.mark.<marker>, ...]` | |
| # (module-level) by looking for the common `pytest.mark.<marker>` substring. | |
| marker_token = f"pytest.mark.{ci_marker}" | |
| new_test_files = [] | |
| for test_file in test_files: | |
| try: | |
| with open(test_file) as f: | |
| if marker_token in f.read(): | |
| new_test_files.append(test_file) | |
| except OSError: | |
| continue | |
| test_files = new_test_files |
| echo "::error::detect-changes: pattern line missing tab separator: '$line'" | ||
| exit 1 | ||
| fi | ||
| patterns+=("$line") | ||
| done <<< "$PATTERNS_INPUT" | ||
|
|
||
| if [ "${#patterns[@]}" -eq 0 ]; then | ||
| echo "::error::detect-changes received no valid pattern lines" | ||
| exit 1 | ||
| fi | ||
|
|
||
| render_table() { | ||
| local files="$1" entry regex desc count sample | ||
| echo "| Pattern | What it covers | Matched files |" | ||
| echo "|---|---|---|" | ||
| for entry in "${patterns[@]}"; do |
There was a problem hiding this comment.
TAB-delimiter requirement is fragile across editors and git configs
The action requires literal TAB characters between the regex and description in each pattern line, and exits with code 1 if any line lacks a tab (lines 72–73). A git attribute like text=auto, an editor that normalizes whitespace, or copy-paste from a browser would silently convert tabs to spaces. When that happens the changes job fails before writing run to GITHUB_OUTPUT, causing all downstream arm-ci jobs to error rather than run — no test results are produced and branch-protection checks would report a failing status. A spaces-delimited format or a structured YAML list would be more portable.
| # Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md). | ||
| # All rights reserved. | ||
| # | ||
| # SPDX-License-Identifier: BSD-3-Clause | ||
|
|
||
| # ARM/Spark CI — exercises Isaac Lab on aarch64 Linux self-hosted runners | ||
| # (NVIDIA DGX Spark). Single job that builds Dockerfile.base for linux/arm64 | ||
| # and runs the `arm_ci`-marked pytest set against it. ECR is not wired for | ||
| # arm64, so build output stays local to the runner; the action's local | ||
| # exact-tag and deps-tag checks short-circuit rebuilds when the runner is |
There was a problem hiding this comment.
TEMP
build.yaml run_docker_tests: 'false' override
The PR description checklist includes an unchecked item: "Revert the TEMP build.yaml run_docker_tests: 'false' override before draft → ready." build.yaml is not part of this PR's diff, which makes it unclear whether the override is already in the base branch (from Part 1 #5695) or still present on this branch. If the override is still active, merging would suppress Docker-based tests in the main Linux CI pipeline.
The pre-scan at the top of pytest_sessionstart wraps its `open()` in `except OSError: continue` to tolerate races (a marker-tagged file deleted between os.walk and open(), an unreadable symlink, etc.). The post-scan filter further down was missing the same guard, so a single transient I/O error during the marker-token check would raise an unhandled exception and abort the orchestrator before any test ran. Mirror the pre-scan's try/except so both passes degrade gracefully. Flagged P1 by Greptile auto-review on commit 4bd088f.
Pre-scan: log loud `::warning::` then continue. The pre-scan is best-effort (missed files may still be picked up via _collect_test_files's normal walk path), so silent continue was acceptable but unhelpful — adding a visible warning lets the issue surface without aborting. Post-scan: raise RuntimeError. This is the final filter pass — a silent drop here would let a marker-tagged test exit the run while the orchestrator still reports success. Loud abort on OSError is the right signal; the failure case (permission flap, race-deleted file, broken symlink on a previously-walked path) is exotic enough that hitting it indicates a real CI environment issue worth investigating. Replaces the silent `continue` introduced in a9711ef from Greptile's P1 suggestion — that suggestion mirrored pre-scan's behavior mechanically but missed the semantic difference between the two passes.
Match the existing ISAACSIM_CI_SHORT post-scan's behavior — no OSError handling, let `open()` propagate. Same semantic for both markers: if a test file can't be read during scan, the orchestrator aborts with the raw OSError and CI goes red. No silent drops. Reverts the try/except additions from a9711ef (Greptile P1 suggestion) and d7fb1bd (the differentiated raise/warn version).
|
Why are we creating yet another separate CI for ARM instead of just letting some of the tests from run on ARM? I vote against that. |
build.yaml's `changes` job had its own ~95-line inline detector while arm-ci.yaml consumed the new detect-changes composite. Route both workflows through the composite so the gating logic stays in one place. Extract the deps-hash computation into _lib/compute-deps-hash.sh, sourced by both docker-build (local-store check) and ecr-build-push-pull (registry check). The two caches stay independent; only the hash schema is shared, which removes the silent-drift risk the docker-build docstring previously warned about. Extract the docker-config + nvcr.io login into _lib/setup-docker-config.sh, sourced by both composites. The script is idempotent (early-returns when DOCKER_CONFIG already points at a valid temp dir), so the `skip-docker-config` input on docker-build and the corresponding plumbing in ecr-build-push-pull's delegation are no longer needed. Bump arm-ci.yaml's actions/checkout from v4 to v6 to match the rest of the workflows.
Per @myurasov-nv's review feedback: same OS (Linux), different runner pool shouldn't justify a separate workflow file. Other multi-arch repos (NVIDIA Warp, NumPy, newton, PyTorch) put cross-arch jobs in one workflow with different `runs-on:` labels. The build/test work itself doesn't change. Add an `arm-ci` job to build.yaml that builds and tests on the same [self-hosted, arm64] runner (no ECR for arm64, so the image must stay local to the runner that builds it). Source-side fixes from the original series stay: AppLauncher EXP_PATH fallback, pytetwild aarch64 gate, OVRTX/OVPhysX skip-on-aarch64, conftest.py CI_MARKER orchestration. Gate the amd64 `build` and `build-curobo` jobs on absence of the `ci-skip-amd64` PR label so this branch can iterate on arm-ci alone without burning GPU minutes. All 18 test-* jobs skip transitively via `needs: build`. Remove the label to re-enable the full amd64 surface. Delete .github/workflows/arm-ci.yaml.
…park-ci-cartpole # Conflicts: # .github/workflows/build.yaml # source/isaaclab/setup.py # tools/conftest.py
Iteration helper. The label-gate added in the prior commit (then named ci-skip-amd64) only skipped build.yaml's amd64 build jobs. Expanding to also gate docs.yaml (build-latest-docs, build-multi-docs), install-ci.yml (both x86 and arm installation test jobs), and license-check.yaml so that PRs iterating on a single platform can run only their target workflow. Rename ci-skip-amd64 -> ci-arm-only to reflect the broader scope.
PR 5695 lays foundation only. The two tagged tests (test_scipy.py, test_torch.py) had pytestmark applied here but no in-PR consumer — the arm-ci and windows-ci workflows that select on these markers ship in isaac-sim#5698 and isaac-sim#5700 respectively. Each downstream PR should land the marker annotations on the tests it actually wants in its canary, alongside the workflow that runs them. This keeps 5695 self-contained as scaffolding and prevents unused annotations from accumulating here.
After folding arm-ci.yaml into build.yaml, the detect-changes composite had a single caller. The composite + sparse-checkout pattern was overhead to support cross-workflow sharing that no longer exists. Restore the inline bash detector that was on develop before the migration. Net: -50 lines, one fewer directory under .github/actions/.
arm64 runs on a different runner pool ([self-hosted, arm64]) than the
amd64 build ([self-hosted, gpu]), so the local docker tag could not
collide in practice. The extra env var was defensive only.
Follow the same suffix pattern that build-curobo already uses
(`${{ env.CI_IMAGE_TAG }}-curobo`) for consistency.
`local-hit`, `local-deps-hit`, and `was-built` were declared as action-level outputs anticipating downstream telemetry callers that never materialized. Grep across .github/ shows zero consumers. Internal step outputs of the same names continue to gate the conditional steps inside the action (the local exact-tag short-circuit, the local deps-tag short-circuit, the post-build deps-tag application), so the deletion is purely surface-area: external declarations go, in-action flow unchanged.
Cleanups in response to self-review: 1. compute-deps-hash.sh refactored from source-with-env-vars to exec-with-positional-args. Caller captures hash via $(...) or pipes into $GITHUB_OUTPUT. More idiomatic, testable from any shell, no env-var contract. 2. docker-build gains a `deps-hash` input. ecr-build-push-pull computes the hash once in its registry-side check step, writes it to $GITHUB_OUTPUT, and passes it forward to docker-build's delegation. docker-build's local check uses the input if non-empty, else computes itself. Hash now computed exactly once per build. 3. `extra-tags` input removed from docker-build. The single consumer (ecr-build-push-pull) now does its own `docker tag <commit> <ECR>` step after the build returns. Less surface area on the primitive. 4. `evict-stale-cache` input added on docker-build (default false). The 14-day eviction is opt-in rather than implicit. No current caller opts in; runners that need disk cleanup do so explicitly. 5. Three unused action-level outputs (local-hit, local-deps-hit, was-built) on docker-build removed — no consumers grep'd. 6. docker-build description reverted to develop's one-liner. 7. ci_marker pre-scan in tools/conftest.py removed. The pre-scan previously overrode TESTS_TO_SKIP for marker-tagged files, which only papered over one mismatch (test_differential_ik.py is in TESTS_TO_SKIP as "Failing" yet was arm_ci-tagged). Respect the global skip list: a skip-listed test stays skipped on all platforms. arm-ci canary goes 8 -> 7 with the one known-failing file correctly dropped. 8. ci-arm-only label gates removed from build.yaml, docs.yaml, install-ci.yml, license-check.yaml. The iteration helper was one-shot; arm-ci is now green, the gate's job is done. Drop the gates and the corresponding repo label. 9. triggered_jobs string in build.yaml's `changes` job reverted to develop's text — the "non-root verify is folded" detail snuck in from an earlier commit and isn't strictly relevant.
The five pre-existing inputs (image-tag, isaacsim-base-image, isaacsim-version, dockerfile-path, context-path) had their descriptions rewritten during the split out of ecr-build-push-pull, with no behavior change. Restore develop's original wording; keep descriptions only on the genuinely-new inputs (platform, cache-from, cache-to, deps-hash, evict-stale-cache).
F2: test_differential_ik.py carried `pytestmark = pytest.mark.arm_ci`,
but the file is in tools/test_settings.py TESTS_TO_SKIP ("Failing"), so
_collect_test_files drops it before the ci_marker filter ever sees it —
the marker was inert. A skip-listed test should not claim to run on any
platform; remove the marker. (The other arm_ci-tagged tests are not in
TESTS_TO_SKIP and are unaffected.)
F3: the changelog fragment credited a pytetwild aarch64 exclusion, but
that gate now lives on develop (via the packaging cleanup that removed
setup.py) and is already documented by a separate fragment. This PR's
diff no longer touches pytetwild gating, so the bullet described a
change not present here. Drop it; keep the in-scope AppLauncher
EXP_PATH fallback entry.
The two _lib helpers were a sourced bash script (setup-docker-config) and an exec'd bash script (compute-deps-hash) — two different calling conventions for what are both "shared step-logic." Unify them on the GHA-native pattern: each becomes a composite action with an inline `run:` block and a declared interface. - _lib/compute-deps-hash/action.yml: inputs (dockerfile-path, isaacsim-base-image, isaacsim-version) -> output `hash`. Both docker-build and ecr-build-push-pull `uses:` it, so the deps-hash schema lives in exactly one place (no drift between the local-store check and the registry check). - _lib/setup-docker-config/action.yml: no inputs; writes DOCKER_CONFIG to $GITHUB_ENV and logs into nvcr.io. Idempotent (a second invocation in the same job short-circuits), so ecr-build-push-pull delegating to docker-build doesn't redo it. Call sites change from `run: . _lib/x.sh` / `$(_lib/x.sh ...)` to `uses: ./.github/actions/_lib/x`. Dedup is unchanged: ecr computes the hash once via the composite and forwards it to docker-build through the deps-hash input; docker-build computes its own only when the input is empty (the arm-ci direct-call path). Delete the two .sh files — the composite action is now the single shared unit, so no separate script is needed.
The `evict-stale-cache` input on docker-build was inert — no caller set it, so the 14-day deps-tag eviction never ran. The Spark runner is long-lived self-hosted with no ECR, so its local deps-cache tags accumulate indefinitely; it's the legitimate consumer. Set evict-stale-cache: "true" on the arm-ci build so stale tags are pruned each run.
Review fixes (F4-F6): - test_cartpole_training_smoke.py: repoint the perception case from the retired/deprecated Isaac-Cartpole-RGB-Camera-Direct-v0 to the canonical Isaac-Cartpole-Camera-Direct-v0 (defaults to rgb, has an rl_games entry point, not on the removal path); fix the docstring claim about camera envs registering "only" rl_games (they register rl_games + skrl, just no rsl_rl). - isaaclab changelog: add a bullet for the AssetConverterBase default USD dir moving from hardcoded /tmp to tempfile.gettempdir() (a user-facing cross-platform change that was undocumented). - isaaclab_tasks changelog: the package's delta here is test-only (arm_ci markers + aarch64 skip guards in rendering test helpers), so convert the fragment to .skip per AGENTS.md (no user-facing entry). Arm-only test tags: - test_scipy.py / test_torch.py carried [windows_ci, arm_ci]; this is the ARM workflow PR, so scope their tags to arm_ci only. windows_ci test tags belong to the Windows workflow PR.
The develop merge brought in isaac-sim#5891 (split tasks tests into core/contrib), which relocated every test file under test/core/ or test/contrib/ and left only shared helpers + conftest at the test-suite root. The new test_cartpole_training_smoke.py predated that split and was the lone test_*.py still at top-level. Cartpole is a core task, so move it under test/core/ to match the new layout, and bump its _REPO_ROOT path depth from parents[3] to parents[4] (matching the sibling core/ tests).
test_train_cartpole_perception timed out at the default 600s on a cold-cache Spark runner: the first camera-enabled run compiles shaders (~600s) before training even starts, which alone consumes the budget. It passed earlier only because that runner had a warm shader cache. Give the camera case a 1800s timeout (the state case keeps 600s).
Should be fixed. |
Summary
arm-cijob inside the existingDocker + Testsworkflow (build.yaml) — sharing itschanges/configgating rather than a standalone workflow. Marker-driven:pytest -m arm_ciselects the aarch64-safe subset. ~28 min cold, ~12 min on a warm runner.pytest.mark.arm_ci; growing the arm canary is a one-line marker, no workflow edit.docker-buildandecr-build-push-pull, so each lives in exactly one place (no drift between the local-store and registry cache checks).1. ARM/Spark CI job
1.1 New
arm-cijob on[self-hosted, arm64]: buildsDockerfile.baseforlinux/arm64via thedocker-buildaction (local cache only — ECR is not wired for arm64, so build and test share one runner), then runs thearm_ci-marked tests viarun-testswithci-marker: arm_ci.1.2 Folded into
build.yamlso the gating (changes,config) and thedocker-build/run-testsactions are shared with the amd64 jobs rather than duplicated in a separate workflow file.1.3 Opts into
docker-build's 14-day stale-cache eviction (evict-stale-cache: true) — the long-lived Spark runner has no ECR, so its localdeps-*tags would otherwise accumulate indefinitely.2. Shared CI utilities
2.1 New
_lib/compute-deps-hashcomposite action — the deps-cache hash (install files + base-image digest). Bothdocker-build(local-store check) andecr-build-push-pull(registry check) call it, so a local hit and a registry hit always resolve the samedeps-<hash>tag. The hash is computed once per build and forwarded to the delegated build via adeps-hashinput.2.2 New
_lib/setup-docker-configcomposite action — credsStore-disabled docker config +nvcr.iologin, idempotent so a caller delegating todocker-builddoes not redo it.2.3
docker-buildis the local-cache build primitive (exact-tag + deps-tag short-circuits);ecr-build-push-pullkeeps the ECR-cache orchestration and delegates the actual build to it.2.4
run-testsgains aci-markerinput, forwarded into the container asCI_MARKERand read bytools/conftest.pyto select test files by marker — the same channel as the existingTEST_FILTER_PATTERN/TEST_INCLUDE_FILESknobs.3. Cross-platform source fixes
3.1
AppLauncherderivesEXP_PATHfrom the installedisaacsimpackage when the env var is unset (the bootstrap early-returns under some aarch64 pip layouts), instead ofKeyError. No-op whenEXP_PATHis already set.3.2
AssetConverterBaseresolves its default USD output undertempfile.gettempdir()instead of a hardcoded/tmp, honoring$TMPDIRand working on Windows.3.3 The kitless rendering test helper skips (rather than fails) the OVRTX/OVPhysX cases on aarch64, where those wheels are not published.
3.4
tools/conftest.pygainsCI_MARKERorchestration: a file-level marker filter so only marker-tagged files spawn per-file subprocesses, parallel to the existingISAACSIM_CI_SHORTpath and not aliased to it.4. Series
Part 2 of
[CI] Cross-platform —. Depends on #5695 (Part 1: marker registration + AppLauncher argv-strip scaffolding). The Windows workflow is #5700 (Part 3).Test plan
arm-cigreen end-to-end on Spark (build + marker-gated tests), cold and warm cache.build.yamljobs unaffected by the composite extraction and thedocker-build/ecr-build-push-pulldelegation.develop(core/contrib test split, Kit-version logging, deferred pxr loading); allarm_ci-marked tests still collected and green after the test-tree move.daily-compatibilityon this branch before merge (gh workflow run "Backwards Compatibility Tests" --ref jichuanh/windows-spark-ci-cartpole).