Skip to content

[CI] Cross-platform — Part 2: ARM/Spark workflow#5698

Open
hujc7 wants to merge 59 commits into
isaac-sim:developfrom
hujc7:jichuanh/windows-spark-ci-cartpole
Open

[CI] Cross-platform — Part 2: ARM/Spark workflow#5698
hujc7 wants to merge 59 commits into
isaac-sim:developfrom
hujc7:jichuanh/windows-spark-ci-cartpole

Conversation

@hujc7
Copy link
Copy Markdown
Collaborator

@hujc7 hujc7 commented May 20, 2026

Summary

  • Adds aarch64 CI on NVIDIA DGX Spark self-hosted runners as an arm-ci job inside the existing Docker + Tests workflow (build.yaml) — sharing its changes/config gating rather than a standalone workflow. Marker-driven: pytest -m arm_ci selects the aarch64-safe subset. ~28 min cold, ~12 min on a warm runner.
  • 7 aarch64-safe test files opt in via pytest.mark.arm_ci; growing the arm canary is a one-line marker, no workflow edit.
  • Extracts the deps-cache hash and docker-config/login into 2 shared composite actions consumed by both docker-build and ecr-build-push-pull, so each lives in exactly one place (no drift between the local-store and registry cache checks).
  • 4 cross-platform source fixes that aarch64 enablement surfaced, all no-ops on amd64.

1. ARM/Spark CI job

1.1 New arm-ci job on [self-hosted, arm64]: builds Dockerfile.base for linux/arm64 via the docker-build action (local cache only — ECR is not wired for arm64, so build and test share one runner), then runs the arm_ci-marked tests via run-tests with ci-marker: arm_ci.

1.2 Folded into build.yaml so the gating (changes, config) and the docker-build/run-tests actions are shared with the amd64 jobs rather than duplicated in a separate workflow file.

1.3 Opts into docker-build's 14-day stale-cache eviction (evict-stale-cache: true) — the long-lived Spark runner has no ECR, so its local deps-* tags would otherwise accumulate indefinitely.

2. Shared CI utilities

2.1 New _lib/compute-deps-hash composite action — the deps-cache hash (install files + base-image digest). Both docker-build (local-store check) and ecr-build-push-pull (registry check) call it, so a local hit and a registry hit always resolve the same deps-<hash> tag. The hash is computed once per build and forwarded to the delegated build via a deps-hash input.

2.2 New _lib/setup-docker-config composite action — credsStore-disabled docker config + nvcr.io login, idempotent so a caller delegating to docker-build does not redo it.

2.3 docker-build is the local-cache build primitive (exact-tag + deps-tag short-circuits); ecr-build-push-pull keeps the ECR-cache orchestration and delegates the actual build to it.

2.4 run-tests gains a ci-marker input, forwarded into the container as CI_MARKER and read by tools/conftest.py to select test files by marker — the same channel as the existing TEST_FILTER_PATTERN/TEST_INCLUDE_FILES knobs.

3. Cross-platform source fixes

3.1 AppLauncher derives EXP_PATH from the installed isaacsim package when the env var is unset (the bootstrap early-returns under some aarch64 pip layouts), instead of KeyError. No-op when EXP_PATH is already set.

3.2 AssetConverterBase resolves its default USD output under tempfile.gettempdir() instead of a hardcoded /tmp, honoring $TMPDIR and working on Windows.

3.3 The kitless rendering test helper skips (rather than fails) the OVRTX/OVPhysX cases on aarch64, where those wheels are not published.

3.4 tools/conftest.py gains CI_MARKER orchestration: a file-level marker filter so only marker-tagged files spawn per-file subprocesses, parallel to the existing ISAACSIM_CI_SHORT path and not aliased to it.

4. Series

Part 2 of [CI] Cross-platform —. Depends on #5695 (Part 1: marker registration + AppLauncher argv-strip scaffolding). The Windows workflow is #5700 (Part 3).

Test plan

  • arm-ci green end-to-end on Spark (build + marker-gated tests), cold and warm cache.
  • amd64 build.yaml jobs unaffected by the composite extraction and the docker-build/ecr-build-push-pull delegation.
  • Merged latest develop (core/contrib test split, Kit-version logging, deferred pxr loading); all arm_ci-marked tests still collected and green after the test-tree move.
  • Trigger daily-compatibility on this branch before merge (gh workflow run "Backwards Compatibility Tests" --ref jichuanh/windows-spark-ci-cartpole).

@github-actions github-actions Bot added isaac-lab Related to Isaac Lab team infrastructure labels May 20, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Update (df786d2): Reviewed incremental diff since 5cbba72 (1 file changed, 1 commit).

New commits reviewed

Commit PR Description
feba370 #5826 Defer pxr loading in isaaclab.sim, .assets, .scene, .sim/utils
207f319 #5844 Log Kit version info on AppLauncher startup
c54e512 Scope test tags to arm_ci only; misc review fixes
ea8ecd6 #5891 Split tasks into core/contrib
acb09f2 Merge remote-tracking branch origin/develop
5cbba72 Move cartpole smoke test into test/core/ after tasks split
df786d2 Raise cartpole perception smoke-test timeout for cold shader cache

Latest change (df786d2)

Raises test_train_cartpole_perception timeout from default (presumably matching state test) to 1800 s to accommodate first-run shader compilation on cold GPU cache. Adds a comment explaining the reasoning. Minimal, sensible change — shader compilation on first camera-enabled run can easily exceed 600 s on some hardware.

Full analysis (unchanged from prior update)

1. Deferred pxr imports (#5826) ✅ — All 9 modules now defer from pxr import ... into function bodies. Type hints stay under TYPE_CHECKING. Fixes Kit startup crash when env-cfg parse pulls pxr before AppLauncher.

2. Kit version logging (#5844) ✅ — Simple _log_kit_version_info() added to AppLauncher. Prints Kit version, kernel version, and git hash to sys.__stderr__.

3. Tasks core/contrib restructuring (#5891) ✅ — Massive but mechanical: direct.* / manager_based.* → flat core.* / contrib.*. Gym IDs unchanged. Docs updated. Changelog fragment present.

4. CI/test adjustments ✅ — Test markers narrowed, cartpole smoke test moved, perception timeout raised for cold cache.

Previous concern status

Issue Status
P1 (conftest.py OSError handling) ✅ Resolved
P2 (TAB delimiter) ✅ Resolved

No new issues found. LGTM from bot. 🤖

Foundation for cross-platform CI. Registers four pytest markers
(windows, windows_ci, arm, arm_ci), teaches AppLauncher to recognize
them in argv so they do not leak into Isaac Sim's argparse, and moves
the AssetConverterBase USD scratch directory from a hardcoded
/tmp/IsaacLab to tempfile.gettempdir() for cross-platform compatibility.

Tags source/isaaclab/test/deps/test_torch.py and test_scipy.py with
the new markers so they are selectable by future cross-platform jobs.

Workflow files (arm-ci.yaml, windows-ci.yaml) ship in follow-up PRs.
@hujc7 hujc7 force-pushed the jichuanh/windows-spark-ci-cartpole branch 2 times, most recently from 00966f4 to ca8fa07 Compare May 20, 2026 23:30
@hujc7 hujc7 changed the title [CI] Experimental IsaacLab install probe on Windows via uv [CI] Add ARM/Spark CI workflow May 20, 2026
@hujc7 hujc7 changed the title [CI] Add ARM/Spark CI workflow [CI] Cross-platform CI 2/3: ARM/Spark workflow May 20, 2026
@hujc7 hujc7 changed the title [CI] Cross-platform CI 2/3: ARM/Spark workflow [CI] Cross-platform: ARM/Spark workflow May 20, 2026
@hujc7 hujc7 changed the title [CI] Cross-platform: ARM/Spark workflow [CI] Cross-platform — Part 2: ARM/Spark workflow May 20, 2026
@hujc7 hujc7 force-pushed the jichuanh/windows-spark-ci-cartpole branch 4 times, most recently from 4b8ecb8 to 3e904c3 Compare May 21, 2026 04:43
Mirrors build.yaml's spirit but stays minimal for the aarch64 path:
  Tier 1 (gates none — continue-on-error): general-arm, install-arm, kit-launch-arm
  Tier 2 (meaningful, marker-filtered): kitless-arm, determinism-arm

Every job sets continue-on-error: true while the aarch64 runner setup
stabilizes. Every pytest invocation passes --timeout=N --timeout-method=signal
so a single hung test cannot consume the whole job slot. Inline scripts
use set -e to fail on the first nonzero return.

Tags three test_rendering_*_kitless.py files plus test_differential_ik.py
and test_operational_space.py with the arm_ci marker so the Tier 2 jobs
can select them via pytest -m arm_ci.
@hujc7 hujc7 force-pushed the jichuanh/windows-spark-ci-cartpole branch from 3e904c3 to 98a5b68 Compare May 21, 2026 05:32
@hujc7 hujc7 force-pushed the jichuanh/windows-spark-ci-cartpole branch 7 times, most recently from b7862d0 to 48819bf Compare May 21, 2026 07:33
- source/isaaclab/setup.py: gate pytetwild==0.2.3 with
  `platform_machine != 'aarch64'`. No aarch64 wheel on PyPI; source build
  fails because the transitive `geogram` dep hardcodes `-m64` in its
  CMakeLists. The single call site at sim/schemas/schemas.py already
  lazy-imports it with a clear "install manually" message, so aarch64
  users keep everything except automatic volume-deformable tetrahedralization.

- docker/Dockerfile.arm-ci: new lightweight Dockerfile that layers
  cmake / build-essential / git + EULA env vars + a bash entrypoint onto
  the multi-arch Isaac Sim base image. Replaces the previous inline
  `docker run + apt-get install + docker commit` chain in arm-ci.yaml.

- docker/Dockerfile.base: reverted the libgmp / libmpfr / libeigen3 /
  libcgal / libboost additions from the arm64-conditional apt block —
  those were only needed by pytetwild's fTetWild build, which we no
  longer install on aarch64.

- .github/workflows/arm-ci.yaml: build via docker/Dockerfile.arm-ci
  instead of the inline apt-install-and-commit pattern. Test steps no
  longer need to re-specify --entrypoint or EULA env vars on every
  docker run.
@hujc7 hujc7 force-pushed the jichuanh/windows-spark-ci-cartpole branch from 48819bf to b3d7f7c Compare May 21, 2026 07:39
hujc7 added 2 commits May 21, 2026 07:44
Forces run_docker_tests=false in build.yaml's changes job so all
heavy test jobs skip via their existing if-gate. Saves CI runner
time + cost during ARM CI iteration on PR isaac-sim#5698. Must be reverted
before final review.
Comment thread tools/conftest.py
Comment on lines +736 to +741
new_test_files = []
for test_file in test_files:
with open(test_file) as f:
if marker_token in f.read():
new_test_files.append(test_file)
test_files = new_test_files
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing error handling in ci_marker file scan

The second-pass scan opens test_file without a try/except OSError, so a race-condition delete, permission error, or unreadable symlink on CI would raise an unhandled exception and abort the entire test orchestrator session. The first scan (pre-scan at line ~703) correctly wraps the same open() in except OSError: continue, but this second scan omits that guard. The inconsistency means a transient filesystem issue during the second pass would produce a hard failure rather than a graceful skip.

Suggested change
new_test_files = []
for test_file in test_files:
with open(test_file) as f:
if marker_token in f.read():
new_test_files.append(test_file)
test_files = new_test_files
if ci_marker:
# Match both `@pytest.mark.<marker>` (per-function) and
# `pytestmark = pytest.mark.<marker>` / `pytestmark = [..., pytest.mark.<marker>, ...]`
# (module-level) by looking for the common `pytest.mark.<marker>` substring.
marker_token = f"pytest.mark.{ci_marker}"
new_test_files = []
for test_file in test_files:
try:
with open(test_file) as f:
if marker_token in f.read():
new_test_files.append(test_file)
except OSError:
continue
test_files = new_test_files

Comment on lines +66 to +81
echo "::error::detect-changes: pattern line missing tab separator: '$line'"
exit 1
fi
patterns+=("$line")
done <<< "$PATTERNS_INPUT"

if [ "${#patterns[@]}" -eq 0 ]; then
echo "::error::detect-changes received no valid pattern lines"
exit 1
fi

render_table() {
local files="$1" entry regex desc count sample
echo "| Pattern | What it covers | Matched files |"
echo "|---|---|---|"
for entry in "${patterns[@]}"; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TAB-delimiter requirement is fragile across editors and git configs

The action requires literal TAB characters between the regex and description in each pattern line, and exits with code 1 if any line lacks a tab (lines 72–73). A git attribute like text=auto, an editor that normalizes whitespace, or copy-paste from a browser would silently convert tabs to spaces. When that happens the changes job fails before writing run to GITHUB_OUTPUT, causing all downstream arm-ci jobs to error rather than run — no test results are produced and branch-protection checks would report a failing status. A spaces-delimited format or a structured YAML list would be more portable.

Comment thread .github/workflows/arm-ci.yaml Outdated
Comment on lines +1 to +10
# Copyright (c) 2022-2026, The Isaac Lab Project Developers (https://github.com/isaac-sim/IsaacLab/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: BSD-3-Clause

# ARM/Spark CI — exercises Isaac Lab on aarch64 Linux self-hosted runners
# (NVIDIA DGX Spark). Single job that builds Dockerfile.base for linux/arm64
# and runs the `arm_ci`-marked pytest set against it. ECR is not wired for
# arm64, so build output stays local to the runner; the action's local
# exact-tag and deps-tag checks short-circuit rebuilds when the runner is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TEMP build.yaml run_docker_tests: 'false' override

The PR description checklist includes an unchecked item: "Revert the TEMP build.yaml run_docker_tests: 'false' override before draft → ready." build.yaml is not part of this PR's diff, which makes it unclear whether the override is already in the base branch (from Part 1 #5695) or still present on this branch. If the override is still active, merging would suppress Docker-based tests in the main Linux CI pipeline.

hujc7 added 3 commits May 27, 2026 07:53
The pre-scan at the top of pytest_sessionstart wraps its `open()` in
`except OSError: continue` to tolerate races (a marker-tagged file
deleted between os.walk and open(), an unreadable symlink, etc.).
The post-scan filter further down was missing the same guard, so a
single transient I/O error during the marker-token check would raise
an unhandled exception and abort the orchestrator before any test ran.

Mirror the pre-scan's try/except so both passes degrade gracefully.

Flagged P1 by Greptile auto-review on commit 4bd088f.
Pre-scan: log loud `::warning::` then continue. The pre-scan is
best-effort (missed files may still be picked up via
_collect_test_files's normal walk path), so silent continue was
acceptable but unhelpful — adding a visible warning lets the issue
surface without aborting.

Post-scan: raise RuntimeError. This is the final filter pass — a
silent drop here would let a marker-tagged test exit the run while
the orchestrator still reports success. Loud abort on OSError is the
right signal; the failure case (permission flap, race-deleted file,
broken symlink on a previously-walked path) is exotic enough that
hitting it indicates a real CI environment issue worth investigating.

Replaces the silent `continue` introduced in a9711ef from
Greptile's P1 suggestion — that suggestion mirrored pre-scan's
behavior mechanically but missed the semantic difference between
the two passes.
Match the existing ISAACSIM_CI_SHORT post-scan's behavior — no
OSError handling, let `open()` propagate. Same semantic for both
markers: if a test file can't be read during scan, the orchestrator
aborts with the raw OSError and CI goes red. No silent drops.

Reverts the try/except additions from a9711ef (Greptile P1
suggestion) and d7fb1bd (the differentiated raise/warn version).
@myurasov-nv
Copy link
Copy Markdown
Member

Why are we creating yet another separate CI for ARM instead of just letting some of the tests from run on ARM? I vote against that.

hujc7 added 2 commits May 31, 2026 23:27
build.yaml's `changes` job had its own ~95-line inline detector while
arm-ci.yaml consumed the new detect-changes composite. Route both
workflows through the composite so the gating logic stays in one place.

Extract the deps-hash computation into _lib/compute-deps-hash.sh,
sourced by both docker-build (local-store check) and
ecr-build-push-pull (registry check). The two caches stay independent;
only the hash schema is shared, which removes the silent-drift risk the
docker-build docstring previously warned about.

Extract the docker-config + nvcr.io login into _lib/setup-docker-config.sh,
sourced by both composites. The script is idempotent (early-returns when
DOCKER_CONFIG already points at a valid temp dir), so the
`skip-docker-config` input on docker-build and the corresponding
plumbing in ecr-build-push-pull's delegation are no longer needed.

Bump arm-ci.yaml's actions/checkout from v4 to v6 to match the rest of
the workflows.
Per @myurasov-nv's review feedback: same OS (Linux), different runner pool
shouldn't justify a separate workflow file. Other multi-arch repos (NVIDIA
Warp, NumPy, newton, PyTorch) put cross-arch jobs in one workflow with
different `runs-on:` labels. The build/test work itself doesn't change.

Add an `arm-ci` job to build.yaml that builds and tests on the same
[self-hosted, arm64] runner (no ECR for arm64, so the image must stay
local to the runner that builds it). Source-side fixes from the original
series stay: AppLauncher EXP_PATH fallback, pytetwild aarch64 gate,
OVRTX/OVPhysX skip-on-aarch64, conftest.py CI_MARKER orchestration.

Gate the amd64 `build` and `build-curobo` jobs on absence of the
`ci-skip-amd64` PR label so this branch can iterate on arm-ci alone
without burning GPU minutes. All 18 test-* jobs skip transitively via
`needs: build`. Remove the label to re-enable the full amd64 surface.

Delete .github/workflows/arm-ci.yaml.
hujc7 added 2 commits June 1, 2026 01:10
…park-ci-cartpole

# Conflicts:
#	.github/workflows/build.yaml
#	source/isaaclab/setup.py
#	tools/conftest.py
@hujc7 hujc7 added ci-arm-only Iteration: skip all non-arm-ci checks (docs, install, license, amd64 build) and removed ci-skip-amd64 labels Jun 1, 2026
Iteration helper. The label-gate added in the prior commit (then named
ci-skip-amd64) only skipped build.yaml's amd64 build jobs. Expanding to
also gate docs.yaml (build-latest-docs, build-multi-docs), install-ci.yml
(both x86 and arm installation test jobs), and license-check.yaml so
that PRs iterating on a single platform can run only their target
workflow.

Rename ci-skip-amd64 -> ci-arm-only to reflect the broader scope.
@hujc7 hujc7 marked this pull request as draft June 1, 2026 01:29
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request Jun 1, 2026
PR 5695 lays foundation only. The two tagged tests (test_scipy.py,
test_torch.py) had pytestmark applied here but no in-PR consumer —
the arm-ci and windows-ci workflows that select on these markers
ship in isaac-sim#5698 and isaac-sim#5700 respectively.

Each downstream PR should land the marker annotations on the tests
it actually wants in its canary, alongside the workflow that runs
them. This keeps 5695 self-contained as scaffolding and prevents
unused annotations from accumulating here.
hujc7 added 3 commits June 1, 2026 16:29
After folding arm-ci.yaml into build.yaml, the detect-changes composite
had a single caller. The composite + sparse-checkout pattern was overhead
to support cross-workflow sharing that no longer exists. Restore the
inline bash detector that was on develop before the migration.

Net: -50 lines, one fewer directory under .github/actions/.
arm64 runs on a different runner pool ([self-hosted, arm64]) than the
amd64 build ([self-hosted, gpu]), so the local docker tag could not
collide in practice. The extra env var was defensive only.

Follow the same suffix pattern that build-curobo already uses
(`${{ env.CI_IMAGE_TAG }}-curobo`) for consistency.
`local-hit`, `local-deps-hit`, and `was-built` were declared as
action-level outputs anticipating downstream telemetry callers that
never materialized. Grep across .github/ shows zero consumers.

Internal step outputs of the same names continue to gate the
conditional steps inside the action (the local exact-tag short-circuit,
the local deps-tag short-circuit, the post-build deps-tag application),
so the deletion is purely surface-area: external declarations go, in-action
flow unchanged.
@hujc7 hujc7 removed the ci-arm-only Iteration: skip all non-arm-ci checks (docs, install, license, amd64 build) label Jun 1, 2026
hujc7 added 9 commits June 1, 2026 18:42
Cleanups in response to self-review:

1. compute-deps-hash.sh refactored from source-with-env-vars to
   exec-with-positional-args. Caller captures hash via $(...) or pipes
   into $GITHUB_OUTPUT. More idiomatic, testable from any shell, no
   env-var contract.

2. docker-build gains a `deps-hash` input. ecr-build-push-pull computes
   the hash once in its registry-side check step, writes it to
   $GITHUB_OUTPUT, and passes it forward to docker-build's delegation.
   docker-build's local check uses the input if non-empty, else computes
   itself. Hash now computed exactly once per build.

3. `extra-tags` input removed from docker-build. The single consumer
   (ecr-build-push-pull) now does its own `docker tag <commit> <ECR>`
   step after the build returns. Less surface area on the primitive.

4. `evict-stale-cache` input added on docker-build (default false).
   The 14-day eviction is opt-in rather than implicit. No current
   caller opts in; runners that need disk cleanup do so explicitly.

5. Three unused action-level outputs (local-hit, local-deps-hit,
   was-built) on docker-build removed — no consumers grep'd.

6. docker-build description reverted to develop's one-liner.

7. ci_marker pre-scan in tools/conftest.py removed. The pre-scan
   previously overrode TESTS_TO_SKIP for marker-tagged files, which
   only papered over one mismatch (test_differential_ik.py is in
   TESTS_TO_SKIP as "Failing" yet was arm_ci-tagged). Respect the
   global skip list: a skip-listed test stays skipped on all
   platforms. arm-ci canary goes 8 -> 7 with the one known-failing
   file correctly dropped.

8. ci-arm-only label gates removed from build.yaml, docs.yaml,
   install-ci.yml, license-check.yaml. The iteration helper was
   one-shot; arm-ci is now green, the gate's job is done. Drop the
   gates and the corresponding repo label.

9. triggered_jobs string in build.yaml's `changes` job reverted to
   develop's text — the "non-root verify is folded" detail snuck in
   from an earlier commit and isn't strictly relevant.
The five pre-existing inputs (image-tag, isaacsim-base-image,
isaacsim-version, dockerfile-path, context-path) had their descriptions
rewritten during the split out of ecr-build-push-pull, with no behavior
change. Restore develop's original wording; keep descriptions only on
the genuinely-new inputs (platform, cache-from, cache-to, deps-hash,
evict-stale-cache).
F2: test_differential_ik.py carried `pytestmark = pytest.mark.arm_ci`,
but the file is in tools/test_settings.py TESTS_TO_SKIP ("Failing"), so
_collect_test_files drops it before the ci_marker filter ever sees it —
the marker was inert. A skip-listed test should not claim to run on any
platform; remove the marker. (The other arm_ci-tagged tests are not in
TESTS_TO_SKIP and are unaffected.)

F3: the changelog fragment credited a pytetwild aarch64 exclusion, but
that gate now lives on develop (via the packaging cleanup that removed
setup.py) and is already documented by a separate fragment. This PR's
diff no longer touches pytetwild gating, so the bullet described a
change not present here. Drop it; keep the in-scope AppLauncher
EXP_PATH fallback entry.
The two _lib helpers were a sourced bash script (setup-docker-config) and
an exec'd bash script (compute-deps-hash) — two different calling
conventions for what are both "shared step-logic." Unify them on the
GHA-native pattern: each becomes a composite action with an inline
`run:` block and a declared interface.

- _lib/compute-deps-hash/action.yml: inputs (dockerfile-path,
  isaacsim-base-image, isaacsim-version) -> output `hash`. Both
  docker-build and ecr-build-push-pull `uses:` it, so the deps-hash
  schema lives in exactly one place (no drift between the local-store
  check and the registry check).
- _lib/setup-docker-config/action.yml: no inputs; writes DOCKER_CONFIG
  to $GITHUB_ENV and logs into nvcr.io. Idempotent (a second invocation
  in the same job short-circuits), so ecr-build-push-pull delegating to
  docker-build doesn't redo it.

Call sites change from `run: . _lib/x.sh` / `$(_lib/x.sh ...)` to
`uses: ./.github/actions/_lib/x`. Dedup is unchanged: ecr computes the
hash once via the composite and forwards it to docker-build through the
deps-hash input; docker-build computes its own only when the input is
empty (the arm-ci direct-call path).

Delete the two .sh files — the composite action is now the single shared
unit, so no separate script is needed.
The `evict-stale-cache` input on docker-build was inert — no caller set
it, so the 14-day deps-tag eviction never ran. The Spark runner is
long-lived self-hosted with no ECR, so its local deps-cache tags
accumulate indefinitely; it's the legitimate consumer. Set
evict-stale-cache: "true" on the arm-ci build so stale tags are pruned
each run.
Review fixes (F4-F6):
- test_cartpole_training_smoke.py: repoint the perception case from the
  retired/deprecated Isaac-Cartpole-RGB-Camera-Direct-v0 to the canonical
  Isaac-Cartpole-Camera-Direct-v0 (defaults to rgb, has an rl_games entry
  point, not on the removal path); fix the docstring claim about camera
  envs registering "only" rl_games (they register rl_games + skrl, just
  no rsl_rl).
- isaaclab changelog: add a bullet for the AssetConverterBase default USD
  dir moving from hardcoded /tmp to tempfile.gettempdir() (a user-facing
  cross-platform change that was undocumented).
- isaaclab_tasks changelog: the package's delta here is test-only
  (arm_ci markers + aarch64 skip guards in rendering test helpers), so
  convert the fragment to .skip per AGENTS.md (no user-facing entry).

Arm-only test tags:
- test_scipy.py / test_torch.py carried [windows_ci, arm_ci]; this is the
  ARM workflow PR, so scope their tags to arm_ci only. windows_ci test
  tags belong to the Windows workflow PR.
The develop merge brought in isaac-sim#5891 (split tasks tests into core/contrib),
which relocated every test file under test/core/ or test/contrib/ and left
only shared helpers + conftest at the test-suite root. The new
test_cartpole_training_smoke.py predated that split and was the lone
test_*.py still at top-level. Cartpole is a core task, so move it under
test/core/ to match the new layout, and bump its _REPO_ROOT path depth
from parents[3] to parents[4] (matching the sibling core/ tests).
test_train_cartpole_perception timed out at the default 600s on a
cold-cache Spark runner: the first camera-enabled run compiles shaders
(~600s) before training even starts, which alone consumes the budget.
It passed earlier only because that runner had a warm shader cache.
Give the camera case a 1800s timeout (the state case keeps 600s).
@hujc7
Copy link
Copy Markdown
Collaborator Author

hujc7 commented Jun 1, 2026

Why are we creating yet another separate CI for ARM instead of just letting some of the tests from run on ARM? I vote against that.

Should be fixed.

@hujc7 hujc7 marked this pull request as ready for review June 2, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants