-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Bug report
Steps to reproduce
- Push any change to a branch with CI enabled
- Wait for the
build-and-test (rolling, ubuntu:noble)job - Observe
test_scenario_config_managementfailing intermittently
Expected behavior
test_scenario_config_management should pass - it launches a single demo node (temp_sensor) and waits for the gateway to discover it.
Actual behavior
The test times out after 60 seconds with:
AssertionError: Discovery incomplete after 60.0s - found 2 apps, need 1.
Missing apps: {'temp_sensor'}, Missing areas: set()
Key observation: the gateway discovers 2 apps but none of them is temp_sensor. This suggests DDS cross-contamination from other tests leaking through ROS_DOMAIN_ID isolation on Rolling.
Root cause analysis
Three contributing factors:
-
DDS discovery is slower on Rolling - newer Fast-RTPS/CycloneDDS version with different timing characteristics. 60-second
DISCOVERY_TIMEOUTis borderline. -
ROS_DOMAIN_ID contamination - the test found 2 apps instead of the expected 1, but not the right one. This points to stale DDS participants from a previous test bleeding into this test's domain. The CMakeLists.txt already documents this risk:
# Each test also gets a unique ROS_DOMAIN_ID to prevent DDS cross-contamination # between tests (e.g., stale DDS participants from previous tests leaking into # subsequent test's graph discovery). -
CI runner contention - Docker-based Rolling CI shares resources, adding latency to DDS discovery.
Proposed fixes (ascending complexity)
1. Increase DISCOVERY_TIMEOUT from 60s to 90s (1-line change, lowest risk)
- File:
src/ros2_medkit_integration_tests/ros2_medkit_test_utils/constants.py:39 - CMake scenario test timeout is already 300s, so 90s is well within bounds
2. Increase ROS_DOMAIN_ID stride for integration tests
- File:
src/ros2_medkit_integration_tests/CMakeLists.txt - Currently domain IDs are sequential (100, 101, 102...). Increasing stride to 2-5 would reduce DDS participant leakage between tests
3. Increase gateway refresh interval in tests from 1s to 2s
- File:
src/ros2_medkit_integration_tests/ros2_medkit_test_utils/launch_helpers.py:98 - Reduces DDS middleware strain, gives more time per cycle
Environment
- ros2_medkit version: 0.3.0 (main branch, commit bdf6fe3)
- ROS 2 distro: Rolling (ubuntu:noble)
- OS: Ubuntu Noble (24.04) in Docker (GitHub Actions CI)
Additional information
- CI run: https://github.com/selfpatch/ros2_medkit/actions/runs/23308288985/job/67788427165
- Rolling is already marked
continue-on-error: truein CI, so this doesn't block PRs - Jazzy and Humble are not affected - their DDS versions have faster/more reliable discovery
- The same test passes consistently on Jazzy and Humble