Skip to content

[BUG] Flaky integration test test_scenario_config_management on Rolling CI #296

@bburda

Description

@bburda

Bug report

Steps to reproduce

  1. Push any change to a branch with CI enabled
  2. Wait for the build-and-test (rolling, ubuntu:noble) job
  3. Observe test_scenario_config_management failing intermittently

Expected behavior

test_scenario_config_management should pass - it launches a single demo node (temp_sensor) and waits for the gateway to discover it.

Actual behavior

The test times out after 60 seconds with:

AssertionError: Discovery incomplete after 60.0s - found 2 apps, need 1.
Missing apps: {'temp_sensor'}, Missing areas: set()

Key observation: the gateway discovers 2 apps but none of them is temp_sensor. This suggests DDS cross-contamination from other tests leaking through ROS_DOMAIN_ID isolation on Rolling.

Root cause analysis

Three contributing factors:

  1. DDS discovery is slower on Rolling - newer Fast-RTPS/CycloneDDS version with different timing characteristics. 60-second DISCOVERY_TIMEOUT is borderline.

  2. ROS_DOMAIN_ID contamination - the test found 2 apps instead of the expected 1, but not the right one. This points to stale DDS participants from a previous test bleeding into this test's domain. The CMakeLists.txt already documents this risk:

    # Each test also gets a unique ROS_DOMAIN_ID to prevent DDS cross-contamination
    # between tests (e.g., stale DDS participants from previous tests leaking into
    # subsequent test's graph discovery).
    
  3. CI runner contention - Docker-based Rolling CI shares resources, adding latency to DDS discovery.

Proposed fixes (ascending complexity)

1. Increase DISCOVERY_TIMEOUT from 60s to 90s (1-line change, lowest risk)

  • File: src/ros2_medkit_integration_tests/ros2_medkit_test_utils/constants.py:39
  • CMake scenario test timeout is already 300s, so 90s is well within bounds

2. Increase ROS_DOMAIN_ID stride for integration tests

  • File: src/ros2_medkit_integration_tests/CMakeLists.txt
  • Currently domain IDs are sequential (100, 101, 102...). Increasing stride to 2-5 would reduce DDS participant leakage between tests

3. Increase gateway refresh interval in tests from 1s to 2s

  • File: src/ros2_medkit_integration_tests/ros2_medkit_test_utils/launch_helpers.py:98
  • Reduces DDS middleware strain, gives more time per cycle

Environment

  • ros2_medkit version: 0.3.0 (main branch, commit bdf6fe3)
  • ROS 2 distro: Rolling (ubuntu:noble)
  • OS: Ubuntu Noble (24.04) in Docker (GitHub Actions CI)

Additional information

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions