Skip to content

feat(ci): shuffle-tests#112300

Draft
joshuarli wants to merge 73 commits intomasterfrom
shuffle-tests-v2
Draft

feat(ci): shuffle-tests#112300
joshuarli wants to merge 73 commits intomasterfrom
shuffle-tests-v2

Conversation

@joshuarli
Copy link
Copy Markdown
Member

reviving #108286

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 6, 2026
- Replace 50-line inline bash Snuba bootstrap with bootstrap-snuba.py
  (adopted from getsentry), which parallelizes ClickHouse DB setup with
  retries and cleaner error handling
- Extract failure aggregation from inline bash to report_shuffle_failures.py:
  deduplicates across shards, writes a consolidated GitHub Actions job
  summary, and upserts GitHub issues (creates or comments on existing)
- Add test_report_shuffle_failures.py with 33 tests covering all pure
  functions and the gh CLI boundary
- Add a report job that runs after all matrix shards, downloads failure
  artifacts, and calls the reporter — replacing per-shard issue creation
  with one centralized deduplicated pass
…mestamp

The test used a hardcoded date of 2025-01-01 which is now outside
ClickHouse's 90-day retention window, causing Snuba to return fewer
rows than expected. Use a recent timestamp (1 hour ago) instead,
matching the pattern already used in the neighbouring
test_export_clickhouse_rows test.
The test was frozen to 2025-01-15, a date now outside Snuba's 90-day
retention window. Snuba returns no data for timestamps that old,
causing the result list to be empty and the index-0 access to raise
IndexError.

Use a module-level constant frozen to yesterday so the test remains
deterministic while always staying within the retention window.
The test used base_time = datetime(2025, 10, 29, ...) which is now
163+ days ago, outside the EAP item retention_days=90 window. Snuba
returns no data for that detector, causing the KeyError on
data[str(detector.id)].

Use MOCK_DATETIME (already defined as yesterday) so base_time is
always within retention. The class-level @freeze_time(MOCK_DATETIME)
already establishes this as the frozen 'now', so using it for
base_time is consistent with the rest of the test suite.
Add 7 new patterns drawn from 27038a4:

- #9 Snuba cross-test contamination: dedicated project + tight query window
- #10 Expired retention window: replace hardcoded 2025 dates with before_now()
- #11 Global state teardown gap: try/finally cleanup + autouse fixture
- #12 Hardcoded IDs causing uniqueness collisions: use uuid4().hex
- #13 Bucket-count boundary flakiness: freeze time mid-period
- #14 Batch timeout causing premature flush: set max_batch_time=300
- #15 Polling loops dependent on time.sleep: add poll_interval=0 parameter

Update the lookup checklist to cover the new patterns.
Cut SKILL.md by ~50% - collapse phases into numbered sections,
inline the artifact-reading one-liner, remove prose repetition.

Cut common-patterns.md by ~60% - lookup table up front, each pattern
trimmed to symptom + fix code only, no diagnostic preamble.
Increase matrix from 11 to 16 shards for faster wall-clock time.

Disable detect-test-pollution temporarily: it re-runs the full test
suite for each failing shard (~10min overhead) which isn't worth the
cost until the baseline flake rate is lower. Also removes the now-dead
testids truncation and upload steps that only fed pollution detection.
The failure artifact always writes type=flaky in the interim.
A prior test that opens a sentry_sdk span leaves the trace active in the
SDK scope. get_trace_id() then returns a non-None value, failing the
assertion that expects None when no span is active.

Wrap the emission in sentry_sdk.isolation_scope() to ensure the handler
runs with a clean scope regardless of what ran before.
reserve_model_ids(Monitor, 14) was called after setup_cross_db_deletion_data(),
which creates monitors with auto-generated IDs. setUp() deletes all monitors but
does not reset the DB sequence, so if a previous test consumed IDs 4 or lower,
the next auto-ID could be 5, 7, 9, or 11 — exactly the IDs the test then tries
to INSERT explicitly, causing an IntegrityError.

Move reserve_model_ids to before any setup calls so all auto-generated
monitors land above 14 and can never collide with the explicit IDs.
Adopt the setup-devservices action from master which starts devservices
in the background before Python/venv setup begins. This lets docker
image pulls and container starts overlap with pip install and other
sentry env setup, saving ~1-2 min per shard.

Also start bootstrap-snuba.py in background before wait.sh completes
so Snuba Phase 1 (ClickHouse DB creation + bootstrap) overlaps with
the final devservices health checks — saving another ~30s.

Move bootstrap-snuba.py to .github/actions/setup-devservices/ where
master keeps it, removing the duplicate in workflows/scripts/.
test_emit shares the same trace context leak as test_gke_emit: when
test_emit_with_trace_id runs first it leaves an active SDK span, making
get_trace_id() return non-None in the next test. Wrap the emit call in
sentry_sdk.isolation_scope() across all parametrizations.

Drop GitHub issue creation from report_shuffle_failures.py entirely —
the workflow now only generates the job summary. Remove GH_TOKEN,
GH_REPO, and issues:write permission from the report job. Update tests
to match.
assert_ttls (test_buffer.py): Skip TTL=-2 (key vanished between keys() and
ttl() TOCTOU race with concurrent xdist workers sharing Redis)

TestGetActiveOrgs.setUp (test_common.py): Wrap 100 create_organization/
create_project calls in time_machine.travel(tick=True) so each gets a
unique millisecond timestamp — class-level @freeze_time causes
MaxSnowflakeRetryError because the generator can't advance the clock
to resolve sequence exhaustion.

test_drain_shard_flush_all__upper_bound (test_outbox.py): Raise barrier
timeout from 10s to 30s for both barrier tests — 10s is too tight under
16-shard CI runner load.

test_get_span_counts_with_many_projects (sampling spans): Same
MaxSnowflakeRetryError as above — 200 create_project calls under class-
level @freeze_time. Wrap the creation loop in time_machine.travel(tick=True).

test_delete_error_events (test_project.py): Snuba eventstream deletions
are processed asynchronously. Add a retry loop (max 5s) after the
scheduled deletion runs before asserting events == 0.
test_updates_grouping_config_if_current_config_is_not_default: The
assertion 'expected_expiry - actual_expiry < 1' fails when both
timestamps are int()-truncated and land exactly 1 second apart. The
comment already says 'one-second tolerance necessary' — fix to <= 1.

test_workflow_engine_serializer: set_value('success') was called at the
top of the test, long before the HTTP request. A concurrent xdist
worker's flushdb() (from the spans buffer fixture) can clear the Redis
key in that window, making the endpoint return 'pending'. Move
set_value() to immediately before the request.

Add a TODO section to fix-flaky-tests/SKILL.md for the two remaining
unfixed flakes (test_simple CrossTransactionAssertionError, test_unmerge
times_seen drift) so future sessions can pick them up.
…n_id=2

Another test's integration can have id=2. Use self.integration.id + 1_000_000
to guarantee the ID doesn't exist regardless of test ordering.
…ring

test_drain_shard_not_flush_all__upper_bound: Use Organization(id=2) for
outbox2 so it's in a different shard. Same-org outboxes coalesce —
processing outbox1 deletes outbox2 as a side effect, breaking the
assertion. A different-shard outbox is unaffected by the shard-1 drain.
Switch back to _PairwiseSync (no timeout) since CI load can break the
1s timeout even on valid test execution.

test_basic[new_group]: Move optimize_snuba_table('events') to BEFORE
get_event_by_processing_counter('x1'). The reprocessed event x1 may
still have the old group_id in Snuba until ClickHouse processes the
replace message — optimize must run first.
…p count

@patch('sentry.conduit.tasks.time.sleep') patches time.sleep on the
global time module, causing retry sleeps from sentry.utils.retries to
accumulate in mock_sleep.call_count (giving ~2333 instead of 100).

Patching 'sentry.conduit.tasks.time' instead replaces only the local
module reference in tasks.py, so retries.py keeps using real time.sleep
and the assert mock_time.sleep.call_count == NUM_DELTAS is reliable.
…ation in push_changes_timeout

TestRecalibrateOrgsTasks.setUp: Wrap 6 create_old_project calls in
time_machine.travel(tick=True) to prevent MaxSnowflakeRetryError.

test_push_changes_timeout: time.time() side_effect=[0,0,200] exhausts
when retry machinery makes extra calls. Use chain([0,0], repeat(200))
so additional calls keep returning 200 (still past timeout) without
raising StopIteration.
maxage=1 caused key 1 to expire before the assertion under slow CI
(16 parallel shards slow down execution beyond 1 second).
…orting

Re-enables pollution detection in a lightweight form (~30s per shard
instead of the previous ~10min full-suite re-run):

- After a shuffle-test failure, run the failing test in isolation
  (no shuffle). If it passes alone → mark failure.json as type="pollution".
  If it fails alone → keep type="flaky" (genuine intermittent failure).
- The existing 'report' job writes all failures to GITHUB_STEP_SUMMARY
  with a "passes in isolation (likely pollution)" label for pollution
  cases. No GitHub issues are created.

report_shuffle_failures.py: Handle pollution failures without
polluting_testid (simple isolation detection) by showing the traceback
with a "passes in isolation" label rather than requiring polluter info.
Restores the original detect_test_pollution.py-based bisection that
automatically finds the polluting test. Replaces the lightweight
isolation-only detection added in the previous commit.

The script does binary search through testids (tests that ran before
the failing one) to find the exact polluting test. Writes result to
GITHUB_STEP_SUMMARY (no issue created). Failure artifacts get
type='pollution' with polluting_testid when bisection succeeds, or
type='flaky' when the test also fails in isolation.
Documents all systemic infrastructure changes, individual test fixes grouped
by root cause, Snuba changes, workflow changes, and current state.
test_recalibrate_orgs_with_sliding_window_org: A concurrent xdist
worker's flushdb() can clear both the sliding-window cache keys and
the recalibration factor keys in the gap between the first and second
recalibrate_orgs() calls. Re-seed both sets of Redis keys before the
second invocation so the double-down assertion is robust.

test_large_digest: Using self.notification (a full pickled+compressed
Event object, potentially hundreds of KB per record) × 8192 records
pushes the DIGEST_OPEN Lua response past rb.Cluster buffer limits,
causing random truncation. Switch to a tiny string value since the
test only checks record count.
Hardcoded group_id=2 for the control events collides with group.id when
test ordering puts this test early in the session. Both events land in
the same group, the suspect-tags scorer sees no cross-group signal, and
the score collapses to 0.0 instead of 2.76. Use group.id + 1_000_000 to
guarantee no collision regardless of ordering.
sentry.utils.snuba._snuba_pool is a module-level global created at
import time from settings.SENTRY_SNUBA (default: port 1218). When the
xdist worker later overrides SENTRY_SNUBA to the per-worker snuba-gw
port (1230+), the pool still points at 1218, causing connection refused
on every eventstream.insert call on that worker.

Recreate _snuba_pool immediately after the settings override so all
subsequent Snuba requests on the worker reach the correct container.
… import

The previous attempt imported sentry.utils.snuba at configure time, which
triggered an import chain hitting sentry.api.helpers.error_upsampling before
Django settings were fully ready in the worker subprocess, causing
KeyError: 'analytics.backend' and crashing all workers.

Fixed approach: check sys.modules instead of importing the module.
- If sentry.utils.snuba is already in sys.modules (imported before
  configure_for_worker ran via conftest loading), recreate _snuba_pool
  in-place pointing at the per-worker URL.
- If it is not yet imported, the next import will read the already-updated
  settings.SENTRY_SNUBA and create the pool with the correct URL automatically.
test_cache_invalidation_error_handling: self.project is a lazy cached
property that initializes self.organization, self.team, and self.project
on first access. When this test runs first in a shuffled order, that
lazy init happens inside the patch(cache.delete, side_effect=Exception)
context. Organization.save() post_save fires cache.delete, which raises
the patched Exception and escapes the test. Fix: eagerly access
self.project before entering the patch context.

test_get_active_orgs_no_max_projects: Add super().setUp() so base class
initialization (including SnubaTestCase.init_snuba() and cell context
setup) runs correctly. Missing super() call left the Snuba/cell state
uninitialised in shuffle-first ordering, contributing to MaxSnowflakeRetryError.
…t_setup too

The configure-time sys.modules check sometimes misses: if sentry.utils.snuba
is first imported after configure_for_worker ran but before the settings
override took effect (an edge case in plugin ordering), the pool gets created
with port 1218 and nothing updates it afterwards.

Add _ensure_snuba_pool() called from pytest_runtest_setup (before every test)
as a safety net. It checks whether _snuba_pool already points at the correct
per-worker port and skips the recreation if so — overhead is one attribute
lookup per test. Use sentry.net.http.connection_from_url (matching what
snuba.py uses) instead of urllib3 directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant