feat(ci): shuffle-tests by joshuarli · Pull Request #112300 · getsentry/sentry

joshuarli · 2026-04-06T21:43:55Z

- Replace 50-line inline bash Snuba bootstrap with bootstrap-snuba.py (adopted from getsentry), which parallelizes ClickHouse DB setup with retries and cleaner error handling - Extract failure aggregation from inline bash to report_shuffle_failures.py: deduplicates across shards, writes a consolidated GitHub Actions job summary, and upserts GitHub issues (creates or comments on existing) - Add test_report_shuffle_failures.py with 33 tests covering all pure functions and the gh CLI boundary - Add a report job that runs after all matrix shards, downloads failure artifacts, and calls the reporter — replacing per-shard issue creation with one centralized deduplicated pass

…mestamp The test used a hardcoded date of 2025-01-01 which is now outside ClickHouse's 90-day retention window, causing Snuba to return fewer rows than expected. Use a recent timestamp (1 hour ago) instead, matching the pattern already used in the neighbouring test_export_clickhouse_rows test.

The test was frozen to 2025-01-15, a date now outside Snuba's 90-day retention window. Snuba returns no data for timestamps that old, causing the result list to be empty and the index-0 access to raise IndexError. Use a module-level constant frozen to yesterday so the test remains deterministic while always staying within the retention window.

The test used base_time = datetime(2025, 10, 29, ...) which is now 163+ days ago, outside the EAP item retention_days=90 window. Snuba returns no data for that detector, causing the KeyError on data[str(detector.id)]. Use MOCK_DATETIME (already defined as yesterday) so base_time is always within retention. The class-level @freeze_time(MOCK_DATETIME) already establishes this as the frozen 'now', so using it for base_time is consistent with the rest of the test suite.

Add 7 new patterns drawn from 27038a4: - #9 Snuba cross-test contamination: dedicated project + tight query window - #10 Expired retention window: replace hardcoded 2025 dates with before_now() - #11 Global state teardown gap: try/finally cleanup + autouse fixture - #12 Hardcoded IDs causing uniqueness collisions: use uuid4().hex - #13 Bucket-count boundary flakiness: freeze time mid-period - #14 Batch timeout causing premature flush: set max_batch_time=300 - #15 Polling loops dependent on time.sleep: add poll_interval=0 parameter Update the lookup checklist to cover the new patterns.

Cut SKILL.md by ~50% - collapse phases into numbered sections, inline the artifact-reading one-liner, remove prose repetition. Cut common-patterns.md by ~60% - lookup table up front, each pattern trimmed to symptom + fix code only, no diagnostic preamble.

Increase matrix from 11 to 16 shards for faster wall-clock time. Disable detect-test-pollution temporarily: it re-runs the full test suite for each failing shard (~10min overhead) which isn't worth the cost until the baseline flake rate is lower. Also removes the now-dead testids truncation and upload steps that only fed pollution detection. The failure artifact always writes type=flaky in the interim.

A prior test that opens a sentry_sdk span leaves the trace active in the SDK scope. get_trace_id() then returns a non-None value, failing the assertion that expects None when no span is active. Wrap the emission in sentry_sdk.isolation_scope() to ensure the handler runs with a clean scope regardless of what ran before.

reserve_model_ids(Monitor, 14) was called after setup_cross_db_deletion_data(), which creates monitors with auto-generated IDs. setUp() deletes all monitors but does not reset the DB sequence, so if a previous test consumed IDs 4 or lower, the next auto-ID could be 5, 7, 9, or 11 — exactly the IDs the test then tries to INSERT explicitly, causing an IntegrityError. Move reserve_model_ids to before any setup calls so all auto-generated monitors land above 14 and can never collide with the explicit IDs.

…aky-tests

Adopt the setup-devservices action from master which starts devservices in the background before Python/venv setup begins. This lets docker image pulls and container starts overlap with pip install and other sentry env setup, saving ~1-2 min per shard. Also start bootstrap-snuba.py in background before wait.sh completes so Snuba Phase 1 (ClickHouse DB creation + bootstrap) overlaps with the final devservices health checks — saving another ~30s. Move bootstrap-snuba.py to .github/actions/setup-devservices/ where master keeps it, removing the duplicate in workflows/scripts/.

.github/workflows/shuffle-tests-across-shards.yml

test_emit shares the same trace context leak as test_gke_emit: when test_emit_with_trace_id runs first it leaves an active SDK span, making get_trace_id() return non-None in the next test. Wrap the emit call in sentry_sdk.isolation_scope() across all parametrizations. Drop GitHub issue creation from report_shuffle_failures.py entirely — the workflow now only generates the job summary. Remove GH_TOKEN, GH_REPO, and issues:write permission from the report job. Update tests to match.

assert_ttls (test_buffer.py): Skip TTL=-2 (key vanished between keys() and ttl() TOCTOU race with concurrent xdist workers sharing Redis) TestGetActiveOrgs.setUp (test_common.py): Wrap 100 create_organization/ create_project calls in time_machine.travel(tick=True) so each gets a unique millisecond timestamp — class-level @freeze_time causes MaxSnowflakeRetryError because the generator can't advance the clock to resolve sequence exhaustion. test_drain_shard_flush_all__upper_bound (test_outbox.py): Raise barrier timeout from 10s to 30s for both barrier tests — 10s is too tight under 16-shard CI runner load. test_get_span_counts_with_many_projects (sampling spans): Same MaxSnowflakeRetryError as above — 200 create_project calls under class- level @freeze_time. Wrap the creation loop in time_machine.travel(tick=True). test_delete_error_events (test_project.py): Snuba eventstream deletions are processed asynchronously. Add a retry loop (max 5s) after the scheduled deletion runs before asserting events == 0.

test_updates_grouping_config_if_current_config_is_not_default: The assertion 'expected_expiry - actual_expiry < 1' fails when both timestamps are int()-truncated and land exactly 1 second apart. The comment already says 'one-second tolerance necessary' — fix to <= 1. test_workflow_engine_serializer: set_value('success') was called at the top of the test, long before the HTTP request. A concurrent xdist worker's flushdb() (from the spans buffer fixture) can clear the Redis key in that window, making the endpoint return 'pending'. Move set_value() to immediately before the request. Add a TODO section to fix-flaky-tests/SKILL.md for the two remaining unfixed flakes (test_simple CrossTransactionAssertionError, test_unmerge times_seen drift) so future sessions can pick them up.

…n_id=2 Another test's integration can have id=2. Use self.integration.id + 1_000_000 to guarantee the ID doesn't exist regardless of test ordering.

…ring test_drain_shard_not_flush_all__upper_bound: Use Organization(id=2) for outbox2 so it's in a different shard. Same-org outboxes coalesce — processing outbox1 deletes outbox2 as a side effect, breaking the assertion. A different-shard outbox is unaffected by the shard-1 drain. Switch back to _PairwiseSync (no timeout) since CI load can break the 1s timeout even on valid test execution. test_basic[new_group]: Move optimize_snuba_table('events') to BEFORE get_event_by_processing_counter('x1'). The reprocessed event x1 may still have the old group_id in Snuba until ClickHouse processes the replace message — optimize must run first.

@patch

…p count @patch('sentry.conduit.tasks.time.sleep') patches time.sleep on the global time module, causing retry sleeps from sentry.utils.retries to accumulate in mock_sleep.call_count (giving ~2333 instead of 100). Patching 'sentry.conduit.tasks.time' instead replaces only the local module reference in tasks.py, so retries.py keeps using real time.sleep and the assert mock_time.sleep.call_count == NUM_DELTAS is reliable.

…ation in push_changes_timeout TestRecalibrateOrgsTasks.setUp: Wrap 6 create_old_project calls in time_machine.travel(tick=True) to prevent MaxSnowflakeRetryError. test_push_changes_timeout: time.time() side_effect=[0,0,200] exhausts when retry machinery makes extra calls. Use chain([0,0], repeat(200)) so additional calls keep returning 200 (still past timeout) without raising StopIteration.

maxage=1 caused key 1 to expire before the assertion under slow CI (16 parallel shards slow down execution beyond 1 second).

src/sentry/utils/session_store.py

…orting Re-enables pollution detection in a lightweight form (~30s per shard instead of the previous ~10min full-suite re-run): - After a shuffle-test failure, run the failing test in isolation (no shuffle). If it passes alone → mark failure.json as type="pollution". If it fails alone → keep type="flaky" (genuine intermittent failure). - The existing 'report' job writes all failures to GITHUB_STEP_SUMMARY with a "passes in isolation (likely pollution)" label for pollution cases. No GitHub issues are created. report_shuffle_failures.py: Handle pollution failures without polluting_testid (simple isolation detection) by showing the traceback with a "passes in isolation" label rather than requiring polluter info.

Restores the original detect_test_pollution.py-based bisection that automatically finds the polluting test. Replaces the lightweight isolation-only detection added in the previous commit. The script does binary search through testids (tests that ran before the failing one) to find the exact polluting test. Writes result to GITHUB_STEP_SUMMARY (no issue created). Failure artifacts get type='pollution' with polluting_testid when bisection succeeds, or type='flaky' when the test also fails in isolation.

Documents all systemic infrastructure changes, individual test fixes grouped by root cause, Snuba changes, workflow changes, and current state.

test_recalibrate_orgs_with_sliding_window_org: A concurrent xdist worker's flushdb() can clear both the sliding-window cache keys and the recalibration factor keys in the gap between the first and second recalibrate_orgs() calls. Re-seed both sets of Redis keys before the second invocation so the double-down assertion is robust. test_large_digest: Using self.notification (a full pickled+compressed Event object, potentially hundreds of KB per record) × 8192 records pushes the DIGEST_OPEN Lua response past rb.Cluster buffer limits, causing random truncation. Switch to a tiny string value since the test only checks record count.

Hardcoded group_id=2 for the control events collides with group.id when test ordering puts this test early in the session. Both events land in the same group, the suspect-tags scorer sees no cross-group signal, and the score collapses to 0.0 instead of 2.76. Use group.id + 1_000_000 to guarantee no collision regardless of ordering.

sentry.utils.snuba._snuba_pool is a module-level global created at import time from settings.SENTRY_SNUBA (default: port 1218). When the xdist worker later overrides SENTRY_SNUBA to the per-worker snuba-gw port (1230+), the pool still points at 1218, causing connection refused on every eventstream.insert call on that worker. Recreate _snuba_pool immediately after the settings override so all subsequent Snuba requests on the worker reach the correct container.

… import The previous attempt imported sentry.utils.snuba at configure time, which triggered an import chain hitting sentry.api.helpers.error_upsampling before Django settings were fully ready in the worker subprocess, causing KeyError: 'analytics.backend' and crashing all workers. Fixed approach: check sys.modules instead of importing the module. - If sentry.utils.snuba is already in sys.modules (imported before configure_for_worker ran via conftest loading), recreate _snuba_pool in-place pointing at the per-worker URL. - If it is not yet imported, the next import will read the already-updated settings.SENTRY_SNUBA and create the pool with the correct URL automatically.

test_cache_invalidation_error_handling: self.project is a lazy cached property that initializes self.organization, self.team, and self.project on first access. When this test runs first in a shuffled order, that lazy init happens inside the patch(cache.delete, side_effect=Exception) context. Organization.save() post_save fires cache.delete, which raises the patched Exception and escapes the test. Fix: eagerly access self.project before entering the patch context. test_get_active_orgs_no_max_projects: Add super().setUp() so base class initialization (including SnubaTestCase.init_snuba() and cell context setup) runs correctly. Missing super() call left the Snuba/cell state uninitialised in shuffle-first ordering, contributing to MaxSnowflakeRetryError.

…t_setup too The configure-time sys.modules check sometimes misses: if sentry.utils.snuba is first imported after configure_for_worker ran but before the settings override took effect (an edge case in plugin ordering), the pool gets created with port 1218 and nothing updates it afterwards. Add _ensure_snuba_pool() called from pytest_runtest_setup (before every test) as a safety net. It checks whether _snuba_pool already points at the correct per-worker port and skips the recreation if so — overhead is one attribute lookup per test. Use sentry.net.http.connection_from_url (matching what snuba.py uses) instead of urllib3 directly.

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 6, 2026

vercel bot deployed to Preview April 6, 2026 21:45 View deployment

vercel bot deployed to Preview April 6, 2026 22:40 View deployment

joshuarli added 5 commits April 10, 2026 10:44

init from shuffle-tests-across-shards

c4336e8

bring in xdist

415f525

rm other workflows

2a6e031

stability patchset

27038a4

joshuarli force-pushed the shuffle-tests-v2 branch from 023e119 to 038a6e4 Compare April 10, 2026 17:46

vercel bot deployed to Preview April 10, 2026 17:47 View deployment

joshuarli added 3 commits April 10, 2026 11:42

vercel bot deployed to Preview April 10, 2026 18:50 View deployment

joshuarli added 3 commits April 10, 2026 12:05

vercel bot deployed to Preview April 10, 2026 19:15 View deployment

joshuarli added 3 commits April 10, 2026 12:44

docs(skills): Make verification mandatory before committing in fix-fl…

1135667

…aky-tests

vercel bot deployed to Preview April 10, 2026 19:49 View deployment

sentry-warden bot reviewed Apr 10, 2026

View reviewed changes

.github/workflows/shuffle-tests-across-shards.yml Show resolved Hide resolved

vercel bot deployed to Preview April 10, 2026 20:13 View deployment

vercel bot deployed to Preview April 10, 2026 20:36 View deployment

vercel bot deployed to Preview April 10, 2026 20:54 View deployment

vercel bot deployed to Preview April 11, 2026 22:12 View deployment

test(slack): Fix test_no_integration_found_error hardcoded integratio…

ac8757b

…n_id=2 Another test's integration can have id=2. Use self.integration.id + 1_000_000 to guarantee the ID doesn't exist regardless of test ordering.

vercel bot deployed to Preview April 11, 2026 22:34 View deployment

vercel bot deployed to Preview April 11, 2026 22:59 View deployment

vercel bot deployed to Preview April 11, 2026 23:24 View deployment

vercel bot deployed to Preview April 12, 2026 00:33 View deployment

test(replays): Increase TimeLimitedCache maxage to 60s in test

7b09ae5

maxage=1 caused key 1 to expire before the assertion under slow CI (16 parallel shards slow down execution beyond 1 second).

vercel bot deployed to Preview April 12, 2026 00:56 View deployment

sentry-warden bot reviewed Apr 12, 2026

View reviewed changes

src/sentry/utils/session_store.py Show resolved Hide resolved

vercel bot deployed to Preview April 12, 2026 01:26 View deployment

vercel bot deployed to Preview April 12, 2026 01:31 View deployment

docs(skills): Add detailed progress inventory to fix-flaky-tests skill

59fb2ef

Documents all systemic infrastructure changes, individual test fixes grouped by root cause, Snuba changes, workflow changes, and current state.

vercel bot deployed to Preview April 12, 2026 01:47 View deployment

vercel bot deployed to Preview April 12, 2026 02:16 View deployment

vercel bot deployed to Preview April 12, 2026 03:19 View deployment

vercel bot deployed to Preview April 12, 2026 03:44 View deployment

vercel bot deployed to Preview April 12, 2026 03:52 View deployment

vercel bot deployed to Preview April 12, 2026 04:29 View deployment

vercel bot deployed to Preview April 12, 2026 04:55 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ci): shuffle-tests#112300

feat(ci): shuffle-tests#112300
joshuarli wants to merge 73 commits intomasterfrom
shuffle-tests-v2

joshuarli commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

joshuarli commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant