feat: consolidate CI dashboard pages and fix data pipelines by ludamad · Pull Request #20507 · AztecProtocol/aztec-packages

ludamad · 2026-02-13T22:26:56Z

Summary

Consolidate CI Insights + Test Timings + CI Attribution into a single 3-tab CI Insights page
Remove attribution tab from Cost Overview (now only Overview + Resource Details)
Replace test-timings page with redirect to CI Insights
Fix test data pipeline: backfill historical daily stats, merge test_events for dates without daily_stats, remove noise "started" events from pub/sub
Fix cost calculation: report unknown when instance type AND vCPUs are both unknown (instead of guessing 192 vCPUs)
Add missing EC2 instance rates (m6a.xlarge, m6a.4xlarge, m6a.8xlarge, m6a.24xlarge)
Make SQLite the source of truth for CI runs (drop Redis reads)
Fix gunicorn 25.x deadlock by removing --preload

Test plan

Load /ci-insights — default "Overview" tab shows MQ chart, test outcomes, flakes table
Click "Test Details" tab — shows duration charts, test tables with 42+ dates of data
Click "Attribution" tab — shows cost by type/user/instances
Load /test-timings — redirects to /ci-insights?tab=test-details
Load /cost-overview — only Overview + Resource Details tabs
Nav bar on all pages: 3 links (cost overview, namespace billing, ci insights)
Service starts cleanly and responds to requests
Deployed and verified on ci.aztec-labs.com

- Fix subprocess race condition with fcntl file lock - Warm billing caches on startup with --preload - Add test timings link to all dashboard nav bars - Reduce gunicorn workers from 100 to 50 - Add METRICS_DB_PATH env var for SQLite location - Fix Content-Encoding stripping for proxied responses - Kill stale ci-metrics process before restart

- Track test successes via daily aggregate table (test_daily_stats) without persisting individual passed events; backfill from existing test_events - Fix instance type detection in log_ci_run to prefer EC2_INSTANCE_TYPE env var over metadata endpoint (which fails in Docker) - Add CloudTrail backfill to resolve unknown instance types for historical CI runs and recalculate costs - Add test success counts to CI Insights chart (stacked bar: successes, flakes, failures) - Add time period metadata to all API responses and display in dashboard headers (ci-insights, cost-overview, test-timings) - Use test_daily_stats for CI performance endpoint counts (proper aggregation across weekly/monthly granularity) - Increase proxy timeout to 180s for slow BigQuery fetches - Reduce ci-metrics to 1 worker to avoid redundant cache warmups

…ange - CloudTrail resolver now joins RunInstances + CreateTags events by instance ID, then matches to ci_runs via Dashboard and Name tags instead of bare timestamp proximity - Restore merge_train_failure_slack_notify to match base branch

The previous CloudTrail resolver had three issues causing near-zero match rates: 1. Single-pass event fetching hit the 5000-event pagination limit, missing most RunInstances events beyond ~16 days. Now fetches in daily chunks. 2. CreateTags filter discarded Name-only events (line 126 of aws_request_instance_type), losing the Name tag for ~90% of instances. Now accumulates all tags first, then filters by Group=build-instance. 3. Name tag parsing couldn't handle INSTANCE_POSTFIX suffixes (e.g. pr-123_arm64_a1-fast). Now uses regex to extract branch name regardless of postfix format. 4. Matching window was 10 minutes (only matched first CI step). Now allows 90 minutes to match all steps on an instance. Tested against real data: resolves 4187/4638 (90%) unknown instance types across 90 days of CloudTrail history.

The API was reading CI runs from a Redis+SQLite hybrid, but the hourly Redis sync used INSERT OR REPLACE which overwrote CloudTrail-enriched instance_type and cost_usd back to empty values. Now: - get_ci_runs() reads exclusively from SQLite - sync_ci_runs_to_sqlite() uses ON CONFLICT DO UPDATE that preserves enriched fields (only overwrites if Redis has non-empty values) - app.py calls updated to drop unused Redis connection argument

- Add hardcoded rates for m6a.xlarge/4xlarge/8xlarge/24xlarge that were missing, causing 192-vCPU fallback ($100+ instead of ~$8 for 8xlarge) - Make pricing discovery dynamic: query DB for distinct instance types so newly resolved types get live pricing automatically - Add recalculate_all_costs() to fix historical cost data

Instead of guessing 192 vCPUs (which massively overestimates), return None so the cost shows as unknown rather than a fabricated number.

… page Merge CI Insights + Test Timings + CI Attribution into a single 3-tab CI Insights page (Overview, Test Details, Attribution). Remove redundant cost chart and KPIs from CI Insights. Remove attribution tab from Cost Overview. Replace test-timings page with redirect. Update nav across all dashboard pages to 3 links.

gunicorn 25.x introduced a control socket that deadlocks when combined with --preload. The worker process gets stuck after fork and never serves requests. Removing --preload fixes the issue.

- Fix _backfill_daily_stats to be incremental (fills gaps instead of skipping when table is non-empty) - Merge test_events into by_date chart so historical failed/flaked data appears even without daily_stats rows - Call _upsert_daily_stats from sync_failed_tests_to_sqlite so synced events populate daily stats - Stop persisting 'started' events to test_events (no duration, bloats DB) - Remove ci:test:started from pub/sub channels (not used for stats)

Send Accept-Encoding: identity to ci-metrics so it returns uncompressed responses. rk.py's Flask-Compress then handles browser compression in one clean step, avoiding the deflate encoding issue that caused garbled output in browsers.

Instead of double-compression (ci-metrics compresses, requests decompresses, Flask-Compress re-compresses), pass the browser's Accept-Encoding to ci-metrics and stream raw compressed bytes back. This avoids the deflate encoding issue that caused garbled output.

Adds two test scripts hooked into ci3/bootstrap.sh test_cmds: - test_proxy: spins up a backend Flask-Compress server and a proxy using the same raw-stream pattern as rk.py, verifying that gzip content passes through without double-compression (regression test for the garbled binary output bug) - test_views: static checks that ci-insights.html has 3 tabs, cost-overview.html has 2 tabs (no attribution), test-timings.html redirects, and nav links are consistent across pages

ludamad added 10 commits February 13, 2026 15:21

fix: return unknown cost when instance type and vCPUs are both unknown

0245302

Instead of guessing 192 vCPUs (which massively overestimates), return None so the cost shows as unknown rather than a fabricated number.

fix: remove gunicorn --preload to prevent ci-metrics deadlock

2acea3d

gunicorn 25.x introduced a control socket that deadlocks when combined with --preload. The worker process gets stuck after fork and never serves requests. Removing --preload fixes the issue.

ludamad requested a review from charlielye as a code owner February 13, 2026 22:26

ludamad added 4 commits February 14, 2026 20:47

fix: prevent double compression in ci-metrics proxy

fbc3af9

Send Accept-Encoding: identity to ci-metrics so it returns uncompressed responses. rk.py's Flask-Compress then handles browser compression in one clean step, avoiding the deflate encoding issue that caused garbled output in browsers.

revert: remove lip-service ci-metrics tests

129fea6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: consolidate CI dashboard pages and fix data pipelines#20507

feat: consolidate CI dashboard pages and fix data pipelines#20507
ludamad wants to merge 14 commits intomerge-train/spartanfrom
ad/fix/ci-metrics-deploy-v2

ludamad commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ludamad commented Feb 13, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant