You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(webapp): add a new backend for the realtime runs feed (#3864)
## Summary
Adds a second backend for the realtime runs feed (`useRealtimeRun`,
`subscribeToRunsWithTag`, `subscribeToBatch`), built to stay healthy
when a single busy environment has many subscribers watching many runs
at once. It is gated behind a feature flag with the existing backend as
the default, so nothing changes for users until it is enabled per
environment.
## Design
A run change is published once, as a small self-describing record, to a
single per-environment channel. Every feed is then a predicate over that
one stream rather than owning a channel:
- A per-instance router indexes the currently-held feeds by run, tag,
and batch. When a run changes it hydrates the affected rows once and
serializes them once, then fans the result to every matching feed. One
hot shared tag watched by many subscribers costs a single database query
and serialize, not one per subscriber.
- Feeds that don't match a change are never woken, wake delivery per
environment is coalesced on a leading edge (250ms default) so a burst of
changes costs one wake, and cold reads coalesce onto a single
short-TTL-cached resolve.
- An admission gate bounds how many cold ClickHouse resolves run
concurrently, so a mass reconnect across many distinct filters queues
instead of stampeding the database.
- Changes that land while a client is between long-polls are delivered
on its next poll instead of waiting for the periodic backstop: each
environment buffers its recent change records, subscriptions linger
briefly after the last feed closes, and a newly-armed poll replays
exactly the connection's gap.
- The per-connection replay cursors behind that are shared across
instances via Redis (a single timestamp each), so a poll landing on a
different instance behind the load balancer still reads the connection's
true gap instead of falling back to a cold resolve. Cursor reads have a
bounded deadline and degrade to the cold-read path on any Redis trouble.
- Tag subscriptions with multiple tags match runs carrying all of the
tags, mirroring the existing backend's filter semantics, and live
long-polls hold for about 20 seconds to match its cadence.
- The per-environment channel supports Redis Cluster sharded pub/sub, so
the wake path scales horizontally across shards by environment.
- The backend reports its health through OpenTelemetry metrics (delivery
lag, poll resolution paths, backstop outcomes, replay and cursor-store
activity), with a provisioned Grafana dashboard for local development.
Everything is behind the feature flag and tunable via env vars; the
existing backend remains the default.
// "1" shares per-connection replay cursors fleet-wide via Redis, so a load-balancer hop reads the connection's true inter-poll gap instead of cold-resolving.
// Pre-RBAC, the resource was the searchParams object itself and
29
-
// the legacy `checkAuthorization` iterated `Object.keys`, so a
30
-
// JWT with type-level `read:tags` (no id) granted access to the
31
-
// unfiltered runs stream. Including `{ type: "tags" }` here
32
-
// preserves that — per-id `read:tags:<tag>` still grants only
33
-
// when the filter includes that tag.
28
+
// `{ type: "tags" }` preserves pre-RBAC type-level `read:tags` access to the unfiltered stream; per-id `read:tags:<tag>` still grants only when the filter includes that tag.
0 commit comments