Skip to content

Commit 499060e

Browse files
committed
fix(webapp): scope leaderElection-lost recovery to reconnect strategy
The previous commit routed leaderElection(false) through handle(), which under the exit strategy schedules process.exit. In a multi-instance deployment that turns lost leader election — a normal operational state — into a restart loop: exit, supervisor restarts, election fails again, exit, and so on. Add a dedicated notifyLeaderElectionLost() on ReplicationErrorRecovery that the reconnect strategy treats as another retry trigger, while exit and log strategies no-op. Wire the wrapper services through the new method.
1 parent a2eaf3e commit 499060e

3 files changed

Lines changed: 18 additions & 4 deletions

File tree

apps/webapp/app/services/replicationErrorRecovery.server.ts

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,11 @@ export type ReplicationErrorRecovery = {
4444
// Called from the replication client's "start" event handler. Resets the
4545
// reconnect attempt counter so the next failure starts from initialDelayMs.
4646
notifyStreamStarted(): void;
47+
// Called from the replication client's "leaderElection" event handler with
48+
// isLeader=false. Only the reconnect strategy acts on this; exit and log
49+
// strategies treat losing the lock as a normal multi-instance state (an
50+
// "exit" instance would otherwise restart-loop whenever a peer holds it).
51+
notifyLeaderElectionLost(error: unknown): void;
4752
// Cancel any pending reconnect/exit timer. Called from shutdown().
4853
dispose(): void;
4954
};
@@ -145,6 +150,14 @@ export function createReplicationErrorRecovery(
145150
attempt = 0;
146151
}
147152
},
153+
notifyLeaderElectionLost(error) {
154+
if (isShuttingDown()) return;
155+
// Only the reconnect strategy should react. For exit, losing the
156+
// lock to a peer would otherwise trigger a restart loop. For log,
157+
// we keep historical no-op semantics.
158+
if (strategy.type !== "reconnect") return;
159+
scheduleReconnect(error);
160+
},
148161
dispose() {
149162
if (pendingReconnect) {
150163
clearTimeout(pendingReconnect);

apps/webapp/app/services/runsReplicationService.server.ts

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -289,9 +289,10 @@ export class RunsReplicationService {
289289
if (!isLeader) {
290290
// Failed leader election doesn't throw or emit an "error" event —
291291
// subscribe() just emits leaderElection(false), calls stop(), and
292-
// returns. Nudge the recovery handler so reconnect doesn't silently
293-
// stall when another instance holds the lock.
294-
this._errorRecovery.handle(
292+
// returns. Route through a dedicated handler so only the reconnect
293+
// strategy acts; the exit strategy must not restart-loop when
294+
// another instance holds the lock.
295+
this._errorRecovery.notifyLeaderElectionLost(
295296
new Error("Failed to acquire replication leader lock")
296297
);
297298
}

apps/webapp/app/services/sessionsReplicationService.server.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ export class SessionsReplicationService {
269269
this.logger.info("Leader election", { isLeader });
270270
if (!isLeader) {
271271
// See RunsReplicationService for the rationale.
272-
this._errorRecovery.handle(
272+
this._errorRecovery.notifyLeaderElectionLost(
273273
new Error("Failed to acquire replication leader lock")
274274
);
275275
}

0 commit comments

Comments
 (0)