Summary
If the event bus worker fails to start (e.g., database not yet reachable via Toxiproxy during container startup), it logs the error and never retries. The event bus remains permanently non-functional until the container is restarted.
Reproduction
Using the resilience test bed (tests/resilience/):
- Start the Docker stack — occasionally, mesh starts before Toxiproxy's Postgres proxy is fully ready
- Mesh logs show:
[EventBus] Error during startup: connect ECONNREFUSED 172.19.0.6:5433
- After this, events are published to the DB but never delivered (worker is dead)
- Only a container restart fixes it
Expected Behavior
The event bus startup should retry with exponential backoff (e.g., 1s, 2s, 4s, up to 30s) until it succeeds. This is critical for:
- Container orchestration where service ordering isn't guaranteed
- Rolling deployments where the database connection pool may be temporarily saturated
- Transient network issues during startup
Impact
- Silent data loss: Events are published and stored in Postgres but never delivered. No further error logs after the initial failure — the system appears healthy.
- Production risk: In Kubernetes, pod startup order is non-deterministic. If the DB connection pool is briefly saturated during a rolling deployment, the event bus dies permanently on affected pods.
Suggested Fix
In apps/mesh/src/api/app.ts (line ~788), wrap the eventBus.start() call in a retry loop:
async function startEventBusWithRetry(eventBus, maxRetries = 10, baseDelayMs = 1000) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
await eventBus.start();
return;
} catch (error) {
const delay = Math.min(baseDelayMs * 2 ** attempt, 30_000);
console.error(`[EventBus] Startup failed (attempt ${attempt + 1}/${maxRetries}), retrying in ${delay}ms:`, error.message);
await new Promise(r => setTimeout(r, delay));
}
}
console.error("[EventBus] Failed to start after all retries");
}
Found By
Resilience test bed: tests/resilience/docker-compose.yml — observed during initial container startup when Toxiproxy wasn't ready before mesh
Summary
If the event bus worker fails to start (e.g., database not yet reachable via Toxiproxy during container startup), it logs the error and never retries. The event bus remains permanently non-functional until the container is restarted.
Reproduction
Using the resilience test bed (
tests/resilience/):[EventBus] Error during startup: connect ECONNREFUSED 172.19.0.6:5433Expected Behavior
The event bus startup should retry with exponential backoff (e.g., 1s, 2s, 4s, up to 30s) until it succeeds. This is critical for:
Impact
Suggested Fix
In
apps/mesh/src/api/app.ts(line ~788), wrap theeventBus.start()call in a retry loop:Found By
Resilience test bed:
tests/resilience/docker-compose.yml— observed during initial container startup when Toxiproxy wasn't ready before mesh