Skip to content

Event bus worker does not retry on startup failure #2994

@tlgimenes

Description

@tlgimenes

Summary

If the event bus worker fails to start (e.g., database not yet reachable via Toxiproxy during container startup), it logs the error and never retries. The event bus remains permanently non-functional until the container is restarted.

Reproduction

Using the resilience test bed (tests/resilience/):

  1. Start the Docker stack — occasionally, mesh starts before Toxiproxy's Postgres proxy is fully ready
  2. Mesh logs show: [EventBus] Error during startup: connect ECONNREFUSED 172.19.0.6:5433
  3. After this, events are published to the DB but never delivered (worker is dead)
  4. Only a container restart fixes it

Expected Behavior

The event bus startup should retry with exponential backoff (e.g., 1s, 2s, 4s, up to 30s) until it succeeds. This is critical for:

  • Container orchestration where service ordering isn't guaranteed
  • Rolling deployments where the database connection pool may be temporarily saturated
  • Transient network issues during startup

Impact

  • Silent data loss: Events are published and stored in Postgres but never delivered. No further error logs after the initial failure — the system appears healthy.
  • Production risk: In Kubernetes, pod startup order is non-deterministic. If the DB connection pool is briefly saturated during a rolling deployment, the event bus dies permanently on affected pods.

Suggested Fix

In apps/mesh/src/api/app.ts (line ~788), wrap the eventBus.start() call in a retry loop:

async function startEventBusWithRetry(eventBus, maxRetries = 10, baseDelayMs = 1000) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      await eventBus.start();
      return;
    } catch (error) {
      const delay = Math.min(baseDelayMs * 2 ** attempt, 30_000);
      console.error(`[EventBus] Startup failed (attempt ${attempt + 1}/${maxRetries}), retrying in ${delay}ms:`, error.message);
      await new Promise(r => setTimeout(r, delay));
    }
  }
  console.error("[EventBus] Failed to start after all retries");
}

Found By

Resilience test bed: tests/resilience/docker-compose.yml — observed during initial container startup when Toxiproxy wasn't ready before mesh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions