Skip to content

infra: agent backend cannot scale beyond a few concurrent users #65

@jerelvelarde

Description

@jerelvelarde

Problem

The agent backend runs on a single uvicorn worker process with an in-memory checkpointer on a 512MB Render starter instance. This is a global bottleneck — not per-user. All concurrent users share the same event loop, the same memory pool, and the same 200-thread checkpoint limit.

Currently the app breaks at ~3 concurrent connections (#63). At production scale (100-1000 users), it would be effectively unusable.

Architecture bottlenecks

1. Single worker process

  • uvicorn runs with 1 worker (default) — all requests share one Python event loop
  • Each GPT-5.4 visualization call takes 10-30s
  • LangGraph has synchronous sections that block the event loop
  • Throughput: ~2-6 visualization requests/minute

2. In-memory checkpointer (BoundedMemorySaver)

  • All conversation state stored in RAM — shared global pool of 200 threads
  • FIFO eviction: after 200 conversations across ALL users, oldest threads are silently deleted
  • Users lose conversation context mid-session with no error
  • Not thread-safe — designed for single-process async only
  • On 512MB starter plan, memory pressure builds well before 200 threads

3. No backpressure or error surfacing

  • When the backend is saturated, requests hang silently — no timeout, no error, no retry
  • Frontend shows no indication that the agent is overloaded
  • Health check at /health returns 200 even when the event loop is blocked

Scale projections

Concurrent users Behavior
1-5 Works fine
10-20 Noticeable latency, requests queue
50+ Requests timeout, SSE connections drop
100+ Effectively down, health checks fail, Render restarts

Proposed solution

Phase 1 — Quick wins (config changes only)

  • Add --workers 4 to uvicorn startCommand in render.yaml — multiplies throughput ~4x
  • Upgrade agent service from starter (512MB) to standard (1GB+) in render.yaml
  • Enable rate limiting (RATE_LIMIT_ENABLED=true) with reasonable limits (e.g. 20 req/min per IP)

Phase 2 — Persistent checkpointer

  • Replace BoundedMemorySaver with PostgreSQL or SQLite async checkpointer
  • Conversation state survives restarts and doesn't consume RAM
  • No more silent thread eviction — threads persist until explicitly cleaned up
  • Render already supports managed Postgres — can add as a service in render.yaml

Phase 3 — Error handling and backpressure

  • Add frontend timeout — show error after ~30s of no response instead of hanging forever
  • Add backend concurrency limit — return 503 "busy" when at capacity rather than queuing indefinitely
  • Add connection health monitoring — detect dropped SSE connections and surface to user
  • Reuse thread IDs per browser tab (sessionStorage) to avoid creating unnecessary threads

Phase 4 — Horizontal scaling

  • Use Gunicorn with uvicorn workers for proper process management
  • Verify Render auto-scaling (1-3 instances) works correctly with persistent checkpointer
  • Add Redis or Postgres for shared state across instances
  • Load test at target concurrency (100+ users) to validate

Related issues

Key files

  • apps/agent/main.py — uvicorn config, BoundedMemorySaver(max_threads=200)
  • apps/agent/src/bounded_memory_saver.py — FIFO eviction logic
  • render.yaml — Render service config (starter plan, no worker config)
  • apps/app/src/app/api/copilotkit/route.ts — Frontend → agent connection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions