infra: agent backend cannot scale beyond a few concurrent users

## Problem

The agent backend runs on a **single uvicorn worker process** with an **in-memory checkpointer** on a **512MB Render starter instance**. This is a global bottleneck — not per-user. All concurrent users share the same event loop, the same memory pool, and the same 200-thread checkpoint limit.

Currently the app breaks at ~3 concurrent connections (#63). At production scale (100-1000 users), it would be effectively unusable.

## Architecture bottlenecks

### 1. Single worker process
- uvicorn runs with **1 worker** (default) — all requests share one Python event loop
- Each GPT-5.4 visualization call takes 10-30s
- LangGraph has synchronous sections that block the event loop
- **Throughput: ~2-6 visualization requests/minute**

### 2. In-memory checkpointer (BoundedMemorySaver)
- All conversation state stored in RAM — **shared global pool of 200 threads**
- FIFO eviction: after 200 conversations across ALL users, oldest threads are silently deleted
- Users lose conversation context mid-session with no error
- Not thread-safe — designed for single-process async only
- On 512MB starter plan, memory pressure builds well before 200 threads

### 3. No backpressure or error surfacing
- When the backend is saturated, requests hang silently — no timeout, no error, no retry
- Frontend shows no indication that the agent is overloaded
- Health check at `/health` returns 200 even when the event loop is blocked

## Scale projections

| Concurrent users | Behavior |
|---|---|
| 1-5 | Works fine |
| 10-20 | Noticeable latency, requests queue |
| 50+ | Requests timeout, SSE connections drop |
| 100+ | Effectively down, health checks fail, Render restarts |

## Proposed solution

### Phase 1 — Quick wins (config changes only)
- [ ] Add `--workers 4` to uvicorn startCommand in `render.yaml` — multiplies throughput ~4x
- [ ] Upgrade agent service from starter (512MB) to standard (1GB+) in `render.yaml`
- [ ] Enable rate limiting (`RATE_LIMIT_ENABLED=true`) with reasonable limits (e.g. 20 req/min per IP)

### Phase 2 — Persistent checkpointer
- [ ] Replace `BoundedMemorySaver` with PostgreSQL or SQLite async checkpointer
- [ ] Conversation state survives restarts and doesn't consume RAM
- [ ] No more silent thread eviction — threads persist until explicitly cleaned up
- [ ] Render already supports managed Postgres — can add as a service in `render.yaml`

### Phase 3 — Error handling and backpressure
- [ ] Add frontend timeout — show error after ~30s of no response instead of hanging forever
- [ ] Add backend concurrency limit — return 503 "busy" when at capacity rather than queuing indefinitely
- [ ] Add connection health monitoring — detect dropped SSE connections and surface to user
- [ ] Reuse thread IDs per browser tab (`sessionStorage`) to avoid creating unnecessary threads

### Phase 4 — Horizontal scaling
- [ ] Use Gunicorn with uvicorn workers for proper process management
- [ ] Verify Render auto-scaling (1-3 instances) works correctly with persistent checkpointer
- [ ] Add Redis or Postgres for shared state across instances
- [ ] Load test at target concurrency (100+ users) to validate

## Related issues

- #63 — Agent stops responding after multiple concurrent tabs (symptom of this)
- #58 — Quality regression (long-running visualization calls exacerbate the single-worker bottleneck)
- #62 — Planning step before visualization (adds an extra round-trip, making concurrency even more critical)

## Key files

- `apps/agent/main.py` — uvicorn config, BoundedMemorySaver(max_threads=200)
- `apps/agent/src/bounded_memory_saver.py` — FIFO eviction logic
- `render.yaml` — Render service config (starter plan, no worker config)
- `apps/app/src/app/api/copilotkit/route.ts` — Frontend → agent connection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: agent backend cannot scale beyond a few concurrent users #65

Problem

Architecture bottlenecks

1. Single worker process

2. In-memory checkpointer (BoundedMemorySaver)

3. No backpressure or error surfacing

Scale projections

Proposed solution

Phase 1 — Quick wins (config changes only)

Phase 2 — Persistent checkpointer

Phase 3 — Error handling and backpressure

Phase 4 — Horizontal scaling

Related issues

Key files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Concurrent users	Behavior
1-5	Works fine
10-20	Noticeable latency, requests queue
50+	Requests timeout, SSE connections drop
100+	Effectively down, health checks fail, Render restarts

infra: agent backend cannot scale beyond a few concurrent users #65

Description

Problem

Architecture bottlenecks

1. Single worker process

2. In-memory checkpointer (BoundedMemorySaver)

3. No backpressure or error surfacing

Scale projections

Proposed solution

Phase 1 — Quick wins (config changes only)

Phase 2 — Persistent checkpointer

Phase 3 — Error handling and backpressure

Phase 4 — Horizontal scaling

Related issues

Key files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions