This document explains what happens after a load-balanced endpoint is deployed on Runpod and is actively running. It covers container startup, request flows, execution patterns, and the security model.
When you deploy a load-balanced Endpoint with flash build and flash deploy:
graph TD
A["User Code"] -->|flash build| B["Package Application"]
B -->|Manifest + Handlers| C["flash deploy"]
C -->|Upload + Provision| D["Runpod Container"]
D --> E["FastAPI Server (uvicorn port 8000)"]
E --> F["Load routes from manifest"]
F --> G["Endpoint Ready"]
Container setup:
- Base image:
runpod/worker-flash-lb:latest(GPU) orrunpod/worker-flash-lb-cpu:latest(CPU) - Contains FastAPI, uvicorn, and Flash runtime dependencies
- Entrypoint: Loads manifest and starts FastAPI server on port 8000
- Runpod exposes this via HTTPS endpoint URL
- Health check: Runpod polls
/pingevery 30 seconds - Environment:
FLASH_MODULE_PATHinjected automatically,RUNPOD_API_KEYinjected whenmakes_remote_calls=True, plus any explicitenv={}vars
The runtime loads your application from the manifest and registers routes:
# user code
from runpod_flash import Endpoint
api = Endpoint(name="my-api", cpu="cpu3c-4-8", workers=(1, 5))
@api.post("/api/process")
async def process_data(x: int, y: int):
return {"result": x + y}At runtime, Flash's LB handler discovers this route from the manifest and registers it with FastAPI. The resulting server handles:
POST /api/process-> callsprocess_data(x, y)GET /ping-> health check (built-in)
When a client makes an HTTP request to a deployed LB endpoint:
sequenceDiagram
participant Client
participant Runpod as Runpod Router
participant Container as Endpoint Container
participant FastAPI
participant UserFunc as User Function
Client->>Runpod: HTTPS POST /api/process
Runpod->>Container: Forward to port 8000
Container->>FastAPI: HTTP POST /api/process
FastAPI->>FastAPI: Match route
FastAPI->>UserFunc: Call process_data(x=5, y=3)
UserFunc-->>FastAPI: Return {"result": 8}
FastAPI-->>Container: HTTP 200 response
Container-->>Runpod: Response body
Runpod-->>Client: HTTPS response
Example:
# user code
api = Endpoint(name="my-api", cpu="cpu3c-4-8", workers=(1, 5))
@api.post("/api/process")
async def process_data(x: int, y: int):
return {"result": x + y}
# client request
# POST https://{endpoint-id}.api.runpod.ai/api/process
# Content-Type: application/json
# {"x": 5, "y": 3}
#
# response: {"result": 8}LB endpoints use subdomain-based URLs:
https://{endpoint-id}.api.runpod.ai/{path}
This differs from QB endpoints which use path-based URLs:
https://api.runpod.ai/v2/{endpoint-id}/runsync
The /execute endpoint accepts and runs arbitrary Python code. It exists only during local development (flash run).
In local development (flash run):
/executeis available for Flash's remote code execution protocol- Code originates from your own
Endpoint-decorated functions - Safe because only you can run code locally
In deployed endpoints (flash deploy):
/executeis not exposed- Only user-defined routes and
/pingare available - No arbitrary code execution possible
Why this matters: An exposed /execute endpoint would allow anyone with network access to execute arbitrary Python code on your infrastructure, including system commands and credential theft.
Request 1 (POST /api/process) -> Worker 1
Request 2 (POST /api/users) -> Worker 1 (concurrent, async)
Request 3 (POST /api/health) -> Worker 2 (new worker)
Runpod scales workers based on the REQUEST_COUNT scaler:
- When active requests exceed
scaler_value, new workers spin up - When requests drop, workers scale down after
idle_timeout - Async functions can handle multiple requests concurrently within a single worker
Async concurrency example:
@api.post("/api/process")
async def process_data(x: int):
import asyncio
await asyncio.sleep(10) # simulate work
return {"result": x}
# 5 concurrent requests:
# requests 1-3: concurrent on worker 1 (async)
# requests 4-5: concurrent on worker 2 (new worker)
# all 5 complete in ~10sSynchronous functions block the worker thread and are processed one at a time per worker.
POST https://endpoint.runpod.ai/api/users
{"invalid json
# Response: 422 Unprocessable Entity
@api.post("/api/users")
async def create_user(name: str):
if not name:
raise ValueError("Name required")
return {"id": 1, "name": name}
# POST with {"name": ""}
# Response: 500 Internal Server ErrorGET https://{endpoint-id}.api.runpod.ai/ping
# 200 OK: {"status": "healthy"}
# Runpod polls /ping every 30 seconds
# 200 OK -> worker healthy
# non-200 -> worker unhealthy, may be replaced
# no response -> worker down, replaced
direct HTTP request (no-op function):
client -> Runpod router: 10-50ms
FastAPI routing: 1-5ms
function execution: variable
response: 10-50ms
total: ~30-110ms
- FastAPI app baseline: ~50-100MB
- Per function in namespace: ~0.5-5MB
- Runpod allocates based on pod type
- Runpod has limits on request body size
- Consider streaming for large payloads
Container logs include:
- Request arrival and route matching
- Function execution and errors
- Response generation
View logs in the Runpod console or via runpod-cli logs <endpoint-id>.
"Connection refused"
- Container not running or uvicorn failed to start
- Check container logs
"Timeout"
- Function took too long
- Increase
execution_timeout_mson the endpoint
"500 Internal Server Error"
- Function raised an exception
- Check container logs for the traceback
"404 Not Found"
- Route not registered in manifest
- Verify route paths in your code
- Load-Balanced Endpoints -- user guide for LB endpoints
- Load Balancer Endpoints (Internal) -- provisioning architecture
- Flash SDK Reference -- complete API reference