Skip to content

feat(llm-gateway): complete Cloudflare Worker implementation (Phases 1–7)#752

Open
iscekic wants to merge 80 commits intomainfrom
feat/llm-gateway
Open

feat(llm-gateway): complete Cloudflare Worker implementation (Phases 1–7)#752
iscekic wants to merge 80 commits intomainfrom
feat/llm-gateway

Conversation

@iscekic
Copy link
Contributor

@iscekic iscekic commented Mar 3, 2026

Summary

  • Implements the llm-gateway Cloudflare Worker — a transparent drop-in replacement for the Next.js /api/openrouter/[...path]/route.ts proxy
  • Full Hono middleware chain matching the existing request flow: body parsing → auth → anonymous gate → rate limiting → provider resolution → balance/org checks → request transform → upstream proxy → background tasks
  • 168 unit tests passing, typecheck clean

Phases

Phase Commit Description
1 5a6b119 Scaffolding: package.json, wrangler.jsonc, tsconfig, Hono stub
2 775f840 Request parsing, JWT auth, anonymous gate
3 7ce0844 Rate limiting (KV sliding window), provider resolution, BYOK
4 faeacf1 Balance/org checks, request validation, request transform
5 84202a8 Upstream proxy, response handling, custom LLM (Vercel AI SDK), abuse service
6 6121e81 Background tasks: usage accounting, API metrics, request logging, abuse cost
7 b31334a 168 unit tests covering every middleware and library module

Follow-up improvements (also in this PR)

  • b07e629 — Replace all setTimeout with Workers-native scheduler.wait()
  • b5eaf47 — Convert O11Y service binding from .fetch() HTTP to RPC (ingestApiMetrics method on WorkerEntrypoint), eliminating HTTP routing overhead and admin token auth for internal calls

Architecture

  • Hono middleware chain — each concern is a named middleware using c.set()/c.get() on Hono context
  • Cloudflare bindings: Hyperdrive (Postgres), KV (rate limiting), Service binding (O11Y via RPC), Secrets Store (all API keys)
  • Platform mappings: after()ctx.waitUntil(), response.clone()ReadableStream.tee(), setTimeoutscheduler.wait(), Sentry → structured logging
  • Smart Placement — colocated with DB, not at edge

Key files

llm-gateway/
  src/
    index.ts              # Hono app with full middleware chain
    middleware/            # 14 middleware files
    handler/proxy.ts      # Upstream fetch + ReadableStream.tee() + background tasks
    background/           # usage-accounting, api-metrics, request-logging, abuse-cost
    lib/                  # Pure functions: models, providers, BYOK, custom-llm, etc.
  test/unit/              # 24 test files, 168 tests
cloudflare-o11y/
  src/index.ts            # Added ingestApiMetrics RPC method

iscekic added 11 commits March 2, 2026 20:22
Add llm-gateway Cloudflare Worker package with:
- pnpm workspace registration
- package.json, tsconfig.json, wrangler.jsonc, eslint.config.mjs
- vitest.config.ts / vitest.workers.config.ts (unit + integration)
- src/index.ts: Hono app stub returning 501 on POST /chat/completions
- src/env.ts: Cloudflare.Env type alias
- src/types.ts: HonoContext, Variables, OpenRouterChatCompletionRequest
- src/logger.ts: workers-tagged-logger setup
- worker-configuration.d.ts: type stub (replace with wrangler types output)
- Smoke test confirming worker module loads cleanly

pnpm typecheck passes, pnpm test passes.
Remove all bindings (Hyperdrive, KV, services, secrets store, vars) — these will be added incrementally as each phase needs them. Simplify worker-configuration.d.ts stub and env.ts to match.
- Use shared Hyperdrive id (624ec80...dd10) with localConnectionString inline, no dev env override
- Create and use real KV namespace id for USER_CACHE_KV (c92d83fa...)
- Secret binding name matches secret name: NEXTAUTH_SECRET_PROD (mirrors session-ingest pattern)
- Drop env.dev block entirely
- Middleware chain: request-timing, parse-body, extract-ip,
  resolve-auto-model, auth, anonymous-gate
- Routes /api/gateway/chat/completions and /api/openrouter/chat/completions
- lib/jwt.ts: verifyGatewayJwt + isPepperValid (jose ERR_JWT_EXPIRED fix)
- worker-utils: add userExistsWithCache (single source of truth for
  KV-backed user existence check); consumed by both session-ingest and llm-gateway
- llm-gateway uses shared USER_EXISTS_CACHE KV (ab836697) from session-ingest
- Unit tests: jwt, parse-body, models, kilo-auto-model (39 tests)
- lib/rate-limit.ts: KV sliding window (1h/200 free models; 24h/10k promotions)
- middleware/free-model-rate-limit.ts: 429 for Kilo free models over limit
- middleware/promotion-limit.ts: 401 for anonymous users over promotion limit
- middleware/log-free-model-usage.ts: waitUntil DB insert + KV increment
- lib/byok.ts: BYOK lookup with Web Crypto AES-256-GCM decryption
- lib/providers.ts: Provider type, PROVIDERS map, getProvider() with BYOK/custom LLM/free model routing
- lib/provider-specific.ts: applyProviderSpecificLogic() porting all provider sub-modules
- lib/tool-calling.ts: normalizeToolCallIds/repairTools/hasAttemptCompletionTool (async Web Crypto SHA-256)
- middleware/provider-resolution.ts: pre-fetches secrets in parallel, wires provider result to context
- types/hono.ts: added provider, userByok, customLlm, secrets variables
- wrangler.jsonc + worker-configuration.d.ts: RATE_LIMIT_KV + provider secrets bindings
- 54 tests passing (7 new rate-limit + 8 providers)
- lib/provider-hash.ts: async Web Crypto SHA-256 (generateProviderSpecificHash)
- lib/promotions.ts: isActiveReviewPromo, isActiveCloudAgentPromo
- lib/extract-headers.ts: getFraudDetectionHeaders, extractProjectHeaders, normalizeProjectId
- lib/prompt-info.ts: extractPromptInfo, estimateChatTokens
- lib/org-restrictions.ts: checkOrganizationModelRestrictions, getBalanceAndOrgSettings (Drizzle query, no credit expiration)
- middleware/request-validation.ts: max_tokens cap, dead model 404, rate-limited-to-death 404
- middleware/balance-and-org.ts: balance check, org model/provider restrictions, data collection gate
- middleware/request-transform.ts: safety_identifier, prompt_cache_key, tool repair, provider-specific logic, extraHeaders on context
- lib/tool-calling.ts: export ENABLE_TOOL_REPAIR
- types/hono.ts: add fraudHeaders, projectId, taskId, editorName, machineId, xKiloCodeVersion, numericKiloCodeVersion, extraHeaders
- src/index.ts: wire requestValidationMiddleware, balanceAndOrgCheckMiddleware, requestTransformMiddleware
- test/unit/provider-hash.test.ts: 6 tests
- test/unit/org-restrictions.test.ts: 11 tests
- 71 tests passing, typecheck clean
- handler/proxy.ts: upstream fetch (openRouterRequest + customLlmRequest),
  abuse classification (non-blocking, 2s timeout), 402→503 conversion,
  error logging, makeErrorReadable, free model SSE rewrite, pass-through
- lib/response-helpers.ts: getOutputHeaders, wrapResponse, makeErrorReadable
- lib/rewrite-free-model-response.ts: SSE stream transformer (eventsource-parser),
  model name rewrite, reasoning_details normalisation, cost field stripping
- lib/abuse-service.ts: classifyAbuse + reportAbuseCost via CF Access headers
- lib/custom-llm/: full port of customLlmRequest (Vercel AI SDK, Anthropic +
  OpenAI-compatible, streaming + non-streaming, reasoning, tool calling,
  temp_phase DB tracking with Web Crypto SHA-256 instead of Node.js crypto.hash)
- index.ts: replace notImplemented stub with proxyHandler
- package.json: add ai, @ai-sdk/anthropic, @ai-sdk/openai, eventsource-parser
- wrangler.jsonc + worker-configuration.d.ts: ABUSE_SERVICE_URL,
  ABUSE_CF_ACCESS_CLIENT_ID, ABUSE_CF_ACCESS_CLIENT_SECRET bindings
- All 71 existing unit tests pass, typecheck clean
…metrics, request logging, abuse cost)

- background/usage-accounting.ts: parse SSE/non-streaming response, compute cost, insert
  microdollar_usage + metadata via CTE, update kilocode_users.microdollars_used, ingest
  org token usage. Fixes duplicate parseMicrodollarUsageFromString that was left in the stub.
- background/api-metrics.ts: drain background stream to extract inferenceProvider, emit
  ApiMetricsParams to O11Y service binding via POST /ingest/api-metrics
- background/request-logging.ts: insert api_request_log for Kilo employees (@kilo.ai /
  @kilocode.ai emails or KILO_ORGANIZATION_ID)
- background/abuse-cost.ts: report upstream cost to abuse service after usage is computed
- handler/proxy.ts: rewritten to tee response.body into 4 streams (client + accounting +
  metrics + logging) using ReadableStream.tee() chains; all background tasks scheduled via
  ctx.waitUntil() with 25s internal timeout; free model path only tees for metrics
- wrangler.jsonc + worker-configuration.d.ts: add O11Y_KILO_GATEWAY_CLIENT_SECRET Secrets
  Store binding needed for O11Y service auth
… passing)

Add comprehensive unit tests covering every middleware and library module:

- middleware-chain: full e2e flow with mocked DB/KV/Secrets, health check,
  404, body validation, anonymous gate, route parity, missing IP
- anonymous-gate: free model anonymous access, paid model 401, authUser passthrough
- free-model-rate-limit: KV sliding window 200/hr, per-IP isolation, non-Kilo skip
- request-validation: max_tokens overflow, dead free models, rate-limited-to-death
- response-helpers: header whitelisting, wrapResponse, BYOK error messages
- rewrite-free-model-response: SSE cost stripping, model replacement,
  reasoning_content→reasoning conversion, JSON non-streaming path
- tool-calling: repairTools deduplication/insertion/orphan removal,
  dropToolStrictProperties, normalizeToolCallIds, hasAttemptCompletionTool
- extract-headers: fraud detection headers, project ID normalization (HTTPS/SSH git URLs)
- prompt-info: extractPromptInfo with multipart content, estimateChatTokens
- feature-detection: validateFeatureHeader with valid/invalid/null inputs
- anonymous: createAnonymousContext, isAnonymousContext type guards
- promotions: isActiveReviewPromo/isActiveCloudAgentPromo with time mocking
- abuse-service: classifyRequest/classifyAbuse/reportCost/reportAbuseCost
  with mocked fetch, error handling, CF Access headers
- request-logging: isKiloEmployee guard (@kilo.ai, @kilocode.ai, org ID),
  DB insert, error handling
- shared test helpers (helpers.ts): JWT signing, mock Env, mock KV/Secrets,
  SSE stream builders
Use the Workers-native scheduler.wait() API instead of setTimeout for all
timer-based patterns. scheduler.wait() is an awaitable alternative that
integrates properly with the Workers I/O scheduler.

Replaced in:
- handler/proxy.ts: withTimeout helper and abuse classification 2s timeout
- background/api-metrics.ts: stream read timeout during inference provider extraction
- background/usage-accounting.ts: retry backoff delay for DB concurrency failures
- worker-configuration.d.ts: added scheduler global type declaration
…etch

Add ingestApiMetrics RPC method to the O11Y worker's WorkerEntrypoint,
matching the existing ingestSessionMetrics pattern used by session-ingest.

This eliminates HTTP routing overhead, JSON serialization, and the
X-O11Y-ADMIN-TOKEN auth header for internal service-to-service calls.

Changes:
- cloudflare-o11y/src/index.ts: add ingestApiMetrics RPC method that
  validates with ApiMetricsParamsSchema and writes to Analytics Engine + Pipeline
- llm-gateway/src/o11y-binding.d.ts: declare O11YBinding type extending
  Fetcher with the ingestApiMetrics RPC method signature
- llm-gateway/src/env.ts: override Cloudflare.Env O11Y type with O11YBinding
- llm-gateway/src/background/api-metrics.ts: replace o11y.fetch() HTTP call
  with o11y.ingestApiMetrics() RPC call, remove clientSecret parameter
- llm-gateway/src/handler/proxy.ts: remove o11yClientSecretPromise pre-fetch
  and clientSecret plumbing, simplify BackgroundTaskParams o11y type
@iscekic iscekic self-assigned this Mar 3, 2026
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 3, 2026

Code Review Summary

Status: 2 New Issues Found (101 total) | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity Count (new) Count (total)
CRITICAL 0 4
WARNING 2 82
SUGGESTION 0 15

New Issues in This Review Pass

Issue Details (click to expand)

WARNING

File Line Issue
llm-gateway/src/background/kilo-pass.ts 510 Triple as unknown as Tx cast violates project coding rules — appears 3 times (lines 510, 512, 516)
llm-gateway/src/background/kilo-pass.ts 312 new Date(nextYearlyIssueAtIso) bypasses parseIso validation — could produce Invalid Date and corrupt issue month to "NaN-NaN-01"

Prior Review Issues (99 comments)

Many issues from prior review passes have been fixed by the author, including:

  • err.message exposed to client → now returns generic error
  • ✅ Redundant userExistsWithCache DB query → single lookup
  • null as unknown as string cast → removed
  • ✅ Missing await on applyProviderSpecificLogic → fixed
  • ✅ Dead src/types.ts file → deleted
  • as ReadableStream<Uint8Array> cast → removed
  • ✅ Null-forgiving ! operator → extracted local const
  • ✅ Unsafe as cast on HTTP status → narrowed at source
  • ✅ Race condition in KV rate limiter → replaced with Durable Object
  • zai provider double config → restructured as if/else chain
  • as cast in anonymous.ts → removed, uses flow-sensitive typing
  • ✅ Uncaught JSON.parse in parseAwsCredentials → wrapped in try/catch
  • ✅ Missing max_completion_tokens validation → now validated
  • ✅ Missing org membership check for custom LLMs → added
  • .tee() on free model path → replaced with buffer+replay
  • ✅ Non-constant-time string comparison for pepper → uses timingSafeEqual
  • NaN from parseFloat → guarded with Number.isNaN()
  • ✅ Reader not cancelled in catch block → added reader.cancel()
  • ✅ Error logging not in waitUntil → wrapped
  • ✅ Dead abuse-cost.ts file → deleted
  • ✅ Unused wrangler secrets → removed
  • ✅ 402→503 early return without background tasks → now schedules tasks first

Several issues were acknowledged as matching the reference implementation (not regressions):

  • JSON.parse without try/catch in SSE parser (matches reference)
  • JSON.parse on tool call arguments (matches reference)
  • new URL() without try/catch (matches reference)
  • Internal error message exposed via e.message (matches reference)
  • atob() without try/catch (matches reference, data from DB)
  • toLowerCase() on model ID (matches reference)

Outstanding issues from prior passes that remain open:

  • as ReadableStreamDefaultReader<Uint8Array> casts (3 locations)
  • as casts in abuse-service.ts (response.json)
  • as cast in custom-llm/index.ts line 162
  • as cast in vercel-routing.ts line 48
  • as cast in tool-calling.ts line 15
  • as cast in index.ts line 25 (useWorkersLogger)
  • Stale file header comments in usage-accounting.ts
  • abuse_classification hardcoded to 0
  • hasGenerationEndpoint: true for custom LLMs
  • Free model list duplication between models.ts and providers.ts
  • context_length mismatch for minimax model
  • Anonymous user ID embeds raw IP address
  • fetchWithBackoff retries on all 4xx (including auth failures)
  • FIRST_TOPUP_BONUS hardcoded constant
  • Inconsistent error response formats across middlewares
  • select() with no columns fetches all columns
  • Timestamp array scalability in RateLimitDO
Files Reviewed (110+ files)

New llm-gateway worker (core):

  • llm-gateway/src/index.ts - Entry point, middleware chain
  • llm-gateway/src/handler/proxy.ts - Core proxy handler
  • llm-gateway/src/handler/background-tasks.ts - Background task orchestration

Middleware:

  • llm-gateway/src/middleware/auth.ts
  • llm-gateway/src/middleware/anonymous-gate.ts
  • llm-gateway/src/middleware/balance-and-org.ts
  • llm-gateway/src/middleware/extract-ip.ts
  • llm-gateway/src/middleware/free-model-rate-limit.ts
  • llm-gateway/src/middleware/log-free-model-usage.ts
  • llm-gateway/src/middleware/parse-body.ts
  • llm-gateway/src/middleware/promotion-limit.ts
  • llm-gateway/src/middleware/provider-resolution.ts
  • llm-gateway/src/middleware/request-timing.ts
  • llm-gateway/src/middleware/request-transform.ts
  • llm-gateway/src/middleware/request-validation.ts
  • llm-gateway/src/middleware/resolve-auto-model.ts

Background tasks:

  • llm-gateway/src/background/api-metrics.ts
  • llm-gateway/src/background/kilo-pass.ts - 2 new issues
  • llm-gateway/src/background/request-logging.ts
  • llm-gateway/src/background/usage-accounting.ts

Libraries:

  • llm-gateway/src/lib/abuse-service.ts
  • llm-gateway/src/lib/anonymous.ts
  • llm-gateway/src/lib/byok.ts
  • llm-gateway/src/lib/custom-llm/index.ts
  • llm-gateway/src/lib/extract-headers.ts
  • llm-gateway/src/lib/feature-detection.ts
  • llm-gateway/src/lib/jwt.ts
  • llm-gateway/src/lib/models.ts
  • llm-gateway/src/lib/org-restrictions.ts
  • llm-gateway/src/lib/promotions.ts
  • llm-gateway/src/lib/prompt-info.ts
  • llm-gateway/src/lib/provider-hash.ts
  • llm-gateway/src/lib/provider-specific.ts
  • llm-gateway/src/lib/providers.ts
  • llm-gateway/src/lib/rate-limit.ts
  • llm-gateway/src/lib/response-helpers.ts
  • llm-gateway/src/lib/rewrite-free-model-response.ts
  • llm-gateway/src/lib/sentry.ts
  • llm-gateway/src/lib/tool-calling.ts
  • llm-gateway/src/lib/vercel-routing.ts

Durable Objects:

  • llm-gateway/src/dos/RateLimitDO.ts

Shared packages:

  • packages/worker-utils/src/o11y-schemas.ts
  • packages/worker-utils/src/user-exists-cache.ts
  • packages/worker-utils/src/index.ts

Other workers (refactored):

  • cloudflare-o11y/src/index.ts
  • cloudflare-o11y/src/api-metrics-routes.ts
  • cloudflare-o11y/src/o11y-analytics.ts
  • cloudflare-o11y/src/session-metrics-analytics.ts
  • cloudflare-session-ingest/src/env.ts
  • cloudflare-session-ingest/src/middleware/kilo-jwt-auth.ts
  • cloudflare-session-ingest/src/o11y-binding.d.ts
  • cloudflare-ai-attribution/src/ai-attribution.worker.ts
  • cloudflare-webhook-agent-ingest/src/index.ts

Config & tests: 40+ test files, config files (eslint, vitest, wrangler, tsconfig, package.json)

GIGAPOTATO_API_URL, OPENROUTER_ORG_ID, and ABUSE_SERVICE_URL were plain
env vars but should be secrets. Move them to secrets_store_secrets and
call .get() at the callsites so the Secrets Store Fetcher objects are
resolved to strings before use.
Replace .tee()-based response body splitting with a buffer-and-replay
approach for background tasks. With .tee(), all stream branches share
the same source — if any consumer (usage accounting DB writes, request
logging) stalls, backpressure propagates to the client stream, causing
periodic 5s freezes during streaming responses.

Now the pass-through path pipes upstream chunks through a TransformStream
directly to the client while accumulating a copy. Background tasks
(accounting, metrics, logging) replay the buffered data after the stream
completes, fully decoupled from client delivery.

Also: restore O11Y service binding (service name: o11y), cancel
unconsumed tee branches in free-model and fallback paths.
- Type-safe ReadableStream readers (cast to ReadableStream<Uint8Array>)
- Remove unnecessary type assertions (as string, as number, as unknown[])
- Replace non-null assertions with local variables or typeof narrowing
- Use import type for type-only imports
- Await async applyProviderSpecificLogic (was a floating promise)
- Remove redundant union type ('vercel' | string → string)
Separate RATE_LIMIT_KV from USER_EXISTS_CACHE into its own KV namespace
(llm-gateway-rate-limit, b22ee150a8fb4f63970bd3ff69f23e4d).
Move ApiMetricsParamsSchema and SessionMetricsParamsSchema from
local definitions in cloudflare-o11y and hand-maintained .d.ts files
into @kilocode/worker-utils/src/o11y-schemas.ts as the single source
of truth.

Consumers updated:
- cloudflare-o11y: imports schema + inferred types from worker-utils
- llm-gateway: o11y-binding.d.ts uses ApiMetricsParams from worker-utils
- cloudflare-session-ingest: o11y-binding.d.ts uses SessionMetricsParams
- Deleted cloudflare-o11y/src/session-metrics-schema.ts (moved)
iscekic added 2 commits March 3, 2026 03:07
…mers

Export ApiMetricsParams/SessionMetricsParams as z.input (accepts
undefined for fields with .default()) so callers don't need ?? ''
workarounds. Add ApiMetricsParamsParsed/SessionMetricsParamsParsed
(z.infer) for post-.parse() consumers in o11y analytics.
iscekic added 2 commits March 3, 2026 16:45
Reference calls usageLimitExceededResponse() which queries credit_transactions
to distinguish first-time users from returning payers:
- first-time: "Paid Model - Credits Required" + credits-required message
- returning:  "Low Credit Warning!" + add-credits message

Worker was always returning the first-time variant, misidentifying
returning users. Add hasUserMadePaidTopup() (mirrors summarizeUserPayments)
and branch on the result.
Reference uses Sentry for: invalid JSON parse errors, upstream 402
(converted to 503) and upstream 5xx responses. Worker was using
console.error/console.warn only, creating a blind spot in production.

- Add @sentry/cloudflare dependency
- Create src/lib/sentry.ts with captureException helper
- Wrap Hono app with withSentry() in index.ts (uses same DSN as reference)
- Capture invalid JSON parse exceptions in parse-body middleware
- Capture 402-to-503 conversion and upstream 5xx errors in proxy handler
- Capture unhandled errors in app.onError
@iscekic iscekic force-pushed the feat/llm-gateway branch from d6dc4f1 to 3ef5600 Compare March 3, 2026 16:18
// Port of src/lib/processUsage.ts — simplified:
// - No Sentry spans/captures (use console.error/warn)
// - No PostHog first-usage events
// - No KiloPass threshold check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Stale comments — the file header claims "No PostHog first-usage events" and "No KiloPass threshold check", but the code below actually implements both features (PostHog events around line 729+, KiloPass threshold check around line 545+). These misleading comments should be removed or updated to reflect the current implementation.

iscekic added 8 commits March 3, 2026 18:16
The reference route.ts calls modelDoesNotExistResponse() which returns
HTTP 404 with { error: 'Model not found', message: 'The requested model
could not be found.' }. The worker was returning HTTP 400 with
{ error: 'model is required' } — wrong status code, wrong error text,
and missing message field.
- Add @sentry/cloudflare stub (no-op withSentry + captureException) to
  vitest alias config, matching the existing cloudflare:workers stub
  pattern. Fixes 4 suites that couldn't load due to transitive import.
- Fix free-model-rate-limit test: expected body.error.code but the
  middleware (correctly) returns { error: string, message: string },
  not a nested object.
- Update middleware-chain test for the 404 model-not-found change.
…ateway and /api/openrouter

The reference validates that the [...path] portion is exactly
/chat/completions and returns invalidPathResponse() (HTTP 400) for
anything else. The worker was falling through to the generic notFound
handler returning HTTP 404 { error: 'Not found' }.

Now requests to e.g. /api/gateway/other or /api/openrouter/v1/models
return HTTP 400 with the exact reference error shape. Truly unknown
paths (outside /api/gateway/ and /api/openrouter/) still return 404.
…ponse

The reference modelNotAllowedResponse() returns:
  error:   'Model not allowed for your team.'
  message: 'The requested model is not allowed for your team.'

The worker was using restrictionError.message for both fields, producing
identical values. Now the message field matches the reference.
… new users

The reference usageLimitExceededResponse() appends
'Get $20 free on your first topup!' for users with no payment history.
The worker was omitting this, showing only the generic 'you need to add
credits' text.
…ilo free models

The reference makeErrorReadable() checks whether the estimated token
count (JSON.stringify(request).length/4 + max output tokens) exceeds
the model's context_length for Kilo free models and returns a clear
user-facing message. The worker had a placeholder comment instead.

- Add context_length to KiloFreeModel type and all model entries
- Add getKiloFreeModelContextLength() lookup
- Add estimateTokenCount() matching the reference algorithm
- Implement the context-length check in makeErrorReadable()
- Destructure requestedModel (was in type but not used)
The reference returns stealthModelError() for Kilo stealth models
(inference_providers includes 'stealth'), producing
{ error: 'Stealth model unable to process request', message: same }
with the upstream status code preserved. The worker had no stealth
model handling at all.

- Add isKiloStealthModel() to models.ts
- Add stealth check as the final fallback in makeErrorReadable(),
  matching the reference ordering (BYOK → context-length → stealth)
Port the reference shouldRouteToVercel() logic that routes ~10% of
eligible non-BYOK requests to Vercel AI Gateway, increasing to ~90%
when OpenRouter error rate exceeds 50% (automatic failover).

- Add vercel-routing.ts with shouldRouteToVercel(), getGatewayErrorRate(),
  and deterministic SHA-256 hash-based routing using taskId or userId
- Add preferredModels list to models.ts (tested/recommended models)
- Wire into getProvider() step 3 (after BYOK + custom LLM, before free
  model gateway routing)
- Pass randomSeed (x-kilocode-taskid header or user.id) from
  provider-resolution middleware

Exclusions match the reference: Anthropic models, data_collection=deny,
non-preferred models, openrouter/* prefix, Kilo free models with
non-openrouter gateways.
id: 'custom',
apiUrl: customLlmRow.base_url,
apiKey: customLlmRow.api_key,
hasGenerationEndpoint: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: hasGenerationEndpoint: true is incorrect for custom LLMs.

Custom LLMs don't expose an OpenRouter-style /generation?id= endpoint. Setting this to true causes fetchGeneration in usage-accounting.ts (line 815) to make a spurious HTTP request to customLlmRow.base_url/generation?id=<messageId> after every successful custom LLM request.

While the call is wrapped in a try/catch (line 835) so it won't crash, it:

  • Adds unnecessary latency to the background accounting task
  • Sends unexpected requests to the customer's custom LLM endpoint
  • Logs a warning on every custom LLM request, creating noise
Suggested change
hasGenerationEndpoint: true,
apiKey: customLlmRow.api_key,
hasGenerationEndpoint: false,

/** RPC method called by llm-gateway via service binding. */
async ingestApiMetrics(params: ApiMetricsParams): Promise<void> {
const parsed = ApiMetricsParamsSchema.parse(params);
writeApiMetricsDataPoint(parsed, 'kilo-gateway', this.env, (p) => this.ctx.waitUntil(p));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: writeApiMetricsDataPoint is not awaited, unlike the sibling ingestSessionMetrics method (line 30) which awaits writeSessionMetricsDataPoint.

writeApiMetricsDataPoint is synchronous for the Analytics Engine write but calls waitUntil for the Stream send. Since this is an RPC method called via service binding, the caller (sendApiMetrics in api-metrics.ts) awaits the result. The method will return before the Stream send completes, which is likely fine since waitUntil extends the execution context. However, the inconsistency with ingestSessionMetrics is worth noting — if writeApiMetricsDataPoint ever becomes async (e.g., for error handling), the missing await would silently swallow errors.


const app = new Hono<HonoContext>();

app.use('*', useWorkersLogger('llm-gateway') as Parameters<typeof app.use>[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: as Parameters<typeof app.use>[1] cast violates the project coding rule to "STRONGLY AVOID typescript's as operator".

Consider using a type-safe wrapper or satisfies instead. If the type mismatch is due to workers-tagged-logger returning a handler typed against an older Hono version, a thin adapter function would be safer:

const logger = useWorkersLogger('llm-gateway');
app.use('*', (c, next) => logger(c, next));

iscekic added 9 commits March 3, 2026 18:53
Move the 402→503 conversion after scheduleBackgroundTasks() so that
API metrics, usage accounting, and request logging are emitted even
when the upstream provider returns 402 Payment Required.

Previously the 402 check returned early before any background tasks
were scheduled, causing metrics gaps. The reference implementation
(route.ts) always calls emitApiMetricsForResponse() before the 402
check — this change matches that behavior.

Add proxy-402.test.ts with tests confirming:
- Background tasks are scheduled before the 503 response is returned
- BYOK 402 responses are NOT converted to 503 (pass through)
- Non-402 errors also schedule background tasks
The free model response path in proxy.ts was explicitly setting
accountingStream: null and loggingStream: null, skipping both usage
accounting and request logging. The reference implementation (route.ts)
calls accountForMicrodollarUsage() and handleRequestLogging() for ALL
non-error responses including free models.

Now the free model path provides accounting and logging streams gated
by !isAnon, matching the paid path behavior. Anonymous free model
requests still correctly skip accounting and logging (only metrics).

Add tests confirming:
- Authenticated free model requests get all three streams
- Anonymous free model requests get only metricsStream
…uto-models

When the client requests kilo/auto*, resolveAutoModelMiddleware mutates
requestBody.model to the resolved model (e.g., anthropic/claude-sonnet-4.6).
The API metrics code was using requestBody.model as requestedModel, which
lost the original kilo/auto* identifier.

The reference (route.ts:399) captures requestedModelLowerCased BEFORE
auto-model resolution and passes that to emitApiMetricsForResponse.

Fix: use autoModel ?? resolvedModel instead of requestBody.model. When
autoModel is set (kilo/auto*), it preserves the original identifier;
when null (non-auto requests), resolvedModel IS the requested model.

Add background-tasks.test.ts verifying both paths.
…o suffixes

The reference (route.ts:400) passes normalizeModelId(originalModelIdLowerCased)
which strips :free, :exacto etc. suffixes from the resolvedModel in API metrics.
The worker was passing resolvedModel as-is, preserving the :free suffix.

- Export normalizeModelId() from models.ts (port of src/lib/model-utils.ts)
- Replace local copy in org-restrictions.ts with the shared export
- Apply normalizeModelId() to resolvedModel in background-tasks.ts metrics

Add tests verifying :free and :exacto suffixes are stripped, and models
without colon suffixes are unchanged.
The reference (route.ts:220) awaits logFreeModelRequest() synchronously
before the upstream request, ensuring the rate-limit entry is counted
even if the upstream fails. The worker was firing the DB insert via
waitUntil() non-blocking.

Now the DB insert is awaited before next(), matching the reference.
DO increments remain non-blocking (worker-specific optimization).
DB insert failure is caught and logged but does not block the request.

Add log-free-model-usage.test.ts verifying:
- DB insert happens before the handler (next())
- Handler still runs if DB insert fails
- Non-free models skip the insert entirely
The PostHog API key was fetched via fire-and-forget .then() without
being awaited. Whether posthogApiKey was populated when background
tasks captured it into bgCommon depended on microtask ordering —
first-usage PostHog events could silently drop if the Secrets Store
round-trip hadn't completed.

Now POSTHOG_API_KEY.get() runs in parallel with ABUSE_SERVICE_URL.get()
inside a single Promise.all(), guaranteeing both are resolved before
the upstream request and bgCommon construction. Still fail-open on
error.
Reference (route.ts:276) defaults to false when transforms array is
absent. Worker was defaulting to null, which could violate a NOT NULL
constraint on the DB column.
The reference (creditTransactions.ts:58) uses notExists(kilo_pass_issuance_items)
to exclude KiloPass bonus credits from the payment count. Without this,
a user who received only KiloPass bonus credits would be classified as
a 'returning user' (no first-topup bonus shown) in the worker, but as
a 'new user' (first-topup bonus shown) in the reference.

Add the notExists subquery to hasUserMadePaidTopup, matching the
reference's summarizeUserPayments behavior.
Use the same pattern as vercel-routing.test.ts: cast globalThis to
Record<string, unknown> once and check via the alias, avoiding the
TS2339 'scheduler does not exist on typeof globalThis' error.
const db = getWorkerDb(c.env.HYPERDRIVE.connectionString);

const rows = await db
.select()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: select() with no column arguments fetches ALL columns from kilocode_users. This includes potentially large/sensitive fields that are never used by the gateway (e.g., profile data). Consider selecting only the columns actually needed downstream:

.select({
  id: kilocode_users.id,
  api_token_pepper: kilocode_users.api_token_pepper,
  google_user_email: kilocode_users.google_user_email,
  microdollars_used: kilocode_users.microdollars_used,
  total_microdollars_acquired: kilocode_users.total_microdollars_acquired,
  is_admin: kilocode_users.is_admin,
})

This reduces data transfer from Postgres and avoids leaking unnecessary PII into the request context.

let response: Response;
if (customLlm) {
const db = getWorkerDb(c.env.HYPERDRIVE.connectionString);
const isLegacyExtension = !!fraudHeaders.http_user_agent?.startsWith('Kilo-Code/');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: fraudHeaders is read from context here (line 76), but it was set by requestTransformMiddleware which runs before proxyHandler. However, isLegacyExtension checks fraudHeaders.http_user_agent — if requestTransformMiddleware ever fails to set fraudHeaders, this would throw a runtime error accessing .http_user_agent on undefined. The middleware chain ordering is correct today, but this implicit dependency is fragile. Consider adding a null check:

const isLegacyExtension = !!fraudHeaders?.http_user_agent?.startsWith('Kilo-Code/');

requestedModel: resolvedModel,
request: requestBody,
response: new Response(errorBodyBytes, response),
isUserByok: !!userByok,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: new Response(errorBodyBytes, response) passes the original response as the ResponseInit. The Response constructor accepts { status, statusText, headers } from the second argument. Since response is a full Response object (not a plain ResponseInit), this relies on the runtime extracting only the relevant fields. While this works in practice on Cloudflare Workers, it's an implicit behavior. Consider being explicit:

new Response(errorBodyBytes, { status: response.status, statusText: response.statusText, headers: response.headers })

function extractMessageTextParts(content: unknown): string[] {
if (typeof content === 'string') return [content];
if (!Array.isArray(content)) return [];
return (content as Array<Record<string, unknown>>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: as Array<Record<string, unknown>> cast violates the project coding rule to "STRONGLY AVOID typescript's as operator". Since content is already narrowed to Array by the Array.isArray(content) check, you can use a type guard in the filter predicate without the cast:

return content
  .filter(
    (part): part is { type: string; text: string } =>
      part !== null &&
      typeof part === 'object' &&
      ('type' in part) &&
      (part.type === 'input_text' || part.type === 'output_text') &&
      ('text' in part) &&
      typeof part.text === 'string'
  )
  .map(p => p.text);

console.debug('[getGatewayErrorRate] query timeout');
return fallback;
}
const rows = result.rows as unknown as Array<{ gateway: string; errorRate: number }>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: as unknown as Array<{ gateway: string; errorRate: number }> is a double as cast that violates the project coding rules. The db.execute generic parameter already types the rows, so this cast is unnecessary. Consider:

const rows = result.rows;
return {
  openrouter: rows.find(r => r.gateway === 'openrouter')?.errorRate ?? 0,
  vercel: rows.find(r => r.gateway === 'vercel')?.errorRate ?? 0,
};

If the types don't align, use satisfies or a runtime check instead of as unknown as.

const providerConfig: OpenRouterProviderConfig = {};

if (params.organizationPlan === 'enterprise' && providerAllowList.length > 0) {
const requiredProviders = extraRequiredProviders(normalizedModelId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: extraRequiredProviders is called with normalizedModelId (which strips the colon suffix via normalizeModelId), but the kiloFreeModelProviders map keys include colons (e.g., corethink:free, x-ai/grok-code-fast-1:optimized:free). This means the lookup will NEVER match for models like corethink:free (normalized to corethink) or x-ai/grok-code-fast-1:optimized:free (normalized to x-ai/grok-code-fast-1). The provider restriction check for enterprise orgs will silently skip these models.

While the previous reviewer noted this is "safe because the default behavior is permissive", this means enterprise orgs with a provider_allow_list that excludes corethink or stealth will NOT be blocked from using those free models — the restriction is silently bypassed. This could be a compliance issue for enterprise customers who expect their allow list to be enforced.

): Promise<Response> {
const maxElapsedMs = 20_000;
const startedAt = Date.now();
let nextDelayMs = 200 * (1 + (Math.random() - 0.5) / 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: fetchWithBackoff retries on ALL 4xx responses (r.status >= 400), including 401 (auth failure) and 403 (forbidden). Auth failures will never succeed on retry. Consider only retrying on 404 and 5xx:

r => r.status === 404 || r.status >= 500

requested_model: usageContext.requested_model,
cache_discount: usageStats.cacheDiscount_mUsd ?? null,
has_error: usageStats.hasError,
abuse_classification: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: abuse_classification is hardcoded to 0 here, even though usageContext.abuse_request_id is available. The reference implementation stores the abuse classification request ID in this field. This means the abuse_classification column in microdollar_usage will always be 0, making it impossible to correlate usage records with abuse classifications in the database.

Suggested change
abuse_classification: 0,
abuse_classification: usageContext.abuse_request_id ?? 0,

}

// 3. Vercel AI Gateway A/B routing (non-BYOK, non-custom-LLM)
if (await shouldRouteToVercel(db, requestedModel, request, randomSeed)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: shouldRouteToVercel is called with requestedModel (already lowercased by parseBodyMiddleware), but preferredModels in models.ts contains mixed-case entries like anthropic/claude-opus-4.6. The preferredModels.includes(requestedModel) check in vercel-routing.ts:109 will fail for any model whose preferredModels entry uses different casing than the lowercased requestedModel. Currently all entries happen to be lowercase, but this is fragile — adding a model with uppercase characters to preferredModels would silently break Vercel routing for it.

}

function isToolMessage(msg: ChatMessage): msg is ToolMessage {
return msg.role === 'tool' && typeof (msg as Record<string, unknown>).tool_call_id === 'string';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: as Record<string, unknown> cast violates the project coding rules. This can be replaced with an in check:

Suggested change
return msg.role === 'tool' && typeof (msg as Record<string, unknown>).tool_call_id === 'string';
return msg.role === 'tool' && 'tool_call_id' in msg && typeof msg.tool_call_id === 'string';

iscekic added 3 commits March 3, 2026 19:26
Switch from the shared Next.js Sentry DSN to a dedicated llm-gateway
project. Enable sendDefaultPii for automatic IP collection on events.
},
{
public_id: 'minimax/minimax-m2.5:free',
context_length: 204_800,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: context_length mismatch — minimax/minimax-m2.5:free is 204_800 here but 1_000_000 in providers.ts. The value here feeds getKiloFreeModelContextLength()makeErrorReadable(), which will incorrectly report context-length errors at ~205K tokens when the model actually supports 1M tokens. This will cause false "context length exceeded" error messages for users with large prompts between 205K and 1M tokens.

Suggested change
context_length: 204_800,
context_length: 1_000_000,

} catch (err) {
if (err instanceof Error) {
// jose uses error.code for JWT-specific errors
if ((err as { code?: string }).code === 'ERR_JWT_EXPIRED') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]: as cast violates the project coding rule to "STRONGLY AVOID typescript's as operator". Since err is already narrowed to Error by the instanceof check, you can use an in guard instead:

Suggested change
if ((err as { code?: string }).code === 'ERR_JWT_EXPIRED') {
if ('code' in err && err.code === 'ERR_JWT_EXPIRED') {

export const extractIpMiddleware = createMiddleware<HonoContext>(async (c, next) => {
// CF-Connecting-IP is the authoritative source on Cloudflare Workers
const cfIp = c.req.header('CF-Connecting-IP');
const xffIp = c.req.header('x-forwarded-for')?.split(',')[0]?.trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: x-forwarded-for is trivially spoofable by clients. On Cloudflare Workers, CF-Connecting-IP is the only trustworthy source — it's set by Cloudflare's edge and cannot be forged. If CF-Connecting-IP is absent, the request likely didn't traverse Cloudflare's edge (e.g., direct-to-origin), so trusting x-forwarded-for means an attacker could spoof their IP to bypass rate limiting (freeModelRateLimitMiddleware, promotionLimitMiddleware) and abuse detection.

Consider either:

  • Rejecting requests without CF-Connecting-IP (they shouldn't reach a Workers deployment normally)
  • Or at minimum, logging a warning when falling back to x-forwarded-for so it's visible in monitoring

iscekic added 2 commits March 3, 2026 19:51
- Include test/integration/**/*.test.ts in Vitest config and remove
  the previous exclusion to enable running integration tests
- Add POSTHOG_API_KEY to test makeEnv helper for analytics-dependent
  tests
const effectiveThreshold = getEffectiveKiloPassThreshold(user.kiloPassThreshold ?? null);
if (effectiveThreshold === null || user.microdollarsUsed < effectiveThreshold) return;

const subscriptionState = await getKiloPassStateForUser(tx as unknown as Tx, kiloUserId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: Triple as unknown as Tx cast violates the project coding rule to "STRONGLY AVOID typescript's as operator".

This pattern appears three times (lines 510, 512, 516). The Tx type is extracted via a conditional type from WorkerDb['transaction'], but the Drizzle transaction callback parameter should already be compatible. Consider defining Tx directly from the transaction callback's parameter type, or using a wrapper function that accepts the transaction and calls the inner functions, avoiding the cast entirely:

type Tx = Parameters<Parameters<WorkerDb['transaction']>[0]>[0];

If the types genuinely don't align, a single cast at the db.transaction call site (wrapping the callback) would be cleaner than casting on every usage.

if (!anchor) return null;
// currentPeriodStart = nextYearlyIssueAt - 1 month (or startedAt)
const currentPeriodStart = nextYearlyIssueAtIso
? addMonths(new Date(nextYearlyIssueAtIso), -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[WARNING]: new Date(nextYearlyIssueAtIso) bypasses the parseIso validation used on line 308.

If nextYearlyIssueAtIso is a truthy but malformed string (e.g., "not-a-date"), parseIso on line 308 returns null and anchor falls through to startedAtIso. But on line 311, the truthy check passes and new Date(nextYearlyIssueAtIso) produces Invalid Date, which propagates through addMonthscomputeIssueMonth producing "NaN-NaN-01" as the issue month.

Use the already-validated result instead:

Suggested change
? addMonths(new Date(nextYearlyIssueAtIso), -1)
? addMonths(parseIso(nextYearlyIssueAtIso) ?? anchor, -1)

Or reuse the parseIso result from line 308 by extracting it into a local variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant