feat(routing): add canonical model pools and admin controls#352
feat(routing): add canonical model pools and admin controls#352vfeitoza wants to merge 1 commit into
Conversation
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR adds canonical model pool routing and operator controls. The main changes are:
Confidence Score: 2/5This should be fixed before merging.
Focus on Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Client request model] --> B[ResolveExecutionSelectorWithContext]
B --> C[Composed canonical resolver]
C --> D[Weighted counter advances]
D --> E[Execution selector chosen]
E --> F[ResolveRequestModelWithAuthorizer]
F --> G[Canonical resolver called again]
G --> H[Weighted counter advances again]
H --> I[Metadata and fallbacks may describe different candidate]
|
| "recommendation", "rebuild with -tags=swagger") | ||
| } | ||
|
|
||
| routingResolver := routing.NewResolver(appCfg.Routing, app.routingState.Service).WithRuntime(providerResult.Registry) |
There was a problem hiding this comment.
When routing.model_pools is empty, routing.NewResolver returns nil, and this line immediately calls WithRuntime on that nil value. Existing configs that do not opt into canonical pools will panic during app startup before the server is created.
Context Used: CLAUDE.md (source)
| return nil | ||
| } | ||
|
|
||
| cfg.Defaults.Strategy = ResolveRoutingStrategy(cfg.Defaults.Strategy) | ||
| if !cfg.Defaults.Strategy.Valid() { | ||
| return fmt.Errorf("routing.defaults.strategy must be one of: priority_failover, weighted_round_robin") | ||
| } | ||
| if cfg.Defaults.SessionAffinityTTL <= 0 { | ||
| cfg.Defaults.SessionAffinityTTL = 30 * time.Minute | ||
| } |
There was a problem hiding this comment.
Partial defaults disable failover
The loader restores durations, attempts, and retry statuses when they are omitted, but it does not restore boolean defaults. If a config file includes a partial routing.defaults block, YAML zero values can turn session_affinity, failover.enabled, or retry_on_model_errors from their documented defaults of true into false, so pool failover or affinity can be silently disabled by an otherwise unrelated routing setting.
Context Used: CLAUDE.md (source)
| var canonicalResolution *core.CanonicalRoutingResolution | ||
| if canonicalResolver, ok := resolver.(interface { | ||
| ResolveWithContext(context.Context, core.RequestedModelSelector) (*core.CanonicalRoutingResolution, bool, error) | ||
| }); ok { | ||
| canonicalResolution, _, err = canonicalResolver.ResolveWithContext(ctx, requested) | ||
| if err != nil { |
There was a problem hiding this comment.
ResolveExecutionSelectorWithContext already calls the composed resolver and may mutate weighted-round-robin counters to choose the execution selector. This second canonical resolve can choose a different candidate for weighted pools and advances the counter again, so the request can execute against candidate A while EffectiveCandidate, fallbacks, and selected provider metadata describe candidate B.
Context Used: CLAUDE.md (source)
| if l.resolver == nil { | ||
| continue | ||
| } | ||
| resolution, matched, err := l.resolver.Resolve(core.NewRequestedModelSelector(canonical, "")) |
There was a problem hiding this comment.
ExposedModelsFiltered uses the live resolver to decide which canonical model to expose. For weighted_round_robin, Resolve increments the resolver counters, so every GET /v1/models poll consumes routing slots and changes which backend future inference requests will hit. A read-only model listing endpoint should not mutate runtime selection state.
Context Used: CLAUDE.md (source)
| case KindPoolCandidate: | ||
| if entry.ProviderName == "" || entry.Model == "" { | ||
| return Entry{}, newValidationError("provider_name and model are required for pool_candidate", nil) | ||
| } | ||
| if entry.Key == "" { | ||
| entry.Key = entry.ProviderName + "/" + entry.Model | ||
| } |
There was a problem hiding this comment.
Pool-candidate state is keyed only as provider/model, without the canonical model name. If two canonical pools reuse the same exact provider/model candidate, disabling that candidate for one pool disables it for every pool. That breaks the advertised per-pool candidate control and can take unrelated canonical models out of rotation.
Context Used: CLAUDE.md (source)
| if resolver := routing.NewResolver(h.routingConfig, h.routingState); resolver != nil { | ||
| resolver = resolver.WithRuntime(h.registry) | ||
| if resolution, matched, err := resolver.Resolve(core.NewRequestedModelSelector(canonicalName, "")); err == nil && matched && resolution != nil { | ||
| effectiveCandidate = resolution.Primary.QualifiedModel() | ||
| effectiveProviderName = resolution.Primary.Provider | ||
| } |
There was a problem hiding this comment.
The admin endpoint creates a fresh resolver each time it builds the pool response. For weighted pools, that resolver has empty counters and no shared affinity state, so it reports the deterministic first candidate rather than the candidate the live resolver would actually select. The dashboard can mark the wrong row as the effective candidate.
Context Used: CLAUDE.md (source)
📝 WalkthroughWalkthroughAdds canonical model pool routing: config + validation, resolver and selection strategies (priority failover / weighted round robin), session affinity, failover policy, runtime-based eligibility, persistent routing state with multi-backend stores, admin APIs and dashboard UI, and integration into request resolution and fallback flows. ChangesCanonical Routing Feature
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
7e1d804 to
779fa78
Compare
There was a problem hiding this comment.
Actionable comments posted: 25
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/admin/dashboard/static/js/modules/aliases.js (1)
42-78: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winExtract repeated map lookup to avoid redundant calls.
routingCandidates.get(this.qualifiedModelName(model))is called 13 times per model row. Extract the result once to improve performance and readability.♻️ Proposed refactor
const rows = this.models.map((model) => ({ + const routingEntry = routingCandidates.get(this.qualifiedModelName(model)) || {}; key: 'model:' + this.qualifiedModelName(model), display_name: this.qualifiedModelName(model), secondary_name: '', provider_name: model.provider_name || '', provider_type: model.provider_type || '', model: model.model, is_alias: false, alias: null, access: model && model.access ? model.access : null, kind_badge: '', masking_alias: null, alias_state_class: '', alias_state_text: '', - canonical_model: (routingCandidates.get(this.qualifiedModelName(model)) || {}).canonical_model || '', - routing_state: (routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || null, - routing_pool: (routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || null, - canonical_enabled: (routingCandidates.get(this.qualifiedModelName(model)) || {}).pool ? ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool.enabled !== false) : true, - canonical_status: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).status || '', - canonical_reason: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).status_reason || '', - routing_strategy: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).strategy || '', - candidate_priority: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).priority ?? null), - candidate_weight: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).weight ?? null), - is_config_primary: Boolean(((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).is_config_primary), - is_effective_candidate: Boolean(((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).is_effective_candidate), - effective_candidate: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).effective_candidate || ''), - config_primary_candidate: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).config_primary_candidate || '') + canonical_model: routingEntry.canonical_model || '', + routing_state: routingEntry.routing_state || null, + routing_pool: routingEntry.pool || null, + canonical_enabled: routingEntry.pool ? (routingEntry.pool.enabled !== false) : true, + canonical_status: (routingEntry.pool || {}).status || '', + canonical_reason: (routingEntry.pool || {}).status_reason || '', + routing_strategy: (routingEntry.pool || {}).strategy || '', + candidate_priority: ((routingEntry.routing_state || {}).priority ?? null), + candidate_weight: ((routingEntry.routing_state || {}).weight ?? null), + is_config_primary: Boolean((routingEntry.routing_state || {}).is_config_primary), + is_effective_candidate: Boolean((routingEntry.routing_state || {}).is_effective_candidate), + effective_candidate: ((routingEntry.pool || {}).effective_candidate || ''), + config_primary_candidate: ((routingEntry.pool || {}).config_primary_candidate || '') }));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 42 - 78, The code repeatedly calls routingCandidates.get(this.qualifiedModelName(model)) inside the rows mapping causing redundant lookups; refactor the rows creation in the models.map callback to compute const rc = routingCandidates.get(this.qualifiedModelName(model)) || {} (and const pool = rc.pool || {} if helpful) once at the top of the callback, then replace all occurrences with rc, pool or rc.routing_state so each property (canonical_model, routing_state, pool, pool.enabled, pool.status, pool.status_reason, pool.strategy, pool.effective_candidate, pool.config_primary_candidate, routing_state.priority, routing_state.weight, etc.) reads from those local variables instead of calling routingCandidates.get(...) repeatedly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config/config_test.go`:
- Around line 720-769: The test TestLoad_RoutingConfigYAML is missing assertions
for the failover block parsed into cfg.Routing.Defaults.Failover; add assertions
that cfg.Routing.Defaults.Failover.Enabled == true,
cfg.Routing.Defaults.Failover.MaxAttempts == 5,
cfg.Routing.Defaults.Failover.RetryOnStatuses contains 429 and 503 (and has
expected length), and cfg.Routing.Defaults.Failover.RetryOnModelErrors == false
so regressions in failover parsing are caught.
In `@config/routing.go`:
- Around line 88-90: Validate and reject or sanitize invalid HTTP status codes
in cfg.Defaults.Failover.RetryOnStatuses during config load: in the routing
config initialization (the code that sets cfg.Defaults.Failover.RetryOnStatuses
in config/routing.go) iterate the slice and ensure each status is within the
HTTP range 100..599, either filtering out invalid values or returning a config
validation error; update the default-setting block that currently sets
[]int{429,500,502,503,504} to perform this validation (refer to
cfg.Defaults.Failover.RetryOnStatuses) so runtime failover matching cannot
receive values like 0, negative, or >599.
In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 523-530: The toggleCanonicalModelEnabled function sends the wrong
enabled value because it uses !(row.canonical_enabled === false) instead of
inverting the current boolean; update toggleCanonicalModelEnabled to compute the
new state as enabled: !row.canonical_enabled (or
Boolean(!row.canonical_enabled)) when calling submitRoutingStateChange, keeping
the existing null/undefined guard for row and row.canonical_model and preserving
the submitted payload fields canonical_model and kind.
- Around line 489-501: submitRoutingStateChange currently doesn't catch
fetch/network errors; wrap the body of submitRoutingStateChange in a try/catch
and on any thrown error behave like the other submit helpers: catch the error,
call the same error handling/logging path (or return false) so the caller
doesn't get an unhandled rejection, and ensure you still call/await
this.fetchRoutingState() and this.fetchRoutingPools() only on success; reference
the existing method submitRoutingStateChange and related helpers
handleFetchResponse, isStaleAuthFetchResult, fetchRoutingState, and
fetchRoutingPools when implementing the catch behavior.
- Around line 216-244: fetchRoutingState() updates this.routingStateViews but
never calls this.syncDisplayModels(), so UI models
(groupDisplayModels()/providerRoutingEnabled()) can remain stale; after
successfully assigning this.routingStateViews (the line setting it from payload)
call this.syncDisplayModels(), and also call this.syncDisplayModels() in the
catch block after setting this.routingStateViews = [] to mirror
fetchRoutingPools() behavior; reference fetchRoutingState, routingStateViews,
syncDisplayModels, fetchRoutingPools, groupDisplayModels, and
providerRoutingEnabled when making the change.
- Around line 246-266: fetchRoutingPools() currently treats any non-handled
response the same and doesn't update routingStateAvailable like
fetchRoutingState() does; update fetchRoutingPools() so that after calling
this.handleFetchResponse(res, 'routing model pools', request) you check for a
503 condition (the same condition used in fetchRoutingState()) and set
this.routingStateAvailable = false and return; otherwise keep the existing
!handled path that clears this.routingPools and calls this.syncDisplayModels();
reference functions/fields: fetchRoutingPools, fetchRoutingState,
handleFetchResponse, routingStateAvailable, routingPools, syncDisplayModels.
In `@internal/admin/handler_routing_pools_test.go`:
- Around line 16-91: The test
TestBuildRoutingPoolResponses_UsesRoutingConfigAndState (and the file
internal/admin/handler_routing_pools_test.go) must be converted into
table-driven tests to cover additional scenarios: weighted round-robin/default
behavior, nil routing state, and empty model pools; refactor the current
specific test into a table of cases each describing input
registry/service/config (construct via providers.NewModelRegistry(),
routingstate.NewService()/nil, NewHandler(..., WithRoutingState(...),
WithRoutingConfig(...))) and expected outputs from
h.buildRoutingPoolResponses(), and iterate over cases asserting Strategy,
Enabled, Candidates, EffectiveCandidate, ConfigPrimaryCandidate, and
BlockedCandidates for each case. Ensure each case has a descriptive name, uses
t.Run, and reuses helper setup code for creating handlers and invoking
buildRoutingPoolResponses to avoid duplication.
In `@internal/admin/handler_routing_state_test.go`:
- Around line 36-88: Replace the four separate tests with a single table-driven
test that iterates cases describing name, setup (using newRoutingStateHandler
and optional pre-seed via h.UpsertRoutingState), HTTP method and path, request
body, and expected status; call the appropriate handler function by inspecting
the method/path and assert rec.Code equals expected. Include cases: list empty
(GET -> 200, body "[]\n"), upsert provider success (PUT valid body -> 200),
upsert malformed/missing enabled (PUT invalid body -> 400), delete success
(DELETE existing key -> 204), delete not-found (DELETE non-existent key -> 404),
and malformed JSON for any endpoint (invalid JSON -> 400). Ensure you reference
the handler functions ListRoutingState, UpsertRoutingState, DeleteRoutingState
and use newRoutingStateHandler to create fresh handler instances per table case.
In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: In the routingstate.KindPoolCandidate branch update the
equality check to include canonical model matching as well as provider and
model: inside the if that currently compares current.ProviderName ==
entry.ProviderName && current.Model == entry.Model, add a comparison for the
canonical model (e.g. current.CanonicalModel == entry.CanonicalModel or the
actual field name used for canonical_model) so the pool-candidate response only
returns when provider, model and canonical_model all match.
In `@internal/app/app.go`:
- Around line 278-292: The routing state (routingStateResult assigned to
app.routingState via routingstate.NewWithSharedStorage / routingstate.New) is
not being closed on error paths or during Shutdown; update the initialization
error handling to include app.routingState.Close() in the errors.Join close list
when routingStateResult was created, and add app.routingState.Close() to the
Shutdown/teardown join calls (where app.pricingOverrides.Close(),
app.modelOverrides.Close(), app.aliases.Close(), etc. are joined) so
routing-state resources are always closed on startup failure and normal
shutdown.
- Around line 437-441: routing.NewResolver(...) can return nil, so avoid
chaining .WithRuntime(...) directly; instead assign the result to a local
variable (routingResolver := routing.NewResolver(appCfg.Routing,
app.routingState.Service)), check if routingResolver != nil, and only then call
routingResolver = routingResolver.WithRuntime(providerResult.Registry); leave
routingResolver nil otherwise so subsequent code (e.g., composedModelResolver :=
routing.NewComposedResolver(app.aliases.Service, routingResolver)) doesn't panic
at startup.
In `@internal/gateway/fallback.go`:
- Around line 57-72: The code calls o.failoverPolicy methods/fields without
checking for nil when CanonicalModel is set; add a defensive nil-check around
access to o.failoverPolicy in the canonical path: when computing shouldAttempt
(replace o.failoverPolicy.ShouldAttempt(primaryErr)) only call it if
o.failoverPolicy != nil, otherwise keep shouldAttempt false (or log/handle
accordingly), and similarly before comparing attempts against
o.failoverPolicy.MaxAttempts inside the loop only access MaxAttempts if
o.failoverPolicy != nil; update the logic around ShouldAttemptFallback, the
shouldAttempt variable, the canonicalPool check, and the attempts loop to avoid
dereferencing a nil o.failoverPolicy.
- Around line 159-174: The code uses o.failoverPolicy without a nil check: when
computing shouldAttempt (overriding ShouldAttemptFallback) and when comparing
attempts with o.failoverPolicy.MaxAttempts; add defensive nil checks so
o.failoverPolicy is non-nil before accessing methods/properties. Specifically,
in the block that sets shouldAttempt (currently: if workflow != nil &&
workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" {
shouldAttempt = o.failoverPolicy.ShouldAttempt(primaryErr) }) guard the call
with o.failoverPolicy != nil, and in the for-loop condition (canonicalPool &&
o.failoverPolicy.MaxAttempts > 0 && attempts >= o.failoverPolicy.MaxAttempts-1)
ensure o.failoverPolicy != nil is checked first (e.g., canonicalPool &&
o.failoverPolicy != nil && o.failoverPolicy.MaxAttempts > 0 && ...), mirroring
the defensive checks used in tryFallbackResponse.
In `@internal/gateway/request_model_resolution.go`:
- Around line 66-67: The code currently calls
ResolveExecutionSelectorWithContext (producing resolvedSelector, aliasApplied)
and then invokes the routing resolver again to compute canonical
selector/metadata, which can advance state twice; instead call the resolver only
once and capture both the execution selector and its canonical metadata in that
single invocation (either extend ResolveExecutionSelectorWithContext to return
canonical metadata or call the canonical-resolve helper once and store its
results), then pass those stored values into RequestModelResolution enrichment
and any execution code so both use the identical selector and metadata (use the
variables resolvedSelector, aliasApplied, and the new
canonicalMetadata/canonicalSelector you obtain from the single resolver call).
In `@internal/providers/config_test.go`:
- Around line 66-84: The clearProviderConfigEnvVars helper only clears
unsuffixed keys, causing resolveProviders and
TestResolveProviders_NoProvidersNoEnvVars to be flaky when ambient
suffixed/provider-family vars (e.g. *_EAST_*, VERTEX_*, BEDROCK_* or
provider-prefixed variations) exist; update clearProviderConfigEnvVars to also
remove any environment variables whose names start with or contain the provider
prefixes listed (OPENAI, ANTHROPIC, GEMINI, DEEPSEEK, XAI, GROQ, OPENROUTER,
ZAI, AZURE, ORACLE, VLLM, OLLAMA) or their family suffixes by scanning
os.Environ (or t.Setenv for each match) and unsetting keys that match those
patterns so resolveProviders and TestResolveProviders_NoProvidersNoEnvVars run
deterministically.
In `@internal/routing/exposed_models_test.go`:
- Around line 33-98: Combine the two tests into a single table-driven test that
iterates cases and calls NewCanonicalExposedModelLister(...).ExposedModels() for
each case; each case should specify the routing strategy and provider runtime
snapshots (healthy/unhealthy) plus the expected selected model properties (e.g.,
expect OwnedBy == "b" or expect ID == "claude-sonnet-4-6"). Reuse the existing
testCatalog models and the staticRuntimeProvider snapshots from the current
tests, and assert on models length and the expected field for each case. Keep
test logic extracted around NewCanonicalExposedModelLister, ExposedModels, and
the assertions so each table row only changes config.RoutingConfig and snapshots
and the expected result.
In `@internal/routing/exposed_models.go`:
- Around line 59-64: The nil-check on the locally-created pointer "chosen" is
impossible and triggers a lint error; remove the dead branch and append the
cloned model directly: after creating "clone := *model" and setting "clone.ID =
canonical", use models = append(models, clone) (or append(models, *chosen) but
without the if chosen != nil check) so the code no longer contains the SA4031
unreachable nil test.
In `@internal/routing/pool_evaluator.go`:
- Around line 112-118: The aggregation currently increments availableCount for
any effectiveEnabled runtime even if runtime.Status == "unhealthy", causing pool
state to be reported as "enabled" incorrectly; update the logic so
availableCount is only incremented when effectiveEnabled && runtime.Status !=
"unhealthy" (same condition used for selectable and selectableCount), and apply
the same fix to the later pool-status aggregation block that uses
effectiveEnabled/availableCount (the section around
selectableCount/availableCount and the block at lines ~140-143) so unhealthy
runtimes are excluded from "available" counts everywhere.
In `@internal/routing/resolver_test.go`:
- Around line 313-315: The lint QF1001 flags the negated conjunction in
manualDisableStateChecker.CandidateEnabled; replace the negation with an
explicit disjunction to satisfy the linter by returning selector.Provider !=
s.disabledProvider || selector.Model != s.disabledModel instead of
!(selector.Provider == s.disabledProvider && selector.Model == s.disabledModel),
keeping references to the same symbols
(manualDisableStateChecker.CandidateEnabled, selector.Provider, selector.Model,
s.disabledProvider, s.disabledModel).
In `@internal/routing/resolver.go`:
- Around line 61-66: Resolver.WithRuntime currently assigns to Resolver.runtime
without synchronization while ResolveWithContext reads it during request
handling, risking a data race; either make the field access concurrency-safe by
acquiring the Resolver.mu mutex when writing (and ensure readers use the same mu
when reading in ResolveWithContext/EvaluatePool paths) or explicitly document
and enforce that WithRuntime is init-only and never called after the Resolver is
published; update the implementation to use the Resolver.mu lock around the
write in WithRuntime (and audit ResolveWithContext/EvaluatePool to use mu when
accessing runtime) or add clear comments and tests asserting init-only usage.
In `@internal/routing/runtime.go`:
- Around line 35-39: The branch that returns "healthy" when snapshot.Registered
&& snapshot.DiscoveredModelCount > 0 && modelFetchError == "" must also consider
availability failures: check snapshot.LastAvailabilityError (or the equivalent
field) and treat non-empty/non-nil availability errors as "degraded" before
returning "healthy". Update the logic around the case that uses
snapshot.UsingCachedModels and snapshot.LastModelFetchSuccessAt to first return
"degraded" if snapshot.LastAvailabilityError is set, and apply the same fix to
the other similar branch referenced (the one at lines 44-47) so both branches
consider snapshot.LastAvailabilityError prior to returning "healthy".
In `@internal/routing/session_affinity.go`:
- Around line 46-49: The current use of core.GetRequestID(ctx) to create
affinity keys (canonicalModel + "|request_id|" + requestID) produces many
one-off keys and the lazy cleanup in the read path (the code referenced around
Put/Get cleanup lines) can let the map grow unbounded; fix by either removing
the request_id fallback from the affinity key generation (only use stable
identifiers) or by bounding the affinity map in the Put path: in the Put method
add a pruning strategy (e.g., TTL per entry or a periodic sweep every N writes
and/or a max-entries eviction) and ensure the existing lazy cleanup code in Get
still complements it; modify the code that builds the key (where canonicalModel
and core.GetRequestID(ctx) are used) and update Put/Get cleanup logic to enforce
bounded size.
In `@internal/routing/strategy.go`:
- Around line 12-20: The priority-failover branch (case
config.RoutingStrategyPriorityFailover) blindly indexes ordered[0] which will
panic if pool.Candidates is empty; add a guard that checks len(pool.Candidates)
(or len(ordered)) == 0 and return an appropriate error (or nil/empty values
consistent with the function signature) before sorting/returning, referencing
the RoutingStrategyPriorityFailover case, pool.Candidates and ordered slice to
locate where to insert the check.
In `@internal/routingstate/service.go`:
- Around line 55-56: normalizeEntry currently overwrites entry.UpdatedAt with
time.Now().UTC(), so calling it from Refresh causes in-memory timestamps to
change even when storage didn't; modify Refresh to preserve persisted UpdatedAt
by either adding a no-update flag to normalizeEntry (e.g., normalizeEntry(entry,
preserveTimestamps bool)) or by capturing the original entry.UpdatedAt before
calling normalizeEntry and restoring it afterwards if no storage changes
occurred; ensure references in Refresh use the preserved UpdatedAt rather than
the value set inside normalizeEntry.
In `@internal/routingstate/types.go`:
- Around line 63-65: For entries of KindPoolCandidate, always canonicalize the
Key to the canonical form provider/model instead of leaving a caller-supplied
Key; in the code handling KindPoolCandidate (the block referencing entry.Key,
entry.ProviderName and entry.Model in internal/routingstate/types.go) remove the
conditional that only sets entry.Key when empty and unconditionally set
entry.Key = entry.ProviderName + "/" + entry.Model so the Key is stable and
unique for a given provider+model.
---
Outside diff comments:
In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 42-78: The code repeatedly calls
routingCandidates.get(this.qualifiedModelName(model)) inside the rows mapping
causing redundant lookups; refactor the rows creation in the models.map callback
to compute const rc = routingCandidates.get(this.qualifiedModelName(model)) ||
{} (and const pool = rc.pool || {} if helpful) once at the top of the callback,
then replace all occurrences with rc, pool or rc.routing_state so each property
(canonical_model, routing_state, pool, pool.enabled, pool.status,
pool.status_reason, pool.strategy, pool.effective_candidate,
pool.config_primary_candidate, routing_state.priority, routing_state.weight,
etc.) reads from those local variables instead of calling
routingCandidates.get(...) repeatedly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c536d95d-8d9a-4bb5-9f8c-89658dc71a81
📒 Files selected for processing (53)
config/config.example.yamlconfig/config.goconfig/config_example_test.goconfig/config_test.goconfig/routing.godocs/features/canonical-routing.mdxinternal/admin/dashboard/static/js/dashboard.jsinternal/admin/dashboard/static/js/modules/aliases.jsinternal/admin/dashboard/static/js/modules/aliases.test.cjsinternal/admin/dashboard/templates/model-table-body.htmlinternal/admin/handler.gointernal/admin/handler_providers.gointernal/admin/handler_routing_pools.gointernal/admin/handler_routing_pools_test.gointernal/admin/handler_routing_state.gointernal/admin/handler_routing_state_test.gointernal/admin/handler_test.gointernal/admin/routes.gointernal/admin/routes_test.gointernal/app/app.gointernal/core/canonical_routing.gointernal/core/request_model_resolution.gointernal/fallback/resolver.gointernal/fallback/resolver_test.gointernal/gateway/fallback.gointernal/gateway/inference_orchestrator.gointernal/gateway/request_model_resolution.gointernal/providers/config_test.gointernal/routing/composed_resolver.gointernal/routing/exposed_models.gointernal/routing/exposed_models_test.gointernal/routing/failover_policy.gointernal/routing/pool_evaluator.gointernal/routing/resolver.gointernal/routing/resolver_test.gointernal/routing/runtime.gointernal/routing/session_affinity.gointernal/routing/strategy.gointernal/routing/types.gointernal/routingstate/factory.gointernal/routingstate/service.gointernal/routingstate/service_test.gointernal/routingstate/store.gointernal/routingstate/store_mongodb.gointernal/routingstate/store_postgresql.gointernal/routingstate/store_sqlite.gointernal/routingstate/types.gointernal/server/fallback_test.gointernal/server/handlers.gointernal/server/http.gointernal/server/internal_chat_completion_executor.gointernal/server/request_model_resolution_test.gointernal/server/translated_inference_service.go
| func TestLoad_RoutingConfigYAML(t *testing.T) { | ||
| clearAllConfigEnvVars(t) | ||
|
|
||
| withTempDir(t, func(dir string) { | ||
| yaml := ` | ||
| routing: | ||
| defaults: | ||
| strategy: weighted_round_robin | ||
| session_affinity: false | ||
| session_affinity_ttl: 45m | ||
| failover: | ||
| enabled: true | ||
| max_attempts: 5 | ||
| retry_on_statuses: [429, 503] | ||
| retry_on_model_errors: false | ||
| model_pools: | ||
| claude-sonnet-4-6: | ||
| candidates: | ||
| - provider: anthropic_b | ||
| model: claude-sonnet-4-6 | ||
| weight: 10 | ||
| priority: 1 | ||
| - provider: anthropic_a | ||
| model: claude-sonnet-4-6-20250929 | ||
| weight: 8 | ||
| priority: 2 | ||
| ` | ||
| if err := os.WriteFile(filepath.Join(dir, "config.yaml"), []byte(yaml), 0644); err != nil { | ||
| t.Fatalf("Failed to write config.yaml: %v", err) | ||
| } | ||
|
|
||
| result, err := Load() | ||
| if err != nil { | ||
| t.Fatalf("Load() failed: %v", err) | ||
| } | ||
| cfg := result.Config | ||
| if cfg.Routing.Defaults.Strategy != RoutingStrategyWeightedRoundRobin { | ||
| t.Fatalf("Routing.Defaults.Strategy = %q, want %q", cfg.Routing.Defaults.Strategy, RoutingStrategyWeightedRoundRobin) | ||
| } | ||
| if cfg.Routing.Defaults.SessionAffinity { | ||
| t.Fatal("expected SessionAffinity=false from YAML") | ||
| } | ||
| if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute { | ||
| t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL) | ||
| } | ||
| pool := cfg.Routing.ModelPools["claude-sonnet-4-6"] | ||
| if len(pool.Candidates) != 2 { | ||
| t.Fatalf("len(pool.Candidates) = %d, want 2", len(pool.Candidates)) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Assert loaded failover fields in the routing YAML test.
Line 730 through Line 735 sets failover inputs, but the test only validates strategy, affinity, TTL, and candidate count. A regression in failover parsing could pass unnoticed.
✅ Suggested test assertions
if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute {
t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL)
}
+ failover := cfg.Routing.Defaults.Failover
+ if !failover.Enabled {
+ t.Fatal("expected Failover.Enabled=true from YAML")
+ }
+ if failover.MaxAttempts != 5 {
+ t.Fatalf("Failover.MaxAttempts = %d, want 5", failover.MaxAttempts)
+ }
+ if !reflect.DeepEqual(failover.RetryOnStatuses, []int{429, 503}) {
+ t.Fatalf("Failover.RetryOnStatuses = %v, want [429 503]", failover.RetryOnStatuses)
+ }
+ if failover.RetryOnModelErrors {
+ t.Fatal("expected Failover.RetryOnModelErrors=false from YAML")
+ }
pool := cfg.Routing.ModelPools["claude-sonnet-4-6"]As per coding guidelines: “Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@config/config_test.go` around lines 720 - 769, The test
TestLoad_RoutingConfigYAML is missing assertions for the failover block parsed
into cfg.Routing.Defaults.Failover; add assertions that
cfg.Routing.Defaults.Failover.Enabled == true,
cfg.Routing.Defaults.Failover.MaxAttempts == 5,
cfg.Routing.Defaults.Failover.RetryOnStatuses contains 429 and 503 (and has
expected length), and cfg.Routing.Defaults.Failover.RetryOnModelErrors == false
so regressions in failover parsing are caught.
| if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 { | ||
| cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504} | ||
| } |
There was a problem hiding this comment.
Validate retry_on_statuses values before accepting config.
Invalid HTTP codes (e.g., 0, 700, negative values) are currently accepted and can silently break failover matching at runtime. Add range validation (100..599) during config load.
Proposed fix
if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 {
cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504}
}
+ for idx, status := range cfg.Defaults.Failover.RetryOnStatuses {
+ if status < 100 || status > 599 {
+ return fmt.Errorf("routing.defaults.failover.retry_on_statuses[%d] must be a valid HTTP status code (100-599)", idx)
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@config/routing.go` around lines 88 - 90, Validate and reject or sanitize
invalid HTTP status codes in cfg.Defaults.Failover.RetryOnStatuses during config
load: in the routing config initialization (the code that sets
cfg.Defaults.Failover.RetryOnStatuses in config/routing.go) iterate the slice
and ensure each status is within the HTTP range 100..599, either filtering out
invalid values or returning a config validation error; update the
default-setting block that currently sets []int{429,500,502,503,504} to perform
this validation (refer to cfg.Defaults.Failover.RetryOnStatuses) so runtime
failover matching cannot receive values like 0, negative, or >599.
| async fetchRoutingState() { | ||
| try { | ||
| const request = this.adminRequestOptions(); | ||
| const res = await fetch('/admin/routing-state', request); | ||
| if (res.status === 503) { | ||
| this.routingStateAvailable = false; | ||
| this.routingStateViews = []; | ||
| this.routingPools = []; | ||
| this.syncDisplayModels(); | ||
| return; | ||
| } | ||
| const handled = this.handleFetchResponse(res, 'routing state', request); | ||
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | ||
| return; | ||
| } | ||
| if (!handled) { | ||
| this.routingStateViews = []; | ||
| this.routingPools = []; | ||
| this.syncDisplayModels(); | ||
| return; | ||
| } | ||
| this.routingStateAvailable = true; | ||
| const payload = await res.json(); | ||
| this.routingStateViews = Array.isArray(payload) ? payload : []; | ||
| } catch (e) { | ||
| console.error('Failed to fetch routing state:', e); | ||
| this.routingStateViews = []; | ||
| } | ||
| }, |
There was a problem hiding this comment.
Missing syncDisplayModels() after successful fetch.
fetchRoutingPools() calls syncDisplayModels() after updating routingPools, but fetchRoutingState() does not call it after updating routingStateViews. Since groupDisplayModels() uses routingStateViews for providerRoutingEnabled(), the display models may not reflect updated routing state.
🐛 Proposed fix
this.routingStateAvailable = true;
const payload = await res.json();
this.routingStateViews = Array.isArray(payload) ? payload : [];
+ this.syncDisplayModels();
} catch (e) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fetchRoutingState() { | |
| try { | |
| const request = this.adminRequestOptions(); | |
| const res = await fetch('/admin/routing-state', request); | |
| if (res.status === 503) { | |
| this.routingStateAvailable = false; | |
| this.routingStateViews = []; | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| const handled = this.handleFetchResponse(res, 'routing state', request); | |
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | |
| return; | |
| } | |
| if (!handled) { | |
| this.routingStateViews = []; | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| this.routingStateAvailable = true; | |
| const payload = await res.json(); | |
| this.routingStateViews = Array.isArray(payload) ? payload : []; | |
| } catch (e) { | |
| console.error('Failed to fetch routing state:', e); | |
| this.routingStateViews = []; | |
| } | |
| }, | |
| async fetchRoutingState() { | |
| try { | |
| const request = this.adminRequestOptions(); | |
| const res = await fetch('/admin/routing-state', request); | |
| if (res.status === 503) { | |
| this.routingStateAvailable = false; | |
| this.routingStateViews = []; | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| const handled = this.handleFetchResponse(res, 'routing state', request); | |
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | |
| return; | |
| } | |
| if (!handled) { | |
| this.routingStateViews = []; | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| this.routingStateAvailable = true; | |
| const payload = await res.json(); | |
| this.routingStateViews = Array.isArray(payload) ? payload : []; | |
| this.syncDisplayModels(); | |
| } catch (e) { | |
| console.error('Failed to fetch routing state:', e); | |
| this.routingStateViews = []; | |
| } | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 216 -
244, fetchRoutingState() updates this.routingStateViews but never calls
this.syncDisplayModels(), so UI models
(groupDisplayModels()/providerRoutingEnabled()) can remain stale; after
successfully assigning this.routingStateViews (the line setting it from payload)
call this.syncDisplayModels(), and also call this.syncDisplayModels() in the
catch block after setting this.routingStateViews = [] to mirror
fetchRoutingPools() behavior; reference fetchRoutingState, routingStateViews,
syncDisplayModels, fetchRoutingPools, groupDisplayModels, and
providerRoutingEnabled when making the change.
| async fetchRoutingPools() { | ||
| try { | ||
| const request = this.adminRequestOptions(); | ||
| const res = await fetch('/admin/routing/model-pools', request); | ||
| const handled = this.handleFetchResponse(res, 'routing model pools', request); | ||
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | ||
| return; | ||
| } | ||
| if (!handled) { | ||
| this.routingPools = []; | ||
| this.syncDisplayModels(); | ||
| return; | ||
| } | ||
| const payload = await res.json(); | ||
| this.routingPools = Array.isArray(payload) ? payload : []; | ||
| this.syncDisplayModels(); | ||
| } catch (e) { | ||
| console.error('Failed to fetch routing pools:', e); | ||
| this.routingPools = []; | ||
| } | ||
| }, |
There was a problem hiding this comment.
Missing 503 handling for consistency with fetchRoutingState().
fetchRoutingState() explicitly handles 503 by setting routingStateAvailable = false, but fetchRoutingPools() does not. If the routing pools endpoint returns 503, the generic !handled path runs without marking the feature as unavailable.
🐛 Proposed fix
async fetchRoutingPools() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing/model-pools', request);
+ if (res.status === 503) {
+ this.routingStateAvailable = false;
+ this.routingPools = [];
+ this.syncDisplayModels();
+ return;
+ }
const handled = this.handleFetchResponse(res, 'routing model pools', request);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fetchRoutingPools() { | |
| try { | |
| const request = this.adminRequestOptions(); | |
| const res = await fetch('/admin/routing/model-pools', request); | |
| const handled = this.handleFetchResponse(res, 'routing model pools', request); | |
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | |
| return; | |
| } | |
| if (!handled) { | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| const payload = await res.json(); | |
| this.routingPools = Array.isArray(payload) ? payload : []; | |
| this.syncDisplayModels(); | |
| } catch (e) { | |
| console.error('Failed to fetch routing pools:', e); | |
| this.routingPools = []; | |
| } | |
| }, | |
| async fetchRoutingPools() { | |
| try { | |
| const request = this.adminRequestOptions(); | |
| const res = await fetch('/admin/routing/model-pools', request); | |
| if (res.status === 503) { | |
| this.routingStateAvailable = false; | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| const handled = this.handleFetchResponse(res, 'routing model pools', request); | |
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | |
| return; | |
| } | |
| if (!handled) { | |
| this.routingPools = []; | |
| this.syncDisplayModels(); | |
| return; | |
| } | |
| const payload = await res.json(); | |
| this.routingPools = Array.isArray(payload) ? payload : []; | |
| this.syncDisplayModels(); | |
| } catch (e) { | |
| console.error('Failed to fetch routing pools:', e); | |
| this.routingPools = []; | |
| } | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 -
266, fetchRoutingPools() currently treats any non-handled response the same and
doesn't update routingStateAvailable like fetchRoutingState() does; update
fetchRoutingPools() so that after calling this.handleFetchResponse(res, 'routing
model pools', request) you check for a 503 condition (the same condition used in
fetchRoutingState()) and set this.routingStateAvailable = false and return;
otherwise keep the existing !handled path that clears this.routingPools and
calls this.syncDisplayModels(); reference functions/fields: fetchRoutingPools,
fetchRoutingState, handleFetchResponse, routingStateAvailable, routingPools,
syncDisplayModels.
| async submitRoutingStateChange(payload) { | ||
| const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) }); | ||
| const res = await fetch('/admin/routing-state', request); | ||
| const handled = this.handleFetchResponse(res, 'routing state update', request); | ||
| if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { | ||
| return false; | ||
| } | ||
| if (!handled) { | ||
| return false; | ||
| } | ||
| await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]); | ||
| return true; | ||
| }, |
There was a problem hiding this comment.
Missing error handling for network failures.
Unlike other submit functions (e.g., submitAliasForm, toggleAliasEnabled), this function lacks a try/catch wrapper. If fetch() throws a network error, the promise rejects unhandled and the caller's await will propagate the exception silently.
🐛 Proposed fix
async submitRoutingStateChange(payload) {
+ try {
const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
const res = await fetch('/admin/routing-state', request);
const handled = this.handleFetchResponse(res, 'routing state update', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return false;
}
if (!handled) {
return false;
}
await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
return true;
+ } catch (e) {
+ console.error('Failed to update routing state:', e);
+ return false;
+ }
},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 -
501, submitRoutingStateChange currently doesn't catch fetch/network errors; wrap
the body of submitRoutingStateChange in a try/catch and on any thrown error
behave like the other submit helpers: catch the error, call the same error
handling/logging path (or return false) so the caller doesn't get an unhandled
rejection, and ensure you still call/await this.fetchRoutingState() and
this.fetchRoutingPools() only on success; reference the existing method
submitRoutingStateChange and related helpers handleFetchResponse,
isStaleAuthFetchResult, fetchRoutingState, and fetchRoutingPools when
implementing the catch behavior.
| case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "": | ||
| if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil { | ||
| return "degraded" | ||
| } | ||
| return "healthy" |
There was a problem hiding this comment.
Availability errors are masked as healthy when models are discovered.
Line 35-39 returns "healthy" before considering availability failures, so snapshots with non-empty LastAvailabilityError can be misclassified.
Suggested adjustment
switch {
+ case availabilityError != "" && snapshot.DiscoveredModelCount > 0:
+ return "degraded"
case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "":
if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil {
return "degraded"
}
return "healthy"Also applies to: 44-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routing/runtime.go` around lines 35 - 39, The branch that returns
"healthy" when snapshot.Registered && snapshot.DiscoveredModelCount > 0 &&
modelFetchError == "" must also consider availability failures: check
snapshot.LastAvailabilityError (or the equivalent field) and treat
non-empty/non-nil availability errors as "degraded" before returning "healthy".
Update the logic around the case that uses snapshot.UsingCachedModels and
snapshot.LastModelFetchSuccessAt to first return "degraded" if
snapshot.LastAvailabilityError is set, and apply the same fix to the other
similar branch referenced (the one at lines 44-47) so both branches consider
snapshot.LastAvailabilityError prior to returning "healthy".
| case config.RoutingStrategyPriorityFailover: | ||
| ordered := append([]Candidate(nil), pool.Candidates...) | ||
| sort.SliceStable(ordered, func(i, j int) bool { | ||
| if ordered[i].Priority != ordered[j].Priority { | ||
| return ordered[i].Priority < ordered[j].Priority | ||
| } | ||
| return ordered[i].QualifiedModel() < ordered[j].QualifiedModel() | ||
| }) | ||
| return ordered[0], ordered[1:], nil |
There was a problem hiding this comment.
Add empty-pool guard before indexing ordered[0].
Line 20 can panic when pool.Candidates is empty in the priority_failover path.
Proposed minimal fix
func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) {
+ if len(pool.Candidates) == 0 {
+ return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel)
+ }
+
switch strategy {
case config.RoutingStrategyPriorityFailover:
ordered := append([]Candidate(nil), pool.Candidates...)
sort.SliceStable(ordered, func(i, j int) bool {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routing/strategy.go` around lines 12 - 20, The priority-failover
branch (case config.RoutingStrategyPriorityFailover) blindly indexes ordered[0]
which will panic if pool.Candidates is empty; add a guard that checks
len(pool.Candidates) (or len(ordered)) == 0 and return an appropriate error (or
nil/empty values consistent with the function signature) before
sorting/returning, referencing the RoutingStrategyPriorityFailover case,
pool.Candidates and ordered slice to locate where to insert the check.
| normalized, err := normalizeEntry(entry) | ||
| if err != nil { |
There was a problem hiding this comment.
Do not rewrite persisted timestamps during Refresh.
normalizeEntry updates UpdatedAt to time.Now().UTC(), so every refresh changes in-memory timestamps even when no data changed in storage.
💡 Proposed fix
for _, entry := range entries {
+ createdAt := entry.CreatedAt
+ updatedAt := entry.UpdatedAt
normalized, err := normalizeEntry(entry)
if err != nil {
return fmt.Errorf("load routing state %q: %w", entry.Key, err)
}
+ normalized.CreatedAt = createdAt
+ normalized.UpdatedAt = updatedAt
next.order = append(next.order, normalized.Key)
next.byKey[normalized.Key] = normalized📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| normalized, err := normalizeEntry(entry) | |
| if err != nil { | |
| for _, entry := range entries { | |
| createdAt := entry.CreatedAt | |
| updatedAt := entry.UpdatedAt | |
| normalized, err := normalizeEntry(entry) | |
| if err != nil { | |
| return fmt.Errorf("load routing state %q: %w", entry.Key, err) | |
| } | |
| normalized.CreatedAt = createdAt | |
| normalized.UpdatedAt = updatedAt | |
| next.order = append(next.order, normalized.Key) | |
| next.byKey[normalized.Key] = normalized | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routingstate/service.go` around lines 55 - 56, normalizeEntry
currently overwrites entry.UpdatedAt with time.Now().UTC(), so calling it from
Refresh causes in-memory timestamps to change even when storage didn't; modify
Refresh to preserve persisted UpdatedAt by either adding a no-update flag to
normalizeEntry (e.g., normalizeEntry(entry, preserveTimestamps bool)) or by
capturing the original entry.UpdatedAt before calling normalizeEntry and
restoring it afterwards if no storage changes occurred; ensure references in
Refresh use the preserved UpdatedAt rather than the value set inside
normalizeEntry.
| if entry.Key == "" { | ||
| entry.Key = entry.ProviderName + "/" + entry.Model | ||
| } |
There was a problem hiding this comment.
Canonicalize pool_candidate keys unconditionally.
For KindPoolCandidate, keeping a caller-supplied non-empty Key allows multiple keys for the same provider/model. That breaks stable identity and can create duplicate logical entries.
💡 Proposed fix
case KindPoolCandidate:
if entry.ProviderName == "" || entry.Model == "" {
return Entry{}, newValidationError("provider_name and model are required for pool_candidate", nil)
}
- if entry.Key == "" {
- entry.Key = entry.ProviderName + "/" + entry.Model
- }
+ entry.Key = entry.ProviderName + "/" + entry.Model🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routingstate/types.go` around lines 63 - 65, For entries of
KindPoolCandidate, always canonicalize the Key to the canonical form
provider/model instead of leaving a caller-supplied Key; in the code handling
KindPoolCandidate (the block referencing entry.Key, entry.ProviderName and
entry.Model in internal/routingstate/types.go) remove the conditional that only
sets entry.Key when empty and unconditionally set entry.Key = entry.ProviderName
+ "/" + entry.Model so the Key is stable and unique for a given provider+model.
Introduce canonical model pool routing with priority failover, weighted distribution, session affinity, runtime-aware candidate filtering, admin state controls, and user-facing documentation.
779fa78 to
62c4df5
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (15)
internal/routing/exposed_models_test.go (1)
33-98: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winMerge these exposed-model behavior tests into one table-driven test.
Both cases follow the same scaffold and should be expressed as table rows to keep future pool-strategy/runtime-state permutations easy to extend.
As per coding guidelines, "
**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes. Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/exposed_models_test.go` around lines 33 - 98, Combine the two tests TestCanonicalExposedModelLister_UsesEffectiveResolverChoice and TestCanonicalExposedModelLister_SkipsUnhealthyCandidates into a single table-driven test that iterates over cases; each case should include name, routing config, catalog models, provider snapshots, expected count and expected model fields, and then for each case construct the lister with NewCanonicalExposedModelLister and call ExposedModels() asserting length and expected values. Keep references to the original scenarios (weighted vs priority strategies, provider snapshots with LastModelFetchError vs LastModelFetchSuccessAt) as separate table entries so future permutations are simple to add, and reuse any setup helpers to build testCatalog and staticRuntimeProvider to avoid duplication.internal/admin/handler_routing_state_test.go (1)
36-88: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winConsolidate routing-state handler tests into a table-driven suite and expand error-path cases.
The current four standalone tests should be converted to table-driven form and include key negative-path cases (e.g., delete not found, malformed JSON) for this behavior change.
As per coding guidelines, "
**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes. Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/handler_routing_state_test.go` around lines 36 - 88, Replace the four standalone tests with a single table-driven test that iterates cases calling the handler methods (ListRoutingState, UpsertRoutingState, DeleteRoutingState) via the same newRoutingStateHandler helper and echo context; for each case define name, method, path body, expected status and expected body (including negative cases: delete-not-found, malformed JSON, missing enabled field), run as subtests with t.Run, construct the httptest request/recorder, set Content-Type, invoke the appropriate handler method, and assert status and response body; ensure cases cover request translation, response normalization, error handling, default config and provider-specific parameter mapping (e.g., provider_name "anthropic_a") so tests replace TestListRoutingState_Empty, TestUpsertRoutingState_Provider, TestUpsertRoutingState_InvalidMissingEnabled and TestDeleteRoutingState.config/config_test.go (1)
730-734:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert loaded failover defaults in
TestLoad_RoutingConfigYAML.The test writes
routing.defaults.failovervalues but never verifies them, so failover parsing regressions can slip through.As per coding guidelines: “Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping.”Suggested test assertions
if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute { t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL) } + failover := cfg.Routing.Defaults.Failover + if !failover.Enabled { + t.Fatal("expected Failover.Enabled=true from YAML") + } + if failover.MaxAttempts != 5 { + t.Fatalf("Failover.MaxAttempts = %d, want 5", failover.MaxAttempts) + } + if !reflect.DeepEqual(failover.RetryOnStatuses, []int{429, 503}) { + t.Fatalf("Failover.RetryOnStatuses = %v, want [429 503]", failover.RetryOnStatuses) + } + if failover.RetryOnModelErrors { + t.Fatal("expected Failover.RetryOnModelErrors=false from YAML") + } pool := cfg.Routing.ModelPools["claude-sonnet-4-6"]Also applies to: 756-768
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/config_test.go` around lines 730 - 734, Add assertions in TestLoad_RoutingConfigYAML to verify that routing.defaults.failover is parsed correctly: assert that the loaded config's Routing.Defaults.Failover.Enabled equals true, MaxAttempts equals 5, RetryOnStatuses contains 429 and 503, and RetryOnModelErrors equals false (use the test's config loader and the TestLoad_RoutingConfigYAML test function and the Routing.Defaults.Failover struct fields to locate where to add these checks); also add similar assertions for the other YAML block referenced around lines 756-768 to ensure both failover default blocks are covered.internal/routing/session_affinity.go (1)
46-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid request-ID affinity keys; they cause unbounded map growth.
request_idfallback creates mostly single-use keys, while cleanup is lazy onGet, so stale entries can accumulate indefinitely.Suggested minimal fix
func affinityKey(ctx context.Context, canonicalModel string) string { @@ userPath := strings.TrimSpace(core.UserPathFromContext(ctx)) if userPath != "" { return canonicalModel + "|user_path|" + userPath } - requestID := strings.TrimSpace(core.GetRequestID(ctx)) - if requestID != "" { - return canonicalModel + "|request_id|" + requestID - } return "" }Also applies to: 67-69, 82-84
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/session_affinity.go` around lines 46 - 49, The current affinity-key construction uses request-scoped IDs (core.GetRequestID(ctx) stored in variable requestID) which produces mostly single-use keys and leaks entries; remove the branches that return canonicalModel + "|request_id|" + requestID so request_id is no longer used as a fallback. Locate the places that compute requestID (calls to core.GetRequestID(ctx) and the return expressions that append "|request_id|" to canonicalModel) and change them to skip to the next stable fallback (or simply return canonicalModel) so only stable identifiers are used for affinity keys.internal/routing/strategy.go (1)
12-20:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winGuard empty pools before indexing
ordered[0].The
priority_failoverpath panics whenpool.Candidatesis empty.Suggested minimal fix
func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) { + if len(pool.Candidates) == 0 { + return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel) + } + switch strategy { case config.RoutingStrategyPriorityFailover: ordered := append([]Candidate(nil), pool.Candidates...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/strategy.go` around lines 12 - 20, In the RoutingStrategyPriorityFailover branch (case config.RoutingStrategyPriorityFailover) guard against an empty pool before accessing ordered[0]: check if len(pool.Candidates) == 0 (or len(ordered) == 0) and return a clear error (or nil candidate with an error) instead of indexing; then proceed to sort and return ordered[0], ordered[1:], nil only when non-empty. Ensure you update any callers expecting a candidate/error accordingly.internal/admin/dashboard/static/js/modules/aliases.js (4)
489-500:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCatch network failures in
submitRoutingStateChange.If
fetch()throws here, the promise rejects unhandled and the toggle callers never get the expectedfalseresult.🔧 Minimal fix
async submitRoutingStateChange(payload) { - const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) }); - const res = await fetch('/admin/routing-state', request); - const handled = this.handleFetchResponse(res, 'routing state update', request); - if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { - return false; - } - if (!handled) { - return false; - } - await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]); - return true; + try { + const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) }); + const res = await fetch('/admin/routing-state', request); + const handled = this.handleFetchResponse(res, 'routing state update', request); + if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { + return false; + } + if (!handled) { + return false; + } + await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]); + return true; + } catch (e) { + console.error('Failed to update routing state:', e); + return false; + } },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 - 500, submitRoutingStateChange currently calls fetch('/admin/routing-state', ...) without catching network exceptions, so any thrown error will bubble out; wrap the fetch/response handling in a try/catch inside submitRoutingStateChange, call this.handleFetchResponse as before and run the existing isStaleAuthFetchResult check, but on any thrown error (or if handleFetchResponse indicates failure) return false instead of letting the exception propagate; ensure the catch logs or handles the error appropriately and returns false so callers always get a boolean result.
246-265:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHandle routing-pool 503s the same way as routing-state 503s.
A 503 from
/admin/routing/model-poolscurrently falls through the generic path, soroutingStateAvailablestays true and the UI can keep stale pool-derived rows.🔧 Minimal fix
async fetchRoutingPools() { try { const request = this.adminRequestOptions(); const res = await fetch('/admin/routing/model-pools', request); + if (res.status === 503) { + this.routingStateAvailable = false; + this.routingPools = []; + this.syncDisplayModels(); + return; + } const handled = this.handleFetchResponse(res, 'routing model pools', request); if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { return; } if (!handled) { @@ } catch (e) { console.error('Failed to fetch routing pools:', e); this.routingPools = []; + this.syncDisplayModels(); } },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 - 265, In fetchRoutingPools(), add an explicit check for a 503 response from the /admin/routing/model-pools call and handle it like the routing-state 503: if res.status === 503 set this.routingStateAvailable = false, set this.routingPools = [], call this.syncDisplayModels(), and return early; place this check after awaiting fetch/res and before parsing the body so stale pool-derived rows are cleared. Reference: method fetchRoutingPools, properties routingStateAvailable and routingPools, and helper syncDisplayModels.
523-529:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInvert the canonical toggle value.
When
row.canonical_enabledistrue, this still sendsenabled: true, so the admin toggle cannot disable an active canonical model.🔧 Minimal fix
async toggleCanonicalModelEnabled(row) { if (!row || !row.canonical_model) return; await this.submitRoutingStateChange({ kind: 'canonical_model', canonical_model: row.canonical_model, - enabled: !(row.canonical_enabled === false) + enabled: !row.canonical_enabled }); },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 523 - 529, The toggleCanonicalModelEnabled function is sending the wrong enabled value: change the enabled calculation so it inverts row.canonical_enabled (e.g., set enabled to !row.canonical_enabled) before calling submitRoutingStateChange; locate the toggleCanonicalModelEnabled method and replace the current enabled: !(row.canonical_enabled === false) expression with an explicit inversion of row.canonical_enabled so toggling an active canonical_model will disable it and vice versa.
237-243:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSync derived rows after routing-state updates.
routingStateViewschanges here, butsyncDisplayModels()is never called afterward, sogroupDisplayModels()andproviderRoutingEnabled()can keep showing stale routing state until some other fetch happens.🔧 Minimal fix
this.routingStateAvailable = true; const payload = await res.json(); this.routingStateViews = Array.isArray(payload) ? payload : []; + this.syncDisplayModels(); } catch (e) { console.error('Failed to fetch routing state:', e); this.routingStateViews = []; + this.syncDisplayModels(); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 237 - 243, routingStateViews is updated in the fetch block but syncDisplayModels() is never called, leaving groupDisplayModels() and providerRoutingEnabled() with stale data; after assigning this.routingStateViews = Array.isArray(payload) ? payload : [] in the try block and after this.routingStateViews = [] in the catch block, call this.syncDisplayModels() so derived rows and routing-dependent computed state are recalculated immediately (refer to syncDisplayModels(), groupDisplayModels(), providerRoutingEnabled(), and the routingStateViews assignment site).internal/gateway/fallback.go (1)
57-72:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard canonical failover policy access.
Both fallback paths call
o.failoverPolicy.ShouldAttempt(...)and readMaxAttemptswhenever a canonical model is active. If the orchestrator is built without a policy, this panics instead of returning the primary error.🔧 Minimal fix
shouldAttempt := ShouldAttemptFallback(primaryErr) - if workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" { + if workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" && o.failoverPolicy != nil { shouldAttempt = o.failoverPolicy.ShouldAttempt(primaryErr) } @@ - canonicalPool := workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" + canonicalPool := workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" && o.failoverPolicy != nilAlso applies to: 159-174
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/gateway/fallback.go` around lines 57 - 72, The code dereferences o.failoverPolicy (calling o.failoverPolicy.ShouldAttempt(...) and reading o.failoverPolicy.MaxAttempts) when a canonical model is active, which can panic if failoverPolicy is nil; update the logic in the canonical-path checks (around ShouldAttemptFallback usage and where canonicalPool is set) to first test o.failoverPolicy != nil before calling ShouldAttempt or accessing MaxAttempts so that if failoverPolicy is nil you keep the original shouldAttempt value and do not apply the attempts-limit break; apply the same nil-guard pattern to the second occurrence of this logic (the similar block at lines 159-174) to ensure the function returns the primary error instead of panicking when no policy is configured.internal/routing/pool_evaluator.go (1)
112-118:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not count unhealthy candidates as available.
selectablealready excludesruntime.Status == "unhealthy", butavailableCountstill includes those candidates. That lets the pool reportenabledeven when one or more candidates are runtime-ineligible.🔧 Minimal fix
- if effectiveEnabled { + if effectiveEnabled && runtime.Status != "unhealthy" { availableCount++ }Also applies to: 137-145
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/pool_evaluator.go` around lines 112 - 118, The code increments availableCount whenever effectiveEnabled is true but ignores runtime.Status, so unhealthy runtimes are counted as available; change the increment to only occur when the candidate is selectable (i.e., use the existing selectable condition or check effectiveEnabled && runtime.Status != "unhealthy") — replace "if effectiveEnabled { availableCount++ }" with "if selectable { availableCount++ }" (and make the same change in the other similar block where availableCount is incremented, identified by the same variables selectable, effectiveEnabled, runtime.Status, availableCount, selectableCount).internal/routing/exposed_models.go (1)
59-64:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove the impossible nil branch.
chosenis always non-nil because it points to a localclone, and this is already tripping SA4031 in CI.🔧 Minimal fix
clone := *model clone.ID = canonical - chosen := &clone - if chosen != nil { - models = append(models, *chosen) - } + models = append(models, clone) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/exposed_models.go` around lines 59 - 64, The code creates a local clone (clone := *model), takes its address into chosen (chosen := &clone) and then guards with an impossible nil check (if chosen != nil) before appending; remove that branch and append the clone (or *chosen) directly to models. Update the block around clone, chosen and models (remove the if chosen != nil check and the chosen variable if you prefer) so you simply set clone.ID = canonical and append clone (or *&clone) to models.internal/admin/handler_routing_state.go (1)
75-77:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMatch pool-candidate responses using
canonical_modelas part of identity.
provider_name + modelis ambiguous across pools; includecanonical_modelto avoid returning the wrong updated entry.Targeted fix
case routingstate.KindPoolCandidate: - if current.ProviderName == entry.ProviderName && current.Model == entry.Model { + if current.ProviderName == entry.ProviderName && + current.Model == entry.Model && + current.CanonicalModel == entry.CanonicalModel { return c.JSON(http.StatusOK, current) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/handler_routing_state.go` around lines 75 - 77, The match for routingstate.KindPoolCandidate currently only compares provider and model (current.ProviderName == entry.ProviderName && current.Model == entry.Model), which can return wrong entries across pools; update the identity check to include the canonical model as well (e.g., compare current.CanonicalModel == entry.CanonicalModel in the same conditional) so the case for routingstate.KindPoolCandidate only returns when ProviderName, Model, and CanonicalModel all match the incoming entry.internal/routing/resolver.go (1)
61-66:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSynchronize access to
Resolver.runtimeon both write and read paths.
WithRuntimewrites shared state unsafely, whileResolveWithContextreads it on the request path before locking. This can race under concurrent runtime refreshes.Minimal thread-safe fix
func (r *Resolver) WithRuntime(runtime RuntimeSnapshotProvider) *Resolver { if r == nil { return nil } + r.mu.Lock() + defer r.mu.Unlock() r.runtime = runtime return r } @@ - evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(r.runtime)) + r.mu.Lock() + runtime := r.runtime + r.mu.Unlock() + evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(runtime))Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/resolver.go` around lines 61 - 66, The write to Resolver.runtime in WithRuntime is not synchronized with reads in ResolveWithContext; modify WithRuntime to take the Resolver's mutex (e.g., r.mu.Lock()/Unlock()) around the assignment, and also ensure ResolveWithContext reads r.runtime under the same mutex (e.g., r.mu.RLock()/RUnlock()) before using it (or promote to full Lock if a mutex type is different); if a mutex field does not exist on Resolver, add a sync.RWMutex named mu and use it consistently in WithRuntime and ResolveWithContext to prevent races.internal/app/app.go (1)
278-293:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClose
routingStateduring normalShutdownto complete lifecycle symmetry.
routingStateis initialized and closed on some startup-failure paths, but it is still not closed in the normal shutdown sequence.Minimal shutdown addition
// 6. Close model pricing overrides subsystem. if a.pricingOverrides != nil { if err := a.pricingOverrides.Close(); err != nil { slog.Error("model pricing overrides close error", "error", err) errs = append(errs, fmt.Errorf("model pricing overrides close: %w", err)) } } + + // 7. Close routing state subsystem. + if a.routingState != nil { + if err := a.routingState.Close(); err != nil { + slog.Error("routing state close error", "error", err) + errs = append(errs, fmt.Errorf("routing state close: %w", err)) + } + } - // 7. Close reusable guardrails subsystem. + // 8. Close reusable guardrails subsystem.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/app.go` around lines 278 - 293, The routingState created as routingstate.Result and assigned to app.routingState is not being closed on normal shutdown; update the shutdown/Close/Shutdown method to include closing app.routingState (nil-check first) so its Close() is called during normal lifecycle termination, and add it to the errors.Join call (or equivalent aggregated close handling) alongside app.pricingOverrides.Close(), app.modelOverrides.Close(), app.aliases.Close(), etc., to ensure symmetric initialization/teardown of routingState.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/admin/handler_routing_pools.go`:
- Around line 82-83: The loop over h.routingConfig.ModelPools yields
nondeterministic ordering; before iterating to build responses, collect the map
keys (canonical), sort them (e.g., using sort.Strings), then iterate over the
sorted slice and use each canonical to look up poolCfg from
h.routingConfig.ModelPools when constructing responses—this makes the pool
listing deterministic while keeping existing logic that uses canonicalName and
poolCfg.
In `@internal/gateway/request_model_resolution.go`:
- Around line 112-121: The copy of canonical metadata into the
RequestModelResolution struct is missing the FailoverUsed and FallbackTarget
fields; update the block that assigns from canonicalResolution to resolution
(the if canonicalResolution != nil { ... } block in request_model_resolution.go)
to also set resolution.FailoverUsed = canonicalResolution.FailoverUsed and
resolution.FallbackTarget = canonicalResolution.FallbackTarget so these failover
outcome details are propagated downstream.
In `@internal/routingstate/service_test.go`:
- Around line 24-47: Convert the two related tests into a single table-driven
test that iterates test cases describing initial store entries (using
memoryStore, Upsert, and Entry with KindProvider/KindPoolCandidate,
ProviderName, Model, Enabled), the candidate list (routing.Candidate slice), the
target model, and the expected results for both FilterCandidates and
CandidateEnabled; for each case call NewService, apply Upsert entries, run
service.FilterCandidates(targetModel, candidates) and assert the filtered
result, and also call service.CandidateEnabled(core.ModelSelector{Provider: ...,
Model: ...}) to assert the default/enabled behavior—this consolidates
TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue into table-driven subtests.
In `@internal/routingstate/service.go`:
- Around line 113-130: StartBackgroundRefresh currently calls s.Refresh with
context.Background(), making refresh calls uninterruptible; change it to create
a cancelable context (e.g. ctx, cancel :=
context.WithCancel(context.Background())) that the goroutine uses when calling
s.Refresh(ctx) and ensure the returned stop function calls cancel() (and closes
the stop channel or signals the goroutine) so in-flight Refresh calls are
canceled promptly; update the goroutine to respect context cancellation when
invoking Refresh and to call cancel/cleanup when exiting.
- Around line 172-190: FilterCandidates currently queries
s.CanonicalModelEnabled, s.ProviderEnabled, and s.CandidateEnabled separately
which can interleave with Refresh; change it to take one consistent
routing-state snapshot under the Service lock and use that snapshot for all
checks. Concretely: add or use a snapshot-returning helper (e.g. a new method
like Service.SnapshotRoutingState or similar) that reads canonical, provider and
candidate state while holding the lock, then modify FilterCandidates to call
that snapshot method once and call snapshot.CanonicalModelEnabled,
snapshot.ProviderEnabled, and snapshot.CandidateEnabled (or access snapshot
fields) for all decisions so the method uses a single, consistent view of
routing state. Ensure no per-candidate locked helper calls remain.
---
Duplicate comments:
In `@config/config_test.go`:
- Around line 730-734: Add assertions in TestLoad_RoutingConfigYAML to verify
that routing.defaults.failover is parsed correctly: assert that the loaded
config's Routing.Defaults.Failover.Enabled equals true, MaxAttempts equals 5,
RetryOnStatuses contains 429 and 503, and RetryOnModelErrors equals false (use
the test's config loader and the TestLoad_RoutingConfigYAML test function and
the Routing.Defaults.Failover struct fields to locate where to add these
checks); also add similar assertions for the other YAML block referenced around
lines 756-768 to ensure both failover default blocks are covered.
In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 489-500: submitRoutingStateChange currently calls
fetch('/admin/routing-state', ...) without catching network exceptions, so any
thrown error will bubble out; wrap the fetch/response handling in a try/catch
inside submitRoutingStateChange, call this.handleFetchResponse as before and run
the existing isStaleAuthFetchResult check, but on any thrown error (or if
handleFetchResponse indicates failure) return false instead of letting the
exception propagate; ensure the catch logs or handles the error appropriately
and returns false so callers always get a boolean result.
- Around line 246-265: In fetchRoutingPools(), add an explicit check for a 503
response from the /admin/routing/model-pools call and handle it like the
routing-state 503: if res.status === 503 set this.routingStateAvailable = false,
set this.routingPools = [], call this.syncDisplayModels(), and return early;
place this check after awaiting fetch/res and before parsing the body so stale
pool-derived rows are cleared. Reference: method fetchRoutingPools, properties
routingStateAvailable and routingPools, and helper syncDisplayModels.
- Around line 523-529: The toggleCanonicalModelEnabled function is sending the
wrong enabled value: change the enabled calculation so it inverts
row.canonical_enabled (e.g., set enabled to !row.canonical_enabled) before
calling submitRoutingStateChange; locate the toggleCanonicalModelEnabled method
and replace the current enabled: !(row.canonical_enabled === false) expression
with an explicit inversion of row.canonical_enabled so toggling an active
canonical_model will disable it and vice versa.
- Around line 237-243: routingStateViews is updated in the fetch block but
syncDisplayModels() is never called, leaving groupDisplayModels() and
providerRoutingEnabled() with stale data; after assigning this.routingStateViews
= Array.isArray(payload) ? payload : [] in the try block and after
this.routingStateViews = [] in the catch block, call this.syncDisplayModels() so
derived rows and routing-dependent computed state are recalculated immediately
(refer to syncDisplayModels(), groupDisplayModels(), providerRoutingEnabled(),
and the routingStateViews assignment site).
In `@internal/admin/handler_routing_state_test.go`:
- Around line 36-88: Replace the four standalone tests with a single
table-driven test that iterates cases calling the handler methods
(ListRoutingState, UpsertRoutingState, DeleteRoutingState) via the same
newRoutingStateHandler helper and echo context; for each case define name,
method, path body, expected status and expected body (including negative cases:
delete-not-found, malformed JSON, missing enabled field), run as subtests with
t.Run, construct the httptest request/recorder, set Content-Type, invoke the
appropriate handler method, and assert status and response body; ensure cases
cover request translation, response normalization, error handling, default
config and provider-specific parameter mapping (e.g., provider_name
"anthropic_a") so tests replace TestListRoutingState_Empty,
TestUpsertRoutingState_Provider, TestUpsertRoutingState_InvalidMissingEnabled
and TestDeleteRoutingState.
In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: The match for routingstate.KindPoolCandidate currently only
compares provider and model (current.ProviderName == entry.ProviderName &&
current.Model == entry.Model), which can return wrong entries across pools;
update the identity check to include the canonical model as well (e.g., compare
current.CanonicalModel == entry.CanonicalModel in the same conditional) so the
case for routingstate.KindPoolCandidate only returns when ProviderName, Model,
and CanonicalModel all match the incoming entry.
In `@internal/app/app.go`:
- Around line 278-293: The routingState created as routingstate.Result and
assigned to app.routingState is not being closed on normal shutdown; update the
shutdown/Close/Shutdown method to include closing app.routingState (nil-check
first) so its Close() is called during normal lifecycle termination, and add it
to the errors.Join call (or equivalent aggregated close handling) alongside
app.pricingOverrides.Close(), app.modelOverrides.Close(), app.aliases.Close(),
etc., to ensure symmetric initialization/teardown of routingState.
In `@internal/gateway/fallback.go`:
- Around line 57-72: The code dereferences o.failoverPolicy (calling
o.failoverPolicy.ShouldAttempt(...) and reading o.failoverPolicy.MaxAttempts)
when a canonical model is active, which can panic if failoverPolicy is nil;
update the logic in the canonical-path checks (around ShouldAttemptFallback
usage and where canonicalPool is set) to first test o.failoverPolicy != nil
before calling ShouldAttempt or accessing MaxAttempts so that if failoverPolicy
is nil you keep the original shouldAttempt value and do not apply the
attempts-limit break; apply the same nil-guard pattern to the second occurrence
of this logic (the similar block at lines 159-174) to ensure the function
returns the primary error instead of panicking when no policy is configured.
In `@internal/routing/exposed_models_test.go`:
- Around line 33-98: Combine the two tests
TestCanonicalExposedModelLister_UsesEffectiveResolverChoice and
TestCanonicalExposedModelLister_SkipsUnhealthyCandidates into a single
table-driven test that iterates over cases; each case should include name,
routing config, catalog models, provider snapshots, expected count and expected
model fields, and then for each case construct the lister with
NewCanonicalExposedModelLister and call ExposedModels() asserting length and
expected values. Keep references to the original scenarios (weighted vs priority
strategies, provider snapshots with LastModelFetchError vs
LastModelFetchSuccessAt) as separate table entries so future permutations are
simple to add, and reuse any setup helpers to build testCatalog and
staticRuntimeProvider to avoid duplication.
In `@internal/routing/exposed_models.go`:
- Around line 59-64: The code creates a local clone (clone := *model), takes its
address into chosen (chosen := &clone) and then guards with an impossible nil
check (if chosen != nil) before appending; remove that branch and append the
clone (or *chosen) directly to models. Update the block around clone, chosen and
models (remove the if chosen != nil check and the chosen variable if you prefer)
so you simply set clone.ID = canonical and append clone (or *&clone) to models.
In `@internal/routing/pool_evaluator.go`:
- Around line 112-118: The code increments availableCount whenever
effectiveEnabled is true but ignores runtime.Status, so unhealthy runtimes are
counted as available; change the increment to only occur when the candidate is
selectable (i.e., use the existing selectable condition or check
effectiveEnabled && runtime.Status != "unhealthy") — replace "if
effectiveEnabled { availableCount++ }" with "if selectable { availableCount++ }"
(and make the same change in the other similar block where availableCount is
incremented, identified by the same variables selectable, effectiveEnabled,
runtime.Status, availableCount, selectableCount).
In `@internal/routing/resolver.go`:
- Around line 61-66: The write to Resolver.runtime in WithRuntime is not
synchronized with reads in ResolveWithContext; modify WithRuntime to take the
Resolver's mutex (e.g., r.mu.Lock()/Unlock()) around the assignment, and also
ensure ResolveWithContext reads r.runtime under the same mutex (e.g.,
r.mu.RLock()/RUnlock()) before using it (or promote to full Lock if a mutex type
is different); if a mutex field does not exist on Resolver, add a sync.RWMutex
named mu and use it consistently in WithRuntime and ResolveWithContext to
prevent races.
In `@internal/routing/session_affinity.go`:
- Around line 46-49: The current affinity-key construction uses request-scoped
IDs (core.GetRequestID(ctx) stored in variable requestID) which produces mostly
single-use keys and leaks entries; remove the branches that return
canonicalModel + "|request_id|" + requestID so request_id is no longer used as a
fallback. Locate the places that compute requestID (calls to
core.GetRequestID(ctx) and the return expressions that append "|request_id|" to
canonicalModel) and change them to skip to the next stable fallback (or simply
return canonicalModel) so only stable identifiers are used for affinity keys.
In `@internal/routing/strategy.go`:
- Around line 12-20: In the RoutingStrategyPriorityFailover branch (case
config.RoutingStrategyPriorityFailover) guard against an empty pool before
accessing ordered[0]: check if len(pool.Candidates) == 0 (or len(ordered) == 0)
and return a clear error (or nil candidate with an error) instead of indexing;
then proceed to sort and return ordered[0], ordered[1:], nil only when
non-empty. Ensure you update any callers expecting a candidate/error
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: bb57203b-83ae-4abc-b516-b9f8de7b0d37
📒 Files selected for processing (53)
config/config.example.yamlconfig/config.goconfig/config_example_test.goconfig/config_test.goconfig/routing.godocs/features/canonical-routing.mdxinternal/admin/dashboard/static/js/dashboard.jsinternal/admin/dashboard/static/js/modules/aliases.jsinternal/admin/dashboard/static/js/modules/aliases.test.cjsinternal/admin/dashboard/templates/model-table-body.htmlinternal/admin/handler.gointernal/admin/handler_providers.gointernal/admin/handler_routing_pools.gointernal/admin/handler_routing_pools_test.gointernal/admin/handler_routing_state.gointernal/admin/handler_routing_state_test.gointernal/admin/handler_test.gointernal/admin/routes.gointernal/admin/routes_test.gointernal/app/app.gointernal/core/canonical_routing.gointernal/core/request_model_resolution.gointernal/fallback/resolver.gointernal/fallback/resolver_test.gointernal/gateway/fallback.gointernal/gateway/inference_orchestrator.gointernal/gateway/request_model_resolution.gointernal/providers/config_test.gointernal/routing/composed_resolver.gointernal/routing/exposed_models.gointernal/routing/exposed_models_test.gointernal/routing/failover_policy.gointernal/routing/pool_evaluator.gointernal/routing/resolver.gointernal/routing/resolver_test.gointernal/routing/runtime.gointernal/routing/session_affinity.gointernal/routing/strategy.gointernal/routing/types.gointernal/routingstate/factory.gointernal/routingstate/service.gointernal/routingstate/service_test.gointernal/routingstate/store.gointernal/routingstate/store_mongodb.gointernal/routingstate/store_postgresql.gointernal/routingstate/store_sqlite.gointernal/routingstate/types.gointernal/server/fallback_test.gointernal/server/handlers.gointernal/server/http.gointernal/server/internal_chat_completion_executor.gointernal/server/request_model_resolution_test.gointernal/server/translated_inference_service.go
| for canonical, poolCfg := range h.routingConfig.ModelPools { | ||
| canonicalName := strings.TrimSpace(canonical) |
There was a problem hiding this comment.
Make pool listing deterministic.
Iterating h.routingConfig.ModelPools directly makes response order nondeterministic across requests. Sort canonical model keys before building responses.
Suggested minimal fix
import (
"net/http"
+ "sort"
"strings"
@@
- for canonical, poolCfg := range h.routingConfig.ModelPools {
+ keys := make([]string, 0, len(h.routingConfig.ModelPools))
+ for canonical := range h.routingConfig.ModelPools {
+ keys = append(keys, canonical)
+ }
+ sort.Strings(keys)
+
+ for _, canonical := range keys {
+ poolCfg := h.routingConfig.ModelPools[canonical]
canonicalName := strings.TrimSpace(canonical)
if canonicalName == "" {
continue
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for canonical, poolCfg := range h.routingConfig.ModelPools { | |
| canonicalName := strings.TrimSpace(canonical) | |
| keys := make([]string, 0, len(h.routingConfig.ModelPools)) | |
| for canonical := range h.routingConfig.ModelPools { | |
| keys = append(keys, canonical) | |
| } | |
| sort.Strings(keys) | |
| for _, canonical := range keys { | |
| poolCfg := h.routingConfig.ModelPools[canonical] | |
| canonicalName := strings.TrimSpace(canonical) | |
| if canonicalName == "" { | |
| continue | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/admin/handler_routing_pools.go` around lines 82 - 83, The loop over
h.routingConfig.ModelPools yields nondeterministic ordering; before iterating to
build responses, collect the map keys (canonical), sort them (e.g., using
sort.Strings), then iterate over the sorted slice and use each canonical to look
up poolCfg from h.routingConfig.ModelPools when constructing responses—this
makes the pool listing deterministic while keeping existing logic that uses
canonicalName and poolCfg.
| if canonicalResolution != nil { | ||
| resolution.CanonicalModel = canonicalResolution.CanonicalModel | ||
| resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...) | ||
| resolution.RoutingStrategy = string(canonicalResolution.Strategy) | ||
| resolution.ConfigPrimary = canonicalResolution.ConfigPrimary | ||
| resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate | ||
| resolution.SelectedProviderName = canonicalResolution.SelectedProviderName | ||
| resolution.SelectedExactModel = canonicalResolution.SelectedExactModel | ||
| resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...) | ||
| } |
There was a problem hiding this comment.
Propagate FailoverUsed and FallbackTarget into RequestModelResolution.
The canonical metadata copy omits these fields, so downstream logs/diagnostics lose failover outcome details.
💡 Proposed fix
if canonicalResolution != nil {
resolution.CanonicalModel = canonicalResolution.CanonicalModel
resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...)
resolution.RoutingStrategy = string(canonicalResolution.Strategy)
resolution.ConfigPrimary = canonicalResolution.ConfigPrimary
resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate
resolution.SelectedProviderName = canonicalResolution.SelectedProviderName
resolution.SelectedExactModel = canonicalResolution.SelectedExactModel
resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...)
+ resolution.FailoverUsed = canonicalResolution.FailoverUsed
+ resolution.FallbackTarget = canonicalResolution.FallbackTarget
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if canonicalResolution != nil { | |
| resolution.CanonicalModel = canonicalResolution.CanonicalModel | |
| resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...) | |
| resolution.RoutingStrategy = string(canonicalResolution.Strategy) | |
| resolution.ConfigPrimary = canonicalResolution.ConfigPrimary | |
| resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate | |
| resolution.SelectedProviderName = canonicalResolution.SelectedProviderName | |
| resolution.SelectedExactModel = canonicalResolution.SelectedExactModel | |
| resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...) | |
| } | |
| if canonicalResolution != nil { | |
| resolution.CanonicalModel = canonicalResolution.CanonicalModel | |
| resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...) | |
| resolution.RoutingStrategy = string(canonicalResolution.Strategy) | |
| resolution.ConfigPrimary = canonicalResolution.ConfigPrimary | |
| resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate | |
| resolution.SelectedProviderName = canonicalResolution.SelectedProviderName | |
| resolution.SelectedExactModel = canonicalResolution.SelectedExactModel | |
| resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...) | |
| resolution.FailoverUsed = canonicalResolution.FailoverUsed | |
| resolution.FallbackTarget = canonicalResolution.FallbackTarget | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/gateway/request_model_resolution.go` around lines 112 - 121, The
copy of canonical metadata into the RequestModelResolution struct is missing the
FailoverUsed and FallbackTarget fields; update the block that assigns from
canonicalResolution to resolution (the if canonicalResolution != nil { ... }
block in request_model_resolution.go) to also set resolution.FailoverUsed =
canonicalResolution.FailoverUsed and resolution.FallbackTarget =
canonicalResolution.FallbackTarget so these failover outcome details are
propagated downstream.
| func TestServiceFilterCandidatesHonorsManualDisable(t *testing.T) { | ||
| store := &memoryStore{} | ||
| service, err := NewService(store) | ||
| if err != nil { t.Fatalf("NewService() error = %v", err) } | ||
| if err := service.Upsert(context.Background(), Entry{Kind: KindProvider, ProviderName: "anthropic_a", Enabled: false}); err != nil { | ||
| t.Fatalf("Upsert provider state error = %v", err) | ||
| } | ||
| if err := service.Upsert(context.Background(), Entry{Kind: KindPoolCandidate, ProviderName: "anthropic_b", Model: "claude-sonnet-4-6", Enabled: false}); err != nil { | ||
| t.Fatalf("Upsert candidate state error = %v", err) | ||
| } | ||
| candidates := []routing.Candidate{{Provider: "anthropic_a", Model: "claude-sonnet-4-6"}, {Provider: "anthropic_b", Model: "claude-sonnet-4-6"}, {Provider: "anthropic_c", Model: "claude-sonnet-4-6"}} | ||
| filtered := service.FilterCandidates("claude-sonnet-4-6", candidates) | ||
| if len(filtered) != 1 || filtered[0].Provider != "anthropic_c" { | ||
| t.Fatalf("filtered = %+v, want only anthropic_c", filtered) | ||
| } | ||
| } | ||
|
|
||
| func TestServiceCandidateEnabledDefaultsTrue(t *testing.T) { | ||
| service, err := NewService(&memoryStore{}) | ||
| if err != nil { t.Fatalf("NewService() error = %v", err) } | ||
| if !service.CandidateEnabled(core.ModelSelector{Provider: "anthropic_b", Model: "claude-opus-4-7"}) { | ||
| t.Fatal("expected unspecified candidate to default to enabled") | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Prefer a table-driven shape for these related candidate-state tests.
These two cases exercise the same behavior surface (FilterCandidates / CandidateEnabled) with overlapping setup; converting to table-driven cases will keep future routing-state scenarios easier to add and maintain.
As per coding guidelines: “**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routingstate/service_test.go` around lines 24 - 47, Convert the two
related tests into a single table-driven test that iterates test cases
describing initial store entries (using memoryStore, Upsert, and Entry with
KindProvider/KindPoolCandidate, ProviderName, Model, Enabled), the candidate
list (routing.Candidate slice), the target model, and the expected results for
both FilterCandidates and CandidateEnabled; for each case call NewService, apply
Upsert entries, run service.FilterCandidates(targetModel, candidates) and assert
the filtered result, and also call
service.CandidateEnabled(core.ModelSelector{Provider: ..., Model: ...}) to
assert the default/enabled behavior—this consolidates
TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue into table-driven subtests.
| func (s *Service) StartBackgroundRefresh(interval time.Duration) func() { | ||
| if interval <= 0 { | ||
| interval = time.Minute | ||
| } | ||
| stop := make(chan struct{}) | ||
| go func() { | ||
| ticker := time.NewTicker(interval) | ||
| defer ticker.Stop() | ||
| for { | ||
| select { | ||
| case <-ticker.C: | ||
| _ = s.Refresh(context.Background()) | ||
| case <-stop: | ||
| return | ||
| } | ||
| } | ||
| }() | ||
| return func() { close(stop) } |
There was a problem hiding this comment.
Do not refresh with an uncancelable context.
StartBackgroundRefresh always uses context.Background(), so a slow or hung store call cannot be interrupted and the returned stop function cannot stop the goroutine until that I/O returns.
🔧 Minimal fix
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
- _ = s.Refresh(context.Background())
+ refreshCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+ _ = s.Refresh(refreshCtx)
+ cancel()
case <-stop:
return
}
}
}()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routingstate/service.go` around lines 113 - 130,
StartBackgroundRefresh currently calls s.Refresh with context.Background(),
making refresh calls uninterruptible; change it to create a cancelable context
(e.g. ctx, cancel := context.WithCancel(context.Background())) that the
goroutine uses when calling s.Refresh(ctx) and ensure the returned stop function
calls cancel() (and closes the stop channel or signals the goroutine) so
in-flight Refresh calls are canceled promptly; update the goroutine to respect
context cancellation when invoking Refresh and to call cancel/cleanup when
exiting.
| func (s *Service) FilterCandidates(canonical string, candidates []routing.Candidate) []routing.Candidate { | ||
| if s == nil { | ||
| return append([]routing.Candidate(nil), candidates...) | ||
| } | ||
| if !s.CanonicalModelEnabled(canonical) { | ||
| return nil | ||
| } | ||
| filtered := make([]routing.Candidate, 0, len(candidates)) | ||
| for _, candidate := range candidates { | ||
| selector := core.ModelSelector{Provider: candidate.Provider, Model: candidate.Model} | ||
| if !s.ProviderEnabled(candidate.Provider) { | ||
| continue | ||
| } | ||
| if !s.CandidateEnabled(selector) { | ||
| continue | ||
| } | ||
| filtered = append(filtered, candidate) | ||
| } | ||
| return filtered |
There was a problem hiding this comment.
Use one snapshot for FilterCandidates.
This method reads canonical, provider, and candidate state through separate locked helpers, so a concurrent Refresh can mix old and new routing state within one request. That can still route a candidate after the canonical model was just disabled.
🔧 Minimal fix
func (s *Service) FilterCandidates(canonical string, candidates []routing.Candidate) []routing.Candidate {
if s == nil {
return append([]routing.Candidate(nil), candidates...)
}
- if !s.CanonicalModelEnabled(canonical) {
+
+ s.mu.RLock()
+ snap := s.snapshot
+ s.mu.RUnlock()
+
+ if entry, ok := snap.canonicalModels[strings.TrimSpace(canonical)]; ok && !entry.Enabled {
return nil
}
filtered := make([]routing.Candidate, 0, len(candidates))
for _, candidate := range candidates {
- selector := core.ModelSelector{Provider: candidate.Provider, Model: candidate.Model}
- if !s.ProviderEnabled(candidate.Provider) {
+ provider := strings.TrimSpace(candidate.Provider)
+ model := strings.TrimSpace(candidate.Model)
+ if entry, ok := snap.providers[provider]; ok && !entry.Enabled {
continue
}
- if !s.CandidateEnabled(selector) {
+ if entry, ok := snap.candidates[provider+"/"+model]; ok && !entry.Enabled {
continue
}
filtered = append(filtered, candidate)
}
return filtered
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routingstate/service.go` around lines 172 - 190, FilterCandidates
currently queries s.CanonicalModelEnabled, s.ProviderEnabled, and
s.CandidateEnabled separately which can interleave with Refresh; change it to
take one consistent routing-state snapshot under the Service lock and use that
snapshot for all checks. Concretely: add or use a snapshot-returning helper
(e.g. a new method like Service.SnapshotRoutingState or similar) that reads
canonical, provider and candidate state while holding the lock, then modify
FilterCandidates to call that snapshot method once and call
snapshot.CanonicalModelEnabled, snapshot.ProviderEnabled, and
snapshot.CandidateEnabled (or access snapshot fields) for all decisions so the
method uses a single, consistent view of routing state. Ensure no per-candidate
locked helper calls remain.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (12)
internal/routingstate/service_test.go (1)
24-47: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winConsider table-driven tests for these related candidate-state scenarios.
Both tests exercise
Servicebehavior around candidate filtering and enabled/disabled state. A table-driven approach would consolidate setup, make future test cases easier to add, and align with the project's testing guidelines.As per coding guidelines: "
**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routingstate/service_test.go` around lines 24 - 47, The two tests for candidate state should be converted into a single table-driven test to consolidate related scenarios: create a slice of test cases each containing a name, initial store entries (e.g., entries for KindProvider/KindPoolCandidate with Enabled true/false), the model string, input candidates, and the expected filtered result or expected CandidateEnabled boolean; then loop over cases and for each case call NewService (with memoryStore), apply Upsert for provided entries, and assert behavior via Service.FilterCandidates and/or Service.CandidateEnabled. Locate and update the existing TestServiceFilterCandidatesHonorsManualDisable and TestServiceCandidateEnabledDefaultsTrue functions (and usage of NewService, memoryStore, Service.FilterCandidates, Service.CandidateEnabled) to implement this table-driven approach so future cases can be added easily.internal/admin/dashboard/static/js/modules/aliases.js (4)
238-243:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRouting-state refresh does not update rendered rows.
After assigning
routingStateViewsat Line 239 (and in the catch path at Line 242),syncDisplayModels()is not called, so routing badges/toggles can stay stale.🔧 Minimal fix
this.routingStateAvailable = true; const payload = await res.json(); this.routingStateViews = Array.isArray(payload) ? payload : []; + this.syncDisplayModels(); } catch (e) { console.error('Failed to fetch routing state:', e); this.routingStateViews = []; + this.syncDisplayModels(); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 238 - 243, The routing-state update assigns this.routingStateViews but never refreshes the rendered rows; after setting this.routingStateViews in both the try path (after const payload = await res.json()) and the catch path (where this.routingStateViews = []), call this.syncDisplayModels() to re-sync UI state (i.e., add this.syncDisplayModels() immediately after both assignments to ensure badges/toggles update).
523-529:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCanonical toggle computes the wrong next enabled state.
At Line 528,
!(row.canonical_enabled === false)keepstrueastrueinstead of toggling it off.🔧 Minimal fix
await this.submitRoutingStateChange({ kind: 'canonical_model', canonical_model: row.canonical_model, - enabled: !(row.canonical_enabled === false) + enabled: !row.canonical_enabled });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 523 - 529, The toggleCanonicalModelEnabled handler computes the next enabled state incorrectly; replace the expression !(row.canonical_enabled === false) with a proper boolean toggle such as !Boolean(row.canonical_enabled) (or simply !row.canonical_enabled) when calling submitRoutingStateChange so the canonical_enabled value actually flips; update the enabled property passed to submitRoutingStateChange in toggleCanonicalModelEnabled to use that corrected expression.
489-501:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
submitRoutingStateChangecan throw unhandled network errors.This path has no
try/catch, unlike other submit helpers. A failedfetch()rejects and escapes without returningfalse.🔧 Minimal fix
async submitRoutingStateChange(payload) { - const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) }); - const res = await fetch('/admin/routing-state', request); - const handled = this.handleFetchResponse(res, 'routing state update', request); - if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { - return false; - } - if (!handled) { - return false; - } - await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]); - return true; + try { + const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) }); + const res = await fetch('/admin/routing-state', request); + const handled = this.handleFetchResponse(res, 'routing state update', request); + if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) { + return false; + } + if (!handled) { + return false; + } + await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]); + return true; + } catch (e) { + console.error('Failed to update routing state:', e); + return false; + } },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 - 501, submitRoutingStateChange currently performs fetch() without a try/catch so network errors can throw and escape; wrap the body of submitRoutingStateChange in a try/catch, catch any errors from fetch() or subsequent awaits (including handleFetchResponse, fetchRoutingState, fetchRoutingPools), log or handle the error as appropriate, and return false on error so the caller gets a safe failure value; keep existing checks for isStaleAuthFetchResult(handled) and the normal success path that returns true.
246-257:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
fetchRoutingPoolsmisses explicit 503 handling parity.
fetchRoutingState()marks routing state as unavailable on 503, butfetchRoutingPools()does not, which can leave availability state inconsistent.🔧 Minimal fix
async fetchRoutingPools() { try { const request = this.adminRequestOptions(); const res = await fetch('/admin/routing/model-pools', request); + if (res.status === 503) { + this.routingStateAvailable = false; + this.routingPools = []; + this.syncDisplayModels(); + return; + } const handled = this.handleFetchResponse(res, 'routing model pools', request);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 - 257, fetchRoutingPools currently omits explicit 503 handling and can leave routing availability inconsistent; update the async fetchRoutingPools method to mirror fetchRoutingState by checking the response status (res.status === 503) after handleFetchResponse and setting this.routingUnavailable = true (and clearing this.routingPools / calling this.syncDisplayModels() as needed) then return; also ensure on the success path you set this.routingUnavailable = false so availability is updated when pools load successfully; use the existing res, handled, and this.routingPools references in fetchRoutingPools to implement this parity.internal/routing/runtime.go (1)
35-47:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvailability errors can still be classified as healthy.
At Line 35 and Line 46, a provider with discovered models and non-empty
LastAvailabilityErrorcan still return"healthy".🔧 Suggested adjustment
switch { + case availabilityError != "" && snapshot.DiscoveredModelCount > 0: + return "degraded" case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "": if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil { return "degraded" } return "healthy" @@ - case snapshot.DiscoveredModelCount > 0: + case snapshot.DiscoveredModelCount > 0 && availabilityError == "": return "healthy"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/runtime.go` around lines 35 - 47, The current health logic in the switch (using snapshot.Registered, snapshot.DiscoveredModelCount, modelFetchError, availabilityError, snapshot.UsingCachedModels, snapshot.LastModelFetchSuccessAt) can still return "healthy" when availabilityError (LastAvailabilityError) is non-empty; update the conditions that return "healthy" to also require availabilityError == "" (e.g., the branches that check snapshot.DiscoveredModelCount > 0 and modelFetchError == "" and the final snapshot.DiscoveredModelCount > 0 branch), and ensure branches that handle availabilityError != "" move to "degraded" or "unhealthy" as intended so availability errors always prevent a "healthy" result.config/routing.go (1)
88-90:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
retry_on_statusesvalues during config load.Invalid codes (e.g., 0, negative, >599) are currently accepted and can break retry matching behavior.
🔧 Minimal fix
if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 { cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504} } + for idx, status := range cfg.Defaults.Failover.RetryOnStatuses { + if status < 100 || status > 599 { + return fmt.Errorf("routing.defaults.failover.retry_on_statuses[%d] must be a valid HTTP status code (100-599)", idx) + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@config/routing.go` around lines 88 - 90, The RetryOnStatuses slice (cfg.Defaults.Failover.RetryOnStatuses) accepts invalid HTTP codes (0, negative, >599) which can break retry logic; add validation during config load to either reject the config with a clear error or sanitize the slice by filtering out values outside 100..599 (or specifically 100–599 inclusive) and duplicate values, e.g., in the config-loading/validation function that constructs cfg.Defaults.Failover, check each status in RetryOnStatuses and return a descriptive error if any are out of range (or remove invalid entries and log a warning) so only valid HTTP status codes remain.internal/routing/strategy.go (1)
12-20:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winAdd empty-candidate guard before
ordered[0].Line 20 panics when
pool.Candidatesis empty inpriority_failover.🔧 Minimal fix
func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) { + if len(pool.Candidates) == 0 { + return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel) + } + switch strategy { case config.RoutingStrategyPriorityFailover: ordered := append([]Candidate(nil), pool.Candidates...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/strategy.go` around lines 12 - 20, Add an empty-candidate guard before accessing ordered[0] in the config.RoutingStrategyPriorityFailover branch: after building ordered := append([]Candidate(nil), pool.Candidates...) check if len(ordered) == 0 and return the function's zero-value candidate(s) and a descriptive error (e.g. fmt.Errorf("priority_failover: no candidates available")) instead of indexing into an empty slice; keep existing behavior for non-empty/one-element slices so ordered[1:] continues to work.internal/routing/session_affinity.go (1)
46-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAffinity map can grow unbounded with request-id keys.
Using request-scoped keys at Line 46 while pruning only on
Get(Line 67) meansPut-heavy flows can accumulate expired entries indefinitely.🔧 Minimal mitigation
func affinityKey(ctx context.Context, canonicalModel string) string { @@ - requestID := strings.TrimSpace(core.GetRequestID(ctx)) - if requestID != "" { - return canonicalModel + "|request_id|" + requestID - } return "" }Also applies to: 67-69, 74-84
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/session_affinity.go` around lines 46 - 49, The affinity map grows unbounded because you create keys using request-scoped values (core.GetRequestID(ctx) → requestID → canonicalModel+"|request_id|"+requestID) but only prune on Get; update the Put path (and any code using the affinity map) to either skip adding request_id-scoped keys entirely or enforce pruning/TTL when inserting so ephemeral request IDs cannot accumulate. Specifically, modify the code paths that build keys from requestID (the block using core.GetRequestID(ctx) and the functions/methods that call Put/Get on the affinity map) to: 1) avoid persisting request-scoped keys into the map or 2) run the same expiry/pruning logic on Put as is run in Get (or attach a TTL eviction) so expired request_id entries are removed on insert. Ensure references to canonicalModel, requestID, Get, and Put are updated accordingly.internal/admin/handler_routing_state.go (1)
75-77:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude
canonical_modelin pool-candidate response matching.
KindPoolCandidatematching is ambiguous with onlyprovider_name + model; it can return the wrong updated row when that pair exists in multiple canonical pools.Proposed minimal fix
case routingstate.KindPoolCandidate: - if current.ProviderName == entry.ProviderName && current.Model == entry.Model { + if current.ProviderName == entry.ProviderName && + current.Model == entry.Model && + current.CanonicalModel == entry.CanonicalModel { return c.JSON(http.StatusOK, current) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/admin/handler_routing_state.go` around lines 75 - 77, The KindPoolCandidate match in the switch (case routingstate.KindPoolCandidate) currently only compares current.ProviderName == entry.ProviderName && current.Model == entry.Model and must also compare canonical_model; update the matching condition to include current.CanonicalModel == entry.CanonicalModel (or the struct field name CanonicalModel) so the handler returns current only when provider, model, and canonical_model all match; ensure any tests or callers that rely on this behavior are adjusted if necessary.internal/routing/resolver.go (1)
61-67:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSynchronize
Resolver.runtimereads/writes to avoid race conditions.
WithRuntimewritesr.runtimeunsafely, whileResolveWithContextreads it on request handling path; this can race under concurrent updates.Proposed minimal fix
func (r *Resolver) WithRuntime(runtime RuntimeSnapshotProvider) *Resolver { if r == nil { return nil } + r.mu.Lock() + defer r.mu.Unlock() r.runtime = runtime return r } @@ - evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(r.runtime)) + r.mu.Lock() + runtime := r.runtime + r.mu.Unlock() + evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(runtime))Also applies to: 97-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/routing/resolver.go` around lines 61 - 67, Resolver.runtime is being written in WithRuntime without synchronization while ResolveWithContext reads it concurrently; add synchronization by introducing a mutex on Resolver (e.g., mu sync.RWMutex) and use mu.Lock()/mu.Unlock() in WithRuntime and mu.RLock()/mu.RUnlock() around reads in ResolveWithContext (and any other accessors/setters such as the other writer at the referenced location), ensuring all reads/writes to the runtime field are guarded.internal/app/app.go (1)
55-55:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClose
routingStateduring normalShutdownas well.
routingStateis now a managed subsystem, butShutdowndoes not close it; this can leak store/refresh resources on normal app termination.Proposed minimal fix
// 7. Close reusable guardrails subsystem. if a.guardrails != nil { if err := a.guardrails.Close(); err != nil { slog.Error("guardrails close error", "error", err) errs = append(errs, fmt.Errorf("guardrails close: %w", err)) } } + + // 8. Close routing state subsystem. + if a.routingState != nil { + if err := a.routingState.Close(); err != nil { + slog.Error("routing state close error", "error", err) + errs = append(errs, fmt.Errorf("routing state close: %w", err)) + } + } - // 8. Close managed auth keys subsystem. + // 9. Close managed auth keys subsystem. if a.authKeys != nil {Also applies to: 278-293
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/app.go` at line 55, routingState (the *routingstate.Result field) is not being closed during normal Shutdown, causing resource leaks; update the app Shutdown method (and any shutdown/cleanup path that currently handles other subsystems) to call routingState.Close() (or routingState.Shutdown() if that API exists) and handle/log any error, and also ensure the same closure is applied in the alternate shutdown/cleanup block that mirrors lines 278-293 so routingState is always closed on normal termination.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/routing/exposed_models.go`:
- Around line 103-114: The fallback branch in
CombinedExposedModelLister.ExposedModelsFiltered bypasses the provided allow
filter by appending lister.ExposedModels() directly; update that branch to
iterate the returned models from lister.ExposedModels() and apply the
allow(core.ModelSelector) predicate to each model before appending so only
permitted models are included (mirror behavior of
serverFilteredExposedModelLister.ExposedModelsFiltered). Ensure you reference
the existing symbols CombinedExposedModelLister, ExposedModelsFiltered,
serverFilteredExposedModelLister, ExposedModels, and the allow parameter when
implementing the per-model check.
In `@internal/routing/failover_policy.go`:
- Around line 51-59: The 5xx/429 retry check is wrongly guarded by the
RetryOnModelErrors early return: in function(s) using the RetryOnModelErrors
boolean (see symbol RetryOnModelErrors) move or duplicate the HTTP status check
so that server-error/429 logic runs regardless of RetryOnModelErrors;
specifically, ensure the block that checks status >=
http.StatusInternalServerError || status == http.StatusTooManyRequests is
evaluated independently (or before) the early return, and keep the model-error
string checks gated by RetryOnModelErrors; also ensure RetryOnStatuses (or
similar configured status list) is consulted for explicit overrides.
---
Duplicate comments:
In `@config/routing.go`:
- Around line 88-90: The RetryOnStatuses slice
(cfg.Defaults.Failover.RetryOnStatuses) accepts invalid HTTP codes (0, negative,
>599) which can break retry logic; add validation during config load to either
reject the config with a clear error or sanitize the slice by filtering out
values outside 100..599 (or specifically 100–599 inclusive) and duplicate
values, e.g., in the config-loading/validation function that constructs
cfg.Defaults.Failover, check each status in RetryOnStatuses and return a
descriptive error if any are out of range (or remove invalid entries and log a
warning) so only valid HTTP status codes remain.
In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 238-243: The routing-state update assigns this.routingStateViews
but never refreshes the rendered rows; after setting this.routingStateViews in
both the try path (after const payload = await res.json()) and the catch path
(where this.routingStateViews = []), call this.syncDisplayModels() to re-sync UI
state (i.e., add this.syncDisplayModels() immediately after both assignments to
ensure badges/toggles update).
- Around line 523-529: The toggleCanonicalModelEnabled handler computes the next
enabled state incorrectly; replace the expression !(row.canonical_enabled ===
false) with a proper boolean toggle such as !Boolean(row.canonical_enabled) (or
simply !row.canonical_enabled) when calling submitRoutingStateChange so the
canonical_enabled value actually flips; update the enabled property passed to
submitRoutingStateChange in toggleCanonicalModelEnabled to use that corrected
expression.
- Around line 489-501: submitRoutingStateChange currently performs fetch()
without a try/catch so network errors can throw and escape; wrap the body of
submitRoutingStateChange in a try/catch, catch any errors from fetch() or
subsequent awaits (including handleFetchResponse, fetchRoutingState,
fetchRoutingPools), log or handle the error as appropriate, and return false on
error so the caller gets a safe failure value; keep existing checks for
isStaleAuthFetchResult(handled) and the normal success path that returns true.
- Around line 246-257: fetchRoutingPools currently omits explicit 503 handling
and can leave routing availability inconsistent; update the async
fetchRoutingPools method to mirror fetchRoutingState by checking the response
status (res.status === 503) after handleFetchResponse and setting
this.routingUnavailable = true (and clearing this.routingPools / calling
this.syncDisplayModels() as needed) then return; also ensure on the success path
you set this.routingUnavailable = false so availability is updated when pools
load successfully; use the existing res, handled, and this.routingPools
references in fetchRoutingPools to implement this parity.
In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: The KindPoolCandidate match in the switch (case
routingstate.KindPoolCandidate) currently only compares current.ProviderName ==
entry.ProviderName && current.Model == entry.Model and must also compare
canonical_model; update the matching condition to include current.CanonicalModel
== entry.CanonicalModel (or the struct field name CanonicalModel) so the handler
returns current only when provider, model, and canonical_model all match; ensure
any tests or callers that rely on this behavior are adjusted if necessary.
In `@internal/app/app.go`:
- Line 55: routingState (the *routingstate.Result field) is not being closed
during normal Shutdown, causing resource leaks; update the app Shutdown method
(and any shutdown/cleanup path that currently handles other subsystems) to call
routingState.Close() (or routingState.Shutdown() if that API exists) and
handle/log any error, and also ensure the same closure is applied in the
alternate shutdown/cleanup block that mirrors lines 278-293 so routingState is
always closed on normal termination.
In `@internal/routing/resolver.go`:
- Around line 61-67: Resolver.runtime is being written in WithRuntime without
synchronization while ResolveWithContext reads it concurrently; add
synchronization by introducing a mutex on Resolver (e.g., mu sync.RWMutex) and
use mu.Lock()/mu.Unlock() in WithRuntime and mu.RLock()/mu.RUnlock() around
reads in ResolveWithContext (and any other accessors/setters such as the other
writer at the referenced location), ensuring all reads/writes to the runtime
field are guarded.
In `@internal/routing/runtime.go`:
- Around line 35-47: The current health logic in the switch (using
snapshot.Registered, snapshot.DiscoveredModelCount, modelFetchError,
availabilityError, snapshot.UsingCachedModels, snapshot.LastModelFetchSuccessAt)
can still return "healthy" when availabilityError (LastAvailabilityError) is
non-empty; update the conditions that return "healthy" to also require
availabilityError == "" (e.g., the branches that check
snapshot.DiscoveredModelCount > 0 and modelFetchError == "" and the final
snapshot.DiscoveredModelCount > 0 branch), and ensure branches that handle
availabilityError != "" move to "degraded" or "unhealthy" as intended so
availability errors always prevent a "healthy" result.
In `@internal/routing/session_affinity.go`:
- Around line 46-49: The affinity map grows unbounded because you create keys
using request-scoped values (core.GetRequestID(ctx) → requestID →
canonicalModel+"|request_id|"+requestID) but only prune on Get; update the Put
path (and any code using the affinity map) to either skip adding
request_id-scoped keys entirely or enforce pruning/TTL when inserting so
ephemeral request IDs cannot accumulate. Specifically, modify the code paths
that build keys from requestID (the block using core.GetRequestID(ctx) and the
functions/methods that call Put/Get on the affinity map) to: 1) avoid persisting
request-scoped keys into the map or 2) run the same expiry/pruning logic on Put
as is run in Get (or attach a TTL eviction) so expired request_id entries are
removed on insert. Ensure references to canonicalModel, requestID, Get, and Put
are updated accordingly.
In `@internal/routing/strategy.go`:
- Around line 12-20: Add an empty-candidate guard before accessing ordered[0] in
the config.RoutingStrategyPriorityFailover branch: after building ordered :=
append([]Candidate(nil), pool.Candidates...) check if len(ordered) == 0 and
return the function's zero-value candidate(s) and a descriptive error (e.g.
fmt.Errorf("priority_failover: no candidates available")) instead of indexing
into an empty slice; keep existing behavior for non-empty/one-element slices so
ordered[1:] continues to work.
In `@internal/routingstate/service_test.go`:
- Around line 24-47: The two tests for candidate state should be converted into
a single table-driven test to consolidate related scenarios: create a slice of
test cases each containing a name, initial store entries (e.g., entries for
KindProvider/KindPoolCandidate with Enabled true/false), the model string, input
candidates, and the expected filtered result or expected CandidateEnabled
boolean; then loop over cases and for each case call NewService (with
memoryStore), apply Upsert for provided entries, and assert behavior via
Service.FilterCandidates and/or Service.CandidateEnabled. Locate and update the
existing TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue functions (and usage of NewService,
memoryStore, Service.FilterCandidates, Service.CandidateEnabled) to implement
this table-driven approach so future cases can be added easily.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d41ff2ac-3e50-40ac-941b-f338818144b0
📒 Files selected for processing (53)
config/config.example.yamlconfig/config.goconfig/config_example_test.goconfig/config_test.goconfig/routing.godocs/features/canonical-routing.mdxinternal/admin/dashboard/static/js/dashboard.jsinternal/admin/dashboard/static/js/modules/aliases.jsinternal/admin/dashboard/static/js/modules/aliases.test.cjsinternal/admin/dashboard/templates/model-table-body.htmlinternal/admin/handler.gointernal/admin/handler_providers.gointernal/admin/handler_routing_pools.gointernal/admin/handler_routing_pools_test.gointernal/admin/handler_routing_state.gointernal/admin/handler_routing_state_test.gointernal/admin/handler_test.gointernal/admin/routes.gointernal/admin/routes_test.gointernal/app/app.gointernal/core/canonical_routing.gointernal/core/request_model_resolution.gointernal/fallback/resolver.gointernal/fallback/resolver_test.gointernal/gateway/fallback.gointernal/gateway/inference_orchestrator.gointernal/gateway/request_model_resolution.gointernal/providers/config_test.gointernal/routing/composed_resolver.gointernal/routing/exposed_models.gointernal/routing/exposed_models_test.gointernal/routing/failover_policy.gointernal/routing/pool_evaluator.gointernal/routing/resolver.gointernal/routing/resolver_test.gointernal/routing/runtime.gointernal/routing/session_affinity.gointernal/routing/strategy.gointernal/routing/types.gointernal/routingstate/factory.gointernal/routingstate/service.gointernal/routingstate/service_test.gointernal/routingstate/store.gointernal/routingstate/store_mongodb.gointernal/routingstate/store_postgresql.gointernal/routingstate/store_sqlite.gointernal/routingstate/types.gointernal/server/fallback_test.gointernal/server/handlers.gointernal/server/http.gointernal/server/internal_chat_completion_executor.gointernal/server/request_model_resolution_test.gointernal/server/translated_inference_service.go
| func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model { | ||
| if l == nil { | ||
| return nil | ||
| } | ||
| var result []core.Model | ||
| for _, lister := range l.listers { | ||
| if filtered, ok := lister.(serverFilteredExposedModelLister); ok { | ||
| result = append(result, filtered.ExposedModelsFiltered(allow)...) | ||
| continue | ||
| } | ||
| result = append(result, lister.ExposedModels()...) | ||
| } |
There was a problem hiding this comment.
allow filtering is bypassed in the fallback branch.
When allow is provided, Line 113 still appends unfiltered models from listers that only implement ExposedModels(). That can expose models that should have been filtered out.
🔧 Minimal fix
func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model {
@@
for _, lister := range l.listers {
if filtered, ok := lister.(serverFilteredExposedModelLister); ok {
result = append(result, filtered.ExposedModelsFiltered(allow)...)
continue
}
+ if allow != nil {
+ continue
+ }
result = append(result, lister.ExposedModels()...)
}
return result
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model { | |
| if l == nil { | |
| return nil | |
| } | |
| var result []core.Model | |
| for _, lister := range l.listers { | |
| if filtered, ok := lister.(serverFilteredExposedModelLister); ok { | |
| result = append(result, filtered.ExposedModelsFiltered(allow)...) | |
| continue | |
| } | |
| result = append(result, lister.ExposedModels()...) | |
| } | |
| func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model { | |
| if l == nil { | |
| return nil | |
| } | |
| var result []core.Model | |
| for _, lister := range l.listers { | |
| if filtered, ok := lister.(serverFilteredExposedModelLister); ok { | |
| result = append(result, filtered.ExposedModelsFiltered(allow)...) | |
| continue | |
| } | |
| if allow != nil { | |
| continue | |
| } | |
| result = append(result, lister.ExposedModels()...) | |
| } | |
| return result | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routing/exposed_models.go` around lines 103 - 114, The fallback
branch in CombinedExposedModelLister.ExposedModelsFiltered bypasses the provided
allow filter by appending lister.ExposedModels() directly; update that branch to
iterate the returned models from lister.ExposedModels() and apply the
allow(core.ModelSelector) predicate to each model before appending so only
permitted models are included (mirror behavior of
serverFilteredExposedModelLister.ExposedModelsFiltered). Ensure you reference
the existing symbols CombinedExposedModelLister, ExposedModelsFiltered,
serverFilteredExposedModelLister, ExposedModels, and the allow parameter when
implementing the per-model check.
| if !p.RetryOnModelErrors { | ||
| return false | ||
| } | ||
| message := strings.ToLower(strings.TrimSpace(gatewayErr.Message)) | ||
| if strings.Contains(message, "model") && (strings.Contains(message, "unavailable") || strings.Contains(message, "not found") || strings.Contains(message, "unsupported")) { | ||
| return true | ||
| } | ||
| if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests { | ||
| return true |
There was a problem hiding this comment.
5xx/429 retry logic is accidentally gated by RetryOnModelErrors.
At Line 51, returning early when RetryOnModelErrors is false makes Line 58–59 unreachable, so transient HTTP failures won’t fail over unless explicitly listed in RetryOnStatuses.
Suggested minimal fix
if _, ok := p.RetryOnStatuses[status]; ok {
return true
}
+ if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests {
+ return true
+ }
if !p.RetryOnModelErrors {
return false
}
message := strings.ToLower(strings.TrimSpace(gatewayErr.Message))
if strings.Contains(message, "model") && (strings.Contains(message, "unavailable") || strings.Contains(message, "not found") || strings.Contains(message, "unsupported")) {
return true
}
- if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests {
- return true
- }
return false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/routing/failover_policy.go` around lines 51 - 59, The 5xx/429 retry
check is wrongly guarded by the RetryOnModelErrors early return: in function(s)
using the RetryOnModelErrors boolean (see symbol RetryOnModelErrors) move or
duplicate the HTTP status check so that server-error/429 logic runs regardless
of RetryOnModelErrors; specifically, ensure the block that checks status >=
http.StatusInternalServerError || status == http.StatusTooManyRequests is
evaluated independently (or before) the early return, and keep the model-error
string checks gated by RetryOnModelErrors; also ensure RetryOnStatuses (or
similar configured status list) is consulted for explicit overrides.
Summary
/v1/models, request resolution, and failover metadata with the same effective routing decisiondocs/features/canonical-routing.mdxWhat changed
1. Canonical routing configuration
This PR introduces a new
routingconfig block with:routing.defaults.strategyrouting.defaults.session_affinityrouting.defaults.session_affinity_ttlrouting.defaults.failover.enabledrouting.defaults.failover.max_attemptsrouting.defaults.failover.retry_on_statusesrouting.defaults.failover.retry_on_model_errorsrouting.model_poolsIt also adds validation and defaults in
config/routing.goand expandsconfig/config.example.yamlso operators have a concrete example to follow.2. Canonical model pool resolver
This PR adds a dedicated routing layer in
internal/routing/that:priority_failoverandweighted_round_robinKey files:
internal/routing/resolver.gointernal/routing/pool_evaluator.gointernal/routing/session_affinity.gointernal/routing/exposed_models.gointernal/routing/failover_policy.go3. Persisted routing state
This PR adds
internal/routingstate/for persisted operational routing state, including:That state is then consumed by runtime routing, admin APIs, and dashboard views.
Key files:
internal/routingstate/service.gointernal/routingstate/store_sqlite.gointernal/routingstate/store_postgresql.gointernal/routingstate/store_mongodb.go4. Runtime-aware eligibility and failover
Candidate eligibility is now shared across resolver,
/v1/models, and admin status.Behavior implemented here:
healthycandidates remain eligibledegradedcandidates remain eligible but are marked degradedunhealthycandidates are removed from effective selection5.
/v1/modelsalignmentGET /v1/modelsnow reflects the same effective routing policy used for real request execution.Instead of exposing a canonical model based on the first configured candidate, the response is derived from the effective candidate selected by the routing layer.
6. Admin API and dashboard
This PR adds new admin routing endpoints and dashboard behavior for operators.
Admin endpoints:
GET /admin/routing-statePUT /admin/routing-stateDELETE /admin/routing-stateGET /admin/routing/model-poolsDashboard updates include:
Config PrimaryvsEffective CandidateKey files:
internal/admin/handler_routing_state.gointernal/admin/handler_routing_pools.gointernal/admin/dashboard/static/js/modules/aliases.jsinternal/admin/dashboard/templates/model-table-body.html7. Structured routing metadata
This PR extends routing/failover observability by carrying structured routing metadata through request resolution and fallback execution.
Examples of metadata now propagated on request resolution:
Key files:
internal/core/canonical_routing.gointernal/core/request_model_resolution.gointernal/gateway/request_model_resolution.gointernal/gateway/fallback.goDeveloper notes
internal/routing/and is wired into the app ininternal/app/app.go.internal/routing/exposed_models.gonow uses the same routing decision surface as real request execution, so/v1/modelsand translated request routing stay consistent.user_pathas the affinity key and falls back to request ID when no scoped user path is available.Documentation
User-facing documentation for this feature lives at:
docs/features/canonical-routing.mdxThat doc explains:
routingconfig block/v1/modelsbehaviorconfigured_provider_models_modestill interacts with canonical poolsTest plan
go test ./internal/routing ./internal/routingstate ./internal/gateway ./internal/server ./internal/admin ./internal/admin/dashboard ./internal/appnode --test internal/admin/dashboard/static/js/modules/aliases.test.cjsSummary by CodeRabbit
New Features
Documentation
Tests