Skip to content

feat(routing): add canonical model pools and admin controls#352

Open
vfeitoza wants to merge 1 commit into
ENTERPILOT:mainfrom
vfeitoza:feat/canonical-routing-failover
Open

feat(routing): add canonical model pools and admin controls#352
vfeitoza wants to merge 1 commit into
ENTERPILOT:mainfrom
vfeitoza:feat/canonical-routing-failover

Conversation

@vfeitoza
Copy link
Copy Markdown
Contributor

@vfeitoza vfeitoza commented May 25, 2026

Summary

  • add canonical model pool routing so one public model name can resolve to multiple exact provider/model candidates
  • add pool-aware failover, session affinity, runtime-aware candidate eligibility, and persisted routing state
  • add admin/dashboard controls and visibility for provider, canonical model, and candidate state
  • align /v1/models, request resolution, and failover metadata with the same effective routing decision
  • document the feature and new user-facing configuration in docs/features/canonical-routing.mdx

What changed

1. Canonical routing configuration

This PR introduces a new routing config block with:

  • routing.defaults.strategy
  • routing.defaults.session_affinity
  • routing.defaults.session_affinity_ttl
  • routing.defaults.failover.enabled
  • routing.defaults.failover.max_attempts
  • routing.defaults.failover.retry_on_statuses
  • routing.defaults.failover.retry_on_model_errors
  • routing.model_pools

It also adds validation and defaults in config/routing.go and expands config/config.example.yaml so operators have a concrete example to follow.

2. Canonical model pool resolver

This PR adds a dedicated routing layer in internal/routing/ that:

  • resolves canonical model names to exact provider/model candidates
  • supports priority_failover and weighted_round_robin
  • applies session affinity before strategy selection
  • repins automatically when the pinned candidate becomes ineligible
  • filters candidates using both manual routing state and runtime health

Key files:

  • internal/routing/resolver.go
  • internal/routing/pool_evaluator.go
  • internal/routing/session_affinity.go
  • internal/routing/exposed_models.go
  • internal/routing/failover_policy.go

3. Persisted routing state

This PR adds internal/routingstate/ for persisted operational routing state, including:

  • provider enable/disable
  • canonical model enable/disable
  • pool candidate enable/disable

That state is then consumed by runtime routing, admin APIs, and dashboard views.

Key files:

  • internal/routingstate/service.go
  • internal/routingstate/store_sqlite.go
  • internal/routingstate/store_postgresql.go
  • internal/routingstate/store_mongodb.go

4. Runtime-aware eligibility and failover

Candidate eligibility is now shared across resolver, /v1/models, and admin status.

Behavior implemented here:

  • healthy candidates remain eligible
  • degraded candidates remain eligible but are marked degraded
  • unhealthy candidates are removed from effective selection
  • canonical pool failover respects pool-scoped failover policy and max attempts

5. /v1/models alignment

GET /v1/models now reflects the same effective routing policy used for real request execution.

Instead of exposing a canonical model based on the first configured candidate, the response is derived from the effective candidate selected by the routing layer.

6. Admin API and dashboard

This PR adds new admin routing endpoints and dashboard behavior for operators.

Admin endpoints:

  • GET /admin/routing-state
  • PUT /admin/routing-state
  • DELETE /admin/routing-state
  • GET /admin/routing/model-pools

Dashboard updates include:

  • provider/canonical/candidate toggles
  • candidate runtime status badges
  • Config Primary vs Effective Candidate
  • blocked candidate visibility and pool status context

Key files:

  • internal/admin/handler_routing_state.go
  • internal/admin/handler_routing_pools.go
  • internal/admin/dashboard/static/js/modules/aliases.js
  • internal/admin/dashboard/templates/model-table-body.html

7. Structured routing metadata

This PR extends routing/failover observability by carrying structured routing metadata through request resolution and fallback execution.

Examples of metadata now propagated on request resolution:

  • canonical model
  • routing strategy
  • config primary candidate
  • effective candidate
  • selected provider name
  • selected exact model
  • blocked candidates
  • failover used
  • fallback target

Key files:

  • internal/core/canonical_routing.go
  • internal/core/request_model_resolution.go
  • internal/gateway/request_model_resolution.go
  • internal/gateway/fallback.go

Developer notes

  • The canonical routing feature is centered around internal/routing/ and is wired into the app in internal/app/app.go.
  • internal/routing/exposed_models.go now uses the same routing decision surface as real request execution, so /v1/models and translated request routing stay consistent.
  • Session affinity currently prefers user_path as the affinity key and falls back to request ID when no scoped user path is available.
  • Failover metadata is updated in the workflow resolution when an actual fallback candidate succeeds.
  • The dashboard consumes pool-level and candidate-level routing state through the admin API rather than recomputing it client-side.

Documentation

User-facing documentation for this feature lives at:

  • docs/features/canonical-routing.mdx

That doc explains:

  • the new routing config block
  • supported strategies
  • session affinity behavior
  • failover behavior
  • runtime eligibility rules
  • /v1/models behavior
  • admin/dashboard controls
  • how configured_provider_models_mode still interacts with canonical pools

Test plan

  • go test ./internal/routing ./internal/routingstate ./internal/gateway ./internal/server ./internal/admin ./internal/admin/dashboard ./internal/app
  • node --test internal/admin/dashboard/static/js/modules/aliases.test.cjs

Summary by CodeRabbit

  • New Features

    • Canonical model pool routing with priority-failover and weighted round-robin strategies.
    • Session affinity (configurable enable/TTL) and configurable failover/retry policies.
    • Admin REST APIs and dashboard controls to view/manage routing state and model pools.
    • Exposed-model listing reflects canonical/effective candidate selection and eligibility.
  • Documentation

    • Added comprehensive canonical routing guide, examples, rollout checklist, and observability notes.
  • Tests

    • Added unit/integration tests covering routing config, resolver, pool evaluation, failover, admin handlers, and exposure listers.

Review Change Stack

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 25, 2026

Greptile Summary

This PR adds canonical model pool routing and operator controls. The main changes are:

  • New routing config for model pools, strategies, affinity, and failover.
  • Pool-aware resolver logic with runtime eligibility and failover metadata.
  • Persisted provider, canonical model, and candidate routing state.
  • Admin API and dashboard views for routing state and pool status.
  • /v1/models exposure based on effective canonical routing decisions.

Confidence Score: 2/5

This should be fixed before merging.

  • Default deployments without routing pools can panic on startup.
  • Weighted routing can select one backend but record metadata and fallbacks for another.
  • Read-only model listing can change future weighted routing choices.
  • Candidate disable controls can affect multiple pools unintentionally.

Focus on internal/app/app.go, internal/gateway/request_model_resolution.go, internal/routing/exposed_models.go, internal/routingstate/types.go, and config/routing.go.

Important Files Changed

Filename Overview
internal/app/app.go Wires routing into startup, server config, and admin setup; contains the no-pool startup panic and cleanup omissions.
internal/gateway/request_model_resolution.go Adds canonical metadata to request resolution; currently resolves weighted pools twice per request.
internal/routing/exposed_models.go Exposes canonical models via the routing layer; model listing mutates weighted routing counters.
internal/routingstate/types.go Defines persisted routing-state identity; pool-candidate keys are missing the canonical pool dimension.
config/routing.go Adds routing config validation and defaults; partial YAML can drop documented boolean defaults.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Client request model] --> B[ResolveExecutionSelectorWithContext]
    B --> C[Composed canonical resolver]
    C --> D[Weighted counter advances]
    D --> E[Execution selector chosen]
    E --> F[ResolveRequestModelWithAuthorizer]
    F --> G[Canonical resolver called again]
    G --> H[Weighted counter advances again]
    H --> I[Metadata and fallbacks may describe different candidate]
Loading

Comments Outside Diff (1)

  1. internal/app/app.go, line 313 (link)

    P2 Routing state can leak

    After routing state is initialized, this later error cleanup path closes the existing subsystems but omits app.routingState.Close(). If guardrail initialization fails, the routing-state storage and background refresh goroutine can remain open even though app construction returns an error.

    Context Used: CLAUDE.md (source)

Reviews (1): Last reviewed commit: "feat(routing): add canonical model pools..." | Re-trigger Greptile

Comment thread internal/app/app.go
"recommendation", "rebuild with -tags=swagger")
}

routingResolver := routing.NewResolver(appCfg.Routing, app.routingState.Service).WithRuntime(providerResult.Registry)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Startup panic without pools

When routing.model_pools is empty, routing.NewResolver returns nil, and this line immediately calls WithRuntime on that nil value. Existing configs that do not opt into canonical pools will panic during app startup before the server is created.

Context Used: CLAUDE.md (source)

Comment thread config/routing.go
Comment on lines +75 to +84
return nil
}

cfg.Defaults.Strategy = ResolveRoutingStrategy(cfg.Defaults.Strategy)
if !cfg.Defaults.Strategy.Valid() {
return fmt.Errorf("routing.defaults.strategy must be one of: priority_failover, weighted_round_robin")
}
if cfg.Defaults.SessionAffinityTTL <= 0 {
cfg.Defaults.SessionAffinityTTL = 30 * time.Minute
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Partial defaults disable failover

The loader restores durations, attempts, and retry statuses when they are omitted, but it does not restore boolean defaults. If a config file includes a partial routing.defaults block, YAML zero values can turn session_affinity, failover.enabled, or retry_on_model_errors from their documented defaults of true into false, so pool failover or affinity can be silently disabled by an otherwise unrelated routing setting.

Context Used: CLAUDE.md (source)

Comment on lines +77 to +82
var canonicalResolution *core.CanonicalRoutingResolution
if canonicalResolver, ok := resolver.(interface {
ResolveWithContext(context.Context, core.RequestedModelSelector) (*core.CanonicalRoutingResolution, bool, error)
}); ok {
canonicalResolution, _, err = canonicalResolver.ResolveWithContext(ctx, requested)
if err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Resolver runs twice

ResolveExecutionSelectorWithContext already calls the composed resolver and may mutate weighted-round-robin counters to choose the execution selector. This second canonical resolve can choose a different candidate for weighted pools and advances the counter again, so the request can execute against candidate A while EffectiveCandidate, fallbacks, and selected provider metadata describe candidate B.

Context Used: CLAUDE.md (source)

if l.resolver == nil {
continue
}
resolution, matched, err := l.resolver.Resolve(core.NewRequestedModelSelector(canonical, ""))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Model listing changes routing

ExposedModelsFiltered uses the live resolver to decide which canonical model to expose. For weighted_round_robin, Resolve increments the resolver counters, so every GET /v1/models poll consumes routing slots and changes which backend future inference requests will hit. A read-only model listing endpoint should not mutate runtime selection state.

Context Used: CLAUDE.md (source)

Comment on lines +59 to +65
case KindPoolCandidate:
if entry.ProviderName == "" || entry.Model == "" {
return Entry{}, newValidationError("provider_name and model are required for pool_candidate", nil)
}
if entry.Key == "" {
entry.Key = entry.ProviderName + "/" + entry.Model
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Candidate state is global

Pool-candidate state is keyed only as provider/model, without the canonical model name. If two canonical pools reuse the same exact provider/model candidate, disabling that candidate for one pool disables it for every pool. That breaks the advertised per-pool candidate control and can take unrelated canonical models out of rotation.

Context Used: CLAUDE.md (source)

Comment on lines +102 to +107
if resolver := routing.NewResolver(h.routingConfig, h.routingState); resolver != nil {
resolver = resolver.WithRuntime(h.registry)
if resolution, matched, err := resolver.Resolve(core.NewRequestedModelSelector(canonicalName, "")); err == nil && matched && resolution != nil {
effectiveCandidate = resolution.Primary.QualifiedModel()
effectiveProviderName = resolution.Primary.Provider
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Effective candidate is stale

The admin endpoint creates a fresh resolver each time it builds the pool response. For weighted pools, that resolver has empty counters and no shared affinity state, so it reports the deterministic first candidate rather than the candidate the live resolver would actually select. The dashboard can mark the wrong row as the effective candidate.

Context Used: CLAUDE.md (source)

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 25, 2026

📝 Walkthrough

Walkthrough

Adds canonical model pool routing: config + validation, resolver and selection strategies (priority failover / weighted round robin), session affinity, failover policy, runtime-based eligibility, persistent routing state with multi-backend stores, admin APIs and dashboard UI, and integration into request resolution and fallback flows.

Changes

Canonical Routing Feature

Layer / File(s) Summary
Routing configuration & core contracts
config/config.example.yaml, config/config.go, config/routing.go, config/config_test.go, internal/core/canonical_routing.go
New routing config section, types for RoutingConfig/Defaults/Failover/ModelPools, strategy normalization/validation, and core CanonicalRoutingResolution data model.
Resolver, strategies & session affinity
internal/routing/*.go, internal/routing/strategy.go, internal/routing/session_affinity.go, internal/routing/composed_resolver.go, internal/routing/resolver_test.go
Implements Resolver over configured pools, selection algorithms for priority-failover and weighted round-robin, session affinity store with TTL, and composed resolver chaining alias + pool resolution with tests.
Pool evaluation & failover policy
internal/routing/pool_evaluator.go, internal/routing/runtime.go, internal/routing/failover_policy.go
Evaluate candidate enablement/selectability from routing state and runtime snapshots, classify runtime health, and implement FailoverPolicy deciding retry eligibility and max attempts.
Routing types & utilities
internal/routing/types.go
Candidate/Pool types, Registry and RuntimeSnapshotProvider interfaces, and normalization helpers.
Routing-state storage backends & types
internal/routingstate/types.go, internal/routingstate/store.go, internal/routingstate/store_sqlite.go, internal/routingstate/store_postgresql.go, internal/routingstate/store_mongodb.go
Defines persistent Store interface, ValidationError, and SQLite/Postgres/Mongo store implementations with consistent upsert/list/delete semantics.
Routing state service & factory
internal/routingstate/service.go, internal/routingstate/factory.go, internal/routingstate/service_test.go
In-memory routingstate Service with snapshotting, filtering, Upsert/Delete with refresh, background refresh, and factory for lifecycle-managed initialization and Close.
Exposed model listing
internal/routing/exposed_models.go, internal/routing/exposed_models_test.go
CanonicalExposedModelLister and CombinedExposedModelLister that surface effective candidate models via resolver + catalog integration.
Request resolution & orchestrator wiring
internal/gateway/request_model_resolution.go, internal/gateway/inference_orchestrator.go, internal/server/*
Resolve with canonical routing when available, populate RequestModelResolution with canonical/pool metadata, and thread FailoverPolicy into orchestrator/executor and handler wiring.
Fallback integration
internal/fallback/resolver.go, internal/gateway/fallback.go, internal/server/fallback_test.go
Canonical pool fallbacks take precedence; fallback attempts use FailoverPolicy for attempt gating and record failover metadata into workflow resolution.
Admin APIs: routing state & pools
internal/admin/handler_routing_state.go, internal/admin/handler_routing_pools.go, internal/admin/handler_routing_state_test.go, internal/admin/handler_routing_pools_test.go
Endpoints to list/upsert/delete routing state and to list configured model pools with evaluated candidate/runtime status and blocked reasons.
Provider status integration
internal/admin/handler_providers.go, internal/admin/handler.go, internal/admin/handler_test.go
Provider status rows now consider routing-state enablement (manual-disable overrides runtime). Admin handler accepts routing state/config dependencies.
Admin route registration & dashboard UI
internal/admin/routes.go, internal/admin/routes_test.go, internal/admin/dashboard/static/js/*, internal/admin/dashboard/templates/*
Registers routing admin routes; dashboard fetches routing state/pools, enriches model table with canonical/candidate metadata, adds toggles and badges, and includes unit test coverage.
App integration
internal/app/app.go
App initializes routingState, composes resolver and failover policy, wires composed resolver and exposed-model lister into server config, and includes routingState in lifecycle cleanup.
Documentation & tests
docs/features/canonical-routing.mdx, various *_test.go and JS tests
Adds full feature docs, numerous unit tests for config, resolver strategies, session affinity, routingstate service, stores, exposed models, fallback integration, and test helpers for isolation.
Test environment isolation
config/config_example_test.go, internal/providers/config_test.go
Improved test isolation with temp-dir scoping and environment variable clearing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

release:internal

Poem

🐰 I hopped through pools of canonical names,
I sorted by priority and played round-robin games,
I pinned tiny paths with a soft affinity kiss,
Nudged failovers gently when endpoints went amiss,
Hooray — routing now routes with a rabbit’s bliss!

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@vfeitoza vfeitoza force-pushed the feat/canonical-routing-failover branch from 7e1d804 to 779fa78 Compare May 25, 2026 09:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 25

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/admin/dashboard/static/js/modules/aliases.js (1)

42-78: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Extract repeated map lookup to avoid redundant calls.

routingCandidates.get(this.qualifiedModelName(model)) is called 13 times per model row. Extract the result once to improve performance and readability.

♻️ Proposed refactor
                 const rows = this.models.map((model) => ({
+                    const routingEntry = routingCandidates.get(this.qualifiedModelName(model)) || {};
                     key: 'model:' + this.qualifiedModelName(model),
                     display_name: this.qualifiedModelName(model),
                     secondary_name: '',
                     provider_name: model.provider_name || '',
                     provider_type: model.provider_type || '',
                     model: model.model,
                     is_alias: false,
                     alias: null,
                     access: model && model.access ? model.access : null,
                     kind_badge: '',
                     masking_alias: null,
                     alias_state_class: '',
                     alias_state_text: '',
-                    canonical_model: (routingCandidates.get(this.qualifiedModelName(model)) || {}).canonical_model || '',
-                    routing_state: (routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || null,
-                    routing_pool: (routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || null,
-                    canonical_enabled: (routingCandidates.get(this.qualifiedModelName(model)) || {}).pool ? ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool.enabled !== false) : true,
-                    canonical_status: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).status || '',
-                    canonical_reason: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).status_reason || '',
-                    routing_strategy: ((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).strategy || '',
-                    candidate_priority: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).priority ?? null),
-                    candidate_weight: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).weight ?? null),
-                    is_config_primary: Boolean(((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).is_config_primary),
-                    is_effective_candidate: Boolean(((routingCandidates.get(this.qualifiedModelName(model)) || {}).routing_state || {}).is_effective_candidate),
-                    effective_candidate: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).effective_candidate || ''),
-                    config_primary_candidate: (((routingCandidates.get(this.qualifiedModelName(model)) || {}).pool || {}).config_primary_candidate || '')
+                    canonical_model: routingEntry.canonical_model || '',
+                    routing_state: routingEntry.routing_state || null,
+                    routing_pool: routingEntry.pool || null,
+                    canonical_enabled: routingEntry.pool ? (routingEntry.pool.enabled !== false) : true,
+                    canonical_status: (routingEntry.pool || {}).status || '',
+                    canonical_reason: (routingEntry.pool || {}).status_reason || '',
+                    routing_strategy: (routingEntry.pool || {}).strategy || '',
+                    candidate_priority: ((routingEntry.routing_state || {}).priority ?? null),
+                    candidate_weight: ((routingEntry.routing_state || {}).weight ?? null),
+                    is_config_primary: Boolean((routingEntry.routing_state || {}).is_config_primary),
+                    is_effective_candidate: Boolean((routingEntry.routing_state || {}).is_effective_candidate),
+                    effective_candidate: ((routingEntry.pool || {}).effective_candidate || ''),
+                    config_primary_candidate: ((routingEntry.pool || {}).config_primary_candidate || '')
                 }));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 42 - 78,
The code repeatedly calls routingCandidates.get(this.qualifiedModelName(model))
inside the rows mapping causing redundant lookups; refactor the rows creation in
the models.map callback to compute const rc =
routingCandidates.get(this.qualifiedModelName(model)) || {} (and const pool =
rc.pool || {} if helpful) once at the top of the callback, then replace all
occurrences with rc, pool or rc.routing_state so each property (canonical_model,
routing_state, pool, pool.enabled, pool.status, pool.status_reason,
pool.strategy, pool.effective_candidate, pool.config_primary_candidate,
routing_state.priority, routing_state.weight, etc.) reads from those local
variables instead of calling routingCandidates.get(...) repeatedly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/config_test.go`:
- Around line 720-769: The test TestLoad_RoutingConfigYAML is missing assertions
for the failover block parsed into cfg.Routing.Defaults.Failover; add assertions
that cfg.Routing.Defaults.Failover.Enabled == true,
cfg.Routing.Defaults.Failover.MaxAttempts == 5,
cfg.Routing.Defaults.Failover.RetryOnStatuses contains 429 and 503 (and has
expected length), and cfg.Routing.Defaults.Failover.RetryOnModelErrors == false
so regressions in failover parsing are caught.

In `@config/routing.go`:
- Around line 88-90: Validate and reject or sanitize invalid HTTP status codes
in cfg.Defaults.Failover.RetryOnStatuses during config load: in the routing
config initialization (the code that sets cfg.Defaults.Failover.RetryOnStatuses
in config/routing.go) iterate the slice and ensure each status is within the
HTTP range 100..599, either filtering out invalid values or returning a config
validation error; update the default-setting block that currently sets
[]int{429,500,502,503,504} to perform this validation (refer to
cfg.Defaults.Failover.RetryOnStatuses) so runtime failover matching cannot
receive values like 0, negative, or >599.

In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 523-530: The toggleCanonicalModelEnabled function sends the wrong
enabled value because it uses !(row.canonical_enabled === false) instead of
inverting the current boolean; update toggleCanonicalModelEnabled to compute the
new state as enabled: !row.canonical_enabled (or
Boolean(!row.canonical_enabled)) when calling submitRoutingStateChange, keeping
the existing null/undefined guard for row and row.canonical_model and preserving
the submitted payload fields canonical_model and kind.
- Around line 489-501: submitRoutingStateChange currently doesn't catch
fetch/network errors; wrap the body of submitRoutingStateChange in a try/catch
and on any thrown error behave like the other submit helpers: catch the error,
call the same error handling/logging path (or return false) so the caller
doesn't get an unhandled rejection, and ensure you still call/await
this.fetchRoutingState() and this.fetchRoutingPools() only on success; reference
the existing method submitRoutingStateChange and related helpers
handleFetchResponse, isStaleAuthFetchResult, fetchRoutingState, and
fetchRoutingPools when implementing the catch behavior.
- Around line 216-244: fetchRoutingState() updates this.routingStateViews but
never calls this.syncDisplayModels(), so UI models
(groupDisplayModels()/providerRoutingEnabled()) can remain stale; after
successfully assigning this.routingStateViews (the line setting it from payload)
call this.syncDisplayModels(), and also call this.syncDisplayModels() in the
catch block after setting this.routingStateViews = [] to mirror
fetchRoutingPools() behavior; reference fetchRoutingState, routingStateViews,
syncDisplayModels, fetchRoutingPools, groupDisplayModels, and
providerRoutingEnabled when making the change.
- Around line 246-266: fetchRoutingPools() currently treats any non-handled
response the same and doesn't update routingStateAvailable like
fetchRoutingState() does; update fetchRoutingPools() so that after calling
this.handleFetchResponse(res, 'routing model pools', request) you check for a
503 condition (the same condition used in fetchRoutingState()) and set
this.routingStateAvailable = false and return; otherwise keep the existing
!handled path that clears this.routingPools and calls this.syncDisplayModels();
reference functions/fields: fetchRoutingPools, fetchRoutingState,
handleFetchResponse, routingStateAvailable, routingPools, syncDisplayModels.

In `@internal/admin/handler_routing_pools_test.go`:
- Around line 16-91: The test
TestBuildRoutingPoolResponses_UsesRoutingConfigAndState (and the file
internal/admin/handler_routing_pools_test.go) must be converted into
table-driven tests to cover additional scenarios: weighted round-robin/default
behavior, nil routing state, and empty model pools; refactor the current
specific test into a table of cases each describing input
registry/service/config (construct via providers.NewModelRegistry(),
routingstate.NewService()/nil, NewHandler(..., WithRoutingState(...),
WithRoutingConfig(...))) and expected outputs from
h.buildRoutingPoolResponses(), and iterate over cases asserting Strategy,
Enabled, Candidates, EffectiveCandidate, ConfigPrimaryCandidate, and
BlockedCandidates for each case. Ensure each case has a descriptive name, uses
t.Run, and reuses helper setup code for creating handlers and invoking
buildRoutingPoolResponses to avoid duplication.

In `@internal/admin/handler_routing_state_test.go`:
- Around line 36-88: Replace the four separate tests with a single table-driven
test that iterates cases describing name, setup (using newRoutingStateHandler
and optional pre-seed via h.UpsertRoutingState), HTTP method and path, request
body, and expected status; call the appropriate handler function by inspecting
the method/path and assert rec.Code equals expected. Include cases: list empty
(GET -> 200, body "[]\n"), upsert provider success (PUT valid body -> 200),
upsert malformed/missing enabled (PUT invalid body -> 400), delete success
(DELETE existing key -> 204), delete not-found (DELETE non-existent key -> 404),
and malformed JSON for any endpoint (invalid JSON -> 400). Ensure you reference
the handler functions ListRoutingState, UpsertRoutingState, DeleteRoutingState
and use newRoutingStateHandler to create fresh handler instances per table case.

In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: In the routingstate.KindPoolCandidate branch update the
equality check to include canonical model matching as well as provider and
model: inside the if that currently compares current.ProviderName ==
entry.ProviderName && current.Model == entry.Model, add a comparison for the
canonical model (e.g. current.CanonicalModel == entry.CanonicalModel or the
actual field name used for canonical_model) so the pool-candidate response only
returns when provider, model and canonical_model all match.

In `@internal/app/app.go`:
- Around line 278-292: The routing state (routingStateResult assigned to
app.routingState via routingstate.NewWithSharedStorage / routingstate.New) is
not being closed on error paths or during Shutdown; update the initialization
error handling to include app.routingState.Close() in the errors.Join close list
when routingStateResult was created, and add app.routingState.Close() to the
Shutdown/teardown join calls (where app.pricingOverrides.Close(),
app.modelOverrides.Close(), app.aliases.Close(), etc. are joined) so
routing-state resources are always closed on startup failure and normal
shutdown.
- Around line 437-441: routing.NewResolver(...) can return nil, so avoid
chaining .WithRuntime(...) directly; instead assign the result to a local
variable (routingResolver := routing.NewResolver(appCfg.Routing,
app.routingState.Service)), check if routingResolver != nil, and only then call
routingResolver = routingResolver.WithRuntime(providerResult.Registry); leave
routingResolver nil otherwise so subsequent code (e.g., composedModelResolver :=
routing.NewComposedResolver(app.aliases.Service, routingResolver)) doesn't panic
at startup.

In `@internal/gateway/fallback.go`:
- Around line 57-72: The code calls o.failoverPolicy methods/fields without
checking for nil when CanonicalModel is set; add a defensive nil-check around
access to o.failoverPolicy in the canonical path: when computing shouldAttempt
(replace o.failoverPolicy.ShouldAttempt(primaryErr)) only call it if
o.failoverPolicy != nil, otherwise keep shouldAttempt false (or log/handle
accordingly), and similarly before comparing attempts against
o.failoverPolicy.MaxAttempts inside the loop only access MaxAttempts if
o.failoverPolicy != nil; update the logic around ShouldAttemptFallback, the
shouldAttempt variable, the canonicalPool check, and the attempts loop to avoid
dereferencing a nil o.failoverPolicy.
- Around line 159-174: The code uses o.failoverPolicy without a nil check: when
computing shouldAttempt (overriding ShouldAttemptFallback) and when comparing
attempts with o.failoverPolicy.MaxAttempts; add defensive nil checks so
o.failoverPolicy is non-nil before accessing methods/properties. Specifically,
in the block that sets shouldAttempt (currently: if workflow != nil &&
workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" {
shouldAttempt = o.failoverPolicy.ShouldAttempt(primaryErr) }) guard the call
with o.failoverPolicy != nil, and in the for-loop condition (canonicalPool &&
o.failoverPolicy.MaxAttempts > 0 && attempts >= o.failoverPolicy.MaxAttempts-1)
ensure o.failoverPolicy != nil is checked first (e.g., canonicalPool &&
o.failoverPolicy != nil && o.failoverPolicy.MaxAttempts > 0 && ...), mirroring
the defensive checks used in tryFallbackResponse.

In `@internal/gateway/request_model_resolution.go`:
- Around line 66-67: The code currently calls
ResolveExecutionSelectorWithContext (producing resolvedSelector, aliasApplied)
and then invokes the routing resolver again to compute canonical
selector/metadata, which can advance state twice; instead call the resolver only
once and capture both the execution selector and its canonical metadata in that
single invocation (either extend ResolveExecutionSelectorWithContext to return
canonical metadata or call the canonical-resolve helper once and store its
results), then pass those stored values into RequestModelResolution enrichment
and any execution code so both use the identical selector and metadata (use the
variables resolvedSelector, aliasApplied, and the new
canonicalMetadata/canonicalSelector you obtain from the single resolver call).

In `@internal/providers/config_test.go`:
- Around line 66-84: The clearProviderConfigEnvVars helper only clears
unsuffixed keys, causing resolveProviders and
TestResolveProviders_NoProvidersNoEnvVars to be flaky when ambient
suffixed/provider-family vars (e.g. *_EAST_*, VERTEX_*, BEDROCK_* or
provider-prefixed variations) exist; update clearProviderConfigEnvVars to also
remove any environment variables whose names start with or contain the provider
prefixes listed (OPENAI, ANTHROPIC, GEMINI, DEEPSEEK, XAI, GROQ, OPENROUTER,
ZAI, AZURE, ORACLE, VLLM, OLLAMA) or their family suffixes by scanning
os.Environ (or t.Setenv for each match) and unsetting keys that match those
patterns so resolveProviders and TestResolveProviders_NoProvidersNoEnvVars run
deterministically.

In `@internal/routing/exposed_models_test.go`:
- Around line 33-98: Combine the two tests into a single table-driven test that
iterates cases and calls NewCanonicalExposedModelLister(...).ExposedModels() for
each case; each case should specify the routing strategy and provider runtime
snapshots (healthy/unhealthy) plus the expected selected model properties (e.g.,
expect OwnedBy == "b" or expect ID == "claude-sonnet-4-6"). Reuse the existing
testCatalog models and the staticRuntimeProvider snapshots from the current
tests, and assert on models length and the expected field for each case. Keep
test logic extracted around NewCanonicalExposedModelLister, ExposedModels, and
the assertions so each table row only changes config.RoutingConfig and snapshots
and the expected result.

In `@internal/routing/exposed_models.go`:
- Around line 59-64: The nil-check on the locally-created pointer "chosen" is
impossible and triggers a lint error; remove the dead branch and append the
cloned model directly: after creating "clone := *model" and setting "clone.ID =
canonical", use models = append(models, clone) (or append(models, *chosen) but
without the if chosen != nil check) so the code no longer contains the SA4031
unreachable nil test.

In `@internal/routing/pool_evaluator.go`:
- Around line 112-118: The aggregation currently increments availableCount for
any effectiveEnabled runtime even if runtime.Status == "unhealthy", causing pool
state to be reported as "enabled" incorrectly; update the logic so
availableCount is only incremented when effectiveEnabled && runtime.Status !=
"unhealthy" (same condition used for selectable and selectableCount), and apply
the same fix to the later pool-status aggregation block that uses
effectiveEnabled/availableCount (the section around
selectableCount/availableCount and the block at lines ~140-143) so unhealthy
runtimes are excluded from "available" counts everywhere.

In `@internal/routing/resolver_test.go`:
- Around line 313-315: The lint QF1001 flags the negated conjunction in
manualDisableStateChecker.CandidateEnabled; replace the negation with an
explicit disjunction to satisfy the linter by returning selector.Provider !=
s.disabledProvider || selector.Model != s.disabledModel instead of
!(selector.Provider == s.disabledProvider && selector.Model == s.disabledModel),
keeping references to the same symbols
(manualDisableStateChecker.CandidateEnabled, selector.Provider, selector.Model,
s.disabledProvider, s.disabledModel).

In `@internal/routing/resolver.go`:
- Around line 61-66: Resolver.WithRuntime currently assigns to Resolver.runtime
without synchronization while ResolveWithContext reads it during request
handling, risking a data race; either make the field access concurrency-safe by
acquiring the Resolver.mu mutex when writing (and ensure readers use the same mu
when reading in ResolveWithContext/EvaluatePool paths) or explicitly document
and enforce that WithRuntime is init-only and never called after the Resolver is
published; update the implementation to use the Resolver.mu lock around the
write in WithRuntime (and audit ResolveWithContext/EvaluatePool to use mu when
accessing runtime) or add clear comments and tests asserting init-only usage.

In `@internal/routing/runtime.go`:
- Around line 35-39: The branch that returns "healthy" when snapshot.Registered
&& snapshot.DiscoveredModelCount > 0 && modelFetchError == "" must also consider
availability failures: check snapshot.LastAvailabilityError (or the equivalent
field) and treat non-empty/non-nil availability errors as "degraded" before
returning "healthy". Update the logic around the case that uses
snapshot.UsingCachedModels and snapshot.LastModelFetchSuccessAt to first return
"degraded" if snapshot.LastAvailabilityError is set, and apply the same fix to
the other similar branch referenced (the one at lines 44-47) so both branches
consider snapshot.LastAvailabilityError prior to returning "healthy".

In `@internal/routing/session_affinity.go`:
- Around line 46-49: The current use of core.GetRequestID(ctx) to create
affinity keys (canonicalModel + "|request_id|" + requestID) produces many
one-off keys and the lazy cleanup in the read path (the code referenced around
Put/Get cleanup lines) can let the map grow unbounded; fix by either removing
the request_id fallback from the affinity key generation (only use stable
identifiers) or by bounding the affinity map in the Put path: in the Put method
add a pruning strategy (e.g., TTL per entry or a periodic sweep every N writes
and/or a max-entries eviction) and ensure the existing lazy cleanup code in Get
still complements it; modify the code that builds the key (where canonicalModel
and core.GetRequestID(ctx) are used) and update Put/Get cleanup logic to enforce
bounded size.

In `@internal/routing/strategy.go`:
- Around line 12-20: The priority-failover branch (case
config.RoutingStrategyPriorityFailover) blindly indexes ordered[0] which will
panic if pool.Candidates is empty; add a guard that checks len(pool.Candidates)
(or len(ordered)) == 0 and return an appropriate error (or nil/empty values
consistent with the function signature) before sorting/returning, referencing
the RoutingStrategyPriorityFailover case, pool.Candidates and ordered slice to
locate where to insert the check.

In `@internal/routingstate/service.go`:
- Around line 55-56: normalizeEntry currently overwrites entry.UpdatedAt with
time.Now().UTC(), so calling it from Refresh causes in-memory timestamps to
change even when storage didn't; modify Refresh to preserve persisted UpdatedAt
by either adding a no-update flag to normalizeEntry (e.g., normalizeEntry(entry,
preserveTimestamps bool)) or by capturing the original entry.UpdatedAt before
calling normalizeEntry and restoring it afterwards if no storage changes
occurred; ensure references in Refresh use the preserved UpdatedAt rather than
the value set inside normalizeEntry.

In `@internal/routingstate/types.go`:
- Around line 63-65: For entries of KindPoolCandidate, always canonicalize the
Key to the canonical form provider/model instead of leaving a caller-supplied
Key; in the code handling KindPoolCandidate (the block referencing entry.Key,
entry.ProviderName and entry.Model in internal/routingstate/types.go) remove the
conditional that only sets entry.Key when empty and unconditionally set
entry.Key = entry.ProviderName + "/" + entry.Model so the Key is stable and
unique for a given provider+model.

---

Outside diff comments:
In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 42-78: The code repeatedly calls
routingCandidates.get(this.qualifiedModelName(model)) inside the rows mapping
causing redundant lookups; refactor the rows creation in the models.map callback
to compute const rc = routingCandidates.get(this.qualifiedModelName(model)) ||
{} (and const pool = rc.pool || {} if helpful) once at the top of the callback,
then replace all occurrences with rc, pool or rc.routing_state so each property
(canonical_model, routing_state, pool, pool.enabled, pool.status,
pool.status_reason, pool.strategy, pool.effective_candidate,
pool.config_primary_candidate, routing_state.priority, routing_state.weight,
etc.) reads from those local variables instead of calling
routingCandidates.get(...) repeatedly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c536d95d-8d9a-4bb5-9f8c-89658dc71a81

📥 Commits

Reviewing files that changed from the base of the PR and between 7c84044 and 7e1d804.

📒 Files selected for processing (53)
  • config/config.example.yaml
  • config/config.go
  • config/config_example_test.go
  • config/config_test.go
  • config/routing.go
  • docs/features/canonical-routing.mdx
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/aliases.js
  • internal/admin/dashboard/static/js/modules/aliases.test.cjs
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/handler.go
  • internal/admin/handler_providers.go
  • internal/admin/handler_routing_pools.go
  • internal/admin/handler_routing_pools_test.go
  • internal/admin/handler_routing_state.go
  • internal/admin/handler_routing_state_test.go
  • internal/admin/handler_test.go
  • internal/admin/routes.go
  • internal/admin/routes_test.go
  • internal/app/app.go
  • internal/core/canonical_routing.go
  • internal/core/request_model_resolution.go
  • internal/fallback/resolver.go
  • internal/fallback/resolver_test.go
  • internal/gateway/fallback.go
  • internal/gateway/inference_orchestrator.go
  • internal/gateway/request_model_resolution.go
  • internal/providers/config_test.go
  • internal/routing/composed_resolver.go
  • internal/routing/exposed_models.go
  • internal/routing/exposed_models_test.go
  • internal/routing/failover_policy.go
  • internal/routing/pool_evaluator.go
  • internal/routing/resolver.go
  • internal/routing/resolver_test.go
  • internal/routing/runtime.go
  • internal/routing/session_affinity.go
  • internal/routing/strategy.go
  • internal/routing/types.go
  • internal/routingstate/factory.go
  • internal/routingstate/service.go
  • internal/routingstate/service_test.go
  • internal/routingstate/store.go
  • internal/routingstate/store_mongodb.go
  • internal/routingstate/store_postgresql.go
  • internal/routingstate/store_sqlite.go
  • internal/routingstate/types.go
  • internal/server/fallback_test.go
  • internal/server/handlers.go
  • internal/server/http.go
  • internal/server/internal_chat_completion_executor.go
  • internal/server/request_model_resolution_test.go
  • internal/server/translated_inference_service.go

Comment thread config/config_test.go
Comment on lines +720 to +769
func TestLoad_RoutingConfigYAML(t *testing.T) {
clearAllConfigEnvVars(t)

withTempDir(t, func(dir string) {
yaml := `
routing:
defaults:
strategy: weighted_round_robin
session_affinity: false
session_affinity_ttl: 45m
failover:
enabled: true
max_attempts: 5
retry_on_statuses: [429, 503]
retry_on_model_errors: false
model_pools:
claude-sonnet-4-6:
candidates:
- provider: anthropic_b
model: claude-sonnet-4-6
weight: 10
priority: 1
- provider: anthropic_a
model: claude-sonnet-4-6-20250929
weight: 8
priority: 2
`
if err := os.WriteFile(filepath.Join(dir, "config.yaml"), []byte(yaml), 0644); err != nil {
t.Fatalf("Failed to write config.yaml: %v", err)
}

result, err := Load()
if err != nil {
t.Fatalf("Load() failed: %v", err)
}
cfg := result.Config
if cfg.Routing.Defaults.Strategy != RoutingStrategyWeightedRoundRobin {
t.Fatalf("Routing.Defaults.Strategy = %q, want %q", cfg.Routing.Defaults.Strategy, RoutingStrategyWeightedRoundRobin)
}
if cfg.Routing.Defaults.SessionAffinity {
t.Fatal("expected SessionAffinity=false from YAML")
}
if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute {
t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL)
}
pool := cfg.Routing.ModelPools["claude-sonnet-4-6"]
if len(pool.Candidates) != 2 {
t.Fatalf("len(pool.Candidates) = %d, want 2", len(pool.Candidates))
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert loaded failover fields in the routing YAML test.

Line 730 through Line 735 sets failover inputs, but the test only validates strategy, affinity, TTL, and candidate count. A regression in failover parsing could pass unnoticed.

✅ Suggested test assertions
 		if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute {
 			t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL)
 		}
+		failover := cfg.Routing.Defaults.Failover
+		if !failover.Enabled {
+			t.Fatal("expected Failover.Enabled=true from YAML")
+		}
+		if failover.MaxAttempts != 5 {
+			t.Fatalf("Failover.MaxAttempts = %d, want 5", failover.MaxAttempts)
+		}
+		if !reflect.DeepEqual(failover.RetryOnStatuses, []int{429, 503}) {
+			t.Fatalf("Failover.RetryOnStatuses = %v, want [429 503]", failover.RetryOnStatuses)
+		}
+		if failover.RetryOnModelErrors {
+			t.Fatal("expected Failover.RetryOnModelErrors=false from YAML")
+		}
 		pool := cfg.Routing.ModelPools["claude-sonnet-4-6"]

As per coding guidelines: “Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/config_test.go` around lines 720 - 769, The test
TestLoad_RoutingConfigYAML is missing assertions for the failover block parsed
into cfg.Routing.Defaults.Failover; add assertions that
cfg.Routing.Defaults.Failover.Enabled == true,
cfg.Routing.Defaults.Failover.MaxAttempts == 5,
cfg.Routing.Defaults.Failover.RetryOnStatuses contains 429 and 503 (and has
expected length), and cfg.Routing.Defaults.Failover.RetryOnModelErrors == false
so regressions in failover parsing are caught.

Comment thread config/routing.go
Comment on lines +88 to +90
if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 {
cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate retry_on_statuses values before accepting config.

Invalid HTTP codes (e.g., 0, 700, negative values) are currently accepted and can silently break failover matching at runtime. Add range validation (100..599) during config load.

Proposed fix
 	if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 {
 		cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504}
 	}
+	for idx, status := range cfg.Defaults.Failover.RetryOnStatuses {
+		if status < 100 || status > 599 {
+			return fmt.Errorf("routing.defaults.failover.retry_on_statuses[%d] must be a valid HTTP status code (100-599)", idx)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/routing.go` around lines 88 - 90, Validate and reject or sanitize
invalid HTTP status codes in cfg.Defaults.Failover.RetryOnStatuses during config
load: in the routing config initialization (the code that sets
cfg.Defaults.Failover.RetryOnStatuses in config/routing.go) iterate the slice
and ensure each status is within the HTTP range 100..599, either filtering out
invalid values or returning a config validation error; update the
default-setting block that currently sets []int{429,500,502,503,504} to perform
this validation (refer to cfg.Defaults.Failover.RetryOnStatuses) so runtime
failover matching cannot receive values like 0, negative, or >599.

Comment on lines +216 to +244
async fetchRoutingState() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing-state', request);
if (res.status === 503) {
this.routingStateAvailable = false;
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
const handled = this.handleFetchResponse(res, 'routing state', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
this.routingStateAvailable = true;
const payload = await res.json();
this.routingStateViews = Array.isArray(payload) ? payload : [];
} catch (e) {
console.error('Failed to fetch routing state:', e);
this.routingStateViews = [];
}
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing syncDisplayModels() after successful fetch.

fetchRoutingPools() calls syncDisplayModels() after updating routingPools, but fetchRoutingState() does not call it after updating routingStateViews. Since groupDisplayModels() uses routingStateViews for providerRoutingEnabled(), the display models may not reflect updated routing state.

🐛 Proposed fix
                     this.routingStateAvailable = true;
                     const payload = await res.json();
                     this.routingStateViews = Array.isArray(payload) ? payload : [];
+                    this.syncDisplayModels();
                 } catch (e) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fetchRoutingState() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing-state', request);
if (res.status === 503) {
this.routingStateAvailable = false;
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
const handled = this.handleFetchResponse(res, 'routing state', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
this.routingStateAvailable = true;
const payload = await res.json();
this.routingStateViews = Array.isArray(payload) ? payload : [];
} catch (e) {
console.error('Failed to fetch routing state:', e);
this.routingStateViews = [];
}
},
async fetchRoutingState() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing-state', request);
if (res.status === 503) {
this.routingStateAvailable = false;
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
const handled = this.handleFetchResponse(res, 'routing state', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingStateViews = [];
this.routingPools = [];
this.syncDisplayModels();
return;
}
this.routingStateAvailable = true;
const payload = await res.json();
this.routingStateViews = Array.isArray(payload) ? payload : [];
this.syncDisplayModels();
} catch (e) {
console.error('Failed to fetch routing state:', e);
this.routingStateViews = [];
}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 216 -
244, fetchRoutingState() updates this.routingStateViews but never calls
this.syncDisplayModels(), so UI models
(groupDisplayModels()/providerRoutingEnabled()) can remain stale; after
successfully assigning this.routingStateViews (the line setting it from payload)
call this.syncDisplayModels(), and also call this.syncDisplayModels() in the
catch block after setting this.routingStateViews = [] to mirror
fetchRoutingPools() behavior; reference fetchRoutingState, routingStateViews,
syncDisplayModels, fetchRoutingPools, groupDisplayModels, and
providerRoutingEnabled when making the change.

Comment on lines +246 to +266
async fetchRoutingPools() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing/model-pools', request);
const handled = this.handleFetchResponse(res, 'routing model pools', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingPools = [];
this.syncDisplayModels();
return;
}
const payload = await res.json();
this.routingPools = Array.isArray(payload) ? payload : [];
this.syncDisplayModels();
} catch (e) {
console.error('Failed to fetch routing pools:', e);
this.routingPools = [];
}
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing 503 handling for consistency with fetchRoutingState().

fetchRoutingState() explicitly handles 503 by setting routingStateAvailable = false, but fetchRoutingPools() does not. If the routing pools endpoint returns 503, the generic !handled path runs without marking the feature as unavailable.

🐛 Proposed fix
             async fetchRoutingPools() {
                 try {
                     const request = this.adminRequestOptions();
                     const res = await fetch('/admin/routing/model-pools', request);
+                    if (res.status === 503) {
+                        this.routingStateAvailable = false;
+                        this.routingPools = [];
+                        this.syncDisplayModels();
+                        return;
+                    }
                     const handled = this.handleFetchResponse(res, 'routing model pools', request);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fetchRoutingPools() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing/model-pools', request);
const handled = this.handleFetchResponse(res, 'routing model pools', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingPools = [];
this.syncDisplayModels();
return;
}
const payload = await res.json();
this.routingPools = Array.isArray(payload) ? payload : [];
this.syncDisplayModels();
} catch (e) {
console.error('Failed to fetch routing pools:', e);
this.routingPools = [];
}
},
async fetchRoutingPools() {
try {
const request = this.adminRequestOptions();
const res = await fetch('/admin/routing/model-pools', request);
if (res.status === 503) {
this.routingStateAvailable = false;
this.routingPools = [];
this.syncDisplayModels();
return;
}
const handled = this.handleFetchResponse(res, 'routing model pools', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return;
}
if (!handled) {
this.routingPools = [];
this.syncDisplayModels();
return;
}
const payload = await res.json();
this.routingPools = Array.isArray(payload) ? payload : [];
this.syncDisplayModels();
} catch (e) {
console.error('Failed to fetch routing pools:', e);
this.routingPools = [];
}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 -
266, fetchRoutingPools() currently treats any non-handled response the same and
doesn't update routingStateAvailable like fetchRoutingState() does; update
fetchRoutingPools() so that after calling this.handleFetchResponse(res, 'routing
model pools', request) you check for a 503 condition (the same condition used in
fetchRoutingState()) and set this.routingStateAvailable = false and return;
otherwise keep the existing !handled path that clears this.routingPools and
calls this.syncDisplayModels(); reference functions/fields: fetchRoutingPools,
fetchRoutingState, handleFetchResponse, routingStateAvailable, routingPools,
syncDisplayModels.

Comment on lines +489 to +501
async submitRoutingStateChange(payload) {
const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
const res = await fetch('/admin/routing-state', request);
const handled = this.handleFetchResponse(res, 'routing state update', request);
if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
return false;
}
if (!handled) {
return false;
}
await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
return true;
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing error handling for network failures.

Unlike other submit functions (e.g., submitAliasForm, toggleAliasEnabled), this function lacks a try/catch wrapper. If fetch() throws a network error, the promise rejects unhandled and the caller's await will propagate the exception silently.

🐛 Proposed fix
             async submitRoutingStateChange(payload) {
+                try {
                     const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
                     const res = await fetch('/admin/routing-state', request);
                     const handled = this.handleFetchResponse(res, 'routing state update', request);
                     if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
                         return false;
                     }
                     if (!handled) {
                         return false;
                     }
                     await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
                     return true;
+                } catch (e) {
+                    console.error('Failed to update routing state:', e);
+                    return false;
+                }
             },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 -
501, submitRoutingStateChange currently doesn't catch fetch/network errors; wrap
the body of submitRoutingStateChange in a try/catch and on any thrown error
behave like the other submit helpers: catch the error, call the same error
handling/logging path (or return false) so the caller doesn't get an unhandled
rejection, and ensure you still call/await this.fetchRoutingState() and
this.fetchRoutingPools() only on success; reference the existing method
submitRoutingStateChange and related helpers handleFetchResponse,
isStaleAuthFetchResult, fetchRoutingState, and fetchRoutingPools when
implementing the catch behavior.

Comment on lines +35 to +39
case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "":
if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil {
return "degraded"
}
return "healthy"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Availability errors are masked as healthy when models are discovered.

Line 35-39 returns "healthy" before considering availability failures, so snapshots with non-empty LastAvailabilityError can be misclassified.

Suggested adjustment
 	switch {
+	case availabilityError != "" && snapshot.DiscoveredModelCount > 0:
+		return "degraded"
 	case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "":
 		if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil {
 			return "degraded"
 		}
 		return "healthy"

Also applies to: 44-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/runtime.go` around lines 35 - 39, The branch that returns
"healthy" when snapshot.Registered && snapshot.DiscoveredModelCount > 0 &&
modelFetchError == "" must also consider availability failures: check
snapshot.LastAvailabilityError (or the equivalent field) and treat
non-empty/non-nil availability errors as "degraded" before returning "healthy".
Update the logic around the case that uses snapshot.UsingCachedModels and
snapshot.LastModelFetchSuccessAt to first return "degraded" if
snapshot.LastAvailabilityError is set, and apply the same fix to the other
similar branch referenced (the one at lines 44-47) so both branches consider
snapshot.LastAvailabilityError prior to returning "healthy".

Comment thread internal/routing/session_affinity.go
Comment on lines +12 to +20
case config.RoutingStrategyPriorityFailover:
ordered := append([]Candidate(nil), pool.Candidates...)
sort.SliceStable(ordered, func(i, j int) bool {
if ordered[i].Priority != ordered[j].Priority {
return ordered[i].Priority < ordered[j].Priority
}
return ordered[i].QualifiedModel() < ordered[j].QualifiedModel()
})
return ordered[0], ordered[1:], nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add empty-pool guard before indexing ordered[0].

Line 20 can panic when pool.Candidates is empty in the priority_failover path.

Proposed minimal fix
 func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) {
+	if len(pool.Candidates) == 0 {
+		return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel)
+	}
+
 	switch strategy {
 	case config.RoutingStrategyPriorityFailover:
 		ordered := append([]Candidate(nil), pool.Candidates...)
 		sort.SliceStable(ordered, func(i, j int) bool {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/strategy.go` around lines 12 - 20, The priority-failover
branch (case config.RoutingStrategyPriorityFailover) blindly indexes ordered[0]
which will panic if pool.Candidates is empty; add a guard that checks
len(pool.Candidates) (or len(ordered)) == 0 and return an appropriate error (or
nil/empty values consistent with the function signature) before
sorting/returning, referencing the RoutingStrategyPriorityFailover case,
pool.Candidates and ordered slice to locate where to insert the check.

Comment on lines +55 to +56
normalized, err := normalizeEntry(entry)
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not rewrite persisted timestamps during Refresh.

normalizeEntry updates UpdatedAt to time.Now().UTC(), so every refresh changes in-memory timestamps even when no data changed in storage.

💡 Proposed fix
 	for _, entry := range entries {
+		createdAt := entry.CreatedAt
+		updatedAt := entry.UpdatedAt
 		normalized, err := normalizeEntry(entry)
 		if err != nil {
 			return fmt.Errorf("load routing state %q: %w", entry.Key, err)
 		}
+		normalized.CreatedAt = createdAt
+		normalized.UpdatedAt = updatedAt
 		next.order = append(next.order, normalized.Key)
 		next.byKey[normalized.Key] = normalized
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
normalized, err := normalizeEntry(entry)
if err != nil {
for _, entry := range entries {
createdAt := entry.CreatedAt
updatedAt := entry.UpdatedAt
normalized, err := normalizeEntry(entry)
if err != nil {
return fmt.Errorf("load routing state %q: %w", entry.Key, err)
}
normalized.CreatedAt = createdAt
normalized.UpdatedAt = updatedAt
next.order = append(next.order, normalized.Key)
next.byKey[normalized.Key] = normalized
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/service.go` around lines 55 - 56, normalizeEntry
currently overwrites entry.UpdatedAt with time.Now().UTC(), so calling it from
Refresh causes in-memory timestamps to change even when storage didn't; modify
Refresh to preserve persisted UpdatedAt by either adding a no-update flag to
normalizeEntry (e.g., normalizeEntry(entry, preserveTimestamps bool)) or by
capturing the original entry.UpdatedAt before calling normalizeEntry and
restoring it afterwards if no storage changes occurred; ensure references in
Refresh use the preserved UpdatedAt rather than the value set inside
normalizeEntry.

Comment on lines +63 to +65
if entry.Key == "" {
entry.Key = entry.ProviderName + "/" + entry.Model
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Canonicalize pool_candidate keys unconditionally.

For KindPoolCandidate, keeping a caller-supplied non-empty Key allows multiple keys for the same provider/model. That breaks stable identity and can create duplicate logical entries.

💡 Proposed fix
 	case KindPoolCandidate:
 		if entry.ProviderName == "" || entry.Model == "" {
 			return Entry{}, newValidationError("provider_name and model are required for pool_candidate", nil)
 		}
-		if entry.Key == "" {
-			entry.Key = entry.ProviderName + "/" + entry.Model
-		}
+		entry.Key = entry.ProviderName + "/" + entry.Model
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/types.go` around lines 63 - 65, For entries of
KindPoolCandidate, always canonicalize the Key to the canonical form
provider/model instead of leaving a caller-supplied Key; in the code handling
KindPoolCandidate (the block referencing entry.Key, entry.ProviderName and
entry.Model in internal/routingstate/types.go) remove the conditional that only
sets entry.Key when empty and unconditionally set entry.Key = entry.ProviderName
+ "/" + entry.Model so the Key is stable and unique for a given provider+model.

Introduce canonical model pool routing with priority failover, weighted distribution, session affinity, runtime-aware candidate filtering, admin state controls, and user-facing documentation.
@vfeitoza vfeitoza force-pushed the feat/canonical-routing-failover branch from 779fa78 to 62c4df5 Compare May 25, 2026 09:23
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (15)
internal/routing/exposed_models_test.go (1)

33-98: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Merge these exposed-model behavior tests into one table-driven test.

Both cases follow the same scaffold and should be expressed as table rows to keep future pool-strategy/runtime-state permutations easy to extend.

As per coding guidelines, "**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes. Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/exposed_models_test.go` around lines 33 - 98, Combine the
two tests TestCanonicalExposedModelLister_UsesEffectiveResolverChoice and
TestCanonicalExposedModelLister_SkipsUnhealthyCandidates into a single
table-driven test that iterates over cases; each case should include name,
routing config, catalog models, provider snapshots, expected count and expected
model fields, and then for each case construct the lister with
NewCanonicalExposedModelLister and call ExposedModels() asserting length and
expected values. Keep references to the original scenarios (weighted vs priority
strategies, provider snapshots with LastModelFetchError vs
LastModelFetchSuccessAt) as separate table entries so future permutations are
simple to add, and reuse any setup helpers to build testCatalog and
staticRuntimeProvider to avoid duplication.
internal/admin/handler_routing_state_test.go (1)

36-88: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Consolidate routing-state handler tests into a table-driven suite and expand error-path cases.

The current four standalone tests should be converted to table-driven form and include key negative-path cases (e.g., delete not found, malformed JSON) for this behavior change.

As per coding guidelines, "**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes. Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_routing_state_test.go` around lines 36 - 88, Replace
the four standalone tests with a single table-driven test that iterates cases
calling the handler methods (ListRoutingState, UpsertRoutingState,
DeleteRoutingState) via the same newRoutingStateHandler helper and echo context;
for each case define name, method, path body, expected status and expected body
(including negative cases: delete-not-found, malformed JSON, missing enabled
field), run as subtests with t.Run, construct the httptest request/recorder, set
Content-Type, invoke the appropriate handler method, and assert status and
response body; ensure cases cover request translation, response normalization,
error handling, default config and provider-specific parameter mapping (e.g.,
provider_name "anthropic_a") so tests replace TestListRoutingState_Empty,
TestUpsertRoutingState_Provider, TestUpsertRoutingState_InvalidMissingEnabled
and TestDeleteRoutingState.
config/config_test.go (1)

730-734: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert loaded failover defaults in TestLoad_RoutingConfigYAML.

The test writes routing.defaults.failover values but never verifies them, so failover parsing regressions can slip through.

Suggested test assertions
 		if cfg.Routing.Defaults.SessionAffinityTTL != 45*time.Minute {
 			t.Fatalf("SessionAffinityTTL = %s, want 45m", cfg.Routing.Defaults.SessionAffinityTTL)
 		}
+		failover := cfg.Routing.Defaults.Failover
+		if !failover.Enabled {
+			t.Fatal("expected Failover.Enabled=true from YAML")
+		}
+		if failover.MaxAttempts != 5 {
+			t.Fatalf("Failover.MaxAttempts = %d, want 5", failover.MaxAttempts)
+		}
+		if !reflect.DeepEqual(failover.RetryOnStatuses, []int{429, 503}) {
+			t.Fatalf("Failover.RetryOnStatuses = %v, want [429 503]", failover.RetryOnStatuses)
+		}
+		if failover.RetryOnModelErrors {
+			t.Fatal("expected Failover.RetryOnModelErrors=false from YAML")
+		}
 		pool := cfg.Routing.ModelPools["claude-sonnet-4-6"]
As per coding guidelines: “Add or update tests for behavior changes to cover request translation, response normalization, error handling, default configuration, and provider-specific parameter mapping.”

Also applies to: 756-768

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/config_test.go` around lines 730 - 734, Add assertions in
TestLoad_RoutingConfigYAML to verify that routing.defaults.failover is parsed
correctly: assert that the loaded config's Routing.Defaults.Failover.Enabled
equals true, MaxAttempts equals 5, RetryOnStatuses contains 429 and 503, and
RetryOnModelErrors equals false (use the test's config loader and the
TestLoad_RoutingConfigYAML test function and the Routing.Defaults.Failover
struct fields to locate where to add these checks); also add similar assertions
for the other YAML block referenced around lines 756-768 to ensure both failover
default blocks are covered.
internal/routing/session_affinity.go (1)

46-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid request-ID affinity keys; they cause unbounded map growth.

request_id fallback creates mostly single-use keys, while cleanup is lazy on Get, so stale entries can accumulate indefinitely.

Suggested minimal fix
 func affinityKey(ctx context.Context, canonicalModel string) string {
@@
 	userPath := strings.TrimSpace(core.UserPathFromContext(ctx))
 	if userPath != "" {
 		return canonicalModel + "|user_path|" + userPath
 	}
-	requestID := strings.TrimSpace(core.GetRequestID(ctx))
-	if requestID != "" {
-		return canonicalModel + "|request_id|" + requestID
-	}
 	return ""
 }

Also applies to: 67-69, 82-84

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/session_affinity.go` around lines 46 - 49, The current
affinity-key construction uses request-scoped IDs (core.GetRequestID(ctx) stored
in variable requestID) which produces mostly single-use keys and leaks entries;
remove the branches that return canonicalModel + "|request_id|" + requestID so
request_id is no longer used as a fallback. Locate the places that compute
requestID (calls to core.GetRequestID(ctx) and the return expressions that
append "|request_id|" to canonicalModel) and change them to skip to the next
stable fallback (or simply return canonicalModel) so only stable identifiers are
used for affinity keys.
internal/routing/strategy.go (1)

12-20: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard empty pools before indexing ordered[0].

The priority_failover path panics when pool.Candidates is empty.

Suggested minimal fix
 func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) {
+	if len(pool.Candidates) == 0 {
+		return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel)
+	}
+
 	switch strategy {
 	case config.RoutingStrategyPriorityFailover:
 		ordered := append([]Candidate(nil), pool.Candidates...)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/strategy.go` around lines 12 - 20, In the
RoutingStrategyPriorityFailover branch (case
config.RoutingStrategyPriorityFailover) guard against an empty pool before
accessing ordered[0]: check if len(pool.Candidates) == 0 (or len(ordered) == 0)
and return a clear error (or nil candidate with an error) instead of indexing;
then proceed to sort and return ordered[0], ordered[1:], nil only when
non-empty. Ensure you update any callers expecting a candidate/error
accordingly.
internal/admin/dashboard/static/js/modules/aliases.js (4)

489-500: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Catch network failures in submitRoutingStateChange.

If fetch() throws here, the promise rejects unhandled and the toggle callers never get the expected false result.

🔧 Minimal fix
             async submitRoutingStateChange(payload) {
-                const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
-                const res = await fetch('/admin/routing-state', request);
-                const handled = this.handleFetchResponse(res, 'routing state update', request);
-                if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
-                    return false;
-                }
-                if (!handled) {
-                    return false;
-                }
-                await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
-                return true;
+                try {
+                    const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
+                    const res = await fetch('/admin/routing-state', request);
+                    const handled = this.handleFetchResponse(res, 'routing state update', request);
+                    if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
+                        return false;
+                    }
+                    if (!handled) {
+                        return false;
+                    }
+                    await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
+                    return true;
+                } catch (e) {
+                    console.error('Failed to update routing state:', e);
+                    return false;
+                }
             },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 -
500, submitRoutingStateChange currently calls fetch('/admin/routing-state', ...)
without catching network exceptions, so any thrown error will bubble out; wrap
the fetch/response handling in a try/catch inside submitRoutingStateChange, call
this.handleFetchResponse as before and run the existing isStaleAuthFetchResult
check, but on any thrown error (or if handleFetchResponse indicates failure)
return false instead of letting the exception propagate; ensure the catch logs
or handles the error appropriately and returns false so callers always get a
boolean result.

246-265: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle routing-pool 503s the same way as routing-state 503s.

A 503 from /admin/routing/model-pools currently falls through the generic path, so routingStateAvailable stays true and the UI can keep stale pool-derived rows.

🔧 Minimal fix
             async fetchRoutingPools() {
                 try {
                     const request = this.adminRequestOptions();
                     const res = await fetch('/admin/routing/model-pools', request);
+                    if (res.status === 503) {
+                        this.routingStateAvailable = false;
+                        this.routingPools = [];
+                        this.syncDisplayModels();
+                        return;
+                    }
                     const handled = this.handleFetchResponse(res, 'routing model pools', request);
                     if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
                         return;
                     }
                     if (!handled) {
@@
                 } catch (e) {
                     console.error('Failed to fetch routing pools:', e);
                     this.routingPools = [];
+                    this.syncDisplayModels();
                 }
             },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 -
265, In fetchRoutingPools(), add an explicit check for a 503 response from the
/admin/routing/model-pools call and handle it like the routing-state 503: if
res.status === 503 set this.routingStateAvailable = false, set this.routingPools
= [], call this.syncDisplayModels(), and return early; place this check after
awaiting fetch/res and before parsing the body so stale pool-derived rows are
cleared. Reference: method fetchRoutingPools, properties routingStateAvailable
and routingPools, and helper syncDisplayModels.

523-529: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Invert the canonical toggle value.

When row.canonical_enabled is true, this still sends enabled: true, so the admin toggle cannot disable an active canonical model.

🔧 Minimal fix
             async toggleCanonicalModelEnabled(row) {
                 if (!row || !row.canonical_model) return;
                 await this.submitRoutingStateChange({
                     kind: 'canonical_model',
                     canonical_model: row.canonical_model,
-                    enabled: !(row.canonical_enabled === false)
+                    enabled: !row.canonical_enabled
                 });
             },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 523 -
529, The toggleCanonicalModelEnabled function is sending the wrong enabled
value: change the enabled calculation so it inverts row.canonical_enabled (e.g.,
set enabled to !row.canonical_enabled) before calling submitRoutingStateChange;
locate the toggleCanonicalModelEnabled method and replace the current enabled:
!(row.canonical_enabled === false) expression with an explicit inversion of
row.canonical_enabled so toggling an active canonical_model will disable it and
vice versa.

237-243: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sync derived rows after routing-state updates.

routingStateViews changes here, but syncDisplayModels() is never called afterward, so groupDisplayModels() and providerRoutingEnabled() can keep showing stale routing state until some other fetch happens.

🔧 Minimal fix
                     this.routingStateAvailable = true;
                     const payload = await res.json();
                     this.routingStateViews = Array.isArray(payload) ? payload : [];
+                    this.syncDisplayModels();
                 } catch (e) {
                     console.error('Failed to fetch routing state:', e);
                     this.routingStateViews = [];
+                    this.syncDisplayModels();
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 237 -
243, routingStateViews is updated in the fetch block but syncDisplayModels() is
never called, leaving groupDisplayModels() and providerRoutingEnabled() with
stale data; after assigning this.routingStateViews = Array.isArray(payload) ?
payload : [] in the try block and after this.routingStateViews = [] in the catch
block, call this.syncDisplayModels() so derived rows and routing-dependent
computed state are recalculated immediately (refer to syncDisplayModels(),
groupDisplayModels(), providerRoutingEnabled(), and the routingStateViews
assignment site).
internal/gateway/fallback.go (1)

57-72: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard canonical failover policy access.

Both fallback paths call o.failoverPolicy.ShouldAttempt(...) and read MaxAttempts whenever a canonical model is active. If the orchestrator is built without a policy, this panics instead of returning the primary error.

🔧 Minimal fix
 	shouldAttempt := ShouldAttemptFallback(primaryErr)
-	if workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" {
+	if workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" && o.failoverPolicy != nil {
 		shouldAttempt = o.failoverPolicy.ShouldAttempt(primaryErr)
 	}
@@
-	canonicalPool := workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != ""
+	canonicalPool := workflow != nil && workflow.Resolution != nil && workflow.Resolution.CanonicalModel != "" && o.failoverPolicy != nil

Also applies to: 159-174

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/gateway/fallback.go` around lines 57 - 72, The code dereferences
o.failoverPolicy (calling o.failoverPolicy.ShouldAttempt(...) and reading
o.failoverPolicy.MaxAttempts) when a canonical model is active, which can panic
if failoverPolicy is nil; update the logic in the canonical-path checks (around
ShouldAttemptFallback usage and where canonicalPool is set) to first test
o.failoverPolicy != nil before calling ShouldAttempt or accessing MaxAttempts so
that if failoverPolicy is nil you keep the original shouldAttempt value and do
not apply the attempts-limit break; apply the same nil-guard pattern to the
second occurrence of this logic (the similar block at lines 159-174) to ensure
the function returns the primary error instead of panicking when no policy is
configured.
internal/routing/pool_evaluator.go (1)

112-118: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not count unhealthy candidates as available.

selectable already excludes runtime.Status == "unhealthy", but availableCount still includes those candidates. That lets the pool report enabled even when one or more candidates are runtime-ineligible.

🔧 Minimal fix
-		if effectiveEnabled {
+		if effectiveEnabled && runtime.Status != "unhealthy" {
 			availableCount++
 		}

Also applies to: 137-145

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/pool_evaluator.go` around lines 112 - 118, The code
increments availableCount whenever effectiveEnabled is true but ignores
runtime.Status, so unhealthy runtimes are counted as available; change the
increment to only occur when the candidate is selectable (i.e., use the existing
selectable condition or check effectiveEnabled && runtime.Status != "unhealthy")
— replace "if effectiveEnabled { availableCount++ }" with "if selectable {
availableCount++ }" (and make the same change in the other similar block where
availableCount is incremented, identified by the same variables selectable,
effectiveEnabled, runtime.Status, availableCount, selectableCount).
internal/routing/exposed_models.go (1)

59-64: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the impossible nil branch.

chosen is always non-nil because it points to a local clone, and this is already tripping SA4031 in CI.

🔧 Minimal fix
 		clone := *model
 		clone.ID = canonical
-		chosen := &clone
-		if chosen != nil {
-			models = append(models, *chosen)
-		}
+		models = append(models, clone)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/exposed_models.go` around lines 59 - 64, The code creates a
local clone (clone := *model), takes its address into chosen (chosen := &clone)
and then guards with an impossible nil check (if chosen != nil) before
appending; remove that branch and append the clone (or *chosen) directly to
models. Update the block around clone, chosen and models (remove the if chosen
!= nil check and the chosen variable if you prefer) so you simply set clone.ID =
canonical and append clone (or *&clone) to models.
internal/admin/handler_routing_state.go (1)

75-77: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Match pool-candidate responses using canonical_model as part of identity.

provider_name + model is ambiguous across pools; include canonical_model to avoid returning the wrong updated entry.

Targeted fix
 			case routingstate.KindPoolCandidate:
-				if current.ProviderName == entry.ProviderName && current.Model == entry.Model {
+				if current.ProviderName == entry.ProviderName &&
+					current.Model == entry.Model &&
+					current.CanonicalModel == entry.CanonicalModel {
 					return c.JSON(http.StatusOK, current)
 				}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_routing_state.go` around lines 75 - 77, The match for
routingstate.KindPoolCandidate currently only compares provider and model
(current.ProviderName == entry.ProviderName && current.Model == entry.Model),
which can return wrong entries across pools; update the identity check to
include the canonical model as well (e.g., compare current.CanonicalModel ==
entry.CanonicalModel in the same conditional) so the case for
routingstate.KindPoolCandidate only returns when ProviderName, Model, and
CanonicalModel all match the incoming entry.
internal/routing/resolver.go (1)

61-66: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Synchronize access to Resolver.runtime on both write and read paths.

WithRuntime writes shared state unsafely, while ResolveWithContext reads it on the request path before locking. This can race under concurrent runtime refreshes.

Minimal thread-safe fix
 func (r *Resolver) WithRuntime(runtime RuntimeSnapshotProvider) *Resolver {
 	if r == nil {
 		return nil
 	}
+	r.mu.Lock()
+	defer r.mu.Unlock()
 	r.runtime = runtime
 	return r
 }
@@
-	evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(r.runtime))
+	r.mu.Lock()
+	runtime := r.runtime
+	r.mu.Unlock()
+	evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(runtime))

Also applies to: 97-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/resolver.go` around lines 61 - 66, The write to
Resolver.runtime in WithRuntime is not synchronized with reads in
ResolveWithContext; modify WithRuntime to take the Resolver's mutex (e.g.,
r.mu.Lock()/Unlock()) around the assignment, and also ensure ResolveWithContext
reads r.runtime under the same mutex (e.g., r.mu.RLock()/RUnlock()) before using
it (or promote to full Lock if a mutex type is different); if a mutex field does
not exist on Resolver, add a sync.RWMutex named mu and use it consistently in
WithRuntime and ResolveWithContext to prevent races.
internal/app/app.go (1)

278-293: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close routingState during normal Shutdown to complete lifecycle symmetry.

routingState is initialized and closed on some startup-failure paths, but it is still not closed in the normal shutdown sequence.

Minimal shutdown addition
 	// 6. Close model pricing overrides subsystem.
 	if a.pricingOverrides != nil {
 		if err := a.pricingOverrides.Close(); err != nil {
 			slog.Error("model pricing overrides close error", "error", err)
 			errs = append(errs, fmt.Errorf("model pricing overrides close: %w", err))
 		}
 	}
+
+	// 7. Close routing state subsystem.
+	if a.routingState != nil {
+		if err := a.routingState.Close(); err != nil {
+			slog.Error("routing state close error", "error", err)
+			errs = append(errs, fmt.Errorf("routing state close: %w", err))
+		}
+	}
 
-	// 7. Close reusable guardrails subsystem.
+	// 8. Close reusable guardrails subsystem.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/app.go` around lines 278 - 293, The routingState created as
routingstate.Result and assigned to app.routingState is not being closed on
normal shutdown; update the shutdown/Close/Shutdown method to include closing
app.routingState (nil-check first) so its Close() is called during normal
lifecycle termination, and add it to the errors.Join call (or equivalent
aggregated close handling) alongside app.pricingOverrides.Close(),
app.modelOverrides.Close(), app.aliases.Close(), etc., to ensure symmetric
initialization/teardown of routingState.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/admin/handler_routing_pools.go`:
- Around line 82-83: The loop over h.routingConfig.ModelPools yields
nondeterministic ordering; before iterating to build responses, collect the map
keys (canonical), sort them (e.g., using sort.Strings), then iterate over the
sorted slice and use each canonical to look up poolCfg from
h.routingConfig.ModelPools when constructing responses—this makes the pool
listing deterministic while keeping existing logic that uses canonicalName and
poolCfg.

In `@internal/gateway/request_model_resolution.go`:
- Around line 112-121: The copy of canonical metadata into the
RequestModelResolution struct is missing the FailoverUsed and FallbackTarget
fields; update the block that assigns from canonicalResolution to resolution
(the if canonicalResolution != nil { ... } block in request_model_resolution.go)
to also set resolution.FailoverUsed = canonicalResolution.FailoverUsed and
resolution.FallbackTarget = canonicalResolution.FallbackTarget so these failover
outcome details are propagated downstream.

In `@internal/routingstate/service_test.go`:
- Around line 24-47: Convert the two related tests into a single table-driven
test that iterates test cases describing initial store entries (using
memoryStore, Upsert, and Entry with KindProvider/KindPoolCandidate,
ProviderName, Model, Enabled), the candidate list (routing.Candidate slice), the
target model, and the expected results for both FilterCandidates and
CandidateEnabled; for each case call NewService, apply Upsert entries, run
service.FilterCandidates(targetModel, candidates) and assert the filtered
result, and also call service.CandidateEnabled(core.ModelSelector{Provider: ...,
Model: ...}) to assert the default/enabled behavior—this consolidates
TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue into table-driven subtests.

In `@internal/routingstate/service.go`:
- Around line 113-130: StartBackgroundRefresh currently calls s.Refresh with
context.Background(), making refresh calls uninterruptible; change it to create
a cancelable context (e.g. ctx, cancel :=
context.WithCancel(context.Background())) that the goroutine uses when calling
s.Refresh(ctx) and ensure the returned stop function calls cancel() (and closes
the stop channel or signals the goroutine) so in-flight Refresh calls are
canceled promptly; update the goroutine to respect context cancellation when
invoking Refresh and to call cancel/cleanup when exiting.
- Around line 172-190: FilterCandidates currently queries
s.CanonicalModelEnabled, s.ProviderEnabled, and s.CandidateEnabled separately
which can interleave with Refresh; change it to take one consistent
routing-state snapshot under the Service lock and use that snapshot for all
checks. Concretely: add or use a snapshot-returning helper (e.g. a new method
like Service.SnapshotRoutingState or similar) that reads canonical, provider and
candidate state while holding the lock, then modify FilterCandidates to call
that snapshot method once and call snapshot.CanonicalModelEnabled,
snapshot.ProviderEnabled, and snapshot.CandidateEnabled (or access snapshot
fields) for all decisions so the method uses a single, consistent view of
routing state. Ensure no per-candidate locked helper calls remain.

---

Duplicate comments:
In `@config/config_test.go`:
- Around line 730-734: Add assertions in TestLoad_RoutingConfigYAML to verify
that routing.defaults.failover is parsed correctly: assert that the loaded
config's Routing.Defaults.Failover.Enabled equals true, MaxAttempts equals 5,
RetryOnStatuses contains 429 and 503, and RetryOnModelErrors equals false (use
the test's config loader and the TestLoad_RoutingConfigYAML test function and
the Routing.Defaults.Failover struct fields to locate where to add these
checks); also add similar assertions for the other YAML block referenced around
lines 756-768 to ensure both failover default blocks are covered.

In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 489-500: submitRoutingStateChange currently calls
fetch('/admin/routing-state', ...) without catching network exceptions, so any
thrown error will bubble out; wrap the fetch/response handling in a try/catch
inside submitRoutingStateChange, call this.handleFetchResponse as before and run
the existing isStaleAuthFetchResult check, but on any thrown error (or if
handleFetchResponse indicates failure) return false instead of letting the
exception propagate; ensure the catch logs or handles the error appropriately
and returns false so callers always get a boolean result.
- Around line 246-265: In fetchRoutingPools(), add an explicit check for a 503
response from the /admin/routing/model-pools call and handle it like the
routing-state 503: if res.status === 503 set this.routingStateAvailable = false,
set this.routingPools = [], call this.syncDisplayModels(), and return early;
place this check after awaiting fetch/res and before parsing the body so stale
pool-derived rows are cleared. Reference: method fetchRoutingPools, properties
routingStateAvailable and routingPools, and helper syncDisplayModels.
- Around line 523-529: The toggleCanonicalModelEnabled function is sending the
wrong enabled value: change the enabled calculation so it inverts
row.canonical_enabled (e.g., set enabled to !row.canonical_enabled) before
calling submitRoutingStateChange; locate the toggleCanonicalModelEnabled method
and replace the current enabled: !(row.canonical_enabled === false) expression
with an explicit inversion of row.canonical_enabled so toggling an active
canonical_model will disable it and vice versa.
- Around line 237-243: routingStateViews is updated in the fetch block but
syncDisplayModels() is never called, leaving groupDisplayModels() and
providerRoutingEnabled() with stale data; after assigning this.routingStateViews
= Array.isArray(payload) ? payload : [] in the try block and after
this.routingStateViews = [] in the catch block, call this.syncDisplayModels() so
derived rows and routing-dependent computed state are recalculated immediately
(refer to syncDisplayModels(), groupDisplayModels(), providerRoutingEnabled(),
and the routingStateViews assignment site).

In `@internal/admin/handler_routing_state_test.go`:
- Around line 36-88: Replace the four standalone tests with a single
table-driven test that iterates cases calling the handler methods
(ListRoutingState, UpsertRoutingState, DeleteRoutingState) via the same
newRoutingStateHandler helper and echo context; for each case define name,
method, path body, expected status and expected body (including negative cases:
delete-not-found, malformed JSON, missing enabled field), run as subtests with
t.Run, construct the httptest request/recorder, set Content-Type, invoke the
appropriate handler method, and assert status and response body; ensure cases
cover request translation, response normalization, error handling, default
config and provider-specific parameter mapping (e.g., provider_name
"anthropic_a") so tests replace TestListRoutingState_Empty,
TestUpsertRoutingState_Provider, TestUpsertRoutingState_InvalidMissingEnabled
and TestDeleteRoutingState.

In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: The match for routingstate.KindPoolCandidate currently only
compares provider and model (current.ProviderName == entry.ProviderName &&
current.Model == entry.Model), which can return wrong entries across pools;
update the identity check to include the canonical model as well (e.g., compare
current.CanonicalModel == entry.CanonicalModel in the same conditional) so the
case for routingstate.KindPoolCandidate only returns when ProviderName, Model,
and CanonicalModel all match the incoming entry.

In `@internal/app/app.go`:
- Around line 278-293: The routingState created as routingstate.Result and
assigned to app.routingState is not being closed on normal shutdown; update the
shutdown/Close/Shutdown method to include closing app.routingState (nil-check
first) so its Close() is called during normal lifecycle termination, and add it
to the errors.Join call (or equivalent aggregated close handling) alongside
app.pricingOverrides.Close(), app.modelOverrides.Close(), app.aliases.Close(),
etc., to ensure symmetric initialization/teardown of routingState.

In `@internal/gateway/fallback.go`:
- Around line 57-72: The code dereferences o.failoverPolicy (calling
o.failoverPolicy.ShouldAttempt(...) and reading o.failoverPolicy.MaxAttempts)
when a canonical model is active, which can panic if failoverPolicy is nil;
update the logic in the canonical-path checks (around ShouldAttemptFallback
usage and where canonicalPool is set) to first test o.failoverPolicy != nil
before calling ShouldAttempt or accessing MaxAttempts so that if failoverPolicy
is nil you keep the original shouldAttempt value and do not apply the
attempts-limit break; apply the same nil-guard pattern to the second occurrence
of this logic (the similar block at lines 159-174) to ensure the function
returns the primary error instead of panicking when no policy is configured.

In `@internal/routing/exposed_models_test.go`:
- Around line 33-98: Combine the two tests
TestCanonicalExposedModelLister_UsesEffectiveResolverChoice and
TestCanonicalExposedModelLister_SkipsUnhealthyCandidates into a single
table-driven test that iterates over cases; each case should include name,
routing config, catalog models, provider snapshots, expected count and expected
model fields, and then for each case construct the lister with
NewCanonicalExposedModelLister and call ExposedModels() asserting length and
expected values. Keep references to the original scenarios (weighted vs priority
strategies, provider snapshots with LastModelFetchError vs
LastModelFetchSuccessAt) as separate table entries so future permutations are
simple to add, and reuse any setup helpers to build testCatalog and
staticRuntimeProvider to avoid duplication.

In `@internal/routing/exposed_models.go`:
- Around line 59-64: The code creates a local clone (clone := *model), takes its
address into chosen (chosen := &clone) and then guards with an impossible nil
check (if chosen != nil) before appending; remove that branch and append the
clone (or *chosen) directly to models. Update the block around clone, chosen and
models (remove the if chosen != nil check and the chosen variable if you prefer)
so you simply set clone.ID = canonical and append clone (or *&clone) to models.

In `@internal/routing/pool_evaluator.go`:
- Around line 112-118: The code increments availableCount whenever
effectiveEnabled is true but ignores runtime.Status, so unhealthy runtimes are
counted as available; change the increment to only occur when the candidate is
selectable (i.e., use the existing selectable condition or check
effectiveEnabled && runtime.Status != "unhealthy") — replace "if
effectiveEnabled { availableCount++ }" with "if selectable { availableCount++ }"
(and make the same change in the other similar block where availableCount is
incremented, identified by the same variables selectable, effectiveEnabled,
runtime.Status, availableCount, selectableCount).

In `@internal/routing/resolver.go`:
- Around line 61-66: The write to Resolver.runtime in WithRuntime is not
synchronized with reads in ResolveWithContext; modify WithRuntime to take the
Resolver's mutex (e.g., r.mu.Lock()/Unlock()) around the assignment, and also
ensure ResolveWithContext reads r.runtime under the same mutex (e.g.,
r.mu.RLock()/RUnlock()) before using it (or promote to full Lock if a mutex type
is different); if a mutex field does not exist on Resolver, add a sync.RWMutex
named mu and use it consistently in WithRuntime and ResolveWithContext to
prevent races.

In `@internal/routing/session_affinity.go`:
- Around line 46-49: The current affinity-key construction uses request-scoped
IDs (core.GetRequestID(ctx) stored in variable requestID) which produces mostly
single-use keys and leaks entries; remove the branches that return
canonicalModel + "|request_id|" + requestID so request_id is no longer used as a
fallback. Locate the places that compute requestID (calls to
core.GetRequestID(ctx) and the return expressions that append "|request_id|" to
canonicalModel) and change them to skip to the next stable fallback (or simply
return canonicalModel) so only stable identifiers are used for affinity keys.

In `@internal/routing/strategy.go`:
- Around line 12-20: In the RoutingStrategyPriorityFailover branch (case
config.RoutingStrategyPriorityFailover) guard against an empty pool before
accessing ordered[0]: check if len(pool.Candidates) == 0 (or len(ordered) == 0)
and return a clear error (or nil candidate with an error) instead of indexing;
then proceed to sort and return ordered[0], ordered[1:], nil only when
non-empty. Ensure you update any callers expecting a candidate/error
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bb57203b-83ae-4abc-b516-b9f8de7b0d37

📥 Commits

Reviewing files that changed from the base of the PR and between 7e1d804 and 779fa78.

📒 Files selected for processing (53)
  • config/config.example.yaml
  • config/config.go
  • config/config_example_test.go
  • config/config_test.go
  • config/routing.go
  • docs/features/canonical-routing.mdx
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/aliases.js
  • internal/admin/dashboard/static/js/modules/aliases.test.cjs
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/handler.go
  • internal/admin/handler_providers.go
  • internal/admin/handler_routing_pools.go
  • internal/admin/handler_routing_pools_test.go
  • internal/admin/handler_routing_state.go
  • internal/admin/handler_routing_state_test.go
  • internal/admin/handler_test.go
  • internal/admin/routes.go
  • internal/admin/routes_test.go
  • internal/app/app.go
  • internal/core/canonical_routing.go
  • internal/core/request_model_resolution.go
  • internal/fallback/resolver.go
  • internal/fallback/resolver_test.go
  • internal/gateway/fallback.go
  • internal/gateway/inference_orchestrator.go
  • internal/gateway/request_model_resolution.go
  • internal/providers/config_test.go
  • internal/routing/composed_resolver.go
  • internal/routing/exposed_models.go
  • internal/routing/exposed_models_test.go
  • internal/routing/failover_policy.go
  • internal/routing/pool_evaluator.go
  • internal/routing/resolver.go
  • internal/routing/resolver_test.go
  • internal/routing/runtime.go
  • internal/routing/session_affinity.go
  • internal/routing/strategy.go
  • internal/routing/types.go
  • internal/routingstate/factory.go
  • internal/routingstate/service.go
  • internal/routingstate/service_test.go
  • internal/routingstate/store.go
  • internal/routingstate/store_mongodb.go
  • internal/routingstate/store_postgresql.go
  • internal/routingstate/store_sqlite.go
  • internal/routingstate/types.go
  • internal/server/fallback_test.go
  • internal/server/handlers.go
  • internal/server/http.go
  • internal/server/internal_chat_completion_executor.go
  • internal/server/request_model_resolution_test.go
  • internal/server/translated_inference_service.go

Comment on lines +82 to +83
for canonical, poolCfg := range h.routingConfig.ModelPools {
canonicalName := strings.TrimSpace(canonical)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make pool listing deterministic.

Iterating h.routingConfig.ModelPools directly makes response order nondeterministic across requests. Sort canonical model keys before building responses.

Suggested minimal fix
 import (
 	"net/http"
+	"sort"
 	"strings"
@@
-	for canonical, poolCfg := range h.routingConfig.ModelPools {
+	keys := make([]string, 0, len(h.routingConfig.ModelPools))
+	for canonical := range h.routingConfig.ModelPools {
+		keys = append(keys, canonical)
+	}
+	sort.Strings(keys)
+
+	for _, canonical := range keys {
+		poolCfg := h.routingConfig.ModelPools[canonical]
 		canonicalName := strings.TrimSpace(canonical)
 		if canonicalName == "" {
 			continue
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for canonical, poolCfg := range h.routingConfig.ModelPools {
canonicalName := strings.TrimSpace(canonical)
keys := make([]string, 0, len(h.routingConfig.ModelPools))
for canonical := range h.routingConfig.ModelPools {
keys = append(keys, canonical)
}
sort.Strings(keys)
for _, canonical := range keys {
poolCfg := h.routingConfig.ModelPools[canonical]
canonicalName := strings.TrimSpace(canonical)
if canonicalName == "" {
continue
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_routing_pools.go` around lines 82 - 83, The loop over
h.routingConfig.ModelPools yields nondeterministic ordering; before iterating to
build responses, collect the map keys (canonical), sort them (e.g., using
sort.Strings), then iterate over the sorted slice and use each canonical to look
up poolCfg from h.routingConfig.ModelPools when constructing responses—this
makes the pool listing deterministic while keeping existing logic that uses
canonicalName and poolCfg.

Comment on lines +112 to +121
if canonicalResolution != nil {
resolution.CanonicalModel = canonicalResolution.CanonicalModel
resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...)
resolution.RoutingStrategy = string(canonicalResolution.Strategy)
resolution.ConfigPrimary = canonicalResolution.ConfigPrimary
resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate
resolution.SelectedProviderName = canonicalResolution.SelectedProviderName
resolution.SelectedExactModel = canonicalResolution.SelectedExactModel
resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate FailoverUsed and FallbackTarget into RequestModelResolution.

The canonical metadata copy omits these fields, so downstream logs/diagnostics lose failover outcome details.

💡 Proposed fix
 	if canonicalResolution != nil {
 		resolution.CanonicalModel = canonicalResolution.CanonicalModel
 		resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...)
 		resolution.RoutingStrategy = string(canonicalResolution.Strategy)
 		resolution.ConfigPrimary = canonicalResolution.ConfigPrimary
 		resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate
 		resolution.SelectedProviderName = canonicalResolution.SelectedProviderName
 		resolution.SelectedExactModel = canonicalResolution.SelectedExactModel
 		resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...)
+		resolution.FailoverUsed = canonicalResolution.FailoverUsed
+		resolution.FallbackTarget = canonicalResolution.FallbackTarget
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if canonicalResolution != nil {
resolution.CanonicalModel = canonicalResolution.CanonicalModel
resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...)
resolution.RoutingStrategy = string(canonicalResolution.Strategy)
resolution.ConfigPrimary = canonicalResolution.ConfigPrimary
resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate
resolution.SelectedProviderName = canonicalResolution.SelectedProviderName
resolution.SelectedExactModel = canonicalResolution.SelectedExactModel
resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...)
}
if canonicalResolution != nil {
resolution.CanonicalModel = canonicalResolution.CanonicalModel
resolution.CanonicalPoolFallbacks = append([]core.ModelSelector(nil), canonicalResolution.Fallbacks...)
resolution.RoutingStrategy = string(canonicalResolution.Strategy)
resolution.ConfigPrimary = canonicalResolution.ConfigPrimary
resolution.EffectiveCandidate = canonicalResolution.EffectiveCandidate
resolution.SelectedProviderName = canonicalResolution.SelectedProviderName
resolution.SelectedExactModel = canonicalResolution.SelectedExactModel
resolution.BlockedCandidates = append([]core.BlockedCandidate(nil), canonicalResolution.BlockedCandidates...)
resolution.FailoverUsed = canonicalResolution.FailoverUsed
resolution.FallbackTarget = canonicalResolution.FallbackTarget
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/gateway/request_model_resolution.go` around lines 112 - 121, The
copy of canonical metadata into the RequestModelResolution struct is missing the
FailoverUsed and FallbackTarget fields; update the block that assigns from
canonicalResolution to resolution (the if canonicalResolution != nil { ... }
block in request_model_resolution.go) to also set resolution.FailoverUsed =
canonicalResolution.FailoverUsed and resolution.FallbackTarget =
canonicalResolution.FallbackTarget so these failover outcome details are
propagated downstream.

Comment on lines +24 to +47
func TestServiceFilterCandidatesHonorsManualDisable(t *testing.T) {
store := &memoryStore{}
service, err := NewService(store)
if err != nil { t.Fatalf("NewService() error = %v", err) }
if err := service.Upsert(context.Background(), Entry{Kind: KindProvider, ProviderName: "anthropic_a", Enabled: false}); err != nil {
t.Fatalf("Upsert provider state error = %v", err)
}
if err := service.Upsert(context.Background(), Entry{Kind: KindPoolCandidate, ProviderName: "anthropic_b", Model: "claude-sonnet-4-6", Enabled: false}); err != nil {
t.Fatalf("Upsert candidate state error = %v", err)
}
candidates := []routing.Candidate{{Provider: "anthropic_a", Model: "claude-sonnet-4-6"}, {Provider: "anthropic_b", Model: "claude-sonnet-4-6"}, {Provider: "anthropic_c", Model: "claude-sonnet-4-6"}}
filtered := service.FilterCandidates("claude-sonnet-4-6", candidates)
if len(filtered) != 1 || filtered[0].Provider != "anthropic_c" {
t.Fatalf("filtered = %+v, want only anthropic_c", filtered)
}
}

func TestServiceCandidateEnabledDefaultsTrue(t *testing.T) {
service, err := NewService(&memoryStore{})
if err != nil { t.Fatalf("NewService() error = %v", err) }
if !service.CandidateEnabled(core.ModelSelector{Provider: "anthropic_b", Model: "claude-opus-4-7"}) {
t.Fatal("expected unspecified candidate to default to enabled")
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Prefer a table-driven shape for these related candidate-state tests.

These two cases exercise the same behavior surface (FilterCandidates / CandidateEnabled) with overlapping setup; converting to table-driven cases will keep future routing-state scenarios easier to add and maintain.

As per coding guidelines: “**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/service_test.go` around lines 24 - 47, Convert the two
related tests into a single table-driven test that iterates test cases
describing initial store entries (using memoryStore, Upsert, and Entry with
KindProvider/KindPoolCandidate, ProviderName, Model, Enabled), the candidate
list (routing.Candidate slice), the target model, and the expected results for
both FilterCandidates and CandidateEnabled; for each case call NewService, apply
Upsert entries, run service.FilterCandidates(targetModel, candidates) and assert
the filtered result, and also call
service.CandidateEnabled(core.ModelSelector{Provider: ..., Model: ...}) to
assert the default/enabled behavior—this consolidates
TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue into table-driven subtests.

Comment on lines +113 to +130
func (s *Service) StartBackgroundRefresh(interval time.Duration) func() {
if interval <= 0 {
interval = time.Minute
}
stop := make(chan struct{})
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ticker.C:
_ = s.Refresh(context.Background())
case <-stop:
return
}
}
}()
return func() { close(stop) }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not refresh with an uncancelable context.

StartBackgroundRefresh always uses context.Background(), so a slow or hung store call cannot be interrupted and the returned stop function cannot stop the goroutine until that I/O returns.

🔧 Minimal fix
 	go func() {
 		ticker := time.NewTicker(interval)
 		defer ticker.Stop()
 		for {
 			select {
 			case <-ticker.C:
-				_ = s.Refresh(context.Background())
+				refreshCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+				_ = s.Refresh(refreshCtx)
+				cancel()
 			case <-stop:
 				return
 			}
 		}
 	}()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/service.go` around lines 113 - 130,
StartBackgroundRefresh currently calls s.Refresh with context.Background(),
making refresh calls uninterruptible; change it to create a cancelable context
(e.g. ctx, cancel := context.WithCancel(context.Background())) that the
goroutine uses when calling s.Refresh(ctx) and ensure the returned stop function
calls cancel() (and closes the stop channel or signals the goroutine) so
in-flight Refresh calls are canceled promptly; update the goroutine to respect
context cancellation when invoking Refresh and to call cancel/cleanup when
exiting.

Comment on lines +172 to +190
func (s *Service) FilterCandidates(canonical string, candidates []routing.Candidate) []routing.Candidate {
if s == nil {
return append([]routing.Candidate(nil), candidates...)
}
if !s.CanonicalModelEnabled(canonical) {
return nil
}
filtered := make([]routing.Candidate, 0, len(candidates))
for _, candidate := range candidates {
selector := core.ModelSelector{Provider: candidate.Provider, Model: candidate.Model}
if !s.ProviderEnabled(candidate.Provider) {
continue
}
if !s.CandidateEnabled(selector) {
continue
}
filtered = append(filtered, candidate)
}
return filtered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use one snapshot for FilterCandidates.

This method reads canonical, provider, and candidate state through separate locked helpers, so a concurrent Refresh can mix old and new routing state within one request. That can still route a candidate after the canonical model was just disabled.

🔧 Minimal fix
 func (s *Service) FilterCandidates(canonical string, candidates []routing.Candidate) []routing.Candidate {
 	if s == nil {
 		return append([]routing.Candidate(nil), candidates...)
 	}
-	if !s.CanonicalModelEnabled(canonical) {
+
+	s.mu.RLock()
+	snap := s.snapshot
+	s.mu.RUnlock()
+
+	if entry, ok := snap.canonicalModels[strings.TrimSpace(canonical)]; ok && !entry.Enabled {
 		return nil
 	}
 	filtered := make([]routing.Candidate, 0, len(candidates))
 	for _, candidate := range candidates {
-		selector := core.ModelSelector{Provider: candidate.Provider, Model: candidate.Model}
-		if !s.ProviderEnabled(candidate.Provider) {
+		provider := strings.TrimSpace(candidate.Provider)
+		model := strings.TrimSpace(candidate.Model)
+		if entry, ok := snap.providers[provider]; ok && !entry.Enabled {
 			continue
 		}
-		if !s.CandidateEnabled(selector) {
+		if entry, ok := snap.candidates[provider+"/"+model]; ok && !entry.Enabled {
 			continue
 		}
 		filtered = append(filtered, candidate)
 	}
 	return filtered
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/service.go` around lines 172 - 190, FilterCandidates
currently queries s.CanonicalModelEnabled, s.ProviderEnabled, and
s.CandidateEnabled separately which can interleave with Refresh; change it to
take one consistent routing-state snapshot under the Service lock and use that
snapshot for all checks. Concretely: add or use a snapshot-returning helper
(e.g. a new method like Service.SnapshotRoutingState or similar) that reads
canonical, provider and candidate state while holding the lock, then modify
FilterCandidates to call that snapshot method once and call
snapshot.CanonicalModelEnabled, snapshot.ProviderEnabled, and
snapshot.CandidateEnabled (or access snapshot fields) for all decisions so the
method uses a single, consistent view of routing state. Ensure no per-candidate
locked helper calls remain.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (12)
internal/routingstate/service_test.go (1)

24-47: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider table-driven tests for these related candidate-state scenarios.

Both tests exercise Service behavior around candidate filtering and enabled/disabled state. A table-driven approach would consolidate setup, make future test cases easier to add, and align with the project's testing guidelines.

As per coding guidelines: "**/*_test.go: Prefer table-driven tests in Go code for testing behavior changes."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routingstate/service_test.go` around lines 24 - 47, The two tests
for candidate state should be converted into a single table-driven test to
consolidate related scenarios: create a slice of test cases each containing a
name, initial store entries (e.g., entries for KindProvider/KindPoolCandidate
with Enabled true/false), the model string, input candidates, and the expected
filtered result or expected CandidateEnabled boolean; then loop over cases and
for each case call NewService (with memoryStore), apply Upsert for provided
entries, and assert behavior via Service.FilterCandidates and/or
Service.CandidateEnabled. Locate and update the existing
TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue functions (and usage of NewService,
memoryStore, Service.FilterCandidates, Service.CandidateEnabled) to implement
this table-driven approach so future cases can be added easily.
internal/admin/dashboard/static/js/modules/aliases.js (4)

238-243: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Routing-state refresh does not update rendered rows.

After assigning routingStateViews at Line 239 (and in the catch path at Line 242), syncDisplayModels() is not called, so routing badges/toggles can stay stale.

🔧 Minimal fix
                     this.routingStateAvailable = true;
                     const payload = await res.json();
                     this.routingStateViews = Array.isArray(payload) ? payload : [];
+                    this.syncDisplayModels();
                 } catch (e) {
                     console.error('Failed to fetch routing state:', e);
                     this.routingStateViews = [];
+                    this.syncDisplayModels();
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 238 -
243, The routing-state update assigns this.routingStateViews but never refreshes
the rendered rows; after setting this.routingStateViews in both the try path
(after const payload = await res.json()) and the catch path (where
this.routingStateViews = []), call this.syncDisplayModels() to re-sync UI state
(i.e., add this.syncDisplayModels() immediately after both assignments to ensure
badges/toggles update).

523-529: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Canonical toggle computes the wrong next enabled state.

At Line 528, !(row.canonical_enabled === false) keeps true as true instead of toggling it off.

🔧 Minimal fix
                 await this.submitRoutingStateChange({
                     kind: 'canonical_model',
                     canonical_model: row.canonical_model,
-                    enabled: !(row.canonical_enabled === false)
+                    enabled: !row.canonical_enabled
                 });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 523 -
529, The toggleCanonicalModelEnabled handler computes the next enabled state
incorrectly; replace the expression !(row.canonical_enabled === false) with a
proper boolean toggle such as !Boolean(row.canonical_enabled) (or simply
!row.canonical_enabled) when calling submitRoutingStateChange so the
canonical_enabled value actually flips; update the enabled property passed to
submitRoutingStateChange in toggleCanonicalModelEnabled to use that corrected
expression.

489-501: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

submitRoutingStateChange can throw unhandled network errors.

This path has no try/catch, unlike other submit helpers. A failed fetch() rejects and escapes without returning false.

🔧 Minimal fix
             async submitRoutingStateChange(payload) {
-                const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
-                const res = await fetch('/admin/routing-state', request);
-                const handled = this.handleFetchResponse(res, 'routing state update', request);
-                if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
-                    return false;
-                }
-                if (!handled) {
-                    return false;
-                }
-                await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
-                return true;
+                try {
+                    const request = this.adminRequestOptions({ method: 'PUT', body: JSON.stringify(payload) });
+                    const res = await fetch('/admin/routing-state', request);
+                    const handled = this.handleFetchResponse(res, 'routing state update', request);
+                    if (typeof this.isStaleAuthFetchResult === 'function' && this.isStaleAuthFetchResult(handled)) {
+                        return false;
+                    }
+                    if (!handled) {
+                        return false;
+                    }
+                    await Promise.all([this.fetchRoutingState(), this.fetchRoutingPools()]);
+                    return true;
+                } catch (e) {
+                    console.error('Failed to update routing state:', e);
+                    return false;
+                }
             },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 489 -
501, submitRoutingStateChange currently performs fetch() without a try/catch so
network errors can throw and escape; wrap the body of submitRoutingStateChange
in a try/catch, catch any errors from fetch() or subsequent awaits (including
handleFetchResponse, fetchRoutingState, fetchRoutingPools), log or handle the
error as appropriate, and return false on error so the caller gets a safe
failure value; keep existing checks for isStaleAuthFetchResult(handled) and the
normal success path that returns true.

246-257: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

fetchRoutingPools misses explicit 503 handling parity.

fetchRoutingState() marks routing state as unavailable on 503, but fetchRoutingPools() does not, which can leave availability state inconsistent.

🔧 Minimal fix
             async fetchRoutingPools() {
                 try {
                     const request = this.adminRequestOptions();
                     const res = await fetch('/admin/routing/model-pools', request);
+                    if (res.status === 503) {
+                        this.routingStateAvailable = false;
+                        this.routingPools = [];
+                        this.syncDisplayModels();
+                        return;
+                    }
                     const handled = this.handleFetchResponse(res, 'routing model pools', request);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/dashboard/static/js/modules/aliases.js` around lines 246 -
257, fetchRoutingPools currently omits explicit 503 handling and can leave
routing availability inconsistent; update the async fetchRoutingPools method to
mirror fetchRoutingState by checking the response status (res.status === 503)
after handleFetchResponse and setting this.routingUnavailable = true (and
clearing this.routingPools / calling this.syncDisplayModels() as needed) then
return; also ensure on the success path you set this.routingUnavailable = false
so availability is updated when pools load successfully; use the existing res,
handled, and this.routingPools references in fetchRoutingPools to implement this
parity.
internal/routing/runtime.go (1)

35-47: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Availability errors can still be classified as healthy.

At Line 35 and Line 46, a provider with discovered models and non-empty LastAvailabilityError can still return "healthy".

🔧 Suggested adjustment
 	switch {
+	case availabilityError != "" && snapshot.DiscoveredModelCount > 0:
+		return "degraded"
 	case snapshot.Registered && snapshot.DiscoveredModelCount > 0 && modelFetchError == "":
 		if snapshot.UsingCachedModels && snapshot.LastModelFetchSuccessAt == nil {
 			return "degraded"
 		}
 		return "healthy"
@@
-	case snapshot.DiscoveredModelCount > 0:
+	case snapshot.DiscoveredModelCount > 0 && availabilityError == "":
 		return "healthy"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/runtime.go` around lines 35 - 47, The current health logic
in the switch (using snapshot.Registered, snapshot.DiscoveredModelCount,
modelFetchError, availabilityError, snapshot.UsingCachedModels,
snapshot.LastModelFetchSuccessAt) can still return "healthy" when
availabilityError (LastAvailabilityError) is non-empty; update the conditions
that return "healthy" to also require availabilityError == "" (e.g., the
branches that check snapshot.DiscoveredModelCount > 0 and modelFetchError == ""
and the final snapshot.DiscoveredModelCount > 0 branch), and ensure branches
that handle availabilityError != "" move to "degraded" or "unhealthy" as
intended so availability errors always prevent a "healthy" result.
config/routing.go (1)

88-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate retry_on_statuses values during config load.

Invalid codes (e.g., 0, negative, >599) are currently accepted and can break retry matching behavior.

🔧 Minimal fix
 	if len(cfg.Defaults.Failover.RetryOnStatuses) == 0 {
 		cfg.Defaults.Failover.RetryOnStatuses = []int{429, 500, 502, 503, 504}
 	}
+	for idx, status := range cfg.Defaults.Failover.RetryOnStatuses {
+		if status < 100 || status > 599 {
+			return fmt.Errorf("routing.defaults.failover.retry_on_statuses[%d] must be a valid HTTP status code (100-599)", idx)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@config/routing.go` around lines 88 - 90, The RetryOnStatuses slice
(cfg.Defaults.Failover.RetryOnStatuses) accepts invalid HTTP codes (0, negative,
>599) which can break retry logic; add validation during config load to either
reject the config with a clear error or sanitize the slice by filtering out
values outside 100..599 (or specifically 100–599 inclusive) and duplicate
values, e.g., in the config-loading/validation function that constructs
cfg.Defaults.Failover, check each status in RetryOnStatuses and return a
descriptive error if any are out of range (or remove invalid entries and log a
warning) so only valid HTTP status codes remain.
internal/routing/strategy.go (1)

12-20: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add empty-candidate guard before ordered[0].

Line 20 panics when pool.Candidates is empty in priority_failover.

🔧 Minimal fix
 func selectCandidates(strategy config.RoutingStrategy, pool Pool, counters map[string]int) (Candidate, []Candidate, error) {
+	if len(pool.Candidates) == 0 {
+		return Candidate{}, nil, fmt.Errorf("pool %q has no candidates", pool.CanonicalModel)
+	}
+
 	switch strategy {
 	case config.RoutingStrategyPriorityFailover:
 		ordered := append([]Candidate(nil), pool.Candidates...)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/strategy.go` around lines 12 - 20, Add an empty-candidate
guard before accessing ordered[0] in the config.RoutingStrategyPriorityFailover
branch: after building ordered := append([]Candidate(nil), pool.Candidates...)
check if len(ordered) == 0 and return the function's zero-value candidate(s) and
a descriptive error (e.g. fmt.Errorf("priority_failover: no candidates
available")) instead of indexing into an empty slice; keep existing behavior for
non-empty/one-element slices so ordered[1:] continues to work.
internal/routing/session_affinity.go (1)

46-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Affinity map can grow unbounded with request-id keys.

Using request-scoped keys at Line 46 while pruning only on Get (Line 67) means Put-heavy flows can accumulate expired entries indefinitely.

🔧 Minimal mitigation
 func affinityKey(ctx context.Context, canonicalModel string) string {
@@
-	requestID := strings.TrimSpace(core.GetRequestID(ctx))
-	if requestID != "" {
-		return canonicalModel + "|request_id|" + requestID
-	}
 	return ""
 }

Also applies to: 67-69, 74-84

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/session_affinity.go` around lines 46 - 49, The affinity map
grows unbounded because you create keys using request-scoped values
(core.GetRequestID(ctx) → requestID → canonicalModel+"|request_id|"+requestID)
but only prune on Get; update the Put path (and any code using the affinity map)
to either skip adding request_id-scoped keys entirely or enforce pruning/TTL
when inserting so ephemeral request IDs cannot accumulate. Specifically, modify
the code paths that build keys from requestID (the block using
core.GetRequestID(ctx) and the functions/methods that call Put/Get on the
affinity map) to: 1) avoid persisting request-scoped keys into the map or 2) run
the same expiry/pruning logic on Put as is run in Get (or attach a TTL eviction)
so expired request_id entries are removed on insert. Ensure references to
canonicalModel, requestID, Get, and Put are updated accordingly.
internal/admin/handler_routing_state.go (1)

75-77: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include canonical_model in pool-candidate response matching.

KindPoolCandidate matching is ambiguous with only provider_name + model; it can return the wrong updated row when that pair exists in multiple canonical pools.

Proposed minimal fix
 			case routingstate.KindPoolCandidate:
-				if current.ProviderName == entry.ProviderName && current.Model == entry.Model {
+				if current.ProviderName == entry.ProviderName &&
+					current.Model == entry.Model &&
+					current.CanonicalModel == entry.CanonicalModel {
 					return c.JSON(http.StatusOK, current)
 				}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/admin/handler_routing_state.go` around lines 75 - 77, The
KindPoolCandidate match in the switch (case routingstate.KindPoolCandidate)
currently only compares current.ProviderName == entry.ProviderName &&
current.Model == entry.Model and must also compare canonical_model; update the
matching condition to include current.CanonicalModel == entry.CanonicalModel (or
the struct field name CanonicalModel) so the handler returns current only when
provider, model, and canonical_model all match; ensure any tests or callers that
rely on this behavior are adjusted if necessary.
internal/routing/resolver.go (1)

61-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize Resolver.runtime reads/writes to avoid race conditions.

WithRuntime writes r.runtime unsafely, while ResolveWithContext reads it on request handling path; this can race under concurrent updates.

Proposed minimal fix
 func (r *Resolver) WithRuntime(runtime RuntimeSnapshotProvider) *Resolver {
 	if r == nil {
 		return nil
 	}
+	r.mu.Lock()
+	defer r.mu.Unlock()
 	r.runtime = runtime
 	return r
 }
@@
-	evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(r.runtime))
+	r.mu.Lock()
+	runtime := r.runtime
+	r.mu.Unlock()
+	evaluated := EvaluatePool(r.strategy, pool, r.state, RuntimeInfoByProvider(runtime))

Also applies to: 97-97

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/resolver.go` around lines 61 - 67, Resolver.runtime is being
written in WithRuntime without synchronization while ResolveWithContext reads it
concurrently; add synchronization by introducing a mutex on Resolver (e.g., mu
sync.RWMutex) and use mu.Lock()/mu.Unlock() in WithRuntime and
mu.RLock()/mu.RUnlock() around reads in ResolveWithContext (and any other
accessors/setters such as the other writer at the referenced location), ensuring
all reads/writes to the runtime field are guarded.
internal/app/app.go (1)

55-55: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close routingState during normal Shutdown as well.

routingState is now a managed subsystem, but Shutdown does not close it; this can leak store/refresh resources on normal app termination.

Proposed minimal fix
 	// 7. Close reusable guardrails subsystem.
 	if a.guardrails != nil {
 		if err := a.guardrails.Close(); err != nil {
 			slog.Error("guardrails close error", "error", err)
 			errs = append(errs, fmt.Errorf("guardrails close: %w", err))
 		}
 	}
+
+	// 8. Close routing state subsystem.
+	if a.routingState != nil {
+		if err := a.routingState.Close(); err != nil {
+			slog.Error("routing state close error", "error", err)
+			errs = append(errs, fmt.Errorf("routing state close: %w", err))
+		}
+	}
 
-	// 8. Close managed auth keys subsystem.
+	// 9. Close managed auth keys subsystem.
 	if a.authKeys != nil {

Also applies to: 278-293

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/app.go` at line 55, routingState (the *routingstate.Result
field) is not being closed during normal Shutdown, causing resource leaks;
update the app Shutdown method (and any shutdown/cleanup path that currently
handles other subsystems) to call routingState.Close() (or
routingState.Shutdown() if that API exists) and handle/log any error, and also
ensure the same closure is applied in the alternate shutdown/cleanup block that
mirrors lines 278-293 so routingState is always closed on normal termination.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/routing/exposed_models.go`:
- Around line 103-114: The fallback branch in
CombinedExposedModelLister.ExposedModelsFiltered bypasses the provided allow
filter by appending lister.ExposedModels() directly; update that branch to
iterate the returned models from lister.ExposedModels() and apply the
allow(core.ModelSelector) predicate to each model before appending so only
permitted models are included (mirror behavior of
serverFilteredExposedModelLister.ExposedModelsFiltered). Ensure you reference
the existing symbols CombinedExposedModelLister, ExposedModelsFiltered,
serverFilteredExposedModelLister, ExposedModels, and the allow parameter when
implementing the per-model check.

In `@internal/routing/failover_policy.go`:
- Around line 51-59: The 5xx/429 retry check is wrongly guarded by the
RetryOnModelErrors early return: in function(s) using the RetryOnModelErrors
boolean (see symbol RetryOnModelErrors) move or duplicate the HTTP status check
so that server-error/429 logic runs regardless of RetryOnModelErrors;
specifically, ensure the block that checks status >=
http.StatusInternalServerError || status == http.StatusTooManyRequests is
evaluated independently (or before) the early return, and keep the model-error
string checks gated by RetryOnModelErrors; also ensure RetryOnStatuses (or
similar configured status list) is consulted for explicit overrides.

---

Duplicate comments:
In `@config/routing.go`:
- Around line 88-90: The RetryOnStatuses slice
(cfg.Defaults.Failover.RetryOnStatuses) accepts invalid HTTP codes (0, negative,
>599) which can break retry logic; add validation during config load to either
reject the config with a clear error or sanitize the slice by filtering out
values outside 100..599 (or specifically 100–599 inclusive) and duplicate
values, e.g., in the config-loading/validation function that constructs
cfg.Defaults.Failover, check each status in RetryOnStatuses and return a
descriptive error if any are out of range (or remove invalid entries and log a
warning) so only valid HTTP status codes remain.

In `@internal/admin/dashboard/static/js/modules/aliases.js`:
- Around line 238-243: The routing-state update assigns this.routingStateViews
but never refreshes the rendered rows; after setting this.routingStateViews in
both the try path (after const payload = await res.json()) and the catch path
(where this.routingStateViews = []), call this.syncDisplayModels() to re-sync UI
state (i.e., add this.syncDisplayModels() immediately after both assignments to
ensure badges/toggles update).
- Around line 523-529: The toggleCanonicalModelEnabled handler computes the next
enabled state incorrectly; replace the expression !(row.canonical_enabled ===
false) with a proper boolean toggle such as !Boolean(row.canonical_enabled) (or
simply !row.canonical_enabled) when calling submitRoutingStateChange so the
canonical_enabled value actually flips; update the enabled property passed to
submitRoutingStateChange in toggleCanonicalModelEnabled to use that corrected
expression.
- Around line 489-501: submitRoutingStateChange currently performs fetch()
without a try/catch so network errors can throw and escape; wrap the body of
submitRoutingStateChange in a try/catch, catch any errors from fetch() or
subsequent awaits (including handleFetchResponse, fetchRoutingState,
fetchRoutingPools), log or handle the error as appropriate, and return false on
error so the caller gets a safe failure value; keep existing checks for
isStaleAuthFetchResult(handled) and the normal success path that returns true.
- Around line 246-257: fetchRoutingPools currently omits explicit 503 handling
and can leave routing availability inconsistent; update the async
fetchRoutingPools method to mirror fetchRoutingState by checking the response
status (res.status === 503) after handleFetchResponse and setting
this.routingUnavailable = true (and clearing this.routingPools / calling
this.syncDisplayModels() as needed) then return; also ensure on the success path
you set this.routingUnavailable = false so availability is updated when pools
load successfully; use the existing res, handled, and this.routingPools
references in fetchRoutingPools to implement this parity.

In `@internal/admin/handler_routing_state.go`:
- Around line 75-77: The KindPoolCandidate match in the switch (case
routingstate.KindPoolCandidate) currently only compares current.ProviderName ==
entry.ProviderName && current.Model == entry.Model and must also compare
canonical_model; update the matching condition to include current.CanonicalModel
== entry.CanonicalModel (or the struct field name CanonicalModel) so the handler
returns current only when provider, model, and canonical_model all match; ensure
any tests or callers that rely on this behavior are adjusted if necessary.

In `@internal/app/app.go`:
- Line 55: routingState (the *routingstate.Result field) is not being closed
during normal Shutdown, causing resource leaks; update the app Shutdown method
(and any shutdown/cleanup path that currently handles other subsystems) to call
routingState.Close() (or routingState.Shutdown() if that API exists) and
handle/log any error, and also ensure the same closure is applied in the
alternate shutdown/cleanup block that mirrors lines 278-293 so routingState is
always closed on normal termination.

In `@internal/routing/resolver.go`:
- Around line 61-67: Resolver.runtime is being written in WithRuntime without
synchronization while ResolveWithContext reads it concurrently; add
synchronization by introducing a mutex on Resolver (e.g., mu sync.RWMutex) and
use mu.Lock()/mu.Unlock() in WithRuntime and mu.RLock()/mu.RUnlock() around
reads in ResolveWithContext (and any other accessors/setters such as the other
writer at the referenced location), ensuring all reads/writes to the runtime
field are guarded.

In `@internal/routing/runtime.go`:
- Around line 35-47: The current health logic in the switch (using
snapshot.Registered, snapshot.DiscoveredModelCount, modelFetchError,
availabilityError, snapshot.UsingCachedModels, snapshot.LastModelFetchSuccessAt)
can still return "healthy" when availabilityError (LastAvailabilityError) is
non-empty; update the conditions that return "healthy" to also require
availabilityError == "" (e.g., the branches that check
snapshot.DiscoveredModelCount > 0 and modelFetchError == "" and the final
snapshot.DiscoveredModelCount > 0 branch), and ensure branches that handle
availabilityError != "" move to "degraded" or "unhealthy" as intended so
availability errors always prevent a "healthy" result.

In `@internal/routing/session_affinity.go`:
- Around line 46-49: The affinity map grows unbounded because you create keys
using request-scoped values (core.GetRequestID(ctx) → requestID →
canonicalModel+"|request_id|"+requestID) but only prune on Get; update the Put
path (and any code using the affinity map) to either skip adding
request_id-scoped keys entirely or enforce pruning/TTL when inserting so
ephemeral request IDs cannot accumulate. Specifically, modify the code paths
that build keys from requestID (the block using core.GetRequestID(ctx) and the
functions/methods that call Put/Get on the affinity map) to: 1) avoid persisting
request-scoped keys into the map or 2) run the same expiry/pruning logic on Put
as is run in Get (or attach a TTL eviction) so expired request_id entries are
removed on insert. Ensure references to canonicalModel, requestID, Get, and Put
are updated accordingly.

In `@internal/routing/strategy.go`:
- Around line 12-20: Add an empty-candidate guard before accessing ordered[0] in
the config.RoutingStrategyPriorityFailover branch: after building ordered :=
append([]Candidate(nil), pool.Candidates...) check if len(ordered) == 0 and
return the function's zero-value candidate(s) and a descriptive error (e.g.
fmt.Errorf("priority_failover: no candidates available")) instead of indexing
into an empty slice; keep existing behavior for non-empty/one-element slices so
ordered[1:] continues to work.

In `@internal/routingstate/service_test.go`:
- Around line 24-47: The two tests for candidate state should be converted into
a single table-driven test to consolidate related scenarios: create a slice of
test cases each containing a name, initial store entries (e.g., entries for
KindProvider/KindPoolCandidate with Enabled true/false), the model string, input
candidates, and the expected filtered result or expected CandidateEnabled
boolean; then loop over cases and for each case call NewService (with
memoryStore), apply Upsert for provided entries, and assert behavior via
Service.FilterCandidates and/or Service.CandidateEnabled. Locate and update the
existing TestServiceFilterCandidatesHonorsManualDisable and
TestServiceCandidateEnabledDefaultsTrue functions (and usage of NewService,
memoryStore, Service.FilterCandidates, Service.CandidateEnabled) to implement
this table-driven approach so future cases can be added easily.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d41ff2ac-3e50-40ac-941b-f338818144b0

📥 Commits

Reviewing files that changed from the base of the PR and between 779fa78 and 62c4df5.

📒 Files selected for processing (53)
  • config/config.example.yaml
  • config/config.go
  • config/config_example_test.go
  • config/config_test.go
  • config/routing.go
  • docs/features/canonical-routing.mdx
  • internal/admin/dashboard/static/js/dashboard.js
  • internal/admin/dashboard/static/js/modules/aliases.js
  • internal/admin/dashboard/static/js/modules/aliases.test.cjs
  • internal/admin/dashboard/templates/model-table-body.html
  • internal/admin/handler.go
  • internal/admin/handler_providers.go
  • internal/admin/handler_routing_pools.go
  • internal/admin/handler_routing_pools_test.go
  • internal/admin/handler_routing_state.go
  • internal/admin/handler_routing_state_test.go
  • internal/admin/handler_test.go
  • internal/admin/routes.go
  • internal/admin/routes_test.go
  • internal/app/app.go
  • internal/core/canonical_routing.go
  • internal/core/request_model_resolution.go
  • internal/fallback/resolver.go
  • internal/fallback/resolver_test.go
  • internal/gateway/fallback.go
  • internal/gateway/inference_orchestrator.go
  • internal/gateway/request_model_resolution.go
  • internal/providers/config_test.go
  • internal/routing/composed_resolver.go
  • internal/routing/exposed_models.go
  • internal/routing/exposed_models_test.go
  • internal/routing/failover_policy.go
  • internal/routing/pool_evaluator.go
  • internal/routing/resolver.go
  • internal/routing/resolver_test.go
  • internal/routing/runtime.go
  • internal/routing/session_affinity.go
  • internal/routing/strategy.go
  • internal/routing/types.go
  • internal/routingstate/factory.go
  • internal/routingstate/service.go
  • internal/routingstate/service_test.go
  • internal/routingstate/store.go
  • internal/routingstate/store_mongodb.go
  • internal/routingstate/store_postgresql.go
  • internal/routingstate/store_sqlite.go
  • internal/routingstate/types.go
  • internal/server/fallback_test.go
  • internal/server/handlers.go
  • internal/server/http.go
  • internal/server/internal_chat_completion_executor.go
  • internal/server/request_model_resolution_test.go
  • internal/server/translated_inference_service.go

Comment on lines +103 to +114
func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model {
if l == nil {
return nil
}
var result []core.Model
for _, lister := range l.listers {
if filtered, ok := lister.(serverFilteredExposedModelLister); ok {
result = append(result, filtered.ExposedModelsFiltered(allow)...)
continue
}
result = append(result, lister.ExposedModels()...)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

allow filtering is bypassed in the fallback branch.

When allow is provided, Line 113 still appends unfiltered models from listers that only implement ExposedModels(). That can expose models that should have been filtered out.

🔧 Minimal fix
 func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model {
@@
 	for _, lister := range l.listers {
 		if filtered, ok := lister.(serverFilteredExposedModelLister); ok {
 			result = append(result, filtered.ExposedModelsFiltered(allow)...)
 			continue
 		}
+		if allow != nil {
+			continue
+		}
 		result = append(result, lister.ExposedModels()...)
 	}
 	return result
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model {
if l == nil {
return nil
}
var result []core.Model
for _, lister := range l.listers {
if filtered, ok := lister.(serverFilteredExposedModelLister); ok {
result = append(result, filtered.ExposedModelsFiltered(allow)...)
continue
}
result = append(result, lister.ExposedModels()...)
}
func (l *CombinedExposedModelLister) ExposedModelsFiltered(allow func(core.ModelSelector) bool) []core.Model {
if l == nil {
return nil
}
var result []core.Model
for _, lister := range l.listers {
if filtered, ok := lister.(serverFilteredExposedModelLister); ok {
result = append(result, filtered.ExposedModelsFiltered(allow)...)
continue
}
if allow != nil {
continue
}
result = append(result, lister.ExposedModels()...)
}
return result
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/exposed_models.go` around lines 103 - 114, The fallback
branch in CombinedExposedModelLister.ExposedModelsFiltered bypasses the provided
allow filter by appending lister.ExposedModels() directly; update that branch to
iterate the returned models from lister.ExposedModels() and apply the
allow(core.ModelSelector) predicate to each model before appending so only
permitted models are included (mirror behavior of
serverFilteredExposedModelLister.ExposedModelsFiltered). Ensure you reference
the existing symbols CombinedExposedModelLister, ExposedModelsFiltered,
serverFilteredExposedModelLister, ExposedModels, and the allow parameter when
implementing the per-model check.

Comment on lines +51 to +59
if !p.RetryOnModelErrors {
return false
}
message := strings.ToLower(strings.TrimSpace(gatewayErr.Message))
if strings.Contains(message, "model") && (strings.Contains(message, "unavailable") || strings.Contains(message, "not found") || strings.Contains(message, "unsupported")) {
return true
}
if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests {
return true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

5xx/429 retry logic is accidentally gated by RetryOnModelErrors.

At Line 51, returning early when RetryOnModelErrors is false makes Line 58–59 unreachable, so transient HTTP failures won’t fail over unless explicitly listed in RetryOnStatuses.

Suggested minimal fix
 	if _, ok := p.RetryOnStatuses[status]; ok {
 		return true
 	}
+	if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests {
+		return true
+	}
 	if !p.RetryOnModelErrors {
 		return false
 	}
 	message := strings.ToLower(strings.TrimSpace(gatewayErr.Message))
 	if strings.Contains(message, "model") && (strings.Contains(message, "unavailable") || strings.Contains(message, "not found") || strings.Contains(message, "unsupported")) {
 		return true
 	}
-	if status >= http.StatusInternalServerError || status == http.StatusTooManyRequests {
-		return true
-	}
 	return false
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/routing/failover_policy.go` around lines 51 - 59, The 5xx/429 retry
check is wrongly guarded by the RetryOnModelErrors early return: in function(s)
using the RetryOnModelErrors boolean (see symbol RetryOnModelErrors) move or
duplicate the HTTP status check so that server-error/429 logic runs regardless
of RetryOnModelErrors; specifically, ensure the block that checks status >=
http.StatusInternalServerError || status == http.StatusTooManyRequests is
evaluated independently (or before) the early return, and keep the model-error
string checks gated by RetryOnModelErrors; also ensure RetryOnStatuses (or
similar configured status list) is consulted for explicit overrides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants