fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754
Open
localai-bot wants to merge 2 commits into
Open
fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status#9754localai-bot wants to merge 2 commits into
localai-bot wants to merge 2 commits into
Conversation
… routing by healthy status Stale node_models rows (state="loaded") were surviving past the healthy state of their owning node, causing /embeddings (and other inference paths) to dispatch to a backend whose process was gone or drained. The downstream symptom in a live cluster was pgvector rejecting inserts with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)" because the misbehaving backend silently returned a malformed (oversized) tensor; the Models page showed the model as "running" without an associated node, like a stale entry, even though the node was no longer visible in the Nodes view. Two changes here, plus a third in a follow-up commit: - MarkDraining now cascade-deletes node_models rows for the affected node, mirroring MarkOffline. Drains are explicit operator actions — the box has been intentionally taken out of rotation — so clearing the rows stops the Models UI from misreporting and prevents the routing layer from picking those rows if scheduling logic is ever relaxed. In-flight requests already hold their gRPC client through Route() and finish normally; the only observable effect is a non-fatal IncrementInFlight warning, acceptable for a drain. MarkUnhealthy is deliberately left status-only: it fires from managers_distributed / reconciler on a single nats.ErrNoResponders with no retry, so a transient NATS hiccup must not nuke every loaded model and force a full reload on recovery. - FindAndLockNodeWithModel's inner JOIN now filters on backend_nodes.status = healthy in addition to node_models.state = loaded. The previous version relied on the second node-fetch step to reject non-healthy nodes, but a concurrent reader could still pick the same stale row in the same window. Belt-and-braces. - DistributedConfig.PerModelHealthCheck renamed to DisablePerModelHealthCheck and inverted at the call site so per-model gRPC probing is on by default. The probe (now made consecutive-miss aware in a follow-up commit) independently health- checks each model's gRPC address and removes stale node_models rows when the backend has crashed even though the worker's node-level heartbeat is still arriving. Migration: the field had no CLI flag, env var binding, or YAML key in tree (only the bare struct field), so there is no user-facing migration. Anything constructing DistributedConfig in code needs to drop the assignment (default now does the right thing) or invert it. Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…emoves a row The per-model gRPC probe used to remove a node_models row on a single failed health check. With the per-model probe now on by default, that made any 5-second gRPC blip (network jitter, a long-running request hogging the worker's gRPC server thread, brief GC pause) trigger a full reload of the affected model — too eager for production. Require perModelMissThreshold (3) consecutive failed probes before removal. At the default 15s tick a model must be unreachable for ~45s before reap; a single successful probe in between resets the streak. Per-(node, model, replica) state tracked under a mutex on the monitor. If the removal call itself fails, the miss counter is left in place so the next tick retries rather than starting the streak over. Tests: - removes stale model via per-model health check after consecutive failures (replaces the single-shot expectation) - preserves model row when an intermittent failure is followed by a success (covers the reset-on-success path and verifies the counter reset by failing twice more without crossing threshold) - newTestHealthMonitor initializes the misses map so direct-construct test helpers don't nil-map-panic in the probe path Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
78fb08b to
9029ec2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stale
node_modelsrows (state="loaded") were surviving past the healthy state of their owning node, causing/embeddings(and other inference paths) to dispatch to a backend whose process was gone or drained. The downstream symptom was pgvector rejecting inserts asvector cannot have more than 16000 dimensions (SQLSTATE 54000)because the misbehaving backend silently returned a malformed (oversized) tensor; the Models page showed the model as "running" without an associated node, like a stale entry, even though the node was no longer visible in the Nodes view.Why
Repro path observed live:
qwen3-embedding-4bwas registered on a worker that had been operationally drained (process gone, but node-level heartbeats still arriving).FindAndLockNodeWithModelpicked the stalenode_modelsrow because the JOIN only filtered bynode_models.state = "loaded"; the status-check happened in a second query whose failure rolled back the transaction but didn't prevent the same row from being picked next time.probeHealth(gRPC ping) passed against the drained worker./embeddingsreturned 200 → LocalRecall's pgvector insert blew up at the 16000-dim ceiling.MarkUnhealthy/MarkDrainingonly flippedbackend_nodes.status; unlikeMarkOffline, they didn't cascade-cleannode_models. Stale rows accumulated.Changes (defense-in-depth)
MarkUnhealthy/MarkDrainingcascade-deletenode_models— mirrorsMarkOffline. On the unhealthy path the backend is presumed unreachable; on drain, routing already filters these out, but clearing the rows stops the UI from misreporting and prevents future scheduling-logic relaxations from picking them up.FindAndLockNodeWithModelinner JOIN now also filters bybackend_nodes.status = healthy— the previous version relied on the second node-fetch step to reject non-healthy nodes, but a concurrent reader could still pick the same stale row in the same window. Belt-and-braces.DistributedConfig.PerModelHealthCheck→DisablePerModelHealthCheck(inverted; on by default) — the per-model gRPC probe independently health-checks each model's gRPC address and removes stalenode_modelsrows when the backend has crashed even though the worker's node-level heartbeat is still arriving.Migration
DistributedConfig.PerModelHealthCheckwas renamed and inverted toDisablePerModelHealthCheck. The field had no CLI flag, env var binding, or YAML key in tree (only the bare struct field), so there's no user-facing migration. Anything constructingDistributedConfigin code needs to drop the assignment (default now does the right thing) or invert it.Test plan
go test ./core/services/nodes/(100s, all pass)go vet ./core/services/nodes/ ./core/config/ ./core/application/clean/embeddingseither returns an error (no replica) or routes to a healthy one — no 200-with-garbage, no pgvector dimension overflow downstream.🤖 Generated with Claude Code