fix(executor): 把 Ollama 4xx 标记为 is_client_error,保护节点不被错请求踢下线#1
fix(executor): 把 Ollama 4xx 标记为 is_client_error,保护节点不被错请求踢下线#1yanbangqing-beep wants to merge 1 commit into
Conversation
Coordinated with keycompute's matching commit on feat/b1-b2-b6-fix
which introduces is_client_error on NodeTaskResult::Failed.
Before: every Ollama error (4xx, 5xx, network) was reported back to the
server as the same opaque ollama_error Failed result. The server then
counted each failure against the node's consecutive_failure_count, so a
single bad client request (e.g. embed-only model used as chat) could
exclude an otherwise healthy node after server-side requeues.
After:
* protocol/types.rs: mirror server's `#[serde(default)] is_client_error`
on NodeTaskResult::Failed. Old servers ignoring the field still see
the previous failure behaviour.
* client/ollama.rs: chat() now returns NodeTokenError::HttpError
{ status, message } instead of stringifying the status into Ollama(_).
Preserves the HTTP status so the executor can classify the failure.
* runtime/executor.rs: new classify_ollama_error helper:
- HTTP 4xx → is_client_error = true
(server: terminal-fail immediately, no node failure_count++)
- HTTP 5xx / network errors → is_client_error = false
(server: existing requeue + failure_count behaviour)
Unit tests assert the 4xx vs 5xx classification.
|
@bot review |
Review of PR #1: fix(executor): classify Ollama 4xx as client error to protect node from being excludedOverall AssessmentThis PR addresses a critical bug (B1) where client-induced 4xx errors from Ollama (e.g., using an embedding-only model for chat) were causing node exclusion after 3 failures. The changes appropriately preserve the HTTP status code, classify 4xx as Blocking IssuesNone. The code is correct and addresses the bug as described. Non-blocking Suggestions
Suggested Tests
ConclusionThe changes are well-structured, follow the stated design, and fix a real production bug. The code is maintainable, backward-compatible, and the new tests are appropriate. No blocking issues. The non-blocking suggestions are minor and can be addressed in a follow-up. Ready to merge. Trigger: |
概述
本 PR 配套 keycompute/keycompute#13 的协议变更,在 node-token 客户端这一侧:
修复的 bug:B1 —— 任意一个会让 Ollama 返回 4xx 的请求(如把 embed-only 模型当 chat 用),会被服务端算到节点头上,3 次失败后整个节点 `excluded` 下线。
复现步骤
前置:node-token 启动后已注册并 online,本地 Ollama 装了至少一个 embed 模型(如 `nomic-embed-text:latest`)。
发一个用 embed-only 模型当 chat 模型的请求 —— Ollama 会返 `400 "does not support chat"`:
```bash
curl -sS http://localhost:3000/v1/chat/completions \
-H "Authorization: Bearer sk-..." \
-d '{"model":"node:nomic-embed-text:latest","messages":[{"role":"user","content":"hi"}]}'
```
改前:
```
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Requeued, node_status=online
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Failed, node_status=excluded
WARN Node EXCLUDED ...
```
服务端把 1 个客户端错请求算成 3 次节点失败,把节点踢下线。
改后:
```
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Failed, node_status=online
```
节点保持 online,任务直接 terminal failed,服务端把错误回传给 HTTP client。
根本原因
src/client/ollama.rs的 `chat()` 函数在收到非 2xx 响应时,把 status code 字符串化:```rust
// 改前:status code 被压成字符串,executor 拿不到结构化信息
return Err(NodeTokenError::Ollama(format!(
"Chat failed: HTTP {} - {}",
status, body
)));
```
`src/runtime/executor.rs` 的 `execute` 看到任何错误都打成相同的 `Failed { code: "ollama_error", message }`,服务端无从区分。
修复方案
1.
src/client/ollama.rs::chat()改用已经存在但未使用的 `NodeTokenError::HttpError { status, message }` 变体,保留 HTTP status:
```rust
return Err(NodeTokenError::HttpError {
status,
message: body,
});
```
2.
src/protocol/types.rs镜像服务端协议,加 `is_client_error: bool` 到 `NodeTaskResult::Failed`,带 `#[serde(default)]` 保持向后兼容(老服务端忽略未知字段)。
3.
src/runtime/executor.rs加 `classify_ollama_error` helper:
```rust
fn classify_ollama_error(err: &NodeTokenError) -> NodeTaskResult {
let (code, message, is_client_error) = match err {
NodeTokenError::HttpError { status, message } if (400..500).contains(status) => (
format!("ollama_http_{}", status),
message.clone(),
true, // 4xx → 客户端请求错
),
NodeTokenError::HttpError { status, message } => (
format!("ollama_http_{}", status),
message.clone(),
false, // 5xx → 节点/服务端问题
),
NodeTokenError::Network(e) => (
"ollama_network".to_string(), e.to_string(), false,
),
other => (
"ollama_error".to_string(), other.to_string(), false,
),
};
NodeTaskResult::Failed { code, message, is_client_error }
}
```
`execute` 里替换为单行调用 `classify_ollama_error(&e)`。
测试
新增的单元测试(test-first)
src/runtime/executor.rs::tests:调整的测试
跑测试
`cargo test --release`:所有现有测试 + 新增 2 个全绿。
向后兼容性
可独立 merge 不破已部署服务端。
关联