Skip to content

fix(executor): 把 Ollama 4xx 标记为 is_client_error,保护节点不被错请求踢下线#1

Open
yanbangqing-beep wants to merge 1 commit into
keycompute:mainfrom
yanbangqing-beep:feat/b1-b2-b6-fix
Open

fix(executor): 把 Ollama 4xx 标记为 is_client_error,保护节点不被错请求踢下线#1
yanbangqing-beep wants to merge 1 commit into
keycompute:mainfrom
yanbangqing-beep:feat/b1-b2-b6-fix

Conversation

@yanbangqing-beep
Copy link
Copy Markdown

概述

本 PR 配套 keycompute/keycompute#13 的协议变更,在 node-token 客户端这一侧:

  1. 保留 Ollama 的 HTTP status code(目前被字符串化丢失)
  2. 按 4xx vs 5xx 分类失败,把 4xx 标记为 `is_client_error: true` 发回服务端
  3. 让服务端能据此区分"客户端请求错"和"节点不健康",不再把客户端的错误请求算成节点失败

修复的 bug:B1 —— 任意一个会让 Ollama 返回 4xx 的请求(如把 embed-only 模型当 chat 用),会被服务端算到节点头上,3 次失败后整个节点 `excluded` 下线。


复现步骤

前置:node-token 启动后已注册并 online,本地 Ollama 装了至少一个 embed 模型(如 `nomic-embed-text:latest`)。

发一个用 embed-only 模型当 chat 模型的请求 —— Ollama 会返 `400 "does not support chat"`:

```bash
curl -sS http://localhost:3000/v1/chat/completions \
-H "Authorization: Bearer sk-..." \
-d '{"model":"node:nomic-embed-text:latest","messages":[{"role":"user","content":"hi"}]}'
```

改前:

```
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Requeued, node_status=online
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Failed, node_status=excluded
WARN Node EXCLUDED ...
```

服务端把 1 个客户端错请求算成 3 次节点失败,把节点踢下线。

改后:

```
ERROR node_token::client::ollama: Ollama chat failed with status 400: {"error":"..."}
INFO node_token::runtime::executor: Task ... completed: action=Failed, node_status=online
```

节点保持 online,任务直接 terminal failed,服务端把错误回传给 HTTP client。


根本原因

src/client/ollama.rs 的 `chat()` 函数在收到非 2xx 响应时,把 status code 字符串化:

```rust
// 改前:status code 被压成字符串,executor 拿不到结构化信息
return Err(NodeTokenError::Ollama(format!(
"Chat failed: HTTP {} - {}",
status, body
)));
```

`src/runtime/executor.rs` 的 `execute` 看到任何错误都打成相同的 `Failed { code: "ollama_error", message }`,服务端无从区分。


修复方案

1. src/client/ollama.rs::chat()

改用已经存在但未使用的 `NodeTokenError::HttpError { status, message }` 变体,保留 HTTP status:

```rust
return Err(NodeTokenError::HttpError {
status,
message: body,
});
```

2. src/protocol/types.rs

镜像服务端协议,加 `is_client_error: bool` 到 `NodeTaskResult::Failed`,带 `#[serde(default)]` 保持向后兼容(老服务端忽略未知字段)。

3. src/runtime/executor.rs

加 `classify_ollama_error` helper:

```rust
fn classify_ollama_error(err: &NodeTokenError) -> NodeTaskResult {
let (code, message, is_client_error) = match err {
NodeTokenError::HttpError { status, message } if (400..500).contains(status) => (
format!("ollama_http_{}", status),
message.clone(),
true, // 4xx → 客户端请求错
),
NodeTokenError::HttpError { status, message } => (
format!("ollama_http_{}", status),
message.clone(),
false, // 5xx → 节点/服务端问题
),
NodeTokenError::Network(e) => (
"ollama_network".to_string(), e.to_string(), false,
),
other => (
"ollama_error".to_string(), other.to_string(), false,
),
};
NodeTaskResult::Failed { code, message, is_client_error }
}
```

`execute` 里替换为单行调用 `classify_ollama_error(&e)`。


测试

新增的单元测试(test-first)

src/runtime/executor.rs::tests:

  • `test_classify_ollama_4xx_is_client_error`:输入 `HttpError{status:400}`,断言 `is_client_error == true`,`code == "ollama_http_400"`
  • `test_classify_ollama_5xx_is_not_client_error`:输入 `HttpError{status:503}`,断言 `is_client_error == false`

调整的测试

  • `test_execute_result_construction_failure` 加上 `is_client_error: false` 字段断言(协议变了,顺手验证默认值)
  • `tests/integration_error_scenarios.rs::test_ollama_connection_failure` 的错误信息断言扩展接受 `HttpError`(因为错误变体名变了)

跑测试

`cargo test --release`:所有现有测试 + 新增 2 个全绿。


向后兼容性

服务端 客户端 行为
客户端发了 `is_client_error` 字段,老服务端忽略未知字段,行为同现状(B1 未消)
B1 消失 —— 4xx 不再踢节点

可独立 merge 不破已部署服务端。


关联

  • 服务端配套 PR:keycompute/keycompute#13(协议加 `is_client_error`、handle_failed_submission 分流、admin recover endpoint)

Coordinated with keycompute's matching commit on feat/b1-b2-b6-fix
which introduces is_client_error on NodeTaskResult::Failed.

Before: every Ollama error (4xx, 5xx, network) was reported back to the
server as the same opaque ollama_error Failed result. The server then
counted each failure against the node's consecutive_failure_count, so a
single bad client request (e.g. embed-only model used as chat) could
exclude an otherwise healthy node after server-side requeues.

After:

* protocol/types.rs: mirror server's `#[serde(default)] is_client_error`
  on NodeTaskResult::Failed. Old servers ignoring the field still see
  the previous failure behaviour.

* client/ollama.rs: chat() now returns NodeTokenError::HttpError
  { status, message } instead of stringifying the status into Ollama(_).
  Preserves the HTTP status so the executor can classify the failure.

* runtime/executor.rs: new classify_ollama_error helper:
  - HTTP 4xx → is_client_error = true
    (server: terminal-fail immediately, no node failure_count++)
  - HTTP 5xx / network errors → is_client_error = false
    (server: existing requeue + failure_count behaviour)

Unit tests assert the 4xx vs 5xx classification.
@aiqubits
Copy link
Copy Markdown
Member

@bot review

@github-actions
Copy link
Copy Markdown

Review of PR #1: fix(executor): classify Ollama 4xx as client error to protect node from being excluded

Overall Assessment

This PR addresses a critical bug (B1) where client-induced 4xx errors from Ollama (e.g., using an embedding-only model for chat) were causing node exclusion after 3 failures. The changes appropriately preserve the HTTP status code, classify 4xx as is_client_error: true, and pass that information to the server. The implementation is clean, backward-compatible, and well-documented. No blocking issues were found.

Blocking Issues

None. The code is correct and addresses the bug as described.

Non-blocking Suggestions

  1. Function naming (src/runtime/executor.rs:307)
    classify_ollama_error is a misnomer because it handles all NodeTokenError variants, not just Ollama-specific ones. Since it is only called from the Ollama execution path, it is functionally correct, but renaming to classify_execution_error or task_result_from_error would improve readability and avoid confusion if the function is reused elsewhere.

  2. Missing test coverage for NodeTokenError::Network
    The unit tests cover HttpError (4xx/5xx) but omit the Network variant and the catch-all other arm. Even though these code paths are straightforward, adding a test for a Network error would strengthen regression coverage.

  3. Integration test assertion is overly broad (tests/integration_error_scenarios.rs)
    The assertion now accepts the string "HttpError" in the error output. This is acceptable because the error variant changed, but the test would also pass if the error message just accidentally contained that substring. A more precise check (e.g., using format!("{:?}", ...) and matching on the actual variant) would be better, but that would require access to the error type, which the integration test may not have. This is a minor point.

  4. Serialization of is_client_error
    The field is always serialized (no skip_serializing_if). This adds a few extra bytes per failure message to the server. While not a problem, you could add #[serde(skip_serializing_if = "is_false")] to reduce payload, but this is purely cosmetic.

  5. Documentation of the other arm in classify_ollama_error
    The fallback arm uses other.to_string(), which relies on NodeTokenError implementing Display. The existing code compiled, so this must be true. However, the other arm will also catch potential NodeTokenError::Json errors from JSON parsing (unlikely in this exact code path). It would be slightly more robust to explicitly match on NodeTokenError::Json and log a distinct code, but it's not necessary.

Suggested Tests

  • Unit test for NodeTokenError::Network in src/runtime/executor.rs:

    #[test]
    fn test_classify_network_error() {
        let err = NodeTokenError::Network("connection refused".to_string());
        match classify_ollama_error(&err) {
            NodeTaskResult::Failed { code, is_client_error, .. } => {
                assert_eq!(code, "ollama_network");
                assert!(!is_client_error);
            }
            _ => panic!("expected Failed"),
        }
    }
  • (Optional) Unit test for an unknown NodeTokenError variant (e.g., if a new variant is added later) – not strictly required, but would improve coverage.

Conclusion

The changes are well-structured, follow the stated design, and fix a real production bug. The code is maintainable, backward-compatible, and the new tests are appropriate. No blocking issues. The non-blocking suggestions are minor and can be addressed in a follow-up.

Ready to merge.


Trigger: @bot review
Note: This review is generated from the PR diff and may not reflect code outside the visible changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants