Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 28 additions & 12 deletions docs/features/thinking_budget.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## Overview

`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>` segment. When the budget is reached, it forces a line break token and then the `</think>` token to terminate the thinking section.
`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>`
segment. When the budget is reached, it terminates thinking by forcing `</think>`. If
`think_stop_sentence` is configured, it forces the custom sentence first and then `</think>`.

## When to Use

Expand All @@ -11,19 +13,22 @@

## How It Works

1. **CPU precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
1. **Request-side precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
2. **Per-step update**: during decoding, the processor tracks `last_token_id` and `tokens_after_start`.
3. **Budget enforcement**: once the budget is reached, it forces a line break and then the thinking end token.
3. **Budget enforcement**: once the budget is reached, it forces `</think>` directly. If `think_stop_sentence`
is configured, it forces that sentence first and then `</think>`.

## Requirements

- The model must provide valid token ids for `think_start_id`, `think_end_id`, and `line_break_id` (via `ModelConfig`).
- If any of these ids are invalid, the processor is disabled and `thinking_budget` will not take effect.
- The model must provide valid token ids for `think_start_id` and `think_end_id` (via `ModelConfig`).
- If either of these ids is invalid, the processor is disabled and `thinking_budget` will not take effect.

## Request Parameters

- `thinking_budget` (int, required to enable): maximum number of tokens after `<think>` before forced termination.
- `think_stop_sentence` (string, optional): a stop sentence that will be tokenized on the CPU side and enforced near the budget boundary.
- `thinking_budget` (int, required to enable): maximum number of decode-time tokens after `<think>` before forced
termination.
- `think_stop_sentence` (string, optional): a literal custom sentence that will be tokenized on the request side
and enforced near the budget boundary.

## Operator-Level vs LogitsProcessor

Expand All @@ -41,16 +46,25 @@ FastDeploy has two ways to limit thinking length:
In short:

- If you only need a hard cap on thinking length, prefer `reasoning_max_tokens`.
- If you need custom behavior (for example, injecting custom sentence tokens), use `ThinkingBudgetLogitsProcessor`.
- If you need custom behavior (for example, inserting a custom sentence before `</think>`), use
`ThinkingBudgetLogitsProcessor`.

## Practical guidance

`reasoning_max_tokens` and `thinking_budget` are not mutually exclusive in current implementation.
If both are configured for the same request, both constraints can take effect, and whichever triggers first will end the thinking phase.

- To use **operator-level-only** behavior: this is request-level config only. Set `enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`, and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave `reasoning_max_tokens` unset.
- Avoid enabling both for strict custom sentence insertion requirements, because operator-level termination may cut the custom sentence path earlier.
- To use **operator-level-only** behavior: this is request-level config only. Set
`enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires
service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`,
and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave
`reasoning_max_tokens` unset.
- `thinking_budget` itself does not require `enable_thinking=true`.
- If an ERNIE chat template already appends `<think>` in the prompt, `thinking_budget` should still take effect; it
does not require the model to emit another `<think>` during decoding.
- Avoid enabling both for strict custom sentence insertion requirements, because operator-level
termination may cut the custom sentence path earlier.

## Online Usage

Expand Down Expand Up @@ -120,4 +134,6 @@ print(outputs[0].outputs.text)

## Performance Note

This processor runs `update_state` and `apply` on every decode step. If you only need a hard thinking-length cap and care most about throughput, consider the operator-level reasoning-length controls instead of per-step logits processing.
This processor runs `update_state` and `apply` on every decode step. If you only need a hard
thinking-length cap and care most about throughput, consider the operator-level reasoning-length
controls instead of per-step logits processing.
31 changes: 20 additions & 11 deletions docs/zh/features/thinking_budget.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

## 概述

`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时,会强制生成换行符 token,再强制生成 `</think>`,从而结束思考段。
`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时,
会直接强制生成 `</think>` 来结束思考段;如果配置了 `think_stop_sentence`,则会先强制输出该自定义
文案,再输出 `</think>`。

## 适用场景

Expand All @@ -11,19 +13,20 @@

## 工作原理

1. **CPU 侧预计算(DataProcessor)**:当请求中包含 `thinking_budget`,会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束,以及已有的思考长度。
1. **请求侧预计算(DataProcessor)**:当请求中包含 `thinking_budget`,会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束,以及已有的思考长度。
2. **每步更新**:解码过程中跟踪 `last_token_id` 与 `tokens_after_start`。
3. **预算约束**:达到预算后,依次强制换行符与思考结束 token。
3. **预算约束**:达到预算后,默认直接强制 `</think>`;如果配置了 `think_stop_sentence`,则先逐 token
强制输出该文案,再输出 `</think>`。

## 前置要求

- 模型需提供有效的 `think_start_id`、`think_end_id`、`line_break_id`(来自 `ModelConfig`)。
- 若任意 id 无效,处理器会禁用,`thinking_budget` 不生效。
- 模型需提供有效的 `think_start_id`、`think_end_id`(来自 `ModelConfig`)。
- 若其中任意 id 无效,处理器会禁用,`thinking_budget` 不生效。

## 请求参数

- `thinking_budget`(int,启用所需):`<think>` 之后允许的最大 token 数。
- `think_stop_sentence`(string,可选):CPU 侧会将该字符串编码为 token ids,并在预算边界附近强制输出。
- `thinking_budget`(int,启用所需):`<think>` 之后允许的最大 decode 阶段 token 数。
- `think_stop_sentence`(string,可选):按字面串编码的自定义终止文案,并在预算边界附近强制输出。

## 算子级限制 vs LogitsProcessor

Expand All @@ -35,21 +38,27 @@ FastDeploy 当前有两种思考长度控制方式:
- 适合“只限制思考长度”的简单场景。
- **`ThinkingBudgetLogitsProcessor`**(`logits_processors_args.thinking_budget`):
- 由每步 Python 侧 logits 处理实现。
- 支持更灵活的行为,例如 `think_stop_sentence`(在结束前插入自定义话术)
- 支持更灵活的行为,例如 `think_stop_sentence`。
- 相比算子级限制,在高并发下通常有更高开销。

可按以下原则选择:

- 仅需限制思考长度:优先用 `reasoning_max_tokens`。
- 需要更灵活控制(如插入自定义话术):使用 `ThinkingBudgetLogitsProcessor`。
- 需要更灵活控制(如在 `</think>` 前插入自定义话术):使用 `ThinkingBudgetLogitsProcessor`。

## 建议实践

当前实现中,`reasoning_max_tokens` 与 `thinking_budget` 不是互斥关系。
同一请求如果同时配置,两套约束都可能生效,谁先触发就先结束思考段。

- **只用算子级限制**:这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`,不要传 `thinking_budget`。
- **只用 LogitsProcessor**(尤其要用 `think_stop_sentence`):这是“服务启动 + 请求参数”两级配置。服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`,并在请求里通过 `logits_processors_args` 传 `thinking_budget`(以及可选的 `think_stop_sentence`);同时不要设置 `reasoning_max_tokens`。
- **只用算子级限制**:这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`,
不要传 `thinking_budget`。
- **只用 LogitsProcessor**(尤其要用 `think_stop_sentence`):这是“服务启动 + 请求参数”两级配置。
服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`,并在请求里通过
`logits_processors_args` 传 `thinking_budget`(以及可选的 `think_stop_sentence`);同时不要设置
`reasoning_max_tokens`。
- `thinking_budget` 本身不依赖 `enable_thinking=true`。
- 如果 ERNIE 的 chat template 已经在 prompt 里拼入 `<think>`,`thinking_budget` 也应正常生效,不要求模型在 decode 阶段再次输出 `<think>`。
- 如果业务要求“必须完整插入自定义话术”,不建议与算子级限制同时开启,否则可能被算子级提前截断。

## 在线使用
Expand Down
9 changes: 9 additions & 0 deletions fastdeploy/input/ernie4_5_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ def process_request_dict(self, request, max_model_len=None):
bad_words_token_ids = self.update_bad_words(bad_words, bad_words_token_ids)
request["bad_words_token_ids"] = bad_words_token_ids

logits_processors_args = self._prepare_think_stop_sentence(
request.get("logits_processors_args") or {}, max_model_len
)
request["logits_processors_args"] = logits_processors_args

# processing prompt_token_ids
if not request.get("prompt_token_ids"):
if request.get("prompt"):
Expand Down Expand Up @@ -143,6 +148,10 @@ def process_request_dict(self, request, max_model_len=None):
# truncate prompts that exceed the length limit
if max_model_len is not None and len(request["prompt_token_ids"]) > max_model_len:
request["prompt_token_ids"] = request["prompt_token_ids"][: max_model_len - 1]
logits_processors_args = self._update_thinking_prompt_state(
request["prompt_token_ids"], request.get("logits_processors_args") or {}
)
request["logits_processors_args"] = logits_processors_args
max_tokens = max_model_len - len(request["prompt_token_ids"])
if request.get("max_tokens") is None:
request["max_tokens"] = max(1, max_tokens)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,11 @@ def process_request_dict(self, request, max_model_len=None):
bad_words_token_ids = self.update_bad_words(bad_words, bad_words_token_ids)
request["bad_words_token_ids"] = bad_words_token_ids

logits_processors_args = self._prepare_think_stop_sentence(
request.get("logits_processors_args") or {}, max_model_len
)
request["logits_processors_args"] = logits_processors_args

if request.get("prompt_token_ids"):
messages = request.get("messages")
if messages:
Expand Down Expand Up @@ -257,6 +262,10 @@ def process_request_dict(self, request, max_model_len=None):
# 截断超过长度限制的prompt
if max_model_len is not None and len(request["prompt_token_ids"]) > max_model_len:
request["prompt_token_ids"] = request["prompt_token_ids"][: max_model_len - 1]
logits_processors_args = self._update_thinking_prompt_state(
request["prompt_token_ids"], request.get("logits_processors_args") or {}
)
request["logits_processors_args"] = logits_processors_args

max_tokens = max_model_len - len(request["prompt_token_ids"])
if request.get("max_tokens") is None:
Expand Down
Loading
Loading