PaddlePaddle · Jiang-Jia-Jun · Mar 23, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 20, 2026
diff --git a/docs/features/thinking_budget.md b/docs/features/thinking_budget.md
@@ -2,7 +2,9 @@
 
 ## Overview
 
-`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>` segment. When the budget is reached, it forces a line break token and then the `</think>` token to terminate the thinking section.
+`ThinkingBudgetLogitsProcessor` limits the number of tokens generated inside the `<think> ... </think>`
+segment. When the budget is reached, it terminates thinking by forcing `</think>`. If
+`think_stop_sentence` is configured, it forces the custom sentence first and then `</think>`.
 
 ## When to Use
 
@@ -11,19 +13,22 @@
 
 ## How It Works
 
-1. **CPU precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
+1. **Request-side precompute (DataProcessor)**: when a request includes `thinking_budget`, the prompt token ids are scanned to determine whether thinking has started, whether it already ended, and how many tokens are already inside the thinking section.
 2. **Per-step update**: during decoding, the processor tracks `last_token_id` and `tokens_after_start`.
-3. **Budget enforcement**: once the budget is reached, it forces a line break and then the thinking end token.
+3. **Budget enforcement**: once the budget is reached, it forces `</think>` directly. If `think_stop_sentence`
+   is configured, it forces that sentence first and then `</think>`.
 
 ## Requirements
 
-- The model must provide valid token ids for `think_start_id`, `think_end_id`, and `line_break_id` (via `ModelConfig`).
-- If any of these ids are invalid, the processor is disabled and `thinking_budget` will not take effect.
+- The model must provide valid token ids for `think_start_id` and `think_end_id` (via `ModelConfig`).
+- If either of these ids is invalid, the processor is disabled and `thinking_budget` will not take effect.
 
 ## Request Parameters
 
-- `thinking_budget` (int, required to enable): maximum number of tokens after `<think>` before forced termination.
-- `think_stop_sentence` (string, optional): a stop sentence that will be tokenized on the CPU side and enforced near the budget boundary.
+- `thinking_budget` (int, required to enable): maximum number of decode-time tokens after `<think>` before forced
+  termination.
+- `think_stop_sentence` (string, optional): a literal custom sentence that will be tokenized on the request side
+  and enforced near the budget boundary.
 
 ## Operator-Level vs LogitsProcessor
 
@@ -41,16 +46,25 @@ FastDeploy has two ways to limit thinking length:
 In short:
 
 - If you only need a hard cap on thinking length, prefer `reasoning_max_tokens`.
-- If you need custom behavior (for example, injecting custom sentence tokens), use `ThinkingBudgetLogitsProcessor`.
+- If you need custom behavior (for example, inserting a custom sentence before `</think>`), use
+  `ThinkingBudgetLogitsProcessor`.
 
 ## Practical guidance
 
 `reasoning_max_tokens` and `thinking_budget` are not mutually exclusive in current implementation.
 If both are configured for the same request, both constraints can take effect, and whichever triggers first will end the thinking phase.
 
-- To use **operator-level-only** behavior: this is request-level config only. Set `enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
-- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`, and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave `reasoning_max_tokens` unset.
-- Avoid enabling both for strict custom sentence insertion requirements, because operator-level termination may cut the custom sentence path earlier.
+- To use **operator-level-only** behavior: this is request-level config only. Set
+  `enable_thinking=true` and `reasoning_max_tokens` in request, and do not set `thinking_budget`.
+- To use **logits-processor-only** behavior (especially with `think_stop_sentence`): this requires
+  service-level + request-level config. Start service with `--logits-processors ThinkingBudgetLogitsProcessor`,
+  and set `thinking_budget` (and optional `think_stop_sentence`) in `logits_processors_args`; leave
+  `reasoning_max_tokens` unset.
+- `thinking_budget` itself does not require `enable_thinking=true`.
+- If an ERNIE chat template already appends `<think>` in the prompt, `thinking_budget` should still take effect; it
+  does not require the model to emit another `<think>` during decoding.
+- Avoid enabling both for strict custom sentence insertion requirements, because operator-level
+  termination may cut the custom sentence path earlier.
 
 ## Online Usage
 
@@ -120,4 +134,6 @@ print(outputs[0].outputs.text)
 
 ## Performance Note
 
-This processor runs `update_state` and `apply` on every decode step. If you only need a hard thinking-length cap and care most about throughput, consider the operator-level reasoning-length controls instead of per-step logits processing.
+This processor runs `update_state` and `apply` on every decode step. If you only need a hard
+thinking-length cap and care most about throughput, consider the operator-level reasoning-length
+controls instead of per-step logits processing.
diff --git a/docs/zh/features/thinking_budget.md b/docs/zh/features/thinking_budget.md
@@ -2,7 +2,9 @@
 
 ## 概述
 
-`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时，会强制生成换行符 token，再强制生成 `</think>`，从而结束思考段。
+`ThinkingBudgetLogitsProcessor` 用于限制 `<think> ... </think>` 区间的生成长度。当预算达到阈值时，
+会直接强制生成 `</think>` 来结束思考段；如果配置了 `think_stop_sentence`，则会先强制输出该自定义
+文案，再输出 `</think>`。
 
 ## 适用场景
 
@@ -11,19 +13,20 @@
 
 ## 工作原理
 
-1. **CPU 侧预计算（DataProcessor）**：当请求中包含 `thinking_budget`，会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束，以及已有的思考长度。
+1. **请求侧预计算（DataProcessor）**：当请求中包含 `thinking_budget`，会基于 prompt 的 token ids 计算是否已进入思考段、是否已结束，以及已有的思考长度。
 2. **每步更新**：解码过程中跟踪 `last_token_id` 与 `tokens_after_start`。
-3. **预算约束**：达到预算后，依次强制换行符与思考结束 token。
+3. **预算约束**：达到预算后，默认直接强制 `</think>`；如果配置了 `think_stop_sentence`，则先逐 token
+   强制输出该文案，再输出 `</think>`。
 
 ## 前置要求
 
-- 模型需提供有效的 `think_start_id`、`think_end_id`、`line_break_id`（来自 `ModelConfig`）。
-- 若任意 id 无效，处理器会禁用，`thinking_budget` 不生效。
+- 模型需提供有效的 `think_start_id`、`think_end_id`（来自 `ModelConfig`）。
+- 若其中任意 id 无效，处理器会禁用，`thinking_budget` 不生效。
 
 ## 请求参数
 
-- `thinking_budget`（int，启用所需）：`<think>` 之后允许的最大 token 数。
-- `think_stop_sentence`（string，可选）：CPU 侧会将该字符串编码为 token ids，并在预算边界附近强制输出。
+- `thinking_budget`（int，启用所需）：`<think>` 之后允许的最大 decode 阶段 token 数。
+- `think_stop_sentence`（string，可选）：按字面串编码的自定义终止文案，并在预算边界附近强制输出。
 
 ## 算子级限制 vs LogitsProcessor
 
@@ -35,21 +38,27 @@ FastDeploy 当前有两种思考长度控制方式：
   - 适合“只限制思考长度”的简单场景。
 - **`ThinkingBudgetLogitsProcessor`**（`logits_processors_args.thinking_budget`）：
   - 由每步 Python 侧 logits 处理实现。
-  - 支持更灵活的行为，例如 `think_stop_sentence`（在结束前插入自定义话术）。
+  - 支持更灵活的行为，例如 `think_stop_sentence`。
   - 相比算子级限制，在高并发下通常有更高开销。
 
 可按以下原则选择：
 
 - 仅需限制思考长度：优先用 `reasoning_max_tokens`。
-- 需要更灵活控制（如插入自定义话术）：使用 `ThinkingBudgetLogitsProcessor`。
+- 需要更灵活控制（如在 `</think>` 前插入自定义话术）：使用 `ThinkingBudgetLogitsProcessor`。
 
 ## 建议实践
 
 当前实现中，`reasoning_max_tokens` 与 `thinking_budget` 不是互斥关系。
 同一请求如果同时配置，两套约束都可能生效，谁先触发就先结束思考段。
 
-- **只用算子级限制**：这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`，不要传 `thinking_budget`。
-- **只用 LogitsProcessor**（尤其要用 `think_stop_sentence`）：这是“服务启动 + 请求参数”两级配置。服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`，并在请求里通过 `logits_processors_args` 传 `thinking_budget`（以及可选的 `think_stop_sentence`）；同时不要设置 `reasoning_max_tokens`。
+- **只用算子级限制**：这是请求级配置。仅在请求中设置 `enable_thinking=true` + `reasoning_max_tokens`，
+  不要传 `thinking_budget`。
+- **只用 LogitsProcessor**（尤其要用 `think_stop_sentence`）：这是“服务启动 + 请求参数”两级配置。
+  服务启动时必须加 `--logits-processors ThinkingBudgetLogitsProcessor`，并在请求里通过
+  `logits_processors_args` 传 `thinking_budget`（以及可选的 `think_stop_sentence`）；同时不要设置
+  `reasoning_max_tokens`。
+- `thinking_budget` 本身不依赖 `enable_thinking=true`。
+- 如果 ERNIE 的 chat template 已经在 prompt 里拼入 `<think>`，`thinking_budget` 也应正常生效，不要求模型在 decode 阶段再次输出 `<think>`。
 - 如果业务要求“必须完整插入自定义话术”，不建议与算子级限制同时开启，否则可能被算子级提前截断。
 
 ## 在线使用

diff --git a/fastdeploy/input/ernie4_5_processor.py b/fastdeploy/input/ernie4_5_processor.py
@@ -107,6 +107,11 @@ def process_request_dict(self, request, max_model_len=None):
             bad_words_token_ids = self.update_bad_words(bad_words, bad_words_token_ids)
             request["bad_words_token_ids"] = bad_words_token_ids
 
+        logits_processors_args = self._prepare_think_stop_sentence(
+            request.get("logits_processors_args") or {}, max_model_len
+        )
+        request["logits_processors_args"] = logits_processors_args
+
         # processing prompt_token_ids
         if not request.get("prompt_token_ids"):
             if request.get("prompt"):
@@ -143,6 +148,10 @@ def process_request_dict(self, request, max_model_len=None):
         # truncate prompts that exceed the length limit
         if max_model_len is not None and len(request["prompt_token_ids"]) > max_model_len:
             request["prompt_token_ids"] = request["prompt_token_ids"][: max_model_len - 1]
+        logits_processors_args = self._update_thinking_prompt_state(
+            request["prompt_token_ids"], request.get("logits_processors_args") or {}
+        )
+        request["logits_processors_args"] = logits_processors_args
         max_tokens = max_model_len - len(request["prompt_token_ids"])
         if request.get("max_tokens") is None:
             request["max_tokens"] = max(1, max_tokens)

diff --git a/fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py b/fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py
@@ -210,6 +210,11 @@ def process_request_dict(self, request, max_model_len=None):
             bad_words_token_ids = self.update_bad_words(bad_words, bad_words_token_ids)
             request["bad_words_token_ids"] = bad_words_token_ids
 
+        logits_processors_args = self._prepare_think_stop_sentence(
+            request.get("logits_processors_args") or {}, max_model_len
+        )
+        request["logits_processors_args"] = logits_processors_args
+
         if request.get("prompt_token_ids"):
             messages = request.get("messages")
             if messages:
@@ -257,6 +262,10 @@ def process_request_dict(self, request, max_model_len=None):
         # 截断超过长度限制的prompt
         if max_model_len is not None and len(request["prompt_token_ids"]) > max_model_len:
             request["prompt_token_ids"] = request["prompt_token_ids"][: max_model_len - 1]
+        logits_processors_args = self._update_thinking_prompt_state(
+            request["prompt_token_ids"], request.get("logits_processors_args") or {}
+        )
+        request["logits_processors_args"] = logits_processors_args
 
         max_tokens = max_model_len - len(request["prompt_token_ids"])
         if request.get("max_tokens") is None: