normalized logprobs when using keep sampling mask by DesmonDay · Pull Request #6966 · PaddlePaddle/FastDeploy

DesmonDay · 2026-03-22T15:47:00Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-22T15:47:11Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在在启用 keep_sampling_mask（top-k/top-p 截断候选集）时，对返回的 token-level logprobs 做重归一化，使其与“截断后的采样分布”一致，并通过 SamplerOutput 透传每条样本的 logZ_K 信息。

Changes:

在 _compute_sampling_mask() 中额外计算并返回每条样本的 logZ_K（候选集概率和的 log），并在 SamplerOutput 中新增 logz_per_batch 字段透传。
在 post_process_normal() / post_process_specualate() 中基于 logZ_K 对 logprobs_tensors.logprobs 做重归一化处理。
调整 _compute_sampling_mask() 返回类型与调用方，以同时返回 sparse mask 与 logz_per_batch。

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
fastdeploy/worker/output.py	为 `SamplerOutput` 新增 `logz_per_batch` 字段，用于后处理阶段重归一化 logprobs。
fastdeploy/model_executor/pre_and_post_process.py	在 normal/speculative 的 post-process 中对 logprobs 做减去 `logZ_K` 的重归一化。
fastdeploy/model_executor/layers/sample/sampler.py	计算 sampling mask 时新增 `logZ_K` 计算并回传，同时更新调用链。

Copilot · 2026-03-22T15:51:02Z

fastdeploy/model_executor/layers/sample/sampler.py

+    # Z_K = sum(probs[i] * final_mask[i]) for each request i
+    # logZ_K = log(Z_K), with small constant to avoid log(0)
+    # ------------------------------------------------------------------
+    candidate_probs = paddle.where(
+        final_mask,
+        probs,
+        paddle.zeros_like(probs)


final_mask 是在 sorted_probs/renorm_sorted_probs 的排序空间里构造出来的（与 sorted_indices 对齐），但这里用它去 mask 原始 probs（未排序）会导致选中的 token 概率错位，从而计算出错误的 Z_K/logZ_K，后续 logprobs 重归一化会整体偏差。建议在排序空间里计算 Z_K（例如对 sorted_probs 按 final_mask 求和），或把 final_mask scatter 回 vocab 原顺序后再与 probs 对齐。

Suggested change

# Z_K = sum(probs[i] * final_mask[i]) for each request i

# logZ_K = log(Z_K), with small constant to avoid log(0)

# ------------------------------------------------------------------

candidate_probs = paddle.where(

final_mask,

probs,

paddle.zeros_like(probs)

# Z_K = sum(sorted_probs[i] * final_mask[i]) for each request i

# logZ_K = log(Z_K), with small constant to avoid log(0)

# ------------------------------------------------------------------

candidate_probs = paddle.where(

final_mask,

sorted_probs,

paddle.zeros_like(sorted_probs)

这里确实有一个坑点：

采样 token：它在候选集 K 内，所以 logZ_K 对它是正确的，归一化后的值是正确的 ✓

top-k 候选：在候选集K内，则 logZ_K 对它是正确的，归一化后的值是正确的 ✓

top-k 集以外的，即使被减去 logZ_K，也没有关系，因为最终并不会返回这些值。

由于现在 keep sampling mask只需要返回一个logprobs，所以可以保证在候选集内的 logprobs 是被正确归一化的，那么其他的值就不必考虑了。

Copilot · 2026-03-22T15:51:03Z