Multi turn rollout#193
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive framework for multi-turn agentic rollouts with trajectory compression, including a new MultiTurnCondenseRollout class, chunking utilities, and a keyword-based condenser. It also adds support for per-token oracle bonuses, entropy bonuses, and improved logging diagnostics for GRPO training. Additionally, it patches the Qwen3 chat template to resolve parsing issues with orphan </think> tags and includes a dataset builder for condensed SFT training.
…ti-turn-rollout # Conflicts: # src/twinkle/template/__init__.py # src/twinkle/template/base.py # src/twinkle/utils/transformers_utils.py
| clamped_ratios = torch.clamp(ratio, max=1 + self.epsilon).detach() | ||
|
|
||
| # Two-sided IS clamp with asymmetric epsilon, matching MiniMax CISPO spec. | ||
| clamped_ratios = torch.clamp(ratio, min=1 - self.epsilon, max=1 + self.epsilon_high).detach() |
| per_token_logps = torch.stack(per_token_logps) | ||
| if return_entropy: | ||
| return per_token_logps, torch.stack(per_token_entropy) | ||
| return per_token_logps |
There was a problem hiding this comment.
Suggest unifying the return type: when return_entropy=False,
return (per_token_logps, None) instead of just per_token_logps.
| self.trace_dir = trace_dir | ||
| self.trace_callback = trace_callback | ||
| self.success_callback = success_callback | ||
| os.makedirs(self.trace_dir, exist_ok=True) |
There was a problem hiding this comment.
这个操作是否放在call中if self.trace_dir后处理
PR type
PR information
Features
Bug Fix
Experiment results
Paste your experiment result here(if needed).