Skip to content

Multi turn rollout#193

Open
tastelikefeet wants to merge 54 commits into
modelscope:mainfrom
tastelikefeet:feat/multi-turn-rollout
Open

Multi turn rollout#193
tastelikefeet wants to merge 54 commits into
modelscope:mainfrom
tastelikefeet:feat/multi-turn-rollout

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

@tastelikefeet tastelikefeet commented May 18, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Features

  1. Add twinkle_agent package to hold multi-turn rollouts and tool callings
  2. Add notifier to notify user when training is failed
  3. Add GRPO entropy loss/ref logps loss
  4. Add GRPO series metrics
  5. Add a patch to qwen3 series models to fix a bug that special tokens will cause jinja encode error
  6. Support sample with base model when enable_lora is True
  7. Support lora model id used in sampling
  8. Support Qwen models tool parsing and cleaning
  9. Support selective_log_softmax returns entropy
  10. Support api/vllm multi turn rollouts and trace saving
  11. Support tool manager
  12. [Experimental] Support passage chunker and condenser and tool

Bug Fix

  1. Fix dataset multi-process preprocessing
  2. Fix a bug that mm model templates cause cache failed

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework for multi-turn agentic rollouts with trajectory compression, including a new MultiTurnCondenseRollout class, chunking utilities, and a keyword-based condenser. It also adds support for per-token oracle bonuses, entropy bonuses, and improved logging diagnostics for GRPO training. Additionally, it patches the Qwen3 chat template to resolve parsing issues with orphan </think> tags and includes a dataset builder for condensed SFT training.

@tastelikefeet tastelikefeet changed the title [WIP]Multi turn rollout Multi turn rollout May 18, 2026
Comment thread src/twinkle/loss/grpo.py
clamped_ratios = torch.clamp(ratio, max=1 + self.epsilon).detach()

# Two-sided IS clamp with asymmetric epsilon, matching MiniMax CISPO spec.
clamped_ratios = torch.clamp(ratio, min=1 - self.epsilon, max=1 + self.epsilon_high).detach()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check it

per_token_logps = torch.stack(per_token_logps)
if return_entropy:
return per_token_logps, torch.stack(per_token_entropy)
return per_token_logps
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest unifying the return type: when return_entropy=False,
return (per_token_logps, None) instead of just per_token_logps.

self.trace_dir = trace_dir
self.trace_callback = trace_callback
self.success_callback = success_callback
os.makedirs(self.trace_dir, exist_ok=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个操作是否放在call中if self.trace_dir后处理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants