Fix failed rollouts counted as completed in AgentModeDaemon by Mr-Neutr0n · Pull Request #480 · microsoft/agent-lightning

Mr-Neutr0n · 2026-02-09T21:52:04Z

Summary

Fixes #355

When vLLM returns HTTP 400 errors, rollouts end up with None reward and empty triplets. These were previously counted as "completed" in _async_run_until_finished, causing the completed count to exceed _total_tasks_queued (e.g. "Completed 33/32 tasks") and triggering assertion failures in get_train_data_batch.
Added _is_failed_rollout() helper that detects rollouts with no reward and no triplets, identifying them as backend failures rather than real results.
Failed rollouts are now skipped instead of stored, allowing tasks to be re-claimed and retried by the store's timeout mechanism.
For v1 mode, wait_for_rollouts now only polls for rollout IDs that have not yet successfully completed, so successful retries are properly picked up.
Removed the early "already processed" skip that could prevent a successful retry from overwriting a stale failed entry.

Test plan

Deploy with a vLLM backend that intermittently returns HTTP 400 errors and verify that failed rollouts are retried rather than counted as completed
Verify get_train_data_batch no longer hits the assert len(self._completed_rollouts_v0) == self._total_tasks_queued assertion
Confirm that "Completed N/M tasks" log messages never show N > M
Run existing training loop end-to-end to verify no regression in normal (no-error) flow

When vLLM returns HTTP 400 errors, the resulting rollouts have None reward and empty triplets. Previously these were still counted as completed, which caused the task count to exceed the total queued (e.g. "Completed 33/32 tasks") and triggered assertion failures in get_train_data_batch. This change: - Adds _is_failed_rollout() to detect rollouts with no reward and no triplets, indicating a backend error rather than a real result - Skips failed rollouts instead of storing them, allowing the task to be re-claimed and retried - For v1 mode, only polls for pending (not yet completed) rollout IDs so that successful retries are properly picked up - Removes the early "already processed" skip that could prevent a successful retry from overwriting a stale entry Fixes microsoft#355

Mr-Neutr0n · 2026-02-12T18:11:38Z

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

ultmaster · 2026-02-28T02:56:32Z

            else:
+                # Only wait for rollout IDs that have not yet successfully completed,
+                # so that re-runs of previously failed tasks are picked up.
+                pending_ids = [


I think we already have retried the failed rollouts. Why this change?

Mr-Neutr0n · 2026-06-02T02:25:38Z

@ultmaster — good question. The bug is more subtle than just "should we count failed rollouts": without this change, the first failed rollout is added to _completed_rollouts_v0 and then the wait_for_rollouts call passes list(self._task_id_to_original_sample.keys()) (all IDs, including ones already in the completed set) on the next loop. The retry infrastructure in the store is fine — it re-enqueues the failed task — but the loop is racing with it: the next wait_for_rollouts returns the same failed rollout, the loop sees it as already-completed, and the "Completed X/Y" counter increments past Y.

Concretely: the symptom in the PR description is "Completed 33/32 tasks", which only happens when failed rollouts have been added to the completed set. So the change is doing two things at once:

pending_ids filters out already-completed IDs, so the loop doesn't re-consider them.
_is_failed_rollout skips counting a failed result, so the retry actually has room to deliver a successful result.

Happy to split into two PRs if that's clearer; the two changes do interact (without #1, #2 alone doesn't help; without #2, #1 just shifts the off-by-one to the next failure).

ultmaster reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix failed rollouts counted as completed in AgentModeDaemon#480

Fix failed rollouts counted as completed in AgentModeDaemon#480
Mr-Neutr0n wants to merge 1 commit into
microsoft:mainfrom
Mr-Neutr0n:fix/rollout-count-failed-tasks

Mr-Neutr0n commented Feb 9, 2026

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

ultmaster Feb 28, 2026

Uh oh!

Mr-Neutr0n commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mr-Neutr0n commented Feb 9, 2026

Summary

Test plan

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

ultmaster Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Mr-Neutr0n commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants