Skip to content

fix(redo): wait for queue manager before recovery, prevent silent task loss#1237

Open
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/redo-recovery-non-blocking
Open

fix(redo): wait for queue manager before recovery, prevent silent task loss#1237
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/redo-recovery-non-blocking

Conversation

@yc111233
Copy link
Copy Markdown
Contributor

@yc111233 yc111233 commented Apr 5, 2026

Summary

  • Redo recovery runs immediately at start() but the queue manager may not be initialized yet. _enqueue_semantic silently returns when queue_manager is None, then mark_done removes the redo marker — the task is permanently lost and memories are never extracted.
  • VLM timeout during recovery is 60s, wasting minutes when VLM is down.

Root cause

Three issues compound:

  1. _recover_pending_redo starts before queue manager is ready
  2. _enqueue_semantic silently drops work when queue_manager is None
  3. mark_done runs after the silent drop → redo marker deleted → task lost forever

Log evidence:

21:44:24 - Recovering pending redo task: 36bfde37-...
21:44:45 - Redo: memory extraction failed ('NoneType' object has no attribute 'enqueue_embedding_msg'), falling back to queue

Fix

  • Wait up to 30s for queue manager to become available before starting redo recovery
  • Reduce VLM timeout from 60s → 15s during recovery (best-effort, not critical path)
  • Make _enqueue_semantic raise RuntimeError instead of silently returning when queue_manager is None

Test plan

  • Verify service starts normally with no pending redo tasks
  • Verify redo recovery waits for queue manager and succeeds
  • Verify redo task is NOT marked done when enqueue fails
  • Verify service starts within 30s even with unreachable VLM

🤖 Generated with Claude Code

…task loss

Three issues in _recover_pending_redo / _enqueue_semantic:

1. Redo recovery runs immediately at start() but the queue manager may
   not be initialized yet.  _enqueue_semantic silently returns when
   queue_manager is None, then mark_done removes the redo marker →
   the task is permanently lost and memories are never extracted.

2. The VLM timeout during recovery is 60 s.  If VLM is unavailable,
   each redo task wastes a full minute, and with multiple tasks this
   delays service readiness by several minutes.

3. _enqueue_semantic silently drops work when queue_manager is None.
   Any caller that assumes the work was accepted will proceed and
   delete its redo marker, losing the task.

Fixes:
- Wait up to 30 s for queue manager to become available before
  starting redo recovery; if it never appears, defer to next restart.
- Reduce VLM timeout during recovery from 60 s to 15 s.
- Make _enqueue_semantic raise RuntimeError instead of silently
  returning when queue_manager is None so callers cannot accidentally
  mark the task as done.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant