fix(dgw): fix adaptive frame skipping during EOF waits in session shadowing#1678
fix(dgw): fix adaptive frame skipping during EOF waits in session shadowing#1678irvingoujAtDevolution wants to merge 1 commit intomasterfrom
Conversation
Let maintainers know that an action is required on their side
|
…dowing Fixes playback failures in session shadowing where adaptive frame skipping permanently stalls video streaming after EOF retry waits. The ratio used for adaptive frame skip decisions (media_time / wall_time) was corrupted by time spent waiting for the recording file to grow during EOF retries. A 20-second EOF wait would result in a ratio of ~0.006, far below the 1.0 threshold, causing all subsequent frames to be decode-only with no encoded output reaching the player. The fix changes the ratio to only measure active processing time (media_time / accumulated_processing_time) instead of total wall time, excluding idle waits. Additionally, EOF waits now time out after 3 seconds to limit the maximum stall duration. Issue: DGW-341
a51ceca to
16733fc
Compare
| let _ = tx.send(WhenEofControlFlow::Break); | ||
| }, | ||
| _ = tokio::time::sleep(std::time::Duration::from_secs(3)) => { | ||
| info!("EOF wait timed out, retrying"); |
There was a problem hiding this comment.
issue: Probably not a good idea to use an info! log at this place. I would suggest trace! unless you think debug! is better
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in session shadowing where adaptive frame skipping would permanently stall video streaming after EOF retry waits. The issue occurred because the realtime ratio calculation (used to decide whether to skip frames) included idle time spent waiting for file growth, causing the ratio to drop far below the 1.0 threshold and triggering perpetual decode-only mode.
Changes:
- Modified the realtime ratio calculation to track only active processing time, excluding EOF wait periods
- Added a 3-second timeout to EOF waits to prevent indefinite stalls
- Increased MAX_RETRY_COUNT from 3 to 25 to accommodate the new timeout-based retry strategy
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| crates/video-streamer/src/streamer/tag_writers.rs | Added processing_time tracking to measure only active processing (excluding idle waits), updated current_realtime_ratio to use processing_time instead of wall time, added Duration import |
| crates/video-streamer/src/streamer/mod.rs | Added 3-second timeout to EOF waits, increased MAX_RETRY_COUNT from 3 to 25 to accommodate timeout-based retries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -99,7 +99,7 @@ pub fn webm_stream( | |||
| } | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
The MAX_RETRY_COUNT increase from 3 to 25 combined with the 3-second timeout per retry creates a maximum potential wait time of 75 seconds (25 × 3s). This seems excessive for a live streaming scenario. Consider either:
- Reducing MAX_RETRY_COUNT to a lower value (e.g., 10 for 30s max wait)
- Adding a comment explaining why 75 seconds is an acceptable maximum wait time
- Implementing a total elapsed time check instead of just a retry count
The original issue describes a 20-second EOF wait corrupting the ratio, but the fix allows up to 75 seconds of total wait time, which could still cause significant delays in live streaming scenarios.
| // NOTE: MAX_RETRY_COUNT is intentionally set to 25. With the 3-second delay | |
| // between retries, this yields a worst-case wait of 75 seconds before we | |
| // give up on the current stream. This is acceptable in our live-streaming | |
| // context because downstream components already enforce a stricter overall | |
| // timeout, and tolerating longer temporary input stalls here reduces | |
| // unnecessary reconnect churn on brief network or encoder hiccups. |
| @@ -371,11 +377,11 @@ where | |||
|
|
|||
There was a problem hiding this comment.
Consider adding a comment to the current_realtime_ratio method to clarify that it measures the ratio of media time to processing time (excluding idle waits), not media time to wall time. This would help future maintainers understand the distinction, especially given that the fix specifically addresses the issue of idle time corrupting this ratio.
Example:
// Calculates the ratio of media time advanced to active processing time.
// This excludes idle time spent waiting for file growth during EOF retries,
// preventing temporary stalls from permanently corrupting the frame skip decision.
| // Calculates the ratio of media time advanced to active processing time. | |
| // This uses `processing_time`, which is intended to exclude idle waits (for example, | |
| // time spent waiting for file growth during EOF retries), so that temporary stalls | |
| // do not permanently corrupt the frame skip decision. |
Fixes playback failures in session shadowing where adaptive frame skipping permanently stalls video streaming after EOF retry waits.
The ratio used for adaptive frame skip decisions (media_time / wall_time) was corrupted by time spent waiting for the recording file to grow during EOF retries. A 20-second EOF wait would result in a ratio of ~0.006, far below the 1.0 threshold, causing all subsequent frames to be decode-only with no encoded output reaching the player.
The fix changes the ratio to only measure active processing time (media_time / accumulated_processing_time) instead of total wall time, excluding idle waits. Additionally, EOF waits now time out after 3 seconds to limit the maximum stall duration.
Issue: DGW-341