Handle exceptions in data reload loop to prevent silent data staleness#7087
Open
bzantium wants to merge 1 commit intotensorflow:masterfrom
Open
Handle exceptions in data reload loop to prevent silent data staleness#7087bzantium wants to merge 1 commit intotensorflow:masterfrom
bzantium wants to merge 1 commit intotensorflow:masterfrom
Conversation
The Reloader thread/process in LocalDataIngester crashes on any unhandled exception (e.g. transient network errors when reading from remote filesystems like GCS). Once the reload loop dies, TensorBoard continues serving stale data with no indication to the user. Wrap the reload loop body in a try/except so that transient errors are logged and the next reload cycle proceeds normally.
f814a01 to
2c78489
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #7086
The
_reloadfunction inLocalDataIngesterhas no exception handling, so any transient error (e.g., network timeout when reading from GCS) kills the Reloader thread permanently. TensorBoard then silently serves stale data with no way to recover short of a restart.This PR wraps the reload loop body in
try/except Exceptionso that:logger.errorChanges
data_ingester.py: Wrap reload loop body in try/except, log errors withexc_info=Truedata_ingester_test.py: Add tests verifying the reload loop survives exceptions from bothAddRunsFromDirectoryandReloadTest plan