Skip to content

Handle exceptions in data reload loop to prevent silent data staleness#7087

Open
bzantium wants to merge 1 commit intotensorflow:masterfrom
bzantium:fix/reload-error-handling
Open

Handle exceptions in data reload loop to prevent silent data staleness#7087
bzantium wants to merge 1 commit intotensorflow:masterfrom
bzantium:fix/reload-error-handling

Conversation

@bzantium
Copy link
Copy Markdown

@bzantium bzantium commented Apr 4, 2026

Summary

Fixes #7086

The _reload function in LocalDataIngester has no exception handling, so any transient error (e.g., network timeout when reading from GCS) kills the Reloader thread permanently. TensorBoard then silently serves stale data with no way to recover short of a restart.

This PR wraps the reload loop body in try/except Exception so that:

  • Transient errors are logged with full traceback via logger.error
  • The reload loop continues to the next cycle instead of crashing
  • TensorBoard automatically recovers once the transient issue resolves

Changes

  • data_ingester.py: Wrap reload loop body in try/except, log errors with exc_info=True
  • data_ingester_test.py: Add tests verifying the reload loop survives exceptions from both AddRunsFromDirectory and Reload

Test plan

  • Added unit tests for exception handling in reload loop
  • Verified manually with GCS logdir + simulated network interruption

The Reloader thread/process in LocalDataIngester crashes on any
unhandled exception (e.g. transient network errors when reading from
remote filesystems like GCS). Once the reload loop dies, TensorBoard
continues serving stale data with no indication to the user.

Wrap the reload loop body in a try/except so that transient errors are
logged and the next reload cycle proceeds normally.
@bzantium bzantium force-pushed the fix/reload-error-handling branch from f814a01 to 2c78489 Compare April 4, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reloader thread crashes on transient errors, causing silent data staleness

1 participant