Skip to content

Use targeted glob for cloud logdir subdirectory discovery#7089

Open
bzantium wants to merge 2 commits intotensorflow:masterfrom
bzantium:perf/cloud-logdir-glob
Open

Use targeted glob for cloud logdir subdirectory discovery#7089
bzantium wants to merge 2 commits intotensorflow:masterfrom
bzantium:perf/cloud-logdir-glob

Conversation

@bzantium
Copy link
Copy Markdown

@bzantium bzantium commented Apr 4, 2026

Summary

Fixes #7088

Replace the level-by-level globbing approach in GetLogdirSubdirectories for cloud filesystems (GCS, S3) with a single recursive glob for *tfevents* files.

The previous method called ListRecursivelyViaGlobbing, which globs *, */*, */*/*, etc., listing all files at every directory level. This is extremely slow when the directory tree contains many non-event files (e.g., model checkpoints).

The new approach uses a single **/*tfevents* glob to directly find event files and derives containing directories from the results.

Benchmark

Tested on a real GCS directory with ~26,000 checkpoint files and 8 event files:

Method Time
Before (level-by-level glob) ~101s
After (targeted **/*tfevents* glob) ~13s

Scope

  • Cloud paths only (gs://, s3://): uses the new targeted glob
  • Local paths: unchanged, still uses ListRecursivelyViaWalking
  • ListRecursivelyViaGlobbing is left intact (not removed) to avoid breaking any external consumers

Test plan

  • Verified on real GCS bucket with multiple experiments containing checkpoints and tensorboard event files
  • Confirmed TensorBoard correctly discovers all runs and loads scalar data
  • Confirmed local filesystem paths are unaffected by this change

bzantium added 2 commits April 5, 2026 01:10
Replace the level-by-level globbing approach for cloud filesystems
(GCS, S3) with a single recursive glob for *tfevents* files. The
previous method listed all files at every directory level, which is
extremely slow when the directory tree contains many non-event files
such as model checkpoints.

For a test case with ~26,000 checkpoint files alongside 8 event files,
this reduces discovery time from ~100s to ~13s.

Local filesystem paths are unaffected; they continue to use the
walk-based approach.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow logdir discovery on cloud filesystems due to level-by-level globbing

1 participant