Skip to content

Speed up local 'get_data_patterns' by avoiding repeated recursive scans#8014

Open
AsymptotaX wants to merge 2 commits intohuggingface:mainfrom
AsymptotaX:perf/local-get-data-patterns-single-scan
Open

Speed up local 'get_data_patterns' by avoiding repeated recursive scans#8014
AsymptotaX wants to merge 2 commits intohuggingface:mainfrom
AsymptotaX:perf/local-get-data-patterns-single-scan

Conversation

@AsymptotaX
Copy link

@AsymptotaX AsymptotaX commented Feb 21, 2026

This PR speeds up get_data_patterns for local paths.

Problem

For load_dataset("imagefolder", data_dir=...), get_data_patterns was repeatedly scanning the same local directory tree for many split patterns (train, test, etc.).
With large folders, this became very slow. This has also been reported before in earlier performance discussions/issues.

Change

In get_data_patterns (local paths only):

  • scan files once (resolve_pattern("**", ...))
  • then match split patterns in memory

Remote paths keep the old behavior.

No API changes.

My Env

  • Mac mini (M4), 16 GB RAM
  • Python 3.13
  • Datasets version: 4.5.0

Benchmarks

imagefolder with local .jpg files

data_dir_only:

  • 10k: 4.35s -> 1.40s (3.10x)
  • 100k: 33.48s -> 9.77s (3.43x)
  • 300k: 160.20s -> 35.81s (4.47x)
  • 1M: 1877.70s -> 164.17s (11.44x) 🔥

explicit_data_files:

  • 10k: 0.75s -> 0.79s
  • 100k: 7.23s -> 7.29s
  • 300k: 25.44s -> 24.26s
  • 1M: 115.85s -> 112.66s

As expected, the improvement is on data_dir_only (auto pattern detection path).
Memory usage did not show a consistent regression in these runs and stayed within normal run-to-run variance for this benchmark setup.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ! can you run make style to fix the CI before we merge ?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@AsymptotaX
Copy link
Author

lgtm ! can you run make style to fix the CI before we merge ?

done. All good, no warnings.

@AsymptotaX AsymptotaX requested a review from lhoestq February 25, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants