Skip to content

perf: use deque for async map task/index queues#8033

Closed
giulio-leone wants to merge 1 commit intohuggingface:mainfrom
giulio-leone:fix/async-map-deque-performance
Closed

perf: use deque for async map task/index queues#8033
giulio-leone wants to merge 1 commit intohuggingface:mainfrom
giulio-leone:fix/async-map-deque-performance

Conversation

@giulio-leone
Copy link
Copy Markdown

Problem

arrow_dataset.py and iterable_dataset.py drain indices and tasks lists front-to-back via .pop(0) during parallel async map operations. With MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL items in flight, each .pop(0) is O(n) making the drain loop O(n²).

Solution

Switch both indices and tasks to collections.deque with .popleft() for O(1) front removal.

Changes

  • src/datasets/arrow_dataset.py: Import deque, type tasks and indices as deque, use .popleft()
  • src/datasets/iterable_dataset.py: Import deque, type tasks, indices, and _owned_loops_and_tasks entry as deque, use .popleft()

Testing

  • Syntax verified via ast.parse() on both files

arrow_dataset.py and iterable_dataset.py drain indices and tasks
lists front-to-back via .pop(0) during parallel async map operations.
With MAX_NUM_RUNNING_ASYNC_MAP_FUNCTIONS_IN_PARALLEL items in flight,
each .pop(0) is O(n) making the drain loop O(n²).

Switch both queues to collections.deque with .popleft() for O(1)
front removal.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@giulio-leone
Copy link
Copy Markdown
Author

Friendly ping — CI is green and this is ready for review. Happy to address any feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants