Skip to content

feat: add multi-process data loader for GIL-bound insert paths#777

Open
wyfanxiao wants to merge 1 commit into
zilliztech:mainfrom
wyfanxiao:multiprocess-loader
Open

feat: add multi-process data loader for GIL-bound insert paths#777
wyfanxiao wants to merge 1 commit into
zilliztech:mainfrom
wyfanxiao:multiprocess-loader

Conversation

@wyfanxiao
Copy link
Copy Markdown
Contributor

Hi, thanks for the detailed review on #769! I've split the PR as suggested and addressed all three bugs you identified (partial load detection, timeout enforcement in queue wait, and PerformanceTimeoutError signature).

Summary

  • Add MultiprocessInsertRunner for parallel data loading across worker processes
  • Add --load-processes CLI option for explicit multi-process control
  • Auto-switch to multi-process loader when a client declares thread_safe=False and --load-concurrency > 1
  • OceanBase: declare thread_safe=False with pickle support
  • Improve concurrent_runner log message for thread_safe=False fallback

On benchmark fairness

Great point on fairness — I've been thinking about this too. One thing I noticed is that SQL-based clients (OceanBase, PgVector, VectorChord, Doris, SeekDB) currently fall back to max_workers=1 due to thread_safe=False, while thread-safe clients like Milvus and Elasticsearch can take advantage of multi-threaded loading. So it seems like there's already some difference in load parallelism across databases today.

The multi-process loader is an attempt to help bridge that gap. It's designed to be opt-in — only activated via explicit --load-processes or when thread_safe=False combined with --load-concurrency > 1. Default behavior stays the same for all existing clients.

Of course, I totally understand if you'd prefer a more conservative approach — for example, making it strictly explicit with no auto-switch at all. Would love to hear what you think works best for the project!

- Add MultiprocessInsertRunner for parallel data loading across worker processes
- Add --load-processes CLI option for explicit multi-process control
- Auto-switch to multi-process loader when client declares thread_safe=False
- OceanBase: declare thread_safe=False with pickle support for multi-process loading
- Improve concurrent_runner log message for thread_safe=False fallback
- Fix: verify inserted == produced after shutdown to prevent partial load
- Fix: check timeout inside _put_with_interruptible_wait to prevent indefinite blocking
- Fix: use PerformanceTimeoutError without message arg to match its __init__ signature
@sre-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wyfanxiao
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants