feat: add multi-process data loader for GIL-bound insert paths by wyfanxiao · Pull Request #777 · zilliztech/VectorDBBench

wyfanxiao · 2026-05-14T03:50:06Z

Hi, thanks for the detailed review on #769! I've split the PR as suggested and addressed all three bugs you identified (partial load detection, timeout enforcement in queue wait, and PerformanceTimeoutError signature).

Summary

Add MultiprocessInsertRunner for parallel data loading across worker processes
Add --load-processes CLI option for explicit multi-process control
Auto-switch to multi-process loader when a client declares thread_safe=False and --load-concurrency > 1
OceanBase: declare thread_safe=False with pickle support
Improve concurrent_runner log message for thread_safe=False fallback

On benchmark fairness

Great point on fairness — I've been thinking about this too. One thing I noticed is that SQL-based clients (OceanBase, PgVector, VectorChord, Doris, SeekDB) currently fall back to max_workers=1 due to thread_safe=False, while thread-safe clients like Milvus and Elasticsearch can take advantage of multi-threaded loading. So it seems like there's already some difference in load parallelism across databases today.

The multi-process loader is an attempt to help bridge that gap. It's designed to be opt-in — only activated via explicit --load-processes or when thread_safe=False combined with --load-concurrency > 1. Default behavior stays the same for all existing clients.

Of course, I totally understand if you'd prefer a more conservative approach — for example, making it strictly explicit with no auto-switch at all. Would love to hear what you think works best for the project!

- Add MultiprocessInsertRunner for parallel data loading across worker processes - Add --load-processes CLI option for explicit multi-process control - Auto-switch to multi-process loader when client declares thread_safe=False - OceanBase: declare thread_safe=False with pickle support for multi-process loading - Improve concurrent_runner log message for thread_safe=False fallback - Fix: verify inserted == produced after shutdown to prevent partial load - Fix: check timeout inside _put_with_interruptible_wait to prevent indefinite blocking - Fix: use PerformanceTimeoutError without message arg to match its __init__ signature

sre-ci-robot · 2026-05-14T03:50:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wyfanxiao
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wyfanxiao mentioned this pull request May 14, 2026

feat(oceanbase): add multi-process loader, configurable index/partition params, and HNSW_BQ cosine support #769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-process data loader for GIL-bound insert paths#777

feat: add multi-process data loader for GIL-bound insert paths#777
wyfanxiao wants to merge 1 commit into
zilliztech:mainfrom
wyfanxiao:multiprocess-loader

wyfanxiao commented May 14, 2026

Uh oh!

sre-ci-robot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wyfanxiao commented May 14, 2026

Summary

On benchmark fairness

Uh oh!

sre-ci-robot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants