Skip to content

Conversation

@Denise2004
Copy link
Contributor

This PR adds complete full-text search feature support to VectorDBBench, enabling the benchmarking tool to evaluate Milvus's BM25 full-text search performance. This feature is based on the MS MARCO dataset and supports test cases of various scales.

Main Achievements

Adapted 3 FTS Performance Test Cases: Support for 100K, 5M, and 8.8M scale MS MARCO datasets. Only submitting 100K version here.

Complete FTS Dataset Management: Support for reading, parsing, and batch processing of TSV format data files

Milvus FTS Client Integration: Implement full-text document insertion and BM25 search functionality.

FTS-Specific Evaluation Metrics: Added calculation of Recall@K, NDCG@K, MRR and other metrics

Frontend Interface Support: Added FTS test case configuration and parameter settings in the Web UI

Core Features

1. Dataset Support

  • Added FtsDatasetManager to manage MS MARCO datasets
  • Support for TSV format collection, queries, qrels files

2. Milvus FTS Integration

  • Implemented insert_fulltext() method: Support batch insertion of full-text documents
  • Implemented search_fulltext() method: Full-text search based on BM25 algorithm
  • Support for 8 configuration parameters: index algorithm, BM25 parameters, tokenizer, stop words, etc.

3. Test Execution Engine

  • Added SerialFtsInsertRunner: FTS document insertion executor
  • Extended SerialSearchRunner and MultiProcessingSearchRunner: Support FTS search testing
  • Support for both serialized search and concurrent search modes

4. Evaluation System

  • Added calc_recall_fts() and calc_ndcg_fts() functions
  • Support for Recall@K, NDCG@K, MRR, QPS, Latency and other metrics

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Denise2004
To complete the pull request process, please assign xuanyang-cn after the PR has been reviewed.
You can assign the PR to them by writing /assign @xuanyang-cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhuwenxing
Copy link
Collaborator

https://github.com/allenai/ir_datasets/

can we use this python lib to manage the datasets for fts? it contains most ir datasets!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants