feat(logosdb): add LogosDB vector database integration#782
Conversation
- Add LogosDB embedded HNSW client (local file-based, mmap, hnswlib) - Config: LogosDBConfig (uri path) + LogosDBIndexConfig (metric type) - Supports COSINE, L2, and IP distance metrics - Uses put_batch for efficient bulk insert; metadata IDs stored as text - Register DB.LogosDB enum, init_cls, config_cls, case_config_cls - Register 'logosdb' CLI command in vectordbbench - Add logosdb optional extra in pyproject.toml Benchmark result (50K OpenAI 1536-dim, COSINE): recall@100=0.9347 ndcg=0.9464 p99=4.6ms p95=4.0ms
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jose-compu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
can you please review @sre-ci-robot @jkatz @javiervegas @claude ? |
XuanYang-cn
left a comment
There was a problem hiding this comment.
I found blockers in the current LogosDB integration. CI is red on the changed files, and the default command still enables a concurrent-search mode that conflicts with LogosDB's single-process database-path constraint.
|
|
||
| @cli.command() | ||
| @click_parameter_decorators_from_typed_dict(LogosDBTypedDict) | ||
| def LogosDB(**parameters: Unpack[LogosDBTypedDict]): |
There was a problem hiding this comment.
must-change: LogosDB inherits search_concurrent=True from CommonTypedDict, but LogosDB documents one DB directory as single-process while VDBBench concurrent search starts multiple ProcessPoolExecutor workers against the same --uri. The default command can fail or report invalid concurrent-search results after loading. Set parameters["search_concurrent"] = False or reject --search-concurrent for LogosDB until a supported single-process concurrent runner exists.
There was a problem hiding this comment.
Thanks for catching this. Fixed in the latest commit by hard-setting parameters["search_concurrent"] = False in the CLI handler.
Quick note: I did test multi-process concurrent reads empirically (4 Pool workers opening the same DB path and running 50 searches each) and all succeeded without errors (LogosDB's memory-mapped storage appears safe for concurrent readers). That said, since the official docs declare it single-process, disabling concurrent search is the right conservative call for now. Can revisit if/when LogosDB formally documents multi-reader support.
Fixed here: b932872
| self.uri = db_config["uri"] | ||
| self.db = None | ||
|
|
||
| if drop_old and os.path.exists(self.uri): |
There was a problem hiding this comment.
must-change: this os.path.exists() call fails the repo ruff gate with PTH110, so all PR test jobs are red before unit tests run. Replace it with Path(self.uri).exists() and update the imports.
| from ..backend.clients.endee.cli import Endee | ||
| from ..backend.clients.hologres.cli import HologresHGraph | ||
| from ..backend.clients.lancedb.cli import LanceDB | ||
| from ..backend.clients.logosdb.cli import LogosDB |
There was a problem hiding this comment.
must-change: this import placement fails ruff I001, so CI stays red after adding the command. Run ruff check --fix vectordb_bench/cli/vectordbbench.py or move the import to the order ruff expects.
…10 ruff rule Co-authored-by: Cursor <cursoragent@cursor.com>
… ruff I001 Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
@XuanYang-cn the benchmark is running now locally, can you try CI again please? |
New results:Processed from run Test configuration
Results
Notes
|
Summary
VectorDBinterface:__init__,initcontext manager,insert_embeddings(viaput_batch),search_embedding, andoptimize.DB.LogosDBin the enum and wires upinit_cls,config_cls, andcase_config_cls.logosdbCLI subcommand with a--uriflag (local directory path).logosdbas an optional extra inpyproject.toml.Design notes
--uri.MetricTypeat runtime (COSINE/L2/IP). COSINE is the default and auto-normalizes vectors.str(id)) and parsed back on search, since LogosDB's internal row IDs are independent of the benchmark ID space.optimize()is a no-op with a log message.Benchmark result
Tested on
Performance1536D50K(OpenAI embeddings, 50K vectors, 1536 dim, COSINE) on Apple M-series:Test plan
pip install logosdb(binary wheels for Linux x86_64/aarch64 and macOS x86_64/arm64, CPython 3.9-3.13)vectordbbench logosdb --uri /tmp/vdbbench_logosdb --case-type Performance1536D50K --skip-search-concurrentvectordb_bench/results/LogosDB/