feat: use _source keyword to select the fields we want#138
feat: use _source keyword to select the fields we want#138hnwyllmm merged 23 commits intooceanbase:developfrom
Conversation
Add strict validation and normalization for return_fields in Collection.hybrid_search and HybridSearch: require List[str] or None, disallow empty strings, deduplicate while preserving order, and treat an explicit empty list as ["_id"]. Preserve deepcopy semantics where used. Reject legacy _source keyword in call sites.
- Default hybrid_search now infers a minimal GET_SQL _source allowlist
from include (avoid fetching large columns like embedding unless
requested)
- Keep explicit return_fields as an override
- Fix integration test inserts to target v2 table names
- Add unit + integration coverage for return_fields=None inference
📝 WalkthroughWalkthroughThis change adds an Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BaseClient
participant SQLBuilder
participant Database
Client->>BaseClient: hybrid_search(query, include)
BaseClient->>BaseClient: _build_search_parm(include)
BaseClient->>SQLBuilder: render GET_SQL with _search_parm(_source)
SQLBuilder->>Database: execute SQL
Database-->>SQLBuilder: rows
SQLBuilder-->>BaseClient: result rows
BaseClient-->>Client: filtered result (per include)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@tests/unit_tests/test_hybrid_search_return_fields.py`:
- Around line 10-18: The three import lines for BaseClient, Collection, and
HybridSearch contain unused " # noqa: E402" markers; remove the trailing "#
noqa: E402" from the import statements (the lines importing
pyseekdb.client.client_base.BaseClient, pyseekdb.client.collection.Collection,
and pyseekdb.client.hybrid_search.HybridSearch) so the file no longer has those
unused noqa directives and lint noise.
- Around line 140-170: Several stub methods in the test class have unused
parameters causing lint warnings; rename unused parameters by prefixing them
with an underscore. Update _build_source_fields(include) to
_build_source_fields(_include), _build_search_parm(query, knn, rank, n_results,
return_fields=None, dimension=None, **kwargs) to use _query, _knn, _rank,
_n_results, _return_fields, _dimension or at least prefix unused ones,
_execute_query_with_cursor(conn, sql, params, use_context_manager=True) to
_execute_query_with_cursor(_conn, sql, _params, _use_context_manager=True)
(keeping sql as-is if used), and _transform_sql_result(result_rows, include) to
_transform_sql_result(_result_rows, _include); leave _ensure_connection as-is if
it has no parameters to change. Ensure you only rename parameters (not usages)
so the stubs continue returning the same values.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/unit_tests/test_hybrid_search_return_fields.py`:
- Around line 11-19: The imports of BaseClient, Collection, and HybridSearch
occur before the test adds the local src to sys.path, so the path change has no
effect; move the sys.path manipulation (creation of project_root/src_root and
sys.path.insert(0, str(src_root))) to run before importing pyseekdb symbols
(BaseClient, Collection, HybridSearch) and ensure the test file imports Path and
sys prior to that manipulation (i.e., relocate the
project_root/src_root/sys.path.insert block above the three pyseekdb import
lines).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Around line 3108-3127: The _build_source_fields method currently rejects
singular aliases and non-source include items; update it to accept aliases and
ignore unrelated entries: keep the TypeError check on include, normalize include
items to lowercase, map singular aliases ("document"->"documents",
"metadata"->"metadatas", "embedding"->"embeddings") into the requested set, then
intersect that requested set with the allowed keys
{"documents","metadatas","embeddings"} (do not raise on other items like
"distances" or "ids"), and build the source list (starting with "_id") by
appending allowed[key] for each key in ("documents","metadatas","embeddings")
present in the canonicalized requested set; remove the ValueError path that
rejected unknown entries.
Summary
close #103
Solution Description
In hybrid_search, users use "include" on the SDK side to control the returned fields. However, "include" only filters data on the SDK side; in reality, the database returns data for all fields, which causes additional overhead. Therefore, "_source" is introduced to control the actual fields returned by the database. The main purpose is to reduce overhead, not to add a new filtering option. For users who are already using "include", efforts should be made to reduce overhead while keeping the change as unnoticeable as possible. The SDK will derive the minimal "_source" based on the "include" parameter to ensure that users are not affected while reducing overhead.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.