Skip to content

feat: use _source keyword to select the fields we want#138

Merged
hnwyllmm merged 23 commits intooceanbase:developfrom
WSL0809:feat/issue-103
Jan 31, 2026
Merged

feat: use _source keyword to select the fields we want#138
hnwyllmm merged 23 commits intooceanbase:developfrom
WSL0809:feat/issue-103

Conversation

@WSL0809
Copy link
Copy Markdown
Contributor

@WSL0809 WSL0809 commented Jan 23, 2026

Summary

close #103

Solution Description

In hybrid_search, users use "include" on the SDK side to control the returned fields. However, "include" only filters data on the SDK side; in reality, the database returns data for all fields, which causes additional overhead. Therefore, "_source" is introduced to control the actual fields returned by the database. The main purpose is to reduce overhead, not to add a new filtering option. For users who are already using "include", efforts should be made to reduce overhead while keeping the change as unnoticeable as possible. The SDK will derive the minimal "_source" based on the "include" parameter to ensure that users are not affected while reducing overhead.

Summary by CodeRabbit

  • New Features
    • Added an include parameter to hybrid search so users can request specific fields (documents, metadata, embeddings); results now always include the record id and return only requested fields in a deterministic order.
  • Tests
    • New unit and integration tests validating include-driven result shapes, SQL generation, and various include combinations.

✏️ Tip: You can customize this high-level summary in your review settings.

Add strict validation and normalization for return_fields in
Collection.hybrid_search and HybridSearch: require List[str] or None,
disallow empty strings, deduplicate while preserving order, and treat an
explicit empty list as ["_id"]. Preserve deepcopy semantics where used.
Reject legacy _source keyword in call sites.
  - Default hybrid_search now infers a minimal GET_SQL _source allowlist
    from include (avoid fetching large columns like embedding unless
    requested)
  - Keep explicit return_fields as an override
  - Fix integration test inserts to target v2 table names
  - Add unit + integration coverage for return_fields=None inference
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

This change adds an include parameter to hybrid search flow, threads it into _build_search_parm, and deterministically constructs a _source allowlist (starting with _id) based on include, affecting GET_SQL field selection.

Changes

Cohort / File(s) Summary
Core Implementation
src/pyseekdb/client/client_base.py
Added include parameter to _collection_hybrid_search and _build_search_parm. Implemented _build_source_fields to validate include, produce a deterministic _source list (always starting with _id) with aliases for document/metadata/embedding, and ensure _search_parm["_source"] is set.
Integration Tests
tests/integration_tests/test_collection_hybrid_search_source_inference.py
New integration tests creating collections, inserting documents/embeddings/metadata, generating/executing GET_SQL, and asserting returned columns and embedding dimensions for various include values (None, [], single, combined).
Unit Tests
tests/unit_tests/test_hybrid_search_source_inference.py
New unit tests validating forwarding of include, _build_search_parm output _source under defaults and empty include, and _build_source_fields behavior including alias handling and embedding inclusion.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BaseClient
    participant SQLBuilder
    participant Database

    Client->>BaseClient: hybrid_search(query, include)
    BaseClient->>BaseClient: _build_search_parm(include)
    BaseClient->>SQLBuilder: render GET_SQL with _search_parm(_source)
    SQLBuilder->>Database: execute SQL
    Database-->>SQLBuilder: rows
    SQLBuilder-->>BaseClient: result rows
    BaseClient-->>Client: filtered result (per include)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐇 I nibble fields with careful cheer,
"_id" first, then what you hold dear.
Include the docs, or metadata's song,
Or embeddings small — no extra throng.
Query light, the rabbit hops along.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: using the _source keyword to select specific fields in hybrid_search, reducing unnecessary data transfer.
Linked Issues check ✅ Passed The PR successfully implements the core requirement from issue #103: adding _source support to select minimal fields in hybrid_search, reducing data transfer for large columns like embeddings.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing _source inference: updates to _collection_hybrid_search and _build_search_parm signatures, _source field logic, and corresponding unit/integration tests. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tests/unit_tests/test_hybrid_search_return_fields.py`:
- Around line 10-18: The three import lines for BaseClient, Collection, and
HybridSearch contain unused "  # noqa: E402" markers; remove the trailing "#
noqa: E402" from the import statements (the lines importing
pyseekdb.client.client_base.BaseClient, pyseekdb.client.collection.Collection,
and pyseekdb.client.hybrid_search.HybridSearch) so the file no longer has those
unused noqa directives and lint noise.
- Around line 140-170: Several stub methods in the test class have unused
parameters causing lint warnings; rename unused parameters by prefixing them
with an underscore. Update _build_source_fields(include) to
_build_source_fields(_include), _build_search_parm(query, knn, rank, n_results,
return_fields=None, dimension=None, **kwargs) to use _query, _knn, _rank,
_n_results, _return_fields, _dimension or at least prefix unused ones,
_execute_query_with_cursor(conn, sql, params, use_context_manager=True) to
_execute_query_with_cursor(_conn, sql, _params, _use_context_manager=True)
(keeping sql as-is if used), and _transform_sql_result(result_rows, include) to
_transform_sql_result(_result_rows, _include); leave _ensure_connection as-is if
it has no parameters to change. Ensure you only rename parameters (not usages)
so the stubs continue returning the same values.

Comment thread tests/unit_tests/test_hybrid_search_return_fields.py Outdated
Comment thread tests/unit_tests/test_hybrid_search_return_fields.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/unit_tests/test_hybrid_search_return_fields.py`:
- Around line 11-19: The imports of BaseClient, Collection, and HybridSearch
occur before the test adds the local src to sys.path, so the path change has no
effect; move the sys.path manipulation (creation of project_root/src_root and
sys.path.insert(0, str(src_root))) to run before importing pyseekdb symbols
(BaseClient, Collection, HybridSearch) and ensure the test file imports Path and
sys prior to that manipulation (i.e., relocate the
project_root/src_root/sys.path.insert block above the three pyseekdb import
lines).

Comment thread tests/unit_tests/test_hybrid_search_return_fields.py Outdated
@WSL0809 WSL0809 changed the title feat: add return_fields to hybrid_search feat: use _source keyword to select the fields we want Jan 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Around line 3108-3127: The _build_source_fields method currently rejects
singular aliases and non-source include items; update it to accept aliases and
ignore unrelated entries: keep the TypeError check on include, normalize include
items to lowercase, map singular aliases ("document"->"documents",
"metadata"->"metadatas", "embedding"->"embeddings") into the requested set, then
intersect that requested set with the allowed keys
{"documents","metadatas","embeddings"} (do not raise on other items like
"distances" or "ids"), and build the source list (starting with "_id") by
appending allowed[key] for each key in ("documents","metadatas","embeddings")
present in the canonicalized requested set; remove the ValueError path that
rejected unknown entries.

Comment thread src/pyseekdb/client/client_base.py
@hnwyllmm hnwyllmm merged commit 703fd58 into oceanbase:develop Jan 31, 2026
7 checks passed
@WSL0809 WSL0809 deleted the feat/issue-103 branch January 31, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: use _source keyword to select the fields we want

2 participants