Skip to content

feat: add serialization helpers for SearchIndex and AsyncSearchIndex#525

Open
thakoreh wants to merge 2 commits intoredis:mainfrom
thakoreh:feat/searchindex-serialization
Open

feat: add serialization helpers for SearchIndex and AsyncSearchIndex#525
thakoreh wants to merge 2 commits intoredis:mainfrom
thakoreh:feat/searchindex-serialization

Conversation

@thakoreh
Copy link

@thakoreh thakoreh commented Mar 3, 2026

Summary

Features

to_dict(include_connection=False)

Serializes the index configuration to a dictionary:

config = index.to_dict()
new_index = SearchIndex.from_dict(config)

to_yaml(path, include_connection=False)

Writes the configuration to a YAML file:

index.to_yaml("schemas/my_index.yaml")

Security

By default, connection parameters are not included for security:

  • Passwords are never serialized
  • Only safe connection keys are included when include_connection=True:
    • ssl, decode_responses, socket_timeout, socket_connect_timeout

Testing

Added comprehensive unit tests in tests/unit/test_index_serialization.py:

  • test_to_dict_without_connection - Schema only
  • test_to_dict_with_connection - Includes safe connection params
  • test_to_yaml - YAML file output
  • test_roundtrip_dict - Round-trip serialization

Definition of Done

  • Round-trip config serialization is supported and tested
  • Sensitive connection details can be omitted/redacted
  • Existing index creation and usage remain non-breaking

Note

Low Risk
Additive serialization methods plus unit tests; default behavior avoids leaking sensitive connection data, so overall impact is low risk.

Overview
Adds to_dict(include_connection=False) and to_yaml(path, include_connection=False) to SearchIndex and AsyncSearchIndex to export index schema/config, optionally including only a whitelisted subset of connection settings (and excluding secrets like passwords by default).

Introduces new unit coverage for serialization round-trips and YAML output, and unflakes test_text_query_word_weights by asserting presence of key query fragments instead of exact token ordering.

Written by Cursor Bugbot for commit 9ebc8c0. This will update automatically on new commits. Configure here.

thakoreh added 2 commits March 3, 2026 00:02
…ions

The test was flaky because it relied on exact string matching with
specific token ordering. Dict iteration order could cause tokens to
appear in different positions across Python versions.

Fix: Replace exact string match with component-based assertions that
check for expected query parts without relying on ordering.

Closes redis#497
Add to_dict() and to_yaml() methods to both SearchIndex and AsyncSearchIndex
classes for improved automation and reproducibility.

Features:
- to_dict() serializes schema configuration to dictionary
- to_yaml() writes configuration to YAML file
- Optional include_connection parameter with security-safe defaults
- Sensitive fields (passwords) are never serialized
- Only safe connection keys are included (ssl, socket_timeout, etc.)

Closes redis#493
Copilot AI review requested due to automatic review settings March 3, 2026 05:10
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

"""
config = self.schema.to_dict()
if include_connection:
config["_redis_url"] = self._redis_url
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis URL can leak passwords despite security claim

High Severity

The _redis_url is serialized as-is when include_connection=True, but Redis URLs commonly embed passwords (e.g., redis://:secret@host:6379 or redis://user:password@host:6379). The _connection_kwargs are carefully filtered through safe_keys, but _redis_url receives no sanitization. This contradicts the PR's stated guarantee that "Passwords are never serialized."

Additional Locations (1)

Fix in Cursor Fix in Web

import yaml
config = self.to_dict(include_connection=include_connection)
with open(path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identical serialization methods duplicated across both index classes

Low Severity

to_dict and to_yaml are copy-pasted identically in both SearchIndex and AsyncSearchIndex. Both subclasses have self._redis_url, self._connection_kwargs, and self.schema, and neither method is async. These could live in BaseSearchIndex, avoiding the maintenance risk of the safe_keys set or serialization logic diverging between the two copies.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds to_dict() and to_yaml() serialization helpers to both SearchIndex and AsyncSearchIndex, allowing index configurations to be persisted and restored. It also fixes a flaky test in test_query_types.py by replacing an exact-string assertion (which was sensitive to non-deterministic token ordering) with individual token presence checks.

Changes:

  • Adds to_dict(include_connection=False) and to_yaml(path, include_connection=False) methods to SearchIndex and AsyncSearchIndex for serializing index configuration with optional connection parameters (excluding sensitive fields like passwords).
  • Adds new unit tests in tests/unit/test_index_serialization.py covering schema-only, connection-included, YAML output, and round-trip scenarios for both sync and async index classes.
  • Fixes the flaky test_text_query_word_weights test in test_query_types.py by removing the pytest.mark.skip and replacing the brittle exact-string assertion with individual component assertions that are not sensitive to token ordering.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

File Description
redisvl/index/index.py Adds to_dict and to_yaml serialization methods to both SearchIndex and AsyncSearchIndex
tests/unit/test_index_serialization.py New test file with unit tests for the serialization methods on both index classes
tests/unit/test_query_types.py Removes pytest.mark.skip and replaces exact-string assertion with token-presence checks to fix the flaky test

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"""
config = self.schema.to_dict()
if include_connection:
config["_redis_url"] = self._redis_url
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When include_connection=True is passed and _redis_url is None (e.g., when the index was initialized with a redis_client directly rather than a URL), the serialized dict will contain "_redis_url": None. This None value is then written to YAML, providing no useful connection info and potentially surprising users who call to_dict(include_connection=True). The code should either omit the _redis_url key when it is None, or document this behavior explicitly.

Suggested change
config["_redis_url"] = self._redis_url
if self._redis_url is not None:
config["_redis_url"] = self._redis_url

Copilot uses AI. Check for mistakes.
Example:
>>> index.to_yaml("schemas/my_index.yaml")
"""
import yaml
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yaml module is imported inline inside both to_yaml methods (import yaml within the function body), instead of at the module level. The schema.py module (which index.py already imports from) imports yaml at the top of the file. Since pyyaml is a declared project dependency, it is always available, and there is no need for a lazy import here. Importing at the top of the module is the established pattern in this codebase and avoids repeating the import on every call.

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +108
def test_roundtrip_dict(self, sample_schema):
"""Test that to_dict() and from_dict() roundtrip correctly."""
original_index = SearchIndex(
schema=sample_schema,
connection_kwargs={"ssl": True},
)

config = original_index.to_dict()
restored_index = SearchIndex.from_dict(config)

assert restored_index.schema.index.name == original_index.schema.index.name
assert restored_index.schema.index.prefix == original_index.schema.index.prefix
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_roundtrip_dict tests (for both SearchIndex and AsyncSearchIndex) only call to_dict() without include_connection=True, so they never exercise the connection-parameter serialization path. As a result, the fact that from_dict silently discards _redis_url and _connection_kwargs (because IndexSchema.from_dict/model_validate ignores unknown keys) is not caught by any test. A test that serializes with include_connection=True and then calls from_dict, checking that _redis_url and _connection_kwargs are restored on the reconstructed index, is needed to validate the full round-trip contract.

Copilot uses AI. Check for mistakes.
Comment on lines +2324 to +2338
def to_yaml(self, path: str, include_connection: bool = False) -> None:
"""Serialize the index configuration to a YAML file.

Args:
path (str): Path to write the YAML file.
include_connection (bool, optional): Whether to include connection
parameters. Defaults to False for security.

Example:
>>> await index.to_yaml("schemas/my_index.yaml")
"""
import yaml
config = self.to_dict(include_connection=include_connection)
with open(path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SearchIndex.to_yaml and AsyncSearchIndex.to_yaml methods always silently overwrite an existing file, but the existing IndexSchema.to_yaml method has an overwrite=True parameter that allows raising a FileExistsError when overwriting is not desired. For API consistency, the new to_yaml methods should also accept an overwrite parameter with the same semantics as IndexSchema.to_yaml.

Copilot uses AI. Check for mistakes.
parameters. Defaults to False for security.

Example:
>>> await index.to_yaml("schemas/my_index.yaml")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_yaml docstring for AsyncSearchIndex incorrectly uses await in the example, suggesting it is an async method. However, to_yaml is defined as a regular synchronous method (def to_yaml, not async def to_yaml). Calling await index.to_yaml(...) will raise a TypeError at runtime. The example should be >>> index.to_yaml("schemas/my_index.yaml") without await.

Suggested change
>>> await index.to_yaml("schemas/my_index.yaml")
>>> index.to_yaml("schemas/my_index.yaml")

Copilot uses AI. Check for mistakes.
Comment on lines +2297 to +2338
def to_dict(self, include_connection: bool = False) -> Dict[str, Any]:
"""Serialize the index configuration to a dictionary.

Args:
include_connection (bool, optional): Whether to include connection
parameters. Defaults to False for security (passwords/URLs
are excluded by default).

Returns:
Dict[str, Any]: Dictionary representation of the index configuration.

Example:
>>> config = index.to_dict()
>>> new_index = AsyncSearchIndex.from_dict(config)
"""
config = self.schema.to_dict()
if include_connection:
config["_redis_url"] = self._redis_url
# Note: connection_kwargs may contain sensitive info
# Only include non-sensitive keys
safe_keys = {"decode_responses", "ssl", "socket_timeout", "socket_connect_timeout"}
config["_connection_kwargs"] = {
k: v for k, v in self._connection_kwargs.items()
if k in safe_keys
}
return config

def to_yaml(self, path: str, include_connection: bool = False) -> None:
"""Serialize the index configuration to a YAML file.

Args:
path (str): Path to write the YAML file.
include_connection (bool, optional): Whether to include connection
parameters. Defaults to False for security.

Example:
>>> await index.to_yaml("schemas/my_index.yaml")
"""
import yaml
config = self.to_dict(include_connection=include_connection)
with open(path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both to_dict and to_yaml methods are implemented identically in SearchIndex (lines 1297–1338) and AsyncSearchIndex (lines 2297–2338). Since BaseSearchIndex already houses shared methods like from_dict, from_yaml, key, etc., these two methods should be moved to BaseSearchIndex to eliminate the duplicate code and make future maintenance easier. Any change to the serialization logic would currently need to be applied in two places.

Copilot uses AI. Check for mistakes.
"""
config = self.schema.to_dict()
if include_connection:
config["_redis_url"] = self._redis_url
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When include_connection=True is passed and _redis_url is None (e.g., when the index was initialized with a redis_client directly rather than a URL), the serialized dict will contain "_redis_url": None. This None value is then written to YAML, providing no useful connection info and potentially surprising users who call to_dict(include_connection=True). The code should either omit the _redis_url key when it is None, or document this behavior explicitly.

Suggested change
config["_redis_url"] = self._redis_url
# Only include the Redis URL if it is available; when the index was
# initialized with a Redis client directly, there may be no URL.
if self._redis_url is not None:
config["_redis_url"] = self._redis_url

Copilot uses AI. Check for mistakes.
Comment on lines +1312 to +1322
config = self.schema.to_dict()
if include_connection:
config["_redis_url"] = self._redis_url
# Note: connection_kwargs may contain sensitive info
# Only include non-sensitive keys
safe_keys = {"decode_responses", "ssl", "socket_timeout", "socket_connect_timeout"}
config["_connection_kwargs"] = {
k: v for k, v in self._connection_kwargs.items()
if k in safe_keys
}
return config
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from_dict method in BaseSearchIndex (line 375) passes the entire dict to IndexSchema.from_dict, which only parses the schema portion (index, fields, version). The _redis_url and _connection_kwargs keys are silently discarded by Pydantic's model validation. This means that even when serialized with include_connection=True, the connection parameters are lost during deserialization — the round-trip with connection data is incomplete.

To fix this, from_dict should be updated to extract _redis_url and _connection_kwargs from the dict before passing it to IndexSchema.from_dict, and then use those to initialize the index (e.g., pass them as redis_url and connection_kwargs kwargs). The existing test_roundtrip_dict tests don't catch this because they call to_dict() without include_connection=True.

Copilot uses AI. Check for mistakes.
Example:
>>> await index.to_yaml("schemas/my_index.yaml")
"""
import yaml
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The yaml module is imported inline inside both to_yaml methods (import yaml within the function body), instead of at the module level. The schema.py module (which index.py already imports from) imports yaml at the top of the file. Since pyyaml is a declared project dependency, it is always available, and there is no need for a lazy import here. Importing at the top of the module is the established pattern in this codebase and avoids repeating the import on every call.

Copilot uses AI. Check for mistakes.
Comment on lines +1324 to +1338
def to_yaml(self, path: str, include_connection: bool = False) -> None:
"""Serialize the index configuration to a YAML file.

Args:
path (str): Path to write the YAML file.
include_connection (bool, optional): Whether to include connection
parameters. Defaults to False for security.

Example:
>>> index.to_yaml("schemas/my_index.yaml")
"""
import yaml
config = self.to_dict(include_connection=include_connection)
with open(path, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SearchIndex.to_yaml and AsyncSearchIndex.to_yaml methods always silently overwrite an existing file, but the existing IndexSchema.to_yaml method has an overwrite=True parameter that allows raising a FileExistsError when overwriting is not desired. For API consistency, the new to_yaml methods should also accept an overwrite parameter with the same semantics as IndexSchema.to_yaml.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SearchIndex/AsyncSearchIndex serialization helpers

2 participants