feat: add serialization helpers for SearchIndex and AsyncSearchIndex#525
feat: add serialization helpers for SearchIndex and AsyncSearchIndex#525thakoreh wants to merge 2 commits intoredis:mainfrom
Conversation
…ions The test was flaky because it relied on exact string matching with specific token ordering. Dict iteration order could cause tokens to appear in different positions across Python versions. Fix: Replace exact string match with component-based assertions that check for expected query parts without relying on ordering. Closes redis#497
Add to_dict() and to_yaml() methods to both SearchIndex and AsyncSearchIndex classes for improved automation and reproducibility. Features: - to_dict() serializes schema configuration to dictionary - to_yaml() writes configuration to YAML file - Optional include_connection parameter with security-safe defaults - Sensitive fields (passwords) are never serialized - Only safe connection keys are included (ssl, socket_timeout, etc.) Closes redis#493
| """ | ||
| config = self.schema.to_dict() | ||
| if include_connection: | ||
| config["_redis_url"] = self._redis_url |
There was a problem hiding this comment.
Redis URL can leak passwords despite security claim
High Severity
The _redis_url is serialized as-is when include_connection=True, but Redis URLs commonly embed passwords (e.g., redis://:secret@host:6379 or redis://user:password@host:6379). The _connection_kwargs are carefully filtered through safe_keys, but _redis_url receives no sanitization. This contradicts the PR's stated guarantee that "Passwords are never serialized."
Additional Locations (1)
| import yaml | ||
| config = self.to_dict(include_connection=include_connection) | ||
| with open(path, "w") as f: | ||
| yaml.dump(config, f, default_flow_style=False, sort_keys=False) |
There was a problem hiding this comment.
Identical serialization methods duplicated across both index classes
Low Severity
to_dict and to_yaml are copy-pasted identically in both SearchIndex and AsyncSearchIndex. Both subclasses have self._redis_url, self._connection_kwargs, and self.schema, and neither method is async. These could live in BaseSearchIndex, avoiding the maintenance risk of the safe_keys set or serialization logic diverging between the two copies.
Additional Locations (1)
There was a problem hiding this comment.
Pull request overview
This PR adds to_dict() and to_yaml() serialization helpers to both SearchIndex and AsyncSearchIndex, allowing index configurations to be persisted and restored. It also fixes a flaky test in test_query_types.py by replacing an exact-string assertion (which was sensitive to non-deterministic token ordering) with individual token presence checks.
Changes:
- Adds
to_dict(include_connection=False)andto_yaml(path, include_connection=False)methods toSearchIndexandAsyncSearchIndexfor serializing index configuration with optional connection parameters (excluding sensitive fields like passwords). - Adds new unit tests in
tests/unit/test_index_serialization.pycovering schema-only, connection-included, YAML output, and round-trip scenarios for both sync and async index classes. - Fixes the flaky
test_text_query_word_weightstest intest_query_types.pyby removing thepytest.mark.skipand replacing the brittle exact-string assertion with individual component assertions that are not sensitive to token ordering.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
redisvl/index/index.py |
Adds to_dict and to_yaml serialization methods to both SearchIndex and AsyncSearchIndex |
tests/unit/test_index_serialization.py |
New test file with unit tests for the serialization methods on both index classes |
tests/unit/test_query_types.py |
Removes pytest.mark.skip and replaces exact-string assertion with token-presence checks to fix the flaky test |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """ | ||
| config = self.schema.to_dict() | ||
| if include_connection: | ||
| config["_redis_url"] = self._redis_url |
There was a problem hiding this comment.
When include_connection=True is passed and _redis_url is None (e.g., when the index was initialized with a redis_client directly rather than a URL), the serialized dict will contain "_redis_url": None. This None value is then written to YAML, providing no useful connection info and potentially surprising users who call to_dict(include_connection=True). The code should either omit the _redis_url key when it is None, or document this behavior explicitly.
| config["_redis_url"] = self._redis_url | |
| if self._redis_url is not None: | |
| config["_redis_url"] = self._redis_url |
| Example: | ||
| >>> index.to_yaml("schemas/my_index.yaml") | ||
| """ | ||
| import yaml |
There was a problem hiding this comment.
The yaml module is imported inline inside both to_yaml methods (import yaml within the function body), instead of at the module level. The schema.py module (which index.py already imports from) imports yaml at the top of the file. Since pyyaml is a declared project dependency, it is always available, and there is no need for a lazy import here. Importing at the top of the module is the established pattern in this codebase and avoids repeating the import on every call.
| def test_roundtrip_dict(self, sample_schema): | ||
| """Test that to_dict() and from_dict() roundtrip correctly.""" | ||
| original_index = SearchIndex( | ||
| schema=sample_schema, | ||
| connection_kwargs={"ssl": True}, | ||
| ) | ||
|
|
||
| config = original_index.to_dict() | ||
| restored_index = SearchIndex.from_dict(config) | ||
|
|
||
| assert restored_index.schema.index.name == original_index.schema.index.name | ||
| assert restored_index.schema.index.prefix == original_index.schema.index.prefix |
There was a problem hiding this comment.
The test_roundtrip_dict tests (for both SearchIndex and AsyncSearchIndex) only call to_dict() without include_connection=True, so they never exercise the connection-parameter serialization path. As a result, the fact that from_dict silently discards _redis_url and _connection_kwargs (because IndexSchema.from_dict/model_validate ignores unknown keys) is not caught by any test. A test that serializes with include_connection=True and then calls from_dict, checking that _redis_url and _connection_kwargs are restored on the reconstructed index, is needed to validate the full round-trip contract.
| def to_yaml(self, path: str, include_connection: bool = False) -> None: | ||
| """Serialize the index configuration to a YAML file. | ||
|
|
||
| Args: | ||
| path (str): Path to write the YAML file. | ||
| include_connection (bool, optional): Whether to include connection | ||
| parameters. Defaults to False for security. | ||
|
|
||
| Example: | ||
| >>> await index.to_yaml("schemas/my_index.yaml") | ||
| """ | ||
| import yaml | ||
| config = self.to_dict(include_connection=include_connection) | ||
| with open(path, "w") as f: | ||
| yaml.dump(config, f, default_flow_style=False, sort_keys=False) |
There was a problem hiding this comment.
The SearchIndex.to_yaml and AsyncSearchIndex.to_yaml methods always silently overwrite an existing file, but the existing IndexSchema.to_yaml method has an overwrite=True parameter that allows raising a FileExistsError when overwriting is not desired. For API consistency, the new to_yaml methods should also accept an overwrite parameter with the same semantics as IndexSchema.to_yaml.
| parameters. Defaults to False for security. | ||
|
|
||
| Example: | ||
| >>> await index.to_yaml("schemas/my_index.yaml") |
There was a problem hiding this comment.
The to_yaml docstring for AsyncSearchIndex incorrectly uses await in the example, suggesting it is an async method. However, to_yaml is defined as a regular synchronous method (def to_yaml, not async def to_yaml). Calling await index.to_yaml(...) will raise a TypeError at runtime. The example should be >>> index.to_yaml("schemas/my_index.yaml") without await.
| >>> await index.to_yaml("schemas/my_index.yaml") | |
| >>> index.to_yaml("schemas/my_index.yaml") |
| def to_dict(self, include_connection: bool = False) -> Dict[str, Any]: | ||
| """Serialize the index configuration to a dictionary. | ||
|
|
||
| Args: | ||
| include_connection (bool, optional): Whether to include connection | ||
| parameters. Defaults to False for security (passwords/URLs | ||
| are excluded by default). | ||
|
|
||
| Returns: | ||
| Dict[str, Any]: Dictionary representation of the index configuration. | ||
|
|
||
| Example: | ||
| >>> config = index.to_dict() | ||
| >>> new_index = AsyncSearchIndex.from_dict(config) | ||
| """ | ||
| config = self.schema.to_dict() | ||
| if include_connection: | ||
| config["_redis_url"] = self._redis_url | ||
| # Note: connection_kwargs may contain sensitive info | ||
| # Only include non-sensitive keys | ||
| safe_keys = {"decode_responses", "ssl", "socket_timeout", "socket_connect_timeout"} | ||
| config["_connection_kwargs"] = { | ||
| k: v for k, v in self._connection_kwargs.items() | ||
| if k in safe_keys | ||
| } | ||
| return config | ||
|
|
||
| def to_yaml(self, path: str, include_connection: bool = False) -> None: | ||
| """Serialize the index configuration to a YAML file. | ||
|
|
||
| Args: | ||
| path (str): Path to write the YAML file. | ||
| include_connection (bool, optional): Whether to include connection | ||
| parameters. Defaults to False for security. | ||
|
|
||
| Example: | ||
| >>> await index.to_yaml("schemas/my_index.yaml") | ||
| """ | ||
| import yaml | ||
| config = self.to_dict(include_connection=include_connection) | ||
| with open(path, "w") as f: | ||
| yaml.dump(config, f, default_flow_style=False, sort_keys=False) |
There was a problem hiding this comment.
Both to_dict and to_yaml methods are implemented identically in SearchIndex (lines 1297–1338) and AsyncSearchIndex (lines 2297–2338). Since BaseSearchIndex already houses shared methods like from_dict, from_yaml, key, etc., these two methods should be moved to BaseSearchIndex to eliminate the duplicate code and make future maintenance easier. Any change to the serialization logic would currently need to be applied in two places.
| """ | ||
| config = self.schema.to_dict() | ||
| if include_connection: | ||
| config["_redis_url"] = self._redis_url |
There was a problem hiding this comment.
When include_connection=True is passed and _redis_url is None (e.g., when the index was initialized with a redis_client directly rather than a URL), the serialized dict will contain "_redis_url": None. This None value is then written to YAML, providing no useful connection info and potentially surprising users who call to_dict(include_connection=True). The code should either omit the _redis_url key when it is None, or document this behavior explicitly.
| config["_redis_url"] = self._redis_url | |
| # Only include the Redis URL if it is available; when the index was | |
| # initialized with a Redis client directly, there may be no URL. | |
| if self._redis_url is not None: | |
| config["_redis_url"] = self._redis_url |
| config = self.schema.to_dict() | ||
| if include_connection: | ||
| config["_redis_url"] = self._redis_url | ||
| # Note: connection_kwargs may contain sensitive info | ||
| # Only include non-sensitive keys | ||
| safe_keys = {"decode_responses", "ssl", "socket_timeout", "socket_connect_timeout"} | ||
| config["_connection_kwargs"] = { | ||
| k: v for k, v in self._connection_kwargs.items() | ||
| if k in safe_keys | ||
| } | ||
| return config |
There was a problem hiding this comment.
The from_dict method in BaseSearchIndex (line 375) passes the entire dict to IndexSchema.from_dict, which only parses the schema portion (index, fields, version). The _redis_url and _connection_kwargs keys are silently discarded by Pydantic's model validation. This means that even when serialized with include_connection=True, the connection parameters are lost during deserialization — the round-trip with connection data is incomplete.
To fix this, from_dict should be updated to extract _redis_url and _connection_kwargs from the dict before passing it to IndexSchema.from_dict, and then use those to initialize the index (e.g., pass them as redis_url and connection_kwargs kwargs). The existing test_roundtrip_dict tests don't catch this because they call to_dict() without include_connection=True.
| Example: | ||
| >>> await index.to_yaml("schemas/my_index.yaml") | ||
| """ | ||
| import yaml |
There was a problem hiding this comment.
The yaml module is imported inline inside both to_yaml methods (import yaml within the function body), instead of at the module level. The schema.py module (which index.py already imports from) imports yaml at the top of the file. Since pyyaml is a declared project dependency, it is always available, and there is no need for a lazy import here. Importing at the top of the module is the established pattern in this codebase and avoids repeating the import on every call.
| def to_yaml(self, path: str, include_connection: bool = False) -> None: | ||
| """Serialize the index configuration to a YAML file. | ||
|
|
||
| Args: | ||
| path (str): Path to write the YAML file. | ||
| include_connection (bool, optional): Whether to include connection | ||
| parameters. Defaults to False for security. | ||
|
|
||
| Example: | ||
| >>> index.to_yaml("schemas/my_index.yaml") | ||
| """ | ||
| import yaml | ||
| config = self.to_dict(include_connection=include_connection) | ||
| with open(path, "w") as f: | ||
| yaml.dump(config, f, default_flow_style=False, sort_keys=False) |
There was a problem hiding this comment.
The SearchIndex.to_yaml and AsyncSearchIndex.to_yaml methods always silently overwrite an existing file, but the existing IndexSchema.to_yaml method has an overwrite=True parameter that allows raising a FileExistsError when overwriting is not desired. For API consistency, the new to_yaml methods should also accept an overwrite parameter with the same semantics as IndexSchema.to_yaml.


Summary
to_dict()andto_yaml()methods to both SearchIndex and AsyncSearchIndexFeatures
to_dict(include_connection=False)Serializes the index configuration to a dictionary:
to_yaml(path, include_connection=False)Writes the configuration to a YAML file:
Security
By default, connection parameters are not included for security:
include_connection=True:ssl,decode_responses,socket_timeout,socket_connect_timeoutTesting
Added comprehensive unit tests in
tests/unit/test_index_serialization.py:test_to_dict_without_connection- Schema onlytest_to_dict_with_connection- Includes safe connection paramstest_to_yaml- YAML file outputtest_roundtrip_dict- Round-trip serializationDefinition of Done
Note
Low Risk
Additive serialization methods plus unit tests; default behavior avoids leaking sensitive connection data, so overall impact is low risk.
Overview
Adds
to_dict(include_connection=False)andto_yaml(path, include_connection=False)toSearchIndexandAsyncSearchIndexto export index schema/config, optionally including only a whitelisted subset of connection settings (and excluding secrets like passwords by default).Introduces new unit coverage for serialization round-trips and YAML output, and unflakes
test_text_query_word_weightsby asserting presence of key query fragments instead of exact token ordering.Written by Cursor Bugbot for commit 9ebc8c0. This will update automatically on new commits. Configure here.