Skip to content

feat: add metadata inpsection Tools to support agent-driven document retrieval #11000

@sjrl

Description

@sjrl

Summary

Add new pre-made tools that an Agent can use to inspect the metadata structure and values of documents in a document store. This would give the agent the information it needs to construct meaningful filters before retrieval.

Motivation

When an agent needs to retrieve documents with metadata filters, it has no way of knowing what fields exist or what values they contain. Without this, it must guess and in practice this means we don't expose the filters as an option to Agent when using a retrieval pipeline tool.

Implementation

Each tool would be a Tool subclass that accepts any DocumentStore instance. We would utilize the newly introduced methods:

  • get_metadata_fields_info() - returns all metadata fields and their types
  • get_metadata_field_unique_values(field) - returns distinct values for a field
  • get_metadata_field_min_max(field) - returns the numeric range for a field
from typing import Any

from haystack.core.serialization import generate_qualified_class_name
from haystack.document_stores.types import DocumentStore
from haystack.tools import Tool
from haystack.utils.deserialization import deserialize_component_inplace


class ListMetadataFieldsTool(Tool):
    """Tool that lists all metadata fields and their types from a document store."""

    def __init__(self, document_store: DocumentStore) -> None:
        self.document_store = document_store
        super().__init__(
            name="list_metadata_fields",
            description=(
                "Returns all metadata fields available on documents and their types "
                "(e.g. keyword, long, date). Call this first to understand what fields "
                "you can filter on."
            ),
            parameters={"type": "object", "properties": {}},
            function=self._list_metadata_fields,
        )

    def _list_metadata_fields(self) -> dict:
        return self.document_store.get_metadata_fields_info()

    def to_dict(self) -> dict[str, Any]:
        return {
            "type": generate_qualified_class_name(type(self)),
            "data": {"document_store": self.document_store.to_dict()},
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "ListMetadataFieldsTool":
        inner_data = data["data"]
        deserialize_component_inplace(inner_data, key="document_store")
        return cls(**inner_data)


class GetMetadataFieldValuesTool(Tool):
    """Tool that returns the distinct values for a given metadata field."""

    def __init__(self, document_store: DocumentStore) -> None:
        self.document_store = document_store
        super().__init__(
            name="get_metadata_field_values",
            description=(
                "Returns the distinct values present for a given metadata field. "
                "Use this to understand what values a field can take before building a filter."
            ),
            parameters={
                "type": "object",
                "properties": {
                    "field": {"type": "string", "description": "The metadata field name."}
                },
                "required": ["field"],
            },
            function=self._get_metadata_field_values,
        )

    def _get_metadata_field_values(self, field: str) -> list:
        return self.document_store.get_metadata_field_unique_values(field=field)

    def to_dict(self) -> dict[str, Any]:
        return {
            "type": generate_qualified_class_name(type(self)),
            "data": {"document_store": self.document_store.to_dict()},
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "GetMetadataFieldValuesTool":
        inner_data = data["data"]
        deserialize_component_inplace(inner_data, key="document_store")
        return cls(**inner_data)


class GetMetadataFieldRangeTool(Tool):
    """Tool that returns the min and max values for a numeric metadata field."""

    def __init__(self, document_store: DocumentStore) -> None:
        self.document_store = document_store
        super().__init__(
            name="get_metadata_field_range",
            description=(
                "Returns the minimum and maximum values for a numeric metadata field. "
                "Use this for fields with continuous values such as dates or counts."
            ),
            parameters={
                "type": "object",
                "properties": {
                    "field": {"type": "string", "description": "The numeric metadata field name."}
                },
                "required": ["field"],
            },
            function=self._get_metadata_field_range,
        )

    def _get_metadata_field_range(self, field: str) -> dict:
        return self.document_store.get_metadata_field_min_max(field=field)

    def to_dict(self) -> dict[str, Any]:
        return {
            "type": generate_qualified_class_name(type(self)),
            "data": {"document_store": self.document_store.to_dict()},
        }

    @classmethod
    def from_dict(cls, data: dict[str, Any]) -> "GetMetadataFieldRangeTool":
        inner_data = data["data"]
        deserialize_component_inplace(inner_data, key="document_store")
        return cls(**inner_data)

Usage:

from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

document_store = OpenSearchDocumentStore(...)

agent = Agent(
    chat_generator=...,
    tools=[
        ListMetadataFieldsTool(document_store),
        GetMetadataFieldValuesTool(document_store),
        GetMetadataFieldRangeTool(document_store),
        retriever_tool,
    ],
)

Validation

Before merging, these tools should be tested against a reasonably sized corpus to verify that the default tool names, descriptions, and parameter descriptions are sufficient for LLMs to use them correctly without additional prompting. For example, we want to make sure the model reliably calls list_metadata_fields before attempting to construct a filter, and whether it correctly interprets the field type information returned.


👋 Hello there! This issue will be handled internally and isn't open for external contributions. If you'd like to contribute, please take a look at issues labeled contributions welcome or good first issue. We'd really appreciate it!

Metadata

Metadata

Assignees

Labels

P1High priority, add to the next sprint

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions