Skip to content

feat: Add FirecrawlWebSearch component#2888

Open
saakshigupta2002 wants to merge 4 commits intodeepset-ai:mainfrom
saakshigupta2002:feat/firecrawl-websearch
Open

feat: Add FirecrawlWebSearch component#2888
saakshigupta2002 wants to merge 4 commits intodeepset-ai:mainfrom
saakshigupta2002:feat/firecrawl-websearch

Conversation

@saakshigupta2002
Copy link
Contributor

Related Issues

Proposed Changes:

This PR adds a new FirecrawlWebSearch component to the Firecrawl integration, enabling web search queries using the Firecrawl Search API. This follows the pattern established by Haystack's existing SearchApiWebSearch and SerperDevWebSearch components, as suggested in the PR review for #2859.

Many use cases require starting from a user search query rather than from a predefined URL list. While the existing FirecrawlCrawler handles crawling known URLs, this new component allows users to search the web and retrieve results as Haystack Documents — making it straightforward to build search-augmented pipelines.

What's included:

  • FirecrawlWebSearch component at haystack_integrations.components.websearch.firecrawl

    • Follows the standard Haystack WebSearch interface: accepts a query string, returns documents and links
    • Supports both synchronous (run()) and asynchronous (run_async()) execution
    • Configurable top_k for result limiting and search_params for Firecrawl-specific options (time filters, location, scrape options, etc.)
    • Full serialization support via to_dict() / from_dict()
    • Graceful error handling with logging (returns empty results on failure)
    • Handles both search-only results (title/description/url) and scraped document results (full markdown content when scrapeOptions are provided)
  • Comprehensive test suite (16 unit tests + 2 integration tests)

    • Tests for initialization, serialization round-trip, search execution, parameter overrides, top_k truncation, async execution, error handling, warm-up behavior, and edge cases (empty/null results)
  • Configuration updates

    • Updated pyproject.toml with "Web Search" keyword and mypy target for the new module
    • Updated pydoc/config_docusaurus.yml to include the new component in API documentation generation

Usage example:

from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
from haystack.utils import Secret

websearch = FirecrawlWebSearch(
    api_key=Secret.from_env_var("FIRECRAWL_API_KEY"),
    top_k=5,
)
websearch.warm_up()

result = websearch.run(query="What is Haystack by deepset?")
documents = result["documents"]  # List of Document objects with search result content
links = result["links"]          # List of URL strings

Design decisions:

  • Uses the Firecrawl Python SDK (firecrawl-py) directly rather than raw HTTP requests, consistent with the existing FirecrawlCrawler implementation. No new dependencies needed since firecrawl-py>=4.0.0 already includes the search() method.
  • Placed under components/websearch/firecrawl/ to follow Haystack's namespace convention for web search components, separate from the existing crawler under components/fetchers/firecrawl/.
  • top_k maps to Firecrawl's limit parameter when not explicitly set in search_params, providing a familiar interface while allowing full control through search_params.
  • Response parsing handles both response types: SearchResultWeb objects (search metadata only) and Firecrawl Document objects (full markdown content when scrapeOptions are provided), using attribute detection rather than type imports to avoid coupling to SDK internals.

How did you test it?

  • All 16 unit tests pass (hatch run test:unit) — covering initialization, serialization round-trip, search execution with mocked clients, runtime parameter overrides, top_k truncation, async execution, error handling, warm-up behavior, and edge cases (empty/null web results)
  • All existing FirecrawlCrawler tests continue to pass (14/14)
  • Lint and format checks pass (hatch run fmt-check)
  • Integration tests included (require FIRECRAWL_API_KEY environment variable): test_run_integration and test_run_async_integration

Notes for the reviewer

  • The new component lives under components/websearch/firecrawl/ (separate from the existing components/fetchers/firecrawl/) to follow the same namespace convention as Haystack core's SearchApiWebSearch and SerperDevWebSearch
  • The _parse_search_response method handles two types of results from the Firecrawl SDK: basic SearchResultWeb objects (when no scrape options are set) and full Document objects with markdown content (when scrapeOptions are provided). This is handled via hasattr checks rather than importing SDK-internal types.
  • No new dependencies were added — firecrawl-py>=4.0.0 (already a dependency) includes the search() method

Checklist

Add a new FirecrawlWebSearch component that enables web search queries
using the Firecrawl Search API. The component follows the standard
Haystack WebSearch interface (query input, documents + links output)
and supports both synchronous and asynchronous execution.

Closes deepset-ai#2870
@saakshigupta2002 saakshigupta2002 requested a review from a team as a code owner February 27, 2026 13:36
@saakshigupta2002 saakshigupta2002 requested review from anakin87 and removed request for a team February 27, 2026 13:36
@github-actions github-actions bot added integration:firecrawl type:documentation Improvements or additions to documentation labels Feb 27, 2026
Add missing py.typed marker file to the websearch package so mypy
can type check the module.
@anakin87
Copy link
Member

@bogdankostic could you help review this PR?

Remember that integration tests do not run on PRs from forks, so we need to verify locally that they are correct.

Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @saakshigupta2002! I left a few comments on how to further improve it :)

api_key=Secret.from_env_var("FIRECRAWL_API_KEY"),
top_k=5,
)
websearch.warm_up()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warm_up is not needed as the component is automatically warmed up on the first call of run.

Suggested change
websearch.warm_up()

Comment on lines +82 to +94
def to_dict(self) -> dict[str, Any]:
"""Serializes the component to a dictionary."""
return default_to_dict(
self,
api_key=self.api_key.to_dict(),
top_k=self.top_k,
search_params=self.search_params,
)

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "FirecrawlWebSearch":
"""Deserializes the component from a dictionary."""
return default_from_dict(cls, data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should not be needed anymore. Serialization should be covered by default serialization of components (see https://docs.haystack.deepset.ai/docs/serialization#default-serialization-behavior).

Comment on lines +56 to +58
:param top_k:
Maximum number of documents to return.
Defaults to 10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add here that this can be overridden by the "limit" parameter in search_params.

Comment on lines +132 to +134
if self.top_k is not None:
documents = documents[: self.top_k]
links = links[: self.top_k]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this, Firecrawl should return only "limit" search results.

Comment on lines +174 to +176
if self.top_k is not None:
documents = documents[: self.top_k]
links = links[: self.top_k]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this, Firecrawl should return only "limit" search results.

return {"documents": documents, "links": links}

@staticmethod
def _parse_search_response(search_response: Any) -> tuple[list[Document], list[str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use Any type here, this should be of type SearchData.

documents: list[Document] = []
links: list[str] = []

web_results = getattr(search_response, "web", None) or []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the SearchData type for search_response, the following should work:

Suggested change
web_results = getattr(search_response, "web", None) or []
web_results = search_response.web or []

Comment on lines +203 to +205
url = getattr(result, "url", "") or ""
title = getattr(result, "title", "") or ""
content = getattr(result, "description", "") or ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url = getattr(result, "url", "") or ""
title = getattr(result, "title", "") or ""
content = getattr(result, "description", "") or ""
url = getattr(result, "url", "")
title = getattr(result, "title", "")
content = getattr(result, "description", "")

- Remove warm_up() from docstring example (auto-called on first run)
- Remove custom to_dict()/from_dict() in favor of default serialization
- Update top_k docstring to note it can be overridden by limit in search_params
- Remove client-side top_k truncation (Firecrawl respects limit server-side)
- Use SearchData type instead of Any for _parse_search_response
- Remove redundant or "" from getattr() calls
- Use search_response.web directly instead of getattr()
- Remove corresponding tests for removed functionality
@saakshigupta2002
Copy link
Contributor Author

Thanks for the PR @saakshigupta2002! I left a few comments on how to further improve it :)

Thank you @bogdankostic for such elaborate feedback, pushed the updates! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:firecrawl type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Firecrawl: Add websearch functionality

3 participants