feat: Add FirecrawlWebSearch component#2888
feat: Add FirecrawlWebSearch component#2888saakshigupta2002 wants to merge 4 commits intodeepset-ai:mainfrom
Conversation
Add a new FirecrawlWebSearch component that enables web search queries using the Firecrawl Search API. The component follows the standard Haystack WebSearch interface (query input, documents + links output) and supports both synchronous and asynchronous execution. Closes deepset-ai#2870
Add missing py.typed marker file to the websearch package so mypy can type check the module.
|
@bogdankostic could you help review this PR? Remember that integration tests do not run on PRs from forks, so we need to verify locally that they are correct. |
bogdankostic
left a comment
There was a problem hiding this comment.
Thanks for the PR @saakshigupta2002! I left a few comments on how to further improve it :)
| api_key=Secret.from_env_var("FIRECRAWL_API_KEY"), | ||
| top_k=5, | ||
| ) | ||
| websearch.warm_up() |
There was a problem hiding this comment.
warm_up is not needed as the component is automatically warmed up on the first call of run.
| websearch.warm_up() |
| def to_dict(self) -> dict[str, Any]: | ||
| """Serializes the component to a dictionary.""" | ||
| return default_to_dict( | ||
| self, | ||
| api_key=self.api_key.to_dict(), | ||
| top_k=self.top_k, | ||
| search_params=self.search_params, | ||
| ) | ||
|
|
||
| @classmethod | ||
| def from_dict(cls, data: dict[str, Any]) -> "FirecrawlWebSearch": | ||
| """Deserializes the component from a dictionary.""" | ||
| return default_from_dict(cls, data) |
There was a problem hiding this comment.
These methods should not be needed anymore. Serialization should be covered by default serialization of components (see https://docs.haystack.deepset.ai/docs/serialization#default-serialization-behavior).
| :param top_k: | ||
| Maximum number of documents to return. | ||
| Defaults to 10. |
There was a problem hiding this comment.
Let's add here that this can be overridden by the "limit" parameter in search_params.
| if self.top_k is not None: | ||
| documents = documents[: self.top_k] | ||
| links = links[: self.top_k] |
There was a problem hiding this comment.
I think we can remove this, Firecrawl should return only "limit" search results.
| if self.top_k is not None: | ||
| documents = documents[: self.top_k] | ||
| links = links[: self.top_k] |
There was a problem hiding this comment.
I think we can remove this, Firecrawl should return only "limit" search results.
| return {"documents": documents, "links": links} | ||
|
|
||
| @staticmethod | ||
| def _parse_search_response(search_response: Any) -> tuple[list[Document], list[str]]: |
There was a problem hiding this comment.
Let's not use Any type here, this should be of type SearchData.
| documents: list[Document] = [] | ||
| links: list[str] = [] | ||
|
|
||
| web_results = getattr(search_response, "web", None) or [] |
There was a problem hiding this comment.
Using the SearchData type for search_response, the following should work:
| web_results = getattr(search_response, "web", None) or [] | |
| web_results = search_response.web or [] |
| url = getattr(result, "url", "") or "" | ||
| title = getattr(result, "title", "") or "" | ||
| content = getattr(result, "description", "") or "" |
There was a problem hiding this comment.
| url = getattr(result, "url", "") or "" | |
| title = getattr(result, "title", "") or "" | |
| content = getattr(result, "description", "") or "" | |
| url = getattr(result, "url", "") | |
| title = getattr(result, "title", "") | |
| content = getattr(result, "description", "") |
- Remove warm_up() from docstring example (auto-called on first run) - Remove custom to_dict()/from_dict() in favor of default serialization - Update top_k docstring to note it can be overridden by limit in search_params - Remove client-side top_k truncation (Firecrawl respects limit server-side) - Use SearchData type instead of Any for _parse_search_response - Remove redundant or "" from getattr() calls - Use search_response.web directly instead of getattr() - Remove corresponding tests for removed functionality
Thank you @bogdankostic for such elaborate feedback, pushed the updates! :) |
Related Issues
Proposed Changes:
This PR adds a new
FirecrawlWebSearchcomponent to the Firecrawl integration, enabling web search queries using the Firecrawl Search API. This follows the pattern established by Haystack's existingSearchApiWebSearchandSerperDevWebSearchcomponents, as suggested in the PR review for #2859.Many use cases require starting from a user search query rather than from a predefined URL list. While the existing
FirecrawlCrawlerhandles crawling known URLs, this new component allows users to search the web and retrieve results as Haystack Documents — making it straightforward to build search-augmented pipelines.What's included:
FirecrawlWebSearchcomponent athaystack_integrations.components.websearch.firecrawlquerystring, returnsdocumentsandlinksrun()) and asynchronous (run_async()) executiontop_kfor result limiting andsearch_paramsfor Firecrawl-specific options (time filters, location, scrape options, etc.)to_dict()/from_dict()scrapeOptionsare provided)Comprehensive test suite (16 unit tests + 2 integration tests)
top_ktruncation, async execution, error handling, warm-up behavior, and edge cases (empty/null results)Configuration updates
pyproject.tomlwith"Web Search"keyword and mypy target for the new modulepydoc/config_docusaurus.ymlto include the new component in API documentation generationUsage example:
Design decisions:
firecrawl-py) directly rather than raw HTTP requests, consistent with the existingFirecrawlCrawlerimplementation. No new dependencies needed sincefirecrawl-py>=4.0.0already includes thesearch()method.components/websearch/firecrawl/to follow Haystack's namespace convention for web search components, separate from the existing crawler undercomponents/fetchers/firecrawl/.top_kmaps to Firecrawl'slimitparameter when not explicitly set insearch_params, providing a familiar interface while allowing full control throughsearch_params.SearchResultWebobjects (search metadata only) and FirecrawlDocumentobjects (full markdown content whenscrapeOptionsare provided), using attribute detection rather than type imports to avoid coupling to SDK internals.How did you test it?
hatch run test:unit) — covering initialization, serialization round-trip, search execution with mocked clients, runtime parameter overrides,top_ktruncation, async execution, error handling, warm-up behavior, and edge cases (empty/null web results)FirecrawlCrawlertests continue to pass (14/14)hatch run fmt-check)FIRECRAWL_API_KEYenvironment variable):test_run_integrationandtest_run_async_integrationNotes for the reviewer
components/websearch/firecrawl/(separate from the existingcomponents/fetchers/firecrawl/) to follow the same namespace convention as Haystack core'sSearchApiWebSearchandSerperDevWebSearch_parse_search_responsemethod handles two types of results from the Firecrawl SDK: basicSearchResultWebobjects (when no scrape options are set) and fullDocumentobjects with markdown content (whenscrapeOptionsare provided). This is handled viahasattrchecks rather than importing SDK-internal types.firecrawl-py>=4.0.0(already a dependency) includes thesearch()methodChecklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.