Skip to content

Commit 04f4b76

Browse files
authored
Add FirecrawlWebSearch component to firecrawl integration page (#415)
1 parent 78335ce commit 04f4b76

1 file changed

Lines changed: 103 additions & 11 deletions

File tree

integrations/firecrawl.md

Lines changed: 103 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: integration
33
name: Firecrawl
4-
description: Crawl websites and extract LLM-ready content using Firecrawl
4+
description: Crawl websites, search the web, and extract LLM-ready content using Firecrawl
55
authors:
66
- name: deepset
77
socials:
@@ -22,13 +22,17 @@ toc: true
2222
- [Overview](#overview)
2323
- [Installation](#installation)
2424
- [Usage](#usage)
25+
- [FirecrawlCrawler](#firecrawlcrawler)
26+
- [FirecrawlWebSearch](#firecrawlwebsearch)
2527
- [License](#license)
2628

2729
## Overview
2830

29-
[Firecrawl](https://firecrawl.dev) turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
31+
[Firecrawl](https://firecrawl.dev) turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
3032

31-
This integration provides a [`FirecrawlCrawler`](https://docs.haystack.deepset.ai/docs/firecrawlcrawler) component that crawls one or more URLs and returns the content as Haystack `Document` objects. Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit.
33+
This integration provides two components:
34+
- [`FirecrawlCrawler`](https://docs.haystack.deepset.ai/docs/firecrawlcrawler): Crawls one or more URLs and follows links to discover subpages, returning extracted content as Haystack `Document` objects.
35+
- [`FirecrawlWebSearch`](https://docs.haystack.deepset.ai/docs/firecrawlwebsearch): Searches the web using a query, scrapes the resulting pages, and returns the structured content as Haystack `Document` objects.
3236

3337
You need a Firecrawl API key to use this integration. You can get one at [firecrawl.dev](https://firecrawl.dev).
3438

@@ -40,13 +44,9 @@ pip install firecrawl-haystack
4044

4145
## Usage
4246

43-
### Components
47+
### FirecrawlCrawler
4448

45-
This integration provides the following component:
46-
47-
- **`FirecrawlCrawler`**: Crawls URLs and their subpages, returning extracted content as Haystack Documents.
48-
49-
### Basic Example
49+
#### Basic Example
5050

5151
```python
5252
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
@@ -69,12 +69,12 @@ crawler = FirecrawlCrawler(
6969
)
7070
```
7171

72-
### Parameters
72+
#### Parameters
7373

7474
- **`api_key`**: API key for Firecrawl. Defaults to the `FIRECRAWL_API_KEY` environment variable.
7575
- **`params`**: Parameters for the crawl request. Defaults to `{"limit": 1, "scrape_options": {"formats": ["markdown"]}}`. See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) for all available parameters. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.
7676

77-
### Async Support
77+
#### Async Support
7878

7979
The component supports asynchronous execution via `run_async`:
8080

@@ -91,6 +91,98 @@ async def main():
9191
asyncio.run(main())
9292
```
9393

94+
### FirecrawlWebSearch
95+
96+
`FirecrawlWebSearch` searches the web using the Firecrawl Search API, scrapes the resulting pages, and returns the structured text as Haystack `Document` objects along with the result URLs. Because Firecrawl actively scrapes and structures page content into LLM-friendly formats, you generally don't need an additional component like `LinkContentFetcher` to read the web pages.
97+
98+
#### Basic Example
99+
100+
```python
101+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
102+
103+
web_search = FirecrawlWebSearch(
104+
top_k=5,
105+
search_params={"scrape_options": {"formats": ["markdown"]}},
106+
)
107+
108+
result = web_search.run(query="What is Haystack by deepset?")
109+
110+
for doc in result["documents"]:
111+
print(doc.content)
112+
```
113+
114+
#### In a Pipeline
115+
116+
Here is an example of a RAG pipeline that uses `FirecrawlWebSearch` to look up an answer on the web. Because Firecrawl returns the actual text of the scraped pages, you can pass the `documents` output directly into a `ChatPromptBuilder` to give the LLM the necessary context.
117+
118+
```python
119+
from haystack import Pipeline
120+
from haystack.utils import Secret
121+
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
122+
from haystack.components.generators.chat import OpenAIChatGenerator
123+
from haystack.dataclasses import ChatMessage
124+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
125+
126+
web_search = FirecrawlWebSearch(
127+
top_k=2,
128+
search_params={"scrape_options": {"formats": ["markdown"]}},
129+
)
130+
131+
prompt_template = [
132+
ChatMessage.from_system("You are a helpful assistant."),
133+
ChatMessage.from_user(
134+
"Given the information below:\n"
135+
"{% for document in documents %}{{ document.content }}\n{% endfor %}\n"
136+
"Answer the following question: {{ query }}.\nAnswer:",
137+
),
138+
]
139+
140+
prompt_builder = ChatPromptBuilder(
141+
template=prompt_template,
142+
required_variables={"query", "documents"},
143+
)
144+
145+
llm = OpenAIChatGenerator(
146+
api_key=Secret.from_env_var("OPENAI_API_KEY"),
147+
model="gpt-4o-mini",
148+
)
149+
150+
pipe = Pipeline()
151+
pipe.add_component("search", web_search)
152+
pipe.add_component("prompt_builder", prompt_builder)
153+
pipe.add_component("llm", llm)
154+
155+
pipe.connect("search.documents", "prompt_builder.documents")
156+
pipe.connect("prompt_builder.prompt", "llm.messages")
157+
158+
query = "What is Haystack by deepset?"
159+
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})
160+
print(result["llm"]["replies"][0].content)
161+
```
162+
163+
#### Parameters
164+
165+
- **`api_key`**: API key for Firecrawl. Defaults to the `FIRECRAWL_API_KEY` environment variable.
166+
- **`top_k`**: Maximum number of documents to return. Defaults to 10. Can be overridden by the `"limit"` parameter in `search_params`.
167+
- **`search_params`**: Additional parameters for the Firecrawl Search API (e.g., time filters, location, scrape options). See the [Firecrawl Search API reference](https://docs.firecrawl.dev/api-reference/endpoint/search) for all available parameters.
168+
169+
#### Async Support
170+
171+
The component supports asynchronous execution via `run_async`:
172+
173+
```python
174+
import asyncio
175+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
176+
177+
async def main():
178+
web_search = FirecrawlWebSearch(top_k=3)
179+
180+
result = await web_search.run_async(query="What is Haystack by deepset?")
181+
print(f"Found {len(result['documents'])} documents")
182+
183+
asyncio.run(main())
184+
```
185+
94186
### License
95187

96188
`firecrawl-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)