You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Crawl websites and extract LLM-ready content using Firecrawl
4
+
description: Crawl websites, search the web, and extract LLM-ready content using Firecrawl
5
5
authors:
6
6
- name: deepset
7
7
socials:
@@ -22,13 +22,17 @@ toc: true
22
22
-[Overview](#overview)
23
23
-[Installation](#installation)
24
24
-[Usage](#usage)
25
+
-[FirecrawlCrawler](#firecrawlcrawler)
26
+
-[FirecrawlWebSearch](#firecrawlwebsearch)
25
27
-[License](#license)
26
28
27
29
## Overview
28
30
29
-
[Firecrawl](https://firecrawl.dev) turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
31
+
[Firecrawl](https://firecrawl.dev) turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
30
32
31
-
This integration provides a [`FirecrawlCrawler`](https://docs.haystack.deepset.ai/docs/firecrawlcrawler) component that crawls one or more URLs and returns the content as Haystack `Document` objects. Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit.
33
+
This integration provides two components:
34
+
-[`FirecrawlCrawler`](https://docs.haystack.deepset.ai/docs/firecrawlcrawler): Crawls one or more URLs and follows links to discover subpages, returning extracted content as Haystack `Document` objects.
35
+
-[`FirecrawlWebSearch`](https://docs.haystack.deepset.ai/docs/firecrawlwebsearch): Searches the web using a query, scrapes the resulting pages, and returns the structured content as Haystack `Document` objects.
32
36
33
37
You need a Firecrawl API key to use this integration. You can get one at [firecrawl.dev](https://firecrawl.dev).
34
38
@@ -40,13 +44,9 @@ pip install firecrawl-haystack
40
44
41
45
## Usage
42
46
43
-
### Components
47
+
### FirecrawlCrawler
44
48
45
-
This integration provides the following component:
46
-
47
-
-**`FirecrawlCrawler`**: Crawls URLs and their subpages, returning extracted content as Haystack Documents.
48
-
49
-
### Basic Example
49
+
#### Basic Example
50
50
51
51
```python
52
52
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
@@ -69,12 +69,12 @@ crawler = FirecrawlCrawler(
69
69
)
70
70
```
71
71
72
-
### Parameters
72
+
####Parameters
73
73
74
74
-**`api_key`**: API key for Firecrawl. Defaults to the `FIRECRAWL_API_KEY` environment variable.
75
75
-**`params`**: Parameters for the crawl request. Defaults to `{"limit": 1, "scrape_options": {"formats": ["markdown"]}}`. See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) for all available parameters. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.
76
76
77
-
### Async Support
77
+
####Async Support
78
78
79
79
The component supports asynchronous execution via `run_async`:
80
80
@@ -91,6 +91,98 @@ async def main():
91
91
asyncio.run(main())
92
92
```
93
93
94
+
### FirecrawlWebSearch
95
+
96
+
`FirecrawlWebSearch` searches the web using the Firecrawl Search API, scrapes the resulting pages, and returns the structured text as Haystack `Document` objects along with the result URLs. Because Firecrawl actively scrapes and structures page content into LLM-friendly formats, you generally don't need an additional component like `LinkContentFetcher` to read the web pages.
97
+
98
+
#### Basic Example
99
+
100
+
```python
101
+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
result = web_search.run(query="What is Haystack by deepset?")
109
+
110
+
for doc in result["documents"]:
111
+
print(doc.content)
112
+
```
113
+
114
+
#### In a Pipeline
115
+
116
+
Here is an example of a RAG pipeline that uses `FirecrawlWebSearch` to look up an answer on the web. Because Firecrawl returns the actual text of the scraped pages, you can pass the `documents` output directly into a `ChatPromptBuilder` to give the LLM the necessary context.
117
+
118
+
```python
119
+
from haystack import Pipeline
120
+
from haystack.utils import Secret
121
+
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
122
+
from haystack.components.generators.chat import OpenAIChatGenerator
123
+
from haystack.dataclasses import ChatMessage
124
+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})
160
+
print(result["llm"]["replies"][0].content)
161
+
```
162
+
163
+
#### Parameters
164
+
165
+
-**`api_key`**: API key for Firecrawl. Defaults to the `FIRECRAWL_API_KEY` environment variable.
166
+
-**`top_k`**: Maximum number of documents to return. Defaults to 10. Can be overridden by the `"limit"` parameter in `search_params`.
167
+
-**`search_params`**: Additional parameters for the Firecrawl Search API (e.g., time filters, location, scrape options). See the [Firecrawl Search API reference](https://docs.firecrawl.dev/api-reference/endpoint/search) for all available parameters.
168
+
169
+
#### Async Support
170
+
171
+
The component supports asynchronous execution via `run_async`:
172
+
173
+
```python
174
+
import asyncio
175
+
from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch
176
+
177
+
asyncdefmain():
178
+
web_search = FirecrawlWebSearch(top_k=3)
179
+
180
+
result =await web_search.run_async(query="What is Haystack by deepset?")
0 commit comments