Skip to content

Commit c8cf749

Browse files
authored
Add Firecrawl integration (#402)
* Add Firecrawl integration page * Update Firecrawl integration documentation to include a link to the FirecrawlCrawler component and remove unnecessary warm-up calls in examples.
1 parent 978d0e7 commit c8cf749

2 files changed

Lines changed: 96 additions & 0 deletions

File tree

integrations/firecrawl.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
layout: integration
3+
name: Firecrawl
4+
description: Crawl websites and extract LLM-ready content using Firecrawl
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
pypi: https://pypi.org/project/firecrawl-haystack
12+
repo: https://github.com/deepset-ai/haystack-core-integrations
13+
type: Data Ingestion
14+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
15+
logo: /logos/firecrawl.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### Table of Contents
21+
22+
- [Overview](#overview)
23+
- [Installation](#installation)
24+
- [Usage](#usage)
25+
- [License](#license)
26+
27+
## Overview
28+
29+
[Firecrawl](https://firecrawl.dev) turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
30+
31+
This integration provides a [`FirecrawlCrawler`](https://docs.haystack.deepset.ai/docs/firecrawlcrawler) component that crawls one or more URLs and returns the content as Haystack `Document` objects. Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit.
32+
33+
You need a Firecrawl API key to use this integration. You can get one at [firecrawl.dev](https://firecrawl.dev).
34+
35+
## Installation
36+
37+
```bash
38+
pip install firecrawl-haystack
39+
```
40+
41+
## Usage
42+
43+
### Components
44+
45+
This integration provides the following component:
46+
47+
- **`FirecrawlCrawler`**: Crawls URLs and their subpages, returning extracted content as Haystack Documents.
48+
49+
### Basic Example
50+
51+
```python
52+
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
53+
54+
crawler = FirecrawlCrawler(params={"limit": 5})
55+
56+
result = crawler.run(urls=["https://docs.haystack.deepset.ai/docs/intro"])
57+
documents = result["documents"]
58+
```
59+
60+
By default, the component reads the API key from the `FIRECRAWL_API_KEY` environment variable. You can also pass it explicitly:
61+
62+
```python
63+
from haystack.utils import Secret
64+
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
65+
66+
crawler = FirecrawlCrawler(
67+
api_key=Secret.from_token("your-api-key"),
68+
params={"limit": 10, "scrape_options": {"formats": ["markdown"]}},
69+
)
70+
```
71+
72+
### Parameters
73+
74+
- **`api_key`**: API key for Firecrawl. Defaults to the `FIRECRAWL_API_KEY` environment variable.
75+
- **`params`**: Parameters for the crawl request. Defaults to `{"limit": 1, "scrape_options": {"formats": ["markdown"]}}`. See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) for all available parameters. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.
76+
77+
### Async Support
78+
79+
The component supports asynchronous execution via `run_async`:
80+
81+
```python
82+
import asyncio
83+
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
84+
85+
async def main():
86+
crawler = FirecrawlCrawler(params={"limit": 5})
87+
88+
result = await crawler.run_async(urls=["https://docs.haystack.deepset.ai/docs/intro"])
89+
print(f"Crawled {len(result['documents'])} documents")
90+
91+
asyncio.run(main())
92+
```
93+
94+
### License
95+
96+
`firecrawl-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

logos/firecrawl.png

2.6 MB
Loading

0 commit comments

Comments
 (0)