Skip to content

Commit 7f496a6

Browse files
committed
add docs
1 parent 490b645 commit 7f496a6

9 files changed

Lines changed: 445 additions & 15 deletions

File tree

docs/guides/ai_crawler.mdx

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
id: ai-crawler
3+
title: AI crawler
4+
description: Learn how to use AiCrawler to extract structured data from HTML pages with an LLM.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
import CodeBlock from '@theme/CodeBlock';
9+
10+
import BasicExample from '!!raw-loader!./code_examples/ai_crawler/basic_example.py';
11+
import AdditionalInstructionsExample from '!!raw-loader!./code_examples/ai_crawler/additional_instructions_example.py';
12+
import CustomDistillerExample from '!!raw-loader!./code_examples/ai_crawler/custom_distiller_example.py';
13+
import SelectorExtractorExample from '!!raw-loader!./code_examples/ai_crawler/selector_extractor_example.py';
14+
import UsageLimitExample from '!!raw-loader!./code_examples/ai_crawler/usage_limit_example.py';
15+
16+
An <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> extracts structured data from a page with an LLM. It fetches each page over plain HTTP and parses it with Parsel, then exposes an <ApiLink to="class/ExtractFunction">`extract`</ApiLink> helper: pass a Pydantic model and get a validated instance back. Instead of writing CSS selectors for every field, you describe the data with a schema and the model fills it in.
17+
18+
The model layer is [Pydantic AI](https://ai.pydantic.dev/), so any provider it supports (OpenAI, Anthropic, Gemini, Ollama, ...) works through the `model` argument. The context is an <ApiLink to="class/AiCrawlingContext">`AiCrawlingContext`</ApiLink>, which extends the <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink>, so the manual <ApiLink to="class/ParselCrawlingContext#selector">`selector`</ApiLink> and <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> stay available next to <ApiLink to="class/ExtractFunction">`extract`</ApiLink>.
19+
20+
:::caution Experimental
21+
22+
<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> is experimental. Its public API may change in future releases.
23+
24+
:::
25+
26+
## When to use AiCrawler
27+
28+
Use <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> when:
29+
30+
- Selectors are unknown or brittle. The model reads the content, so it tolerates markup that varies or changes.
31+
- One schema spans many layouts. A single Pydantic model fits differently structured pages, with no per-page selectors.
32+
- Rapid prototyping. You describe the data with a schema instead of writing selectors.
33+
34+
For pages with a stable, known structure, a plain <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> is cheaper, since it runs no model calls.
35+
36+
<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> fetches pages over plain HTTP and does not render JavaScript. For pages that need a browser, or for complex multi-step interactions, use <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink>. See the [Stagehand crawler guide](./stagehand-crawler).
37+
38+
## Installation
39+
40+
<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> requires the `ai` optional dependency group:
41+
42+
```bash
43+
pip install 'crawlee[ai]'
44+
```
45+
46+
or with uv:
47+
48+
```bash
49+
uv add 'crawlee[ai]'
50+
```
51+
52+
The `ai` extra installs the OpenAI integration by default. To use another provider, add the matching [pydantic-ai-slim](https://ai.pydantic.dev/install/#use-with-pydantic-ai-slim) extra. For example, for Anthropic:
53+
54+
```bash
55+
pip install 'crawlee[ai]' 'pydantic-ai-slim[anthropic]'
56+
```
57+
58+
## Basic usage
59+
60+
Provide a `model` and call <ApiLink to="class/AiCrawlingContext#extract">`context.extract`</ApiLink> with a Pydantic model inside the handler. The example below extracts an article and pushes it to the dataset.
61+
62+
<CodeBlock className="language-python">
63+
{BasicExample}
64+
</CodeBlock>
65+
66+
The `model` builds the crawler's default extractor, an <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>. With neither `model` nor `extractor`, a default OpenAI model is used.
67+
68+
The `model` argument accepts a provider-prefixed name or a Pydantic AI `Model` instance.
69+
70+
```python
71+
# A provider-prefixed name reads credentials from the provider's environment variable (e.g. OPENAI_API_KEY).
72+
crawler = AiCrawler(model='openai:gpt-5.4-nano')
73+
74+
# A Model instance takes credentials explicitly.
75+
from pydantic_ai.models.openai import OpenAIChatModel
76+
from pydantic_ai.providers.openai import OpenAIProvider
77+
78+
model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...'))
79+
crawler = AiCrawler(model=model)
80+
```
81+
82+
## Extractors
83+
84+
An extractor turns a page into your schema. Extractors implement different strategies for working with the LLM, and each one uses an <ApiLink to="class/AiHtmlDistiller">`AiHtmlDistiller`</ApiLink> to shape the model's input. Crawlee ships two.
85+
86+
### AiDirectExtractor
87+
88+
<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result; on a mismatch, it sends the error back to the model to fix, bounded by `retries`.
89+
90+
It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, which scales poorly on a large site.
91+
92+
Use `additional_instructions` to focus the model on the data you want:
93+
94+
<CodeBlock className="language-python">
95+
{AdditionalInstructionsExample}
96+
</CodeBlock>
97+
98+
### AiSelectorExtractor
99+
100+
<ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> asks the model for reusable CSS selectors on the first page of a route, caches them, and reuses them with no model call on later pages of the same layout, so it scales to large sites. When a page matches none of the cached selectors (a different markup variant), it generates and caches a new set, so one bucket can hold several variants. If selector generation fails, or the schema shape is unsupported, it degrades to the `fallback` extractor when one is set, and raises otherwise. Selectors are bucketed by `cache_tag`, which defaults to the request label, so each route keeps its own set. The cache is persisted to a <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, so a later run reuses selectors learned earlier.
101+
102+
<CodeBlock className="language-python">
103+
{SelectorExtractorExample}
104+
</CodeBlock>
105+
106+
It supports schemas built from scalar fields, lists of scalars, lists of items, and a single nested item, one level deep. For shapes it cannot serve (such as a `dict` field), set a `fallback` or use <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>.
107+
108+
Both extractors share two more knobs. `retries` caps how many times the model may fix output that fails schema validation (default 1 for <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>, 3 for <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink>). `instructions` replaces the base task instructions entirely.
109+
110+
## Distillers
111+
112+
A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument).
113+
114+
<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself.
115+
116+
### Custom distiller
117+
118+
Subclass <ApiLink to="class/BaseAiHtmlDistiller">`BaseAiHtmlDistiller`</ApiLink> and implement <ApiLink to="class/BaseAiHtmlDistiller#distill">`distill`</ApiLink> to send a different representation. Set `prompt_notes` so the model knows the input format. The extractor appends the notes to its instructions.
119+
120+
The example below converts the cleaned page to Markdown with [html-to-markdown](https://pypi.org/project/html-to-markdown/), an extra dependency:
121+
122+
```bash
123+
pip install html-to-markdown
124+
```
125+
126+
<CodeBlock className="language-python">
127+
{CustomDistillerExample}
128+
</CodeBlock>
129+
130+
## Extract options
131+
132+
<ApiLink to="class/AiCrawlingContext#extract">`context.extract`</ApiLink> takes options alongside the schema:
133+
134+
- `scope` - a CSS selector that restricts extraction to the first matching subtree (e.g. `main` or `article.post`). It saves tokens and keeps the model away from unrelated parts of the page.
135+
- `cache_tag` - the bucket for cached selectors. It defaults to the request label.
136+
- `additional_instructions` - extra instructions for this call, appended to the base instructions. With <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> they steer the one-time selector generation, not each extraction, so use them to point the model at the right region.
137+
138+
## Usage and cost
139+
140+
Token usage accumulates on <ApiLink to="class/AiCrawlingContext#ai_usage">`context.ai_usage`</ApiLink>, and on <ApiLink to="class/AiCrawler#ai_usage">`crawler.ai_usage`</ApiLink> for the whole crawl. The accumulator is an <ApiLink to="class/AiUsageStats">`AiUsageStats`</ApiLink> with <ApiLink to="class/AiUsageStats#requests">`requests`</ApiLink>, <ApiLink to="class/AiUsageStats#input_tokens">`input_tokens`</ApiLink>, <ApiLink to="class/AiUsageStats#output_tokens">`output_tokens`</ApiLink>, and <ApiLink to="class/AiUsageStats#total_tokens">`total_tokens`</ApiLink>.
141+
142+
To cap spend, pass `usage_limits` (a pydantic-ai `UsageLimits`) to an extractor. It applies to every model run, and <ApiLink to="class/ExtractFunction">`extract`</ApiLink> raises `UsageLimitExceeded` when a page needs more. The example below caps each extraction, logs and skips pages that exceed it, and stops the whole crawl once a token budget is spent.
143+
144+
<CodeBlock className="language-python">
145+
{UsageLimitExample}
146+
</CodeBlock>
147+
148+
## Conclusion
149+
150+
This guide introduced <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> and its <ApiLink to="class/ExtractFunction">`extract`</ApiLink> helper, the <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> and <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> strategies, the built-in and custom distillers, the extract options, and how failures and cost are handled. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

docs/guides/architecture_overview.mdx

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ class ParselCrawler
4949
5050
class BeautifulSoupCrawler
5151
52+
class AiCrawler
53+
5254
class PlaywrightCrawler
5355
5456
class AdaptivePlaywrightCrawler
@@ -65,18 +67,20 @@ BasicCrawler --|> AdaptivePlaywrightCrawler
6567
AbstractHttpCrawler --|> HttpCrawler
6668
AbstractHttpCrawler --|> ParselCrawler
6769
AbstractHttpCrawler --|> BeautifulSoupCrawler
70+
AbstractHttpCrawler --|> AiCrawler
6871
PlaywrightCrawler --|> StagehandCrawler
6972
```
7073

7174
### HTTP crawlers
7275

7376
HTTP crawlers use HTTP clients to fetch pages and parse them with HTML parsing libraries. They are fast and efficient for sites that do not require JavaScript rendering. HTTP clients are Crawlee components that wrap around HTTP libraries like [httpx](https://www.python-httpx.org/), [curl-impersonate](https://github.com/lwthiker/curl-impersonate) or [impit](https://apify.github.io/impit) and handle HTTP communication for requests and responses. You can learn more about them in the [HTTP clients guide](./http-clients).
7477

75-
HTTP crawlers inherit from <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> and there are three crawlers that belong to this category:
78+
HTTP crawlers inherit from <ApiLink to="class/AbstractHttpCrawler">`AbstractHttpCrawler`</ApiLink> and there are four crawlers that belong to this category:
7679

7780
- <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> utilizes the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) HTML parser.
7881
- <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> utilizes [Parsel](https://github.com/scrapy/parsel) for parsing HTML.
7982
- <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink> does not parse HTTP responses at all and is used when no content parsing is required.
83+
- <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> parses HTML with Parsel and uses an LLM to extract structured data into a validated Pydantic model.
8084

8185
You can learn more about HTTP crawlers in the [HTTP crawlers guide](./http-crawlers).
8286

@@ -120,6 +124,8 @@ class ParselCrawlingContext
120124
121125
class BeautifulSoupCrawlingContext
122126
127+
class AiCrawlingContext
128+
123129
class PlaywrightPreNavCrawlingContext
124130
125131
class PlaywrightCrawlingContext
@@ -148,6 +154,8 @@ ParsedHttpCrawlingContext --|> ParselCrawlingContext
148154
149155
ParsedHttpCrawlingContext --|> BeautifulSoupCrawlingContext
150156
157+
ParselCrawlingContext --|> AiCrawlingContext
158+
151159
BasicCrawlingContext --|> PlaywrightPreNavCrawlingContext
152160
153161
PlaywrightPreNavCrawlingContext --|> PlaywrightCrawlingContext
@@ -168,6 +176,7 @@ They have a similar inheritance structure as the crawlers, with the base class b
168176
- <ApiLink to="class/ParsedHttpCrawlingContext">`ParsedHttpCrawlingContext`</ApiLink> for HTTP crawlers with parsed responses.
169177
- <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink> for HTTP crawlers that use [Parsel](https://github.com/scrapy/parsel) for parsing.
170178
- <ApiLink to="class/BeautifulSoupCrawlingContext">`BeautifulSoupCrawlingContext`</ApiLink> for HTTP crawlers that use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing.
179+
- <ApiLink to="class/AiCrawlingContext">`AiCrawlingContext`</ApiLink> for the AI crawler, extending the Parsel context with an `extract` helper.
171180
- <ApiLink to="class/PlaywrightPreNavCrawlingContext">`PlaywrightPreNavCrawlingContext`</ApiLink> for Playwright crawlers before the page is navigated.
172181
- <ApiLink to="class/PlaywrightCrawlingContext">`PlaywrightCrawlingContext`</ApiLink> for Playwright crawlers.
173182
- <ApiLink to="class/AdaptivePlaywrightPreNavCrawlingContext">`AdaptivePlaywrightPreNavCrawlingContext`</ApiLink> for Adaptive Playwright crawlers before the page is navigated.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
import asyncio
2+
3+
from pydantic import BaseModel
4+
from pydantic_ai.models.openai import OpenAIChatModel
5+
from pydantic_ai.providers.openai import OpenAIProvider
6+
7+
from crawlee.crawlers import AiCrawler, AiCrawlingContext
8+
9+
10+
class Post(BaseModel):
11+
"""Model representing a single post."""
12+
13+
title: str
14+
url: str
15+
16+
17+
class Posts(BaseModel):
18+
"""Model representing the extracted list of posts."""
19+
20+
posts: list[Post]
21+
22+
23+
async def main() -> None:
24+
model = OpenAIChatModel(
25+
'gpt-5.4-nano',
26+
provider=OpenAIProvider(api_key='your-openai-api-key'),
27+
)
28+
crawler = AiCrawler(model=model, max_requests_per_crawl=5)
29+
30+
@crawler.router.default_handler
31+
async def handler(context: AiCrawlingContext) -> None:
32+
# The instruction narrows what the model returns from the page.
33+
posts = await context.extract(
34+
Posts,
35+
additional_instructions='Extract only the top five posts on the page.',
36+
)
37+
38+
await context.push_data(posts.model_dump())
39+
40+
await crawler.run(['https://news.ycombinator.com'])
41+
42+
43+
if __name__ == '__main__':
44+
asyncio.run(main())
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import asyncio
2+
3+
from pydantic import BaseModel
4+
from pydantic_ai.models.openai import OpenAIChatModel
5+
from pydantic_ai.providers.openai import OpenAIProvider
6+
7+
from crawlee.crawlers import AiCrawler, AiCrawlingContext
8+
9+
10+
class Article(BaseModel):
11+
"""Model representing the extracted data for an article."""
12+
13+
title: str
14+
short_text: str
15+
16+
17+
async def main() -> None:
18+
model = OpenAIChatModel(
19+
'gpt-5.4-nano',
20+
# Set the provider with the API key explicitly.
21+
provider=OpenAIProvider(api_key='your-openai-api-key'),
22+
)
23+
24+
crawler = AiCrawler(model=model, max_requests_per_crawl=5)
25+
26+
@crawler.router.default_handler
27+
async def handler(context: AiCrawlingContext) -> None:
28+
context.log.info(f'Processing {context.request.url} ...')
29+
30+
# Pass a Pydantic model and get a validated instance back.
31+
article = await context.extract(Article)
32+
33+
await context.push_data(article.model_dump())
34+
35+
await context.enqueue_links()
36+
37+
await crawler.run(['https://crawlee.dev/'])
38+
39+
40+
if __name__ == '__main__':
41+
asyncio.run(main())
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
import asyncio
2+
3+
from html_to_markdown import convert
4+
from lxml_html_clean import Cleaner
5+
from pydantic import BaseModel
6+
from pydantic_ai.models.openai import OpenAIChatModel
7+
from pydantic_ai.providers.openai import OpenAIProvider
8+
9+
from crawlee.crawlers import (
10+
AiCrawler,
11+
AiCrawlingContext,
12+
AiDirectExtractor,
13+
BaseAiHtmlDistiller,
14+
get_basic_ai_cleaner,
15+
)
16+
17+
# Notes appended to the model instructions so it knows the input format.
18+
MARKDOWN_PROMPT_NOTES = 'The document is Markdown converted from the HTML page.'
19+
20+
21+
class MarkdownDistiller(BaseAiHtmlDistiller):
22+
"""Distiller that cleans the page HTML and converts it to Markdown."""
23+
24+
def __init__(self, cleaner: Cleaner | None = None) -> None:
25+
super().__init__(prompt_notes=MARKDOWN_PROMPT_NOTES)
26+
27+
# Strip scripts, styles, and other noise before the conversion.
28+
self._cleaner = cleaner or get_basic_ai_cleaner()
29+
30+
def distill(self, html: str) -> str:
31+
return convert(self._cleaner.clean_html(html)).content or ''
32+
33+
34+
class Article(BaseModel):
35+
"""Model representing the extracted data for an article."""
36+
37+
title: str
38+
short_text: str
39+
40+
41+
async def main() -> None:
42+
model = OpenAIChatModel(
43+
'gpt-5.4-nano',
44+
# Set the provider with the API key explicitly.
45+
provider=OpenAIProvider(api_key='your-openai-api-key'),
46+
)
47+
crawler = AiCrawler(
48+
# Use the custom distiller to convert the page to Markdown before extraction.
49+
extractor=AiDirectExtractor(model=model, distiller=MarkdownDistiller()),
50+
max_requests_per_crawl=5,
51+
)
52+
53+
@crawler.router.default_handler
54+
async def handler(context: AiCrawlingContext) -> None:
55+
# Pass a Pydantic model and get a validated instance back.
56+
article = await context.extract(Article)
57+
await context.push_data(article.model_dump())
58+
59+
# Enqueue links as usual, the distillation and extraction don't affect
60+
# the rest of the crawling logic.
61+
await context.enqueue_links()
62+
63+
await crawler.run(['https://crawlee.dev/'])
64+
65+
66+
if __name__ == '__main__':
67+
asyncio.run(main())

0 commit comments

Comments
 (0)