|
| 1 | +--- |
| 2 | +id: ai-crawler |
| 3 | +title: AI crawler |
| 4 | +description: Learn how to use AiCrawler to extract structured data from HTML pages with an LLM. |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | +import CodeBlock from '@theme/CodeBlock'; |
| 9 | + |
| 10 | +import BasicExample from '!!raw-loader!./code_examples/ai_crawler/basic_example.py'; |
| 11 | +import AdditionalInstructionsExample from '!!raw-loader!./code_examples/ai_crawler/additional_instructions_example.py'; |
| 12 | +import CustomDistillerExample from '!!raw-loader!./code_examples/ai_crawler/custom_distiller_example.py'; |
| 13 | +import SelectorExtractorExample from '!!raw-loader!./code_examples/ai_crawler/selector_extractor_example.py'; |
| 14 | +import UsageLimitExample from '!!raw-loader!./code_examples/ai_crawler/usage_limit_example.py'; |
| 15 | + |
| 16 | +An <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> extracts structured data from a page with an LLM. It fetches each page over plain HTTP and parses it with Parsel, then exposes an <ApiLink to="class/ExtractFunction">`extract`</ApiLink> helper: pass a Pydantic model and get a validated instance back. Instead of writing CSS selectors for every field, you describe the data with a schema and the model fills it in. |
| 17 | + |
| 18 | +The model layer is [Pydantic AI](https://ai.pydantic.dev/), so any provider it supports (OpenAI, Anthropic, Gemini, Ollama, ...) works through the `model` argument. The context is an <ApiLink to="class/AiCrawlingContext">`AiCrawlingContext`</ApiLink>, which extends the <ApiLink to="class/ParselCrawlingContext">`ParselCrawlingContext`</ApiLink>, so the manual <ApiLink to="class/ParselCrawlingContext#selector">`selector`</ApiLink> and <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> stay available next to <ApiLink to="class/ExtractFunction">`extract`</ApiLink>. |
| 19 | + |
| 20 | +:::caution Experimental |
| 21 | + |
| 22 | +<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> is experimental. Its public API may change in future releases. |
| 23 | + |
| 24 | +::: |
| 25 | + |
| 26 | +## When to use AiCrawler |
| 27 | + |
| 28 | +Use <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> when: |
| 29 | + |
| 30 | +- Selectors are unknown or brittle. The model reads the content, so it tolerates markup that varies or changes. |
| 31 | +- One schema spans many layouts. A single Pydantic model fits differently structured pages, with no per-page selectors. |
| 32 | +- Rapid prototyping. You describe the data with a schema instead of writing selectors. |
| 33 | + |
| 34 | +For pages with a stable, known structure, a plain <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> is cheaper, since it runs no model calls. |
| 35 | + |
| 36 | +<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> fetches pages over plain HTTP and does not render JavaScript. For pages that need a browser, or for complex multi-step interactions, use <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink>. See the [Stagehand crawler guide](./stagehand-crawler). |
| 37 | + |
| 38 | +## Installation |
| 39 | + |
| 40 | +<ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> requires the `ai` optional dependency group: |
| 41 | + |
| 42 | +```bash |
| 43 | +pip install 'crawlee[ai]' |
| 44 | +``` |
| 45 | + |
| 46 | +or with uv: |
| 47 | + |
| 48 | +```bash |
| 49 | +uv add 'crawlee[ai]' |
| 50 | +``` |
| 51 | + |
| 52 | +The `ai` extra installs the OpenAI integration by default. To use another provider, add the matching [pydantic-ai-slim](https://ai.pydantic.dev/install/#use-with-pydantic-ai-slim) extra. For example, for Anthropic: |
| 53 | + |
| 54 | +```bash |
| 55 | +pip install 'crawlee[ai]' 'pydantic-ai-slim[anthropic]' |
| 56 | +``` |
| 57 | + |
| 58 | +## Basic usage |
| 59 | + |
| 60 | +Provide a `model` and call <ApiLink to="class/AiCrawlingContext#extract">`context.extract`</ApiLink> with a Pydantic model inside the handler. The example below extracts an article and pushes it to the dataset. |
| 61 | + |
| 62 | +<CodeBlock className="language-python"> |
| 63 | + {BasicExample} |
| 64 | +</CodeBlock> |
| 65 | + |
| 66 | +The `model` builds the crawler's default extractor, an <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>. With neither `model` nor `extractor`, a default OpenAI model is used. |
| 67 | + |
| 68 | +The `model` argument accepts a provider-prefixed name or a Pydantic AI `Model` instance. |
| 69 | + |
| 70 | +```python |
| 71 | +# A provider-prefixed name reads credentials from the provider's environment variable (e.g. OPENAI_API_KEY). |
| 72 | +crawler = AiCrawler(model='openai:gpt-5.4-nano') |
| 73 | + |
| 74 | +# A Model instance takes credentials explicitly. |
| 75 | +from pydantic_ai.models.openai import OpenAIChatModel |
| 76 | +from pydantic_ai.providers.openai import OpenAIProvider |
| 77 | + |
| 78 | +model = OpenAIChatModel('gpt-5.4-nano', provider=OpenAIProvider(api_key='...')) |
| 79 | +crawler = AiCrawler(model=model) |
| 80 | +``` |
| 81 | + |
| 82 | +## Extractors |
| 83 | + |
| 84 | +An extractor turns a page into your schema. Extractors implement different strategies for working with the LLM, and each one uses an <ApiLink to="class/AiHtmlDistiller">`AiHtmlDistiller`</ApiLink> to shape the model's input. Crawlee ships two. |
| 85 | + |
| 86 | +### AiDirectExtractor |
| 87 | + |
| 88 | +<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> sends the distilled page to the model in one call. The schema is the model's output type. Pydantic AI validates the result; on a mismatch, it sends the error back to the model to fix, bounded by `retries`. |
| 89 | + |
| 90 | +It reads each page on its own, so extraction is accurate per page. It accepts schemas of any shape: nested models, lists, dictionaries, unions, and deep nesting. The cost is one model call per page, which scales poorly on a large site. |
| 91 | + |
| 92 | +Use `additional_instructions` to focus the model on the data you want: |
| 93 | + |
| 94 | +<CodeBlock className="language-python"> |
| 95 | + {AdditionalInstructionsExample} |
| 96 | +</CodeBlock> |
| 97 | + |
| 98 | +### AiSelectorExtractor |
| 99 | + |
| 100 | +<ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> asks the model for reusable CSS selectors on the first page of a route, caches them, and reuses them with no model call on later pages of the same layout, so it scales to large sites. When a page matches none of the cached selectors (a different markup variant), it generates and caches a new set, so one bucket can hold several variants. If selector generation fails, or the schema shape is unsupported, it degrades to the `fallback` extractor when one is set, and raises otherwise. Selectors are bucketed by `cache_tag`, which defaults to the request label, so each route keeps its own set. The cache is persisted to a <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink>, so a later run reuses selectors learned earlier. |
| 101 | + |
| 102 | +<CodeBlock className="language-python"> |
| 103 | + {SelectorExtractorExample} |
| 104 | +</CodeBlock> |
| 105 | + |
| 106 | +It supports schemas built from scalar fields, lists of scalars, lists of items, and a single nested item, one level deep. For shapes it cannot serve (such as a `dict` field), set a `fallback` or use <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>. |
| 107 | + |
| 108 | +Both extractors share two more knobs. `retries` caps how many times the model may fix output that fails schema validation (default 1 for <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink>, 3 for <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink>). `instructions` replaces the base task instructions entirely. |
| 109 | + |
| 110 | +## Distillers |
| 111 | + |
| 112 | +A distiller reduces raw HTML to a compact representation the model reads cheaply. Each extractor uses one. Replace it with the extractor's `distiller` argument (the crawler itself has no `distiller` argument). |
| 113 | + |
| 114 | +<ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> defaults to an <ApiLink to="class/AiCleanHtmlDistiller">`AiCleanHtmlDistiller`</ApiLink>: cleaned, structure-preserving HTML that keeps the full page text. <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> uses an <ApiLink to="class/AiSkeletonDistiller">`AiSkeletonDistiller`</ApiLink> internally to ask the model for selectors; you rarely set it yourself. |
| 115 | + |
| 116 | +### Custom distiller |
| 117 | + |
| 118 | +Subclass <ApiLink to="class/BaseAiHtmlDistiller">`BaseAiHtmlDistiller`</ApiLink> and implement <ApiLink to="class/BaseAiHtmlDistiller#distill">`distill`</ApiLink> to send a different representation. Set `prompt_notes` so the model knows the input format. The extractor appends the notes to its instructions. |
| 119 | + |
| 120 | +The example below converts the cleaned page to Markdown with [html-to-markdown](https://pypi.org/project/html-to-markdown/), an extra dependency: |
| 121 | + |
| 122 | +```bash |
| 123 | +pip install html-to-markdown |
| 124 | +``` |
| 125 | + |
| 126 | +<CodeBlock className="language-python"> |
| 127 | + {CustomDistillerExample} |
| 128 | +</CodeBlock> |
| 129 | + |
| 130 | +## Extract options |
| 131 | + |
| 132 | +<ApiLink to="class/AiCrawlingContext#extract">`context.extract`</ApiLink> takes options alongside the schema: |
| 133 | + |
| 134 | +- `scope` - a CSS selector that restricts extraction to the first matching subtree (e.g. `main` or `article.post`). It saves tokens and keeps the model away from unrelated parts of the page. |
| 135 | +- `cache_tag` - the bucket for cached selectors. It defaults to the request label. |
| 136 | +- `additional_instructions` - extra instructions for this call, appended to the base instructions. With <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> they steer the one-time selector generation, not each extraction, so use them to point the model at the right region. |
| 137 | + |
| 138 | +## Usage and cost |
| 139 | + |
| 140 | +Token usage accumulates on <ApiLink to="class/AiCrawlingContext#ai_usage">`context.ai_usage`</ApiLink>, and on <ApiLink to="class/AiCrawler#ai_usage">`crawler.ai_usage`</ApiLink> for the whole crawl. The accumulator is an <ApiLink to="class/AiUsageStats">`AiUsageStats`</ApiLink> with <ApiLink to="class/AiUsageStats#requests">`requests`</ApiLink>, <ApiLink to="class/AiUsageStats#input_tokens">`input_tokens`</ApiLink>, <ApiLink to="class/AiUsageStats#output_tokens">`output_tokens`</ApiLink>, and <ApiLink to="class/AiUsageStats#total_tokens">`total_tokens`</ApiLink>. |
| 141 | + |
| 142 | +To cap spend, pass `usage_limits` (a pydantic-ai `UsageLimits`) to an extractor. It applies to every model run, and <ApiLink to="class/ExtractFunction">`extract`</ApiLink> raises `UsageLimitExceeded` when a page needs more. The example below caps each extraction, logs and skips pages that exceed it, and stops the whole crawl once a token budget is spent. |
| 143 | + |
| 144 | +<CodeBlock className="language-python"> |
| 145 | + {UsageLimitExample} |
| 146 | +</CodeBlock> |
| 147 | + |
| 148 | +## Conclusion |
| 149 | + |
| 150 | +This guide introduced <ApiLink to="class/AiCrawler">`AiCrawler`</ApiLink> and its <ApiLink to="class/ExtractFunction">`extract`</ApiLink> helper, the <ApiLink to="class/AiDirectExtractor">`AiDirectExtractor`</ApiLink> and <ApiLink to="class/AiSelectorExtractor">`AiSelectorExtractor`</ApiLink> strategies, the built-in and custom distillers, the extract options, and how failures and cost are handled. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/crawlee-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! |
0 commit comments