|
| 1 | +--- |
| 2 | +id: stagehand-crawler-guide |
| 3 | +title: "StagehandCrawler guide" |
| 4 | +sidebar_label: "StagehandCrawler" |
| 5 | +description: AI-powered web crawling with natural language browser automation |
| 6 | +--- |
| 7 | + |
| 8 | +import ApiLink from '@site/src/components/ApiLink'; |
| 9 | +import Tabs from '@theme/Tabs'; |
| 10 | +import TabItem from '@theme/TabItem'; |
| 11 | +import CodeBlock from '@theme/CodeBlock'; |
| 12 | + |
| 13 | +import StagehandBasicSource from '!!raw-loader!./stagehand_crawler_basic.ts'; |
| 14 | +import StagehandExtractSource from '!!raw-loader!./stagehand_crawler_extract.ts'; |
| 15 | +import StagehandCombinedSource from '!!raw-loader!./stagehand_crawler_combined.ts'; |
| 16 | + |
| 17 | +​<ApiLink to="stagehand-crawler/class/StagehandCrawler">`StagehandCrawler`</ApiLink> combines Crawlee's powerful crawling infrastructure with [Stagehand's](https://github.com/browserbase/stagehand) AI-powered browser automation. Instead of writing CSS selectors or XPath queries, you can interact with web pages using natural language instructions. |
| 18 | + |
| 19 | +## What is Stagehand |
| 20 | + |
| 21 | +[Stagehand](https://github.com/browserbase/stagehand) is an AI-powered browser automation library from Browserbase. It allows you to control a browser using natural language commands like "click the login button" or "extract the product price". Under the hood, Stagehand uses large language models (OpenAI, Anthropic, or Google) to understand the page structure and execute your instructions. |
| 22 | + |
| 23 | +## How StagehandCrawler works |
| 24 | + |
| 25 | +StagehandCrawler extends <ApiLink to="browser-crawler/class/BrowserCrawler">`BrowserCrawler`</ApiLink> and enhances each page with AI-powered methods. Here's the architecture: |
| 26 | + |
| 27 | +1. **Stagehand launches the browser** - When a new browser is needed, Stagehand initializes and launches a Chromium browser |
| 28 | +2. **Playwright connects via CDP** - Crawlee connects Playwright to the same browser using Chrome DevTools Protocol (CDP) |
| 29 | +3. **Pages are enhanced with AI methods** - Each page gets `act()`, `extract()`, `observe()`, and `agent()` methods |
| 30 | +4. **BrowserPool manages scaling** - Crawlee's BrowserPool handles browser lifecycle, retries, and concurrency |
| 31 | + |
| 32 | +``` |
| 33 | +┌─────────────────────────────────────────────────────────┐ |
| 34 | +│ StagehandCrawler │ |
| 35 | +├─────────────────────────────────────────────────────────┤ |
| 36 | +│ BrowserPool (manages browser lifecycle & concurrency) │ |
| 37 | +├─────────────────────────────────────────────────────────┤ |
| 38 | +│ Stagehand Instance │ |
| 39 | +│ ├── Launches Chromium browser │ |
| 40 | +│ ├── Provides CDP endpoint │ |
| 41 | +│ └── Handles AI operations (act/extract/observe) │ |
| 42 | +├─────────────────────────────────────────────────────────┤ |
| 43 | +│ Playwright (connected via CDP) │ |
| 44 | +│ └── Standard page operations (goto, click, type, etc.) │ |
| 45 | +└─────────────────────────────────────────────────────────┘ |
| 46 | +``` |
| 47 | + |
| 48 | +## Key features |
| 49 | + |
| 50 | +The enhanced page object provides four AI-powered methods: |
| 51 | + |
| 52 | +### `page.act(instruction)` - Perform actions |
| 53 | + |
| 54 | +Execute actions on the page using natural language. See [Stagehand act() documentation](https://docs.stagehand.dev/reference/act) for more details. |
| 55 | + |
| 56 | +```ts |
| 57 | +await page.act('Click the "Add to Cart" button'); |
| 58 | +await page.act('Fill in the email field with test@example.com'); |
| 59 | +await page.act('Scroll down to load more products'); |
| 60 | +``` |
| 61 | + |
| 62 | +### `page.extract(instruction, schema)` - Extract structured data |
| 63 | + |
| 64 | +Extract data from the page using a Zod schema for type safety. See [Stagehand extract() documentation](https://docs.stagehand.dev/reference/extract) for more details. |
| 65 | + |
| 66 | +```ts |
| 67 | +import { z } from 'zod'; |
| 68 | + |
| 69 | +const productSchema = z.object({ |
| 70 | + title: z.string(), |
| 71 | + price: z.number(), |
| 72 | + description: z.string(), |
| 73 | +}); |
| 74 | + |
| 75 | +const product = await page.extract('Get the product details', productSchema); |
| 76 | +// product is typed as { title: string, price: number, description: string } |
| 77 | +``` |
| 78 | + |
| 79 | +### `page.observe()` - Discover page actions |
| 80 | + |
| 81 | +Analyze the page and get AI-suggested actions. This is useful for exploring unfamiliar pages or building adaptive scrapers. See [Stagehand observe() documentation](https://docs.stagehand.dev/reference/observe) for more details. |
| 82 | + |
| 83 | +```ts |
| 84 | +const actions = await page.observe(); |
| 85 | +// Returns available actions like: |
| 86 | +// [ |
| 87 | +// { action: 'click', element: 'Load More button', selector: '.load-more' }, |
| 88 | +// { action: 'click', element: 'Next Page link', selector: 'a.pagination-next' }, |
| 89 | +// { action: 'fill', element: 'Search input', selector: '#search' }, |
| 90 | +// ] |
| 91 | + |
| 92 | +// Use observe to find pagination dynamically |
| 93 | +for (const action of actions) { |
| 94 | + if (action.element?.toLowerCase().includes('next page')) { |
| 95 | + await page.act(`Click the ${action.element}`); |
| 96 | + break; |
| 97 | + } |
| 98 | +} |
| 99 | +``` |
| 100 | + |
| 101 | +### `page.agent(config)` - Autonomous agents |
| 102 | + |
| 103 | +Create an autonomous agent for complex multi-step workflows. Unlike `act()` which executes a single action, `agent()` can plan and execute multiple steps autonomously to achieve a goal. See [Stagehand agent() documentation](https://docs.stagehand.dev/reference/agent) for more details. |
| 104 | + |
| 105 | +```ts |
| 106 | +const agent = page.agent({ model: 'gpt-4.1-mini' }); |
| 107 | +const result = await agent.execute('Find the cheapest laptop and add it to cart'); |
| 108 | +``` |
| 109 | + |
| 110 | +**When to use `act()` vs `agent()`:** |
| 111 | +- Use `act()` for single, discrete actions ("click this button", "fill this form") |
| 112 | +- Use `agent()` for goals requiring multiple steps with decision-making ("find and purchase the cheapest item") |
| 113 | + |
| 114 | +Note that `agent()` makes multiple LLM calls and can be slower and more expensive than sequential `act()` calls where you control the flow. |
| 115 | + |
| 116 | +## Requirements |
| 117 | + |
| 118 | +StagehandCrawler requires an API key for the AI model provider. The recommended way is to use the `apiKey` option: |
| 119 | + |
| 120 | +```ts |
| 121 | +const crawler = new StagehandCrawler({ |
| 122 | + stagehandOptions: { |
| 123 | + model: 'openai/gpt-4.1-mini', |
| 124 | + apiKey: 'sk-...', // Your OpenAI API key |
| 125 | + }, |
| 126 | +}); |
| 127 | +``` |
| 128 | + |
| 129 | +Alternatively, you can use environment variables (used as fallback when `apiKey` is not provided): |
| 130 | + |
| 131 | +- **OpenAI**: `OPENAI_API_KEY` |
| 132 | +- **Anthropic**: `ANTHROPIC_API_KEY` |
| 133 | +- **Google**: `GOOGLE_API_KEY` |
| 134 | + |
| 135 | +## Limitations |
| 136 | + |
| 137 | +Some Crawlee features work differently or are unavailable with StagehandCrawler: |
| 138 | + |
| 139 | +### Chromium only |
| 140 | + |
| 141 | +Stagehand uses Chrome DevTools Protocol (CDP), so only Chromium browsers are supported. The `launcher` option is ignored - you cannot use Firefox or WebKit. |
| 142 | + |
| 143 | +### Reduced fingerprinting control |
| 144 | + |
| 145 | +Since Stagehand controls the browser launch process, Crawlee's advanced fingerprinting features are limited: |
| 146 | + |
| 147 | +- **Browser fingerprints** - Basic fingerprinting (viewport, user-agent) is applied, but low-level browser properties cannot be modified |
| 148 | +- **`launchOptions`** - Only a subset of Playwright launch options are passed through to Stagehand (`headless`, `args`, `executablePath`, `proxy`, `viewport`) |
| 149 | +- **Browser context options** - Custom context configurations are not fully supported since Stagehand manages the browser context |
| 150 | + |
| 151 | +Stagehand provides its own anti-detection measures, but you have less granular control compared to PlaywrightCrawler. |
| 152 | + |
| 153 | +## When to use StagehandCrawler |
| 154 | + |
| 155 | +**Use StagehandCrawler when:** |
| 156 | +- Pages have complex, dynamic structures that are hard to scrape with selectors |
| 157 | +- You need to interact with pages in ways that are difficult to express programmatically |
| 158 | +- You want to quickly prototype scrapers without writing detailed selectors |
| 159 | +- The target website frequently changes its structure |
| 160 | + |
| 161 | +**Consider alternatives when:** |
| 162 | +- You need maximum performance (use CheerioCrawler or PlaywrightCrawler) |
| 163 | +- You need to minimize costs (LLM API calls add up) |
| 164 | +- You need fine-grained browser control (use PlaywrightCrawler) |
| 165 | +- You need Firefox or WebKit support (use PlaywrightCrawler) |
| 166 | + |
| 167 | +## Basic example |
| 168 | + |
| 169 | +Here's a simple example that extracts code examples from the Crawlee website: |
| 170 | + |
| 171 | +<CodeBlock language="ts" title="src/main.ts">{StagehandBasicSource}</CodeBlock> |
| 172 | + |
| 173 | +## Data extraction example |
| 174 | + |
| 175 | +Here's an example showing structured data extraction with Zod schemas: |
| 176 | + |
| 177 | +<CodeBlock language="ts" title="src/main.ts">{StagehandExtractSource}</CodeBlock> |
| 178 | + |
| 179 | +## Configuration options |
| 180 | + |
| 181 | +### Stagehand options |
| 182 | + |
| 183 | +Configure the AI behavior through `stagehandOptions`: |
| 184 | + |
| 185 | +```ts |
| 186 | +const crawler = new StagehandCrawler({ |
| 187 | + stagehandOptions: { |
| 188 | + // Environment: 'LOCAL' or 'BROWSERBASE' |
| 189 | + env: 'LOCAL', |
| 190 | + |
| 191 | + // AI model to use (e.g., 'openai/gpt-4.1-mini', 'anthropic/claude-sonnet-4-20250514') |
| 192 | + model: 'openai/gpt-4.1-mini', |
| 193 | + |
| 194 | + // API key for the LLM provider (can be overridden by environment variables) |
| 195 | + apiKey: process.env.OPENAI_API_KEY, |
| 196 | + |
| 197 | + // Logging verbosity: 0 (minimal), 1 (standard), 2 (debug) |
| 198 | + verbose: 1, |
| 199 | + |
| 200 | + // Enable automatic error recovery |
| 201 | + selfHeal: true, |
| 202 | + |
| 203 | + // Timeout for DOM to stabilize (ms) |
| 204 | + domSettleTimeout: 30000, |
| 205 | + }, |
| 206 | +}); |
| 207 | +``` |
| 208 | + |
| 209 | +### Environment variables |
| 210 | + |
| 211 | +Stagehand options can alternatively be set via environment variables. Programmatic options always take precedence over environment variables: |
| 212 | + |
| 213 | +| Environment variable | Option | Notes | |
| 214 | +|---------------------|--------|-------| |
| 215 | +| `OPENAI_API_KEY` | `apiKey` | Fallback for OpenAI models | |
| 216 | +| `ANTHROPIC_API_KEY` | `apiKey` | Fallback for Anthropic models | |
| 217 | +| `GOOGLE_API_KEY` | `apiKey` | Fallback for Google models | |
| 218 | +| `STAGEHAND_ENV` | `env` | | |
| 219 | +| `STAGEHAND_MODEL` | `model` | | |
| 220 | +| `STAGEHAND_VERBOSE` | `verbose` | | |
| 221 | +| `STAGEHAND_API_KEY` | `apiKey` | Browserbase API key | |
| 222 | +| `STAGEHAND_PROJECT_ID` | `projectId` | Browserbase project ID | |
| 223 | + |
| 224 | +## Using with Browserbase |
| 225 | + |
| 226 | +For cloud browser infrastructure, you can use [Browserbase](https://browserbase.com/): |
| 227 | + |
| 228 | +```ts |
| 229 | +const crawler = new StagehandCrawler({ |
| 230 | + stagehandOptions: { |
| 231 | + env: 'BROWSERBASE', |
| 232 | + apiKey: process.env.BROWSERBASE_API_KEY, |
| 233 | + projectId: process.env.BROWSERBASE_PROJECT_ID, |
| 234 | + model: 'openai/gpt-4.1-mini', |
| 235 | + }, |
| 236 | +}); |
| 237 | +``` |
| 238 | + |
| 239 | +## Combining AI and standard methods |
| 240 | + |
| 241 | +You can mix AI-powered methods with standard Playwright methods: |
| 242 | + |
| 243 | +<CodeBlock language="ts" title="src/main.ts">{StagehandCombinedSource}</CodeBlock> |
| 244 | + |
| 245 | +## Further reading |
| 246 | + |
| 247 | +- [Stagehand documentation](https://docs.stagehand.dev/) |
| 248 | +- [Browserbase documentation](https://docs.browserbase.com/) |
| 249 | +- <ApiLink to="stagehand-crawler/class/StagehandCrawler">`StagehandCrawler` API reference</ApiLink> |
0 commit comments