Skip to content

Commit fb9a3f5

Browse files
authored
docs: Add deployment guide "deploy on AWS" (#1679)
### Description - Add deployment guides for AWS Lambda with HTTP-based Crawler and Playwright crawler. ### Issues - Closes: #698
1 parent c019003 commit fb9a3f5

File tree

5 files changed

+365
-8
lines changed

5 files changed

+365
-8
lines changed

docs/deployment/aws_lambda.mdx

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
---
2+
id: aws-lambda
3+
title: Deploy on AWS Lambda
4+
description: Prepare your crawler to run on AWS Lambda.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
9+
import CodeBlock from '@theme/CodeBlock';
10+
11+
import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py';
12+
import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py';
13+
import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile';
14+
15+
[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. This guide covers deploying <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> and <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.
16+
17+
The code examples are based on the [BeautifulSoupCrawler example](../examples/beautifulsoup-crawler).
18+
19+
## BeautifulSoupCrawler on AWS Lambda
20+
21+
For simple crawlers that don't require browser rendering, you can deploy using a ZIP archive.
22+
23+
### Updating the code
24+
25+
When instantiating a crawler, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead.
26+
27+
Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute.
28+
29+
:::important
30+
31+
Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance.
32+
33+
**TL;DR: Keep your Lambda stateless.**
34+
35+
:::
36+
37+
Finally, return the scraped data from the Lambda when the crawler run ends.
38+
39+
<CodeBlock language="python" title="lambda_function.py">
40+
{BeautifulSoupCrawlerLambda}
41+
</CodeBlock>
42+
43+
### Preparing the environment
44+
45+
Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies:
46+
47+
```bash
48+
python3.14 -m venv .venv
49+
source .venv/bin/activate
50+
pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools'
51+
```
52+
53+
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Including it in your dependencies is recommended to avoid version misalignment issues with the Lambda runtime.
54+
55+
### Creating the ZIP archive
56+
57+
Create a ZIP archive from your project, including dependencies from the virtual environment:
58+
59+
```bash
60+
cd .venv/lib/python3.14/site-packages
61+
zip -r ../../../../package.zip .
62+
cd ../../../../
63+
zip package.zip lambda_function.py
64+
```
65+
66+
:::note Large dependencies?
67+
68+
AWS has a limit of 50 MB for direct upload and 250 MB for unzipped deployment package size.
69+
70+
A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible.
71+
72+
To create a Lambda Layer:
73+
74+
1. Create a `python/` folder and copy dependencies from `site-packages` into it
75+
2. Create a zip archive: `zip -r layer.zip python/`
76+
3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first)
77+
4. Attach the Layer to your Lambda function
78+
79+
:::
80+
81+
### Creating the Lambda function
82+
83+
Create the Lambda function in the AWS Lambda Console:
84+
85+
1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/).
86+
2. Click **Create function**.
87+
3. Select **Author from scratch**.
88+
4. Enter a **Function name**, for example `BeautifulSoupTest`.
89+
5. Choose a **Python runtime** that matches the version used in your virtual environment (for example, Python 3.14).
90+
6. Click **Create function** to finish.
91+
92+
Once created, upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button.
93+
94+
In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`.
95+
96+
:::tip Configuration
97+
98+
In the Configuration tab, you can adjust:
99+
100+
- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended.
101+
- **Timeout**: Set according to the size of the website you are scraping (1 minute for the example code).
102+
- **Ephemeral storage**: Size of the `/tmp` directory.
103+
104+
See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory.
105+
106+
:::
107+
108+
After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler.
109+
110+
## PlaywrightCrawler on AWS Lambda
111+
112+
For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits.
113+
114+
### Updating the code
115+
116+
As with <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and wrap the logic in a `lambda_handler` function. Additionally, configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime.
117+
118+
<CodeBlock language="python" title="main.py">
119+
{PlaywrightCrawlerLambda}
120+
</CodeBlock>
121+
122+
### Installing and configuring AWS CLI
123+
124+
Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system.
125+
126+
Authenticate by running:
127+
128+
```bash
129+
aws login
130+
```
131+
132+
### Preparing the project
133+
134+
Initialize the project by running `uvx 'crawlee[cli]' create`.
135+
136+
Or use a single command if you don't need interactive mode:
137+
138+
```bash
139+
uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install
140+
```
141+
142+
Add the following dependencies:
143+
144+
```bash
145+
uv add awslambdaric aws-lambda-powertools boto3
146+
```
147+
148+
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Use it if your function integrates with any other AWS services.
149+
150+
The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`:
151+
152+
<CodeBlock language="dockerfile" title="Dockerfile">
153+
{PlaywrightCrawlerDockerfile}
154+
</CodeBlock>
155+
156+
### Building and pushing the Docker image
157+
158+
Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html).
159+
160+
Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them.
161+
162+
Example:
163+
```bash
164+
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data}
165+
docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright .
166+
docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest
167+
docker push {user-specific-data}/lambda/aws-playwright:latest
168+
```
169+
170+
### Creating the Lambda function
171+
172+
1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/).
173+
2. Click **Create function**.
174+
3. Select **Container image**.
175+
4. Browse and select your ECR image.
176+
5. Click **Create function** to finish.
177+
178+
:::tip Configuration
179+
180+
In the Configuration tab, you can adjust resources. Playwright crawlers require more resources than BeautifulSoup crawlers:
181+
182+
- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages.
183+
- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls.
184+
- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files.
185+
186+
See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory.
187+
188+
:::
189+
190+
After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import asyncio
2+
import json
3+
from datetime import timedelta
4+
from typing import Any
5+
6+
from aws_lambda_powertools.utilities.typing import LambdaContext
7+
8+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
9+
from crawlee.storage_clients import MemoryStorageClient
10+
from crawlee.storages import Dataset, RequestQueue
11+
12+
13+
async def main() -> str:
14+
# highlight-start
15+
# Disable writing storage data to the file system
16+
storage_client = MemoryStorageClient()
17+
# highlight-end
18+
19+
# Initialize storages
20+
dataset = await Dataset.open(storage_client=storage_client)
21+
request_queue = await RequestQueue.open(storage_client=storage_client)
22+
23+
crawler = BeautifulSoupCrawler(
24+
storage_client=storage_client,
25+
max_request_retries=1,
26+
request_handler_timeout=timedelta(seconds=30),
27+
max_requests_per_crawl=10,
28+
)
29+
30+
@crawler.router.default_handler
31+
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
32+
context.log.info(f'Processing {context.request.url} ...')
33+
34+
data = {
35+
'url': context.request.url,
36+
'title': context.soup.title.string if context.soup.title else None,
37+
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
38+
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
39+
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
40+
}
41+
42+
await context.push_data(data)
43+
await context.enqueue_links()
44+
45+
await crawler.run(['https://crawlee.dev'])
46+
47+
# Extract data saved in `Dataset`
48+
data = await crawler.get_data()
49+
50+
# Clean up storages after the crawl
51+
await dataset.drop()
52+
await request_queue.drop()
53+
54+
# Serialize the list of scraped items to JSON string
55+
return json.dumps(data.items)
56+
57+
58+
def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]:
59+
result = asyncio.run(main())
60+
# Return the response with results
61+
return {'statusCode': 200, 'body': result}
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import asyncio
2+
import json
3+
from datetime import timedelta
4+
from typing import Any
5+
6+
from aws_lambda_powertools.utilities.typing import LambdaContext
7+
8+
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
9+
from crawlee.storage_clients import MemoryStorageClient
10+
from crawlee.storages import Dataset, RequestQueue
11+
12+
13+
async def main() -> str:
14+
# highlight-start
15+
# Disable writing storage data to the file system
16+
storage_client = MemoryStorageClient()
17+
# highlight-end
18+
19+
# Initialize storages
20+
dataset = await Dataset.open(storage_client=storage_client)
21+
request_queue = await RequestQueue.open(storage_client=storage_client)
22+
23+
crawler = PlaywrightCrawler(
24+
storage_client=storage_client,
25+
max_request_retries=1,
26+
request_handler_timeout=timedelta(seconds=30),
27+
max_requests_per_crawl=10,
28+
# highlight-start
29+
# Configure Playwright to run in AWS Lambda environment
30+
browser_launch_options={
31+
'args': [
32+
'--no-sandbox',
33+
'--disable-setuid-sandbox',
34+
'--disable-dev-shm-usage',
35+
'--disable-gpu',
36+
'--single-process',
37+
]
38+
},
39+
# highlight-end
40+
)
41+
42+
@crawler.router.default_handler
43+
async def request_handler(context: PlaywrightCrawlingContext) -> None:
44+
context.log.info(f'Processing {context.request.url} ...')
45+
46+
data = {
47+
'url': context.request.url,
48+
'title': await context.page.title(),
49+
'h1s': await context.page.locator('h1').all_text_contents(),
50+
'h2s': await context.page.locator('h2').all_text_contents(),
51+
'h3s': await context.page.locator('h3').all_text_contents(),
52+
}
53+
54+
await context.push_data(data)
55+
await context.enqueue_links()
56+
57+
await crawler.run(['https://crawlee.dev'])
58+
59+
# Extract data saved in `Dataset`
60+
data = await crawler.get_data()
61+
62+
# Clean up storages after the crawl
63+
await dataset.drop()
64+
await request_queue.drop()
65+
66+
# Serialize the list of scraped items to JSON string
67+
return json.dumps(data.items)
68+
69+
70+
def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]:
71+
result = asyncio.run(main())
72+
# Return the response with results
73+
return {'statusCode': 200, 'body': result}
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
FROM apify/actor-python-playwright:3.14
2+
3+
RUN apt update && apt install -yq git && rm -rf /var/lib/apt/lists/*
4+
5+
RUN pip install -U pip setuptools \
6+
&& pip install 'uv<1'
7+
8+
ENV UV_PROJECT_ENVIRONMENT="/usr/local"
9+
10+
COPY pyproject.toml uv.lock ./
11+
12+
RUN echo "Python version:" \
13+
&& python --version \
14+
&& echo "Installing dependencies:" \
15+
&& PLAYWRIGHT_INSTALLED=$(pip freeze | grep -q playwright && echo "true" || echo "false") \
16+
&& if [ "$PLAYWRIGHT_INSTALLED" = "true" ]; then \
17+
echo "Playwright already installed, excluding from uv sync" \
18+
&& uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact --no-install-package playwright; \
19+
else \
20+
echo "Playwright not found, installing all dependencies" \
21+
&& uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact; \
22+
fi \
23+
&& echo "All installed Python packages:" \
24+
&& pip freeze
25+
26+
COPY . ./
27+
28+
RUN python -m compileall -q .
29+
30+
# highlight-start
31+
# AWS Lambda entrypoint
32+
ENTRYPOINT [ "/usr/local/bin/python3", "-m", "awslambdaric" ]
33+
34+
# Lambda handler function
35+
CMD [ "aws_playwright.main.lambda_handler" ]
36+
# highlight-end

website/sidebars.js

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,11 @@ module.exports = {
5454
id: 'deployment/apify-platform',
5555
label: 'Deploy on Apify',
5656
},
57-
// {
58-
// type: 'category',
59-
// label: 'Deploy on AWS',
60-
// items: [
61-
// 'deployment/aws-cheerio',
62-
// 'deployment/aws-browsers',
63-
// ],
64-
// },
57+
{
58+
type: 'doc',
59+
id: 'deployment/aws-lambda',
60+
label: 'Deploy on AWS Lambda'
61+
},
6562
{
6663
type: 'category',
6764
label: 'Deploy to Google Cloud',

0 commit comments

Comments
 (0)