Skip to content

Commit 975f9f9

Browse files
committed
add deploy on AWS
1 parent dc380a5 commit 975f9f9

6 files changed

Lines changed: 378 additions & 8 deletions

File tree

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
id: aws-lambda-beautifulsoup
3+
title: BeautifulSoup crawler on AWS Lambda
4+
description: Prepare your BeautifulSoupCrawler to run in Lambda functions on Amazon Web Services.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
9+
import CodeBlock from '@theme/CodeBlock';
10+
11+
import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py';
12+
13+
[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. It is well suited for deploying simple crawlers that don't require browser rendering. For simple projects, you can deploy using a ZIP archive.
14+
15+
## Updating the code
16+
17+
For the project foundation, use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> as described in this [example](../examples/beautifulsoup-crawler).
18+
19+
When instantiating a crawler, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead.
20+
21+
Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute.
22+
23+
:::important
24+
25+
Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance.
26+
27+
**TLDR: Keep your Lambda stateless.**
28+
29+
:::
30+
31+
Finally, return the scraped data from the Lambda when the crawler run ends.
32+
33+
<CodeBlock language="python" title="lambda_function.py">
34+
{BeautifulSoupCrawlerLambda}
35+
</CodeBlock>
36+
37+
## Deploying the project
38+
39+
### Preparing the environment
40+
41+
Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies:
42+
43+
```bash
44+
python3.14 -m venv .venv
45+
source .venv/bin/activate
46+
pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools'
47+
```
48+
49+
### Creating the ZIP archive
50+
51+
Create a ZIP archive from your project, including dependencies from the virtual environment:
52+
53+
```bash
54+
cd .venv/lib/python3.14/site-packages
55+
zip -r ../../../../package.zip .
56+
cd ../../../../
57+
zip package.zip lambda_function.py
58+
```
59+
60+
:::note Large dependencies?
61+
62+
AWS has a limit of 50MB for direct upload and 250MB for unzipped deployment package size.
63+
64+
A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible.
65+
66+
To create a Lambda Layer:
67+
68+
1. Create a `python/` folder and copy dependencies from `site-packages` into it
69+
2. Create a zip archive: `zip -r layer.zip python/`
70+
3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first)
71+
4. Attach the Layer to your Lambda function
72+
73+
:::
74+
75+
### Uploading and configuring
76+
77+
Upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button.
78+
79+
In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`.
80+
81+
:::tip Configuration
82+
83+
In the Configuration tab, you can adjust:
84+
85+
- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended.
86+
- **Timeout**: Set according to the size of the website you are scraping (1 minute for code in example).
87+
- **Ephemeral storage**: Size of the `/tmp` directory.
88+
89+
See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory.
90+
91+
:::
92+
93+
After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by analyzing the event object that AWS passes as the first argument to the handler.
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
id: aws-lambda-playwright
3+
title: PlaywrightCrawler on AWS Lambda
4+
description: Prepare your PlaywrightCrawler to run in Lambda functions on Amazon Web Services.
5+
---
6+
7+
import ApiLink from '@site/src/components/ApiLink';
8+
9+
import CodeBlock from '@theme/CodeBlock';
10+
11+
import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py';
12+
import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile';
13+
14+
[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that runs code without provisioning or managing servers. For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits.
15+
16+
## Updating the code
17+
18+
For the project foundation, use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> as described in this [example](../examples/beautifulsoup-crawler). We will update it to work with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>.
19+
20+
When instantiating a crawler, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. Replace `BeautifulSoupCrawler` with `PlaywrightCrawler` and configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime.
21+
22+
Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute.
23+
24+
:::important
25+
26+
Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance.
27+
28+
**TLDR: Keep your Lambda stateless.**
29+
30+
:::
31+
32+
Finally, return the scraped data from the Lambda when the crawler run ends.
33+
34+
<CodeBlock language="python" title="main.py">
35+
{PlaywrightCrawlerLambda}
36+
</CodeBlock>
37+
38+
## Deploying the project
39+
40+
### Installing and configuring AWS CLI
41+
42+
Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system.
43+
44+
Authenticate by running:
45+
46+
```bash
47+
aws login
48+
```
49+
50+
### Preparing the project
51+
52+
Initialize the project by running `uvx 'crawlee[cli]' create`.
53+
54+
Or use a single command if you don't need interactive mode:
55+
56+
```bash
57+
uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install
58+
```
59+
60+
Add additional dependencies:
61+
62+
```bash
63+
uv add awslambdaric aws-lambda-powertools boto3
64+
```
65+
66+
The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`:
67+
68+
<CodeBlock language="dockerfile" title="Dockerfile">
69+
{PlaywrightCrawlerDockerfile}
70+
</CodeBlock>
71+
72+
### Building and pushing the Docker image
73+
74+
Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html).
75+
76+
Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them.
77+
78+
Example:
79+
```bash
80+
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data}
81+
docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright .
82+
docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest
83+
docker push {user-specific-data}/lambda/aws-playwright:latest
84+
```
85+
86+
### Creating the Lambda function
87+
88+
1. In the AWS Lambda Console, click "Create function"
89+
2. Select "Container image"
90+
3. Browse and select your ECR image
91+
4. Configure the function settings
92+
93+
:::tip Configuration
94+
95+
Playwright crawlers require more resources than HTTP-based crawlers:
96+
97+
- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages.
98+
- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls.
99+
- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files.
100+
101+
See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory.
102+
103+
:::
104+
105+
## Testing the function
106+
107+
After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import asyncio
2+
import json
3+
from datetime import timedelta
4+
from typing import Any
5+
6+
from aws_lambda_powertools.utilities.typing import LambdaContext
7+
8+
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
9+
from crawlee.storage_clients import MemoryStorageClient
10+
from crawlee.storages import Dataset, RequestQueue
11+
12+
13+
async def main() -> str:
14+
# highlight-start
15+
# Disable writing storage data to the file system
16+
storage_client = MemoryStorageClient()
17+
# highlight-end
18+
19+
# Initialize storages
20+
dataset = await Dataset.open(storage_client=storage_client)
21+
request_queue = await RequestQueue.open(storage_client=storage_client)
22+
23+
crawler = BeautifulSoupCrawler(
24+
storage_client=storage_client,
25+
max_request_retries=1,
26+
request_handler_timeout=timedelta(seconds=30),
27+
max_requests_per_crawl=10,
28+
)
29+
30+
@crawler.router.default_handler
31+
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
32+
context.log.info(f'Processing {context.request.url} ...')
33+
34+
data = {
35+
'url': context.request.url,
36+
'title': context.soup.title.string if context.soup.title else None,
37+
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
38+
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
39+
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
40+
}
41+
42+
await context.push_data(data)
43+
await context.enqueue_links()
44+
45+
await crawler.run(['https://crawlee.dev'])
46+
47+
# Extract data saved in `Dataset`
48+
data = await crawler.get_data()
49+
50+
# Clean up storages after the crawl
51+
await dataset.drop()
52+
await request_queue.drop()
53+
54+
# Serialize the list of scraped items to JSON string
55+
return json.dumps(data.items)
56+
57+
58+
def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]:
59+
result = asyncio.run(main())
60+
# Return the response with results
61+
return {'statusCode': 200, 'body': result}
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import asyncio
2+
import json
3+
from datetime import timedelta
4+
from typing import Any
5+
6+
from aws_lambda_powertools.utilities.typing import LambdaContext
7+
8+
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
9+
from crawlee.storage_clients import MemoryStorageClient
10+
from crawlee.storages import Dataset, RequestQueue
11+
12+
13+
async def main() -> str:
14+
# highlight-start
15+
# Disable writing storage data to the file system
16+
storage_client = MemoryStorageClient()
17+
# highlight-end
18+
19+
# Initialize storages
20+
dataset = await Dataset.open(storage_client=storage_client)
21+
request_queue = await RequestQueue.open(storage_client=storage_client)
22+
23+
crawler = PlaywrightCrawler(
24+
storage_client=storage_client,
25+
max_request_retries=1,
26+
request_handler_timeout=timedelta(seconds=30),
27+
max_requests_per_crawl=10,
28+
# highlight-start
29+
# Configure Playwright to run in AWS Lambda environment
30+
browser_launch_options={
31+
'args': [
32+
'--no-sandbox',
33+
'--disable-setuid-sandbox',
34+
'--disable-dev-shm-usage',
35+
'--disable-gpu',
36+
'--single-process',
37+
]
38+
},
39+
# highlight-end
40+
)
41+
42+
@crawler.router.default_handler
43+
async def request_handler(context: PlaywrightCrawlingContext) -> None:
44+
context.log.info(f'Processing {context.request.url} ...')
45+
46+
data = {
47+
'url': context.request.url,
48+
'title': await context.page.title(),
49+
'h1s': await context.page.locator('h1').all_text_contents(),
50+
'h2s': await context.page.locator('h2').all_text_contents(),
51+
'h3s': await context.page.locator('h3').all_text_contents(),
52+
}
53+
54+
await context.push_data(data)
55+
await context.enqueue_links()
56+
57+
await crawler.run(['https://crawlee.dev'])
58+
59+
# Extract data saved in `Dataset`
60+
data = await crawler.get_data()
61+
62+
# Clean up storages after the crawl
63+
await dataset.drop()
64+
await request_queue.drop()
65+
66+
# Serialize the list of scraped items to JSON string
67+
return json.dumps(data.items)
68+
69+
70+
def lambda_handler(_event: dict[str, Any], _context: LambdaContext) -> dict[str, Any]:
71+
result = asyncio.run(main())
72+
# Return the response with results
73+
return {'statusCode': 200, 'body': result}
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
FROM apify/actor-python-playwright:3.14
2+
3+
RUN apt update && apt install -yq git && rm -rf /var/lib/apt/lists/*
4+
5+
RUN pip install -U pip setuptools \
6+
&& pip install 'uv<1'
7+
8+
ENV UV_PROJECT_ENVIRONMENT="/usr/local"
9+
10+
COPY pyproject.toml uv.lock ./
11+
12+
RUN echo "Python version:" \
13+
&& python --version \
14+
&& echo "Installing dependencies:" \
15+
&& PLAYWRIGHT_INSTALLED=$(pip freeze | grep -q playwright && echo "true" || echo "false") \
16+
&& if [ "$PLAYWRIGHT_INSTALLED" = "true" ]; then \
17+
echo "Playwright already installed, excluding from uv sync" \
18+
&& uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact --no-install-package playwright; \
19+
else \
20+
echo "Playwright not found, installing all dependencies" \
21+
&& uv sync --frozen --no-install-project --no-editable -q --no-dev --inexact; \
22+
fi \
23+
&& echo "All installed Python packages:" \
24+
&& pip freeze
25+
26+
COPY . ./
27+
28+
RUN python -m compileall -q .
29+
30+
# highlight-start
31+
# AWS Lambda entrypoint
32+
ENTRYPOINT [ "/usr/local/bin/python3", "-m", "awslambdaric" ]
33+
34+
# Lambda handler function
35+
CMD [ "aws_playwright.main.lambda_handler" ]
36+
# highlight-end

website/sidebars.js

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -54,14 +54,14 @@ module.exports = {
5454
id: 'deployment/apify-platform',
5555
label: 'Deploy on Apify',
5656
},
57-
// {
58-
// type: 'category',
59-
// label: 'Deploy on AWS',
60-
// items: [
61-
// 'deployment/aws-cheerio',
62-
// 'deployment/aws-browsers',
63-
// ],
64-
// },
57+
{
58+
type: 'category',
59+
label: 'Deploy on AWS',
60+
items: [
61+
'deployment/aws-lambda-beautifulsoup',
62+
'deployment/aws-lambda-playwright',
63+
],
64+
},
6565
{
6666
type: 'category',
6767
label: 'Deploy to Google Cloud',

0 commit comments

Comments
 (0)