|
| 1 | +--- |
| 2 | +id: aws-lambda-playwright |
| 3 | +title: PlaywrightCrawler on AWS Lambda |
| 4 | +description: Prepare your PlaywrightCrawler to run in Lambda functions on Amazon Web Services. |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | + |
| 9 | +import CodeBlock from '@theme/CodeBlock'; |
| 10 | + |
| 11 | +import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py'; |
| 12 | +import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile'; |
| 13 | + |
| 14 | +[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that runs code without provisioning or managing servers. For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits. |
| 15 | + |
| 16 | +## Updating the code |
| 17 | + |
| 18 | +For the project foundation, use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> as described in this [example](../examples/beautifulsoup-crawler). We will update it to work with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. |
| 19 | + |
| 20 | +When instantiating a crawler, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. Replace `BeautifulSoupCrawler` with `PlaywrightCrawler` and configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime. |
| 21 | + |
| 22 | +Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. |
| 23 | + |
| 24 | +:::important |
| 25 | + |
| 26 | +Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. |
| 27 | + |
| 28 | +**TLDR: Keep your Lambda stateless.** |
| 29 | + |
| 30 | +::: |
| 31 | + |
| 32 | +Finally, return the scraped data from the Lambda when the crawler run ends. |
| 33 | + |
| 34 | +<CodeBlock language="python" title="main.py"> |
| 35 | + {PlaywrightCrawlerLambda} |
| 36 | +</CodeBlock> |
| 37 | + |
| 38 | +## Deploying the project |
| 39 | + |
| 40 | +### Installing and configuring AWS CLI |
| 41 | + |
| 42 | +Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system. |
| 43 | + |
| 44 | +Authenticate by running: |
| 45 | + |
| 46 | +```bash |
| 47 | +aws login |
| 48 | +``` |
| 49 | + |
| 50 | +### Preparing the project |
| 51 | + |
| 52 | +Initialize the project by running `uvx 'crawlee[cli]' create`. |
| 53 | + |
| 54 | +Or use a single command if you don't need interactive mode: |
| 55 | + |
| 56 | +```bash |
| 57 | +uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install |
| 58 | +``` |
| 59 | + |
| 60 | +Add additional dependencies: |
| 61 | + |
| 62 | +```bash |
| 63 | +uv add awslambdaric aws-lambda-powertools boto3 |
| 64 | +``` |
| 65 | + |
| 66 | +The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: |
| 67 | + |
| 68 | +<CodeBlock language="dockerfile" title="Dockerfile"> |
| 69 | + {PlaywrightCrawlerDockerfile} |
| 70 | +</CodeBlock> |
| 71 | + |
| 72 | +### Building and pushing the Docker image |
| 73 | + |
| 74 | +Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html). |
| 75 | + |
| 76 | +Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them. |
| 77 | + |
| 78 | +Example: |
| 79 | +```bash |
| 80 | +aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data} |
| 81 | +docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright . |
| 82 | +docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest |
| 83 | +docker push {user-specific-data}/lambda/aws-playwright:latest |
| 84 | +``` |
| 85 | + |
| 86 | +### Creating the Lambda function |
| 87 | + |
| 88 | +1. In the AWS Lambda Console, click "Create function" |
| 89 | +2. Select "Container image" |
| 90 | +3. Browse and select your ECR image |
| 91 | +4. Configure the function settings |
| 92 | + |
| 93 | +:::tip Configuration |
| 94 | + |
| 95 | +Playwright crawlers require more resources than HTTP-based crawlers: |
| 96 | + |
| 97 | +- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. |
| 98 | +- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. |
| 99 | +- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files. |
| 100 | + |
| 101 | +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. |
| 102 | + |
| 103 | +::: |
| 104 | + |
| 105 | +## Testing the function |
| 106 | + |
| 107 | +After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. |
0 commit comments