|
| 1 | +--- |
| 2 | +id: aws-lambda |
| 3 | +title: Deploy on AWS Lambda |
| 4 | +description: Prepare your crawler to run on AWS Lambda. |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | + |
| 9 | +import CodeBlock from '@theme/CodeBlock'; |
| 10 | + |
| 11 | +import BeautifulSoupCrawlerLambda from '!!raw-loader!./code_examples/aws/beautifulsoup_crawler_lambda.py'; |
| 12 | +import PlaywrightCrawlerLambda from '!!raw-loader!./code_examples/aws/playwright_crawler_lambda.py'; |
| 13 | +import PlaywrightCrawlerDockerfile from '!!raw-loader!./code_examples/aws/playwright_dockerfile'; |
| 14 | + |
| 15 | +[AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html) is a serverless compute service that lets you run code without provisioning or managing servers. This guide covers deploying <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> and <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. |
| 16 | + |
| 17 | +The code examples are based on the [BeautifulSoupCrawler example](../examples/beautifulsoup-crawler). |
| 18 | + |
| 19 | +## BeautifulSoupCrawler on AWS Lambda |
| 20 | + |
| 21 | +For simple crawlers that don't require browser rendering, you can deploy using a ZIP archive. |
| 22 | + |
| 23 | +### Updating the code |
| 24 | + |
| 25 | +When instantiating a crawler, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink>. By default, Crawlee uses file-based storage, but the Lambda filesystem is read-only (except for `/tmp`). Using `MemoryStorageClient` tells Crawlee to use in-memory storage instead. |
| 26 | + |
| 27 | +Wrap the crawler logic in a `lambda_handler` function. This is the entry point that AWS will execute. |
| 28 | + |
| 29 | +:::important |
| 30 | + |
| 31 | +Make sure to always instantiate a new crawler for every Lambda invocation. AWS keeps the environment running for some time after the first execution (to reduce cold-start times), so subsequent calls may access an already-used crawler instance. |
| 32 | + |
| 33 | +**TL;DR: Keep your Lambda stateless.** |
| 34 | + |
| 35 | +::: |
| 36 | + |
| 37 | +Finally, return the scraped data from the Lambda when the crawler run ends. |
| 38 | + |
| 39 | +<CodeBlock language="python" title="lambda_function.py"> |
| 40 | + {BeautifulSoupCrawlerLambda} |
| 41 | +</CodeBlock> |
| 42 | + |
| 43 | +### Preparing the environment |
| 44 | + |
| 45 | +Lambda requires all dependencies to be included in the deployment package. Create a virtual environment and install dependencies: |
| 46 | + |
| 47 | +```bash |
| 48 | +python3.14 -m venv .venv |
| 49 | +source .venv/bin/activate |
| 50 | +pip install 'crawlee[beautifulsoup]' 'boto3' 'aws-lambda-powertools' |
| 51 | +``` |
| 52 | + |
| 53 | +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Including it in your dependencies is recommended to avoid version misalignment issues with the Lambda runtime. |
| 54 | + |
| 55 | +### Creating the ZIP archive |
| 56 | + |
| 57 | +Create a ZIP archive from your project, including dependencies from the virtual environment: |
| 58 | + |
| 59 | +```bash |
| 60 | +cd .venv/lib/python3.14/site-packages |
| 61 | +zip -r ../../../../package.zip . |
| 62 | +cd ../../../../ |
| 63 | +zip package.zip lambda_function.py |
| 64 | +``` |
| 65 | + |
| 66 | +:::note Large dependencies? |
| 67 | + |
| 68 | +AWS has a limit of 50 MB for direct upload and 250 MB for unzipped deployment package size. |
| 69 | + |
| 70 | +A better way to manage dependencies is by using Lambda Layers. With Layers, you can share files between multiple Lambda functions and keep the actual code as slim as possible. |
| 71 | + |
| 72 | +To create a Lambda Layer: |
| 73 | + |
| 74 | +1. Create a `python/` folder and copy dependencies from `site-packages` into it |
| 75 | +2. Create a zip archive: `zip -r layer.zip python/` |
| 76 | +3. Create a new Lambda Layer from the archive (you may need to upload it to S3 first) |
| 77 | +4. Attach the Layer to your Lambda function |
| 78 | + |
| 79 | +::: |
| 80 | + |
| 81 | +### Creating the Lambda function |
| 82 | + |
| 83 | +Create the Lambda function in the AWS Lambda Console: |
| 84 | + |
| 85 | +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). |
| 86 | +2. Click **Create function**. |
| 87 | +3. Select **Author from scratch**. |
| 88 | +4. Enter a **Function name**, for example `BeautifulSoupTest`. |
| 89 | +5. Choose a **Python runtime** that matches the version used in your virtual environment (for example, Python 3.14). |
| 90 | +6. Click **Create function** to finish. |
| 91 | + |
| 92 | +Once created, upload `package.zip` as the code source in the AWS Lambda Console using the "Upload from" button. |
| 93 | + |
| 94 | +In Lambda Runtime Settings, set the handler. Since the file is named `lambda_function.py` and the function is `lambda_handler`, you can use the default value `lambda_function.lambda_handler`. |
| 95 | + |
| 96 | +:::tip Configuration |
| 97 | + |
| 98 | +In the Configuration tab, you can adjust: |
| 99 | + |
| 100 | +- **Memory**: Memory size can greatly affect execution speed. A minimum of 256-512 MB is recommended. |
| 101 | +- **Timeout**: Set according to the size of the website you are scraping (1 minute for the example code). |
| 102 | +- **Ephemeral storage**: Size of the `/tmp` directory. |
| 103 | + |
| 104 | +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. |
| 105 | + |
| 106 | +::: |
| 107 | + |
| 108 | +After the Lambda deploys, you can test it by clicking the "Test" button. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. |
| 109 | + |
| 110 | +## PlaywrightCrawler on AWS Lambda |
| 111 | + |
| 112 | +For crawlers that require browser rendering, you need to deploy using Docker container images because Playwright and browser binaries exceed Lambda's ZIP deployment size limits. |
| 113 | + |
| 114 | +### Updating the code |
| 115 | + |
| 116 | +As with <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, use <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> and wrap the logic in a `lambda_handler` function. Additionally, configure `browser_launch_options` with flags optimized for serverless environments. These flags disable sandboxing and GPU features that aren't available in Lambda's containerized runtime. |
| 117 | + |
| 118 | +<CodeBlock language="python" title="main.py"> |
| 119 | + {PlaywrightCrawlerLambda} |
| 120 | +</CodeBlock> |
| 121 | + |
| 122 | +### Installing and configuring AWS CLI |
| 123 | + |
| 124 | +Install AWS CLI following the [official documentation](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) according to your operating system. |
| 125 | + |
| 126 | +Authenticate by running: |
| 127 | + |
| 128 | +```bash |
| 129 | +aws login |
| 130 | +``` |
| 131 | + |
| 132 | +### Preparing the project |
| 133 | + |
| 134 | +Initialize the project by running `uvx 'crawlee[cli]' create`. |
| 135 | + |
| 136 | +Or use a single command if you don't need interactive mode: |
| 137 | + |
| 138 | +```bash |
| 139 | +uvx 'crawlee[cli]' create aws_playwright --crawler-type playwright --http-client impit --package-manager uv --no-apify --start-url 'https://crawlee.dev' --install |
| 140 | +``` |
| 141 | + |
| 142 | +Add the following dependencies: |
| 143 | + |
| 144 | +```bash |
| 145 | +uv add awslambdaric aws-lambda-powertools boto3 |
| 146 | +``` |
| 147 | + |
| 148 | +[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is the AWS SDK for Python. Use it if your function integrates with any other AWS services. |
| 149 | + |
| 150 | +The project is created with a Dockerfile that needs to be modified for AWS Lambda by adding `ENTRYPOINT` and updating `CMD`: |
| 151 | + |
| 152 | +<CodeBlock language="dockerfile" title="Dockerfile"> |
| 153 | + {PlaywrightCrawlerDockerfile} |
| 154 | +</CodeBlock> |
| 155 | + |
| 156 | +### Building and pushing the Docker image |
| 157 | + |
| 158 | +Create a repository `lambda/aws-playwright` in [Amazon Elastic Container Registry](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) in the same region where your Lambda functions will run. To learn more, refer to the [official documentation](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html). |
| 159 | + |
| 160 | +Navigate to the created repository and click the "View push commands" button. This will open a window with console commands for uploading the Docker image to your repository. Execute them. |
| 161 | + |
| 162 | +Example: |
| 163 | +```bash |
| 164 | +aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin {user-specific-data} |
| 165 | +docker build --platform linux/amd64 --provenance=false -t lambda/aws-playwright . |
| 166 | +docker tag lambda/aws-playwright:latest {user-specific-data}/lambda/aws-playwright:latest |
| 167 | +docker push {user-specific-data}/lambda/aws-playwright:latest |
| 168 | +``` |
| 169 | + |
| 170 | +### Creating the Lambda function |
| 171 | + |
| 172 | +1. Navigate to `Lambda` in [AWS Management Console](https://aws.amazon.com/console/). |
| 173 | +2. Click **Create function**. |
| 174 | +3. Select **Container image**. |
| 175 | +4. Browse and select your ECR image. |
| 176 | +5. Click **Create function** to finish. |
| 177 | + |
| 178 | +:::tip Configuration |
| 179 | + |
| 180 | +In the Configuration tab, you can adjust resources. Playwright crawlers require more resources than BeautifulSoup crawlers: |
| 181 | + |
| 182 | +- **Memory**: Minimum 1024 MB recommended. Browser operations are memory-intensive, so 2048 MB or more may be needed for complex pages. |
| 183 | +- **Timeout**: Set according to crawl size. Browser startup adds overhead, so allow at least 5 minutes even for simple crawls. |
| 184 | +- **Ephemeral storage**: Default 512 MB is usually sufficient unless downloading large files. |
| 185 | + |
| 186 | +See the [official documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) to learn how performance and cost scale with memory. |
| 187 | + |
| 188 | +::: |
| 189 | + |
| 190 | +After the Lambda deploys, click the "Test" button to invoke it. The event contents don't matter for a basic test, but you can parameterize your crawler by parsing the event object that AWS passes as the first argument to the handler. |
0 commit comments