This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This repository builds AWS Lambda layers containing Tesseract OCR libraries and the tesseract command-line binary. It supports Amazon Linux 2023 (recommended) and Amazon Linux 2 (deprecated) runtimes, with ready-to-use binaries and Docker-based build processes.
- All repository configuration is managed through
.projenrc.ts - Run
npx projento regenerate configuration files after editing.projenrc.ts - The project uses projen to manage nested subprojects (example/cdk, continous-integration/lambda-handlers/node)
Two Dockerfiles build the Tesseract layer for different Amazon Linux versions:
Dockerfile.al2023- Amazon Linux 2023 (recommended - Python 3.12+, Node.js 20+, Ruby 3.2+, Java 17+)Dockerfile.al2- Amazon Linux 2 (deprecated - Python 3.8-3.11, Node.js 18, Ruby 2.7, Java 8/11)
Amazon Linux 1 support has been removed.
The Dockerfiles compile Tesseract OCR and Leptonica from source with configurable build arguments:
TESSERACT_VERSION- tesseract version to buildLEPTONICA_VERSION- leptonica version to buildOCR_LANG- additional language data to include (beyond eng/osd)TESSERACT_DATA_SUFFIX- traineddata variant: empty (default),_best, or_fastTESSERACT_DATA_VERSION- version of trained models (currently 4.1.0)
Located in continous-integration/:
main.ts- CDK app that builds layers using Docker bundling and creates test Lambda functionslambda-handlers/python/- Python handler for testinglambda-handlers/node/- Node.js handler for testing
The CI/CD flow:
npx cdk synthsynthesizes the stack and bundles layer artifacts via Docker- AWS SAM CLI locally invokes test functions with the built layer
- Test output is checked for errors
- Successful builds copy artifacts to
ready-to-use/amazonlinux-2023/andready-to-use/amazonlinux-2/
The ready-to-use/ directory contains pre-built layer binaries:
amazonlinux-2023/- Recommended layer contents for AL2023-based runtimesamazonlinux-2/- Deprecated layer contents for AL2-based runtimes (will be removed)- These are deployed to
/optwhen attached to a Lambda function
# Install dependencies (runs across all subprojects)
npm ci
# Build the project (compile TypeScript)
npm run build
# Run unit tests
npm test
# Synthesize CDK stack (triggers Docker build of layer)
npm run synth
# Run integration tests (requires Docker and SAM CLI, AL2023 only)
npm run test:integration # All AL2023 tests
npm run test:integration:al2023 # AL2023 tests (Python 3.12 + Node.js 20)
npm run test:integration:python312 # Python 3.12 (AL2023) test
npm run test:integration:node20 # Node.js 20 (AL2023) test
# Bundle ready-to-use artifacts after synth
npm run bundle:binary
# Create release assets (zip files)
npm run packagenpm run eslint# Build AL2023 layer (recommended)
docker build --build-arg TESSERACT_VERSION=5.5.2 \
--build-arg OCR_LANG=fra \
-t tesseract-lambda-layer \
-f Dockerfile.al2023 .
# Build AL2 layer (deprecated)
docker build -t tesseract-lambda-layer -f Dockerfile.al2 .
# Extract built artifacts from container
export CONTAINER=$(docker run -d tesseract-lambda-layer false)
docker cp $CONTAINER:/opt/build-dist layer
docker rm $CONTAINER
unset CONTAINER# Upgrade all dependencies (JavaScript)
npm run upgrade
# Upgrade Python dependencies in CI handlers
npm run upgrade:ci:py
# Upgrade all subprojects
npm run upgrade:subprojectsThis repository uses CDK's Docker bundling feature extensively:
// Building a layer from Docker
const layer = new lambda.LayerVersion(stack, 'layer', {
code: Code.fromAsset(pathToSource, {
bundling: {
image: DockerImage.fromBuild(path, { file: 'Dockerfile.al2' }),
command: ['/bin/bash', '-c', 'cp -r /opt/build-dist/. /asset-output/'],
},
}),
});The bundling happens during cdk synth, not during cdk deploy. Artifacts are cached in cdk.out/.
Two example projects demonstrate layer usage:
- References
ready-to-use/amazonlinux-2/vialayers.pathinserverless.yml - Uses serverless-python-requirements plugin with Docker
- Creates LayerVersion from
ready-to-use/amazonlinux-2/usingCode.fromAsset() - Separate projen subproject with its own dependencies
- Library files are stripped during build (using
strip -s) to reduce size - Stripping can cause issues if build runtime differs from Lambda runtime - use matching base images
- The layer includes tesseract binaries in
/opt/binand libraries in/opt/lib - Ready-to-use artifacts are tracked in git for convenience (large binary files)
- Release workflow is scheduled annually (Jan 1) via projen
releaseTrigger - Dependency upgrades run weekly via GitHub Actions with projen credentials from GitHub App