The cBioPortal datahub repository currently uses GitHub's Git LFS (Large File Storage) to manage large genomics data files. GitHub charges for LFS storage and bandwidth via data packs ($5/month per 50GB storage + 50GB bandwidth). As the datahub grows, these costs are becoming prohibitive.
This document describes the AWS infrastructure required to replace GitHub's LFS storage with an S3-backed alternative. The solution keeps GitHub as the frontend for code review and pull requests, but routes all LFS object storage to S3 via a lightweight Lambda function.
Install the AWS CLI v2 if you don't have it:
# macOS
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
# Verify
aws --versionYou need version 2.x. If you have an older version, reinstall using the command above.
Configure a named profile for the AWS account you'll be deploying to:
aws configure --profile YOUR_PROFILE_NAMEYou'll be prompted for:
- AWS Access Key ID
- AWS Secret Access Key
- Default region (
us-east-1) - Default output format (
json)
Verify it works:
aws sts get-caller-identity --profile YOUR_PROFILE_NAMEAll AWS CLI commands in this document use --profile YOUR_PROFILE_NAME.
Set the PROFILE variable in the Makefile to match:
PROFILE = YOUR_PROFILE_NAMENote for MSK users: If you authenticate via saml2aws, your credentials
are temporary and need to be refreshed periodically. Run saml2aws login
before running any AWS CLI commands. The --profile argument should match
your saml2aws profile name.
Git LFS uses a protocol called the Batch API. When a curator runs git push
or a researcher runs git pull, git-lfs makes an HTTP request asking "where
do I upload/download this file?" Normally GitHub answers that question. This
solution replaces GitHub's LFS server with a Lambda function that answers
instead, generating pre-signed S3 URLs that allow git-lfs to upload and
download files directly to/from S3.
The Lambda function never touches the actual file data — it only brokers pre-signed URLs. All heavy lifting (storage, bandwidth) is handled by S3.
Curator: git push -> GitHub (pointer files) + Lambda -> S3 (file content)
Researcher: git pull -> GitHub (pointer files) + Lambda -> S3 (file content)
GitHub stores only tiny pointer files (a few lines of text per file). All actual file content lives in S3.
A single S3 bucket serves two purposes:
- LFS storage (
lfs/objects/*) — stores LFS objects in content-addressable layout. Private — accessible only via pre-signed URLs generated by the Lambda. - Snapshot storage (
public/*) — stores a human-readable mirror of the repository with files at their original paths. Publicly readable — allows direct download without git or LFS tooling.
YOUR_BUCKET_NAME/
├── lfs/
│ └── objects/ <- private, Lambda access only
│ └── ab/cd/abcd... <- LFS content-addressable objects
└── public/ <- public read
└── brca_tcga/
├── data_mutations.txt
└── meta_study.txt
This design is well suited for the AWS Open Data Sponsorship Program, which may limit sponsored projects to a single bucket. The bucket serves two audiences with appropriate access controls for each.
Note on AWS Open Data Program: cBioPortal is a strong candidate for the AWS Open Data Sponsorship Program at https://aws.amazon.com/opendata/open-data-sponsorship-program/ which covers S3 storage and egress costs entirely for qualifying open source scientific datasets. An application should be submitted in parallel with this infrastructure work. If accepted, all S3 costs are eliminated.
aws s3api create-bucket \
--bucket YOUR_BUCKET_NAME \
--region us-east-1 \
--profile YOUR_PROFILE_NAMEBefore applying the bucket policy, disable S3 Block Public Access on the
bucket. This is required to allow the selective public policy on public/*.
Note that this does not make the bucket fully public — the bucket policy
controls exactly what is and isn't accessible:
aws s3api put-public-access-block \
--bucket YOUR_BUCKET_NAME \
--public-access-block-configuration \
"BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false" \
--profile YOUR_PROFILE_NAMEThe bucket policy in docs/bucket-policy.json makes snapshot files publicly
readable while restricting LFS objects to the Lambda role only. Replace
YOUR_BUCKET_NAME and YOUR_ACCOUNT_ID in the file before running:
aws s3api put-bucket-policy \
--bucket YOUR_BUCKET_NAME \
--policy file://docs/bucket-policy.json \
--profile YOUR_PROFILE_NAMEThe policy has two statements:
- PublicReadSnapshot — allows anyone to download files under
public/* - DenyPublicReadLFS — blocks public access to
lfs/*except for the Lambda role. Pre-signed URLs for LFS downloads still work because they are signed with the Lambda role's credentials.
Or use the Makefile which handles both steps automatically:
make configure-bucketPurpose: Allow the Lambda function to generate pre-signed S3 URLs for uploads and downloads, and to read curator API keys from Secrets Manager.
Role name: github-lfs-lambda-role
Trust policy — defines that the Lambda service can assume this role.
The policy document is in docs/trust-policy.json:
# Create the role
aws iam create-role \
--role-name github-lfs-lambda-role \
--assume-role-policy-document file://docs/trust-policy.json \
--profile YOUR_PROFILE_NAME
# Attach basic execution policy (CloudWatch logging)
aws iam attach-role-policy \
--role-name github-lfs-lambda-role \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole \
--profile YOUR_PROFILE_NAMEReasoning for CloudWatch logging: The Lambda is a critical piece of infrastructure. If uploads or downloads fail, CloudWatch logs are the only way to diagnose the problem. At the volume this Lambda will be invoked (only during git push/pull operations), logging costs are negligible.
Inline policy — grants S3 and Secrets Manager access, scoped to the
specific resources. The policy document is in docs/inline-policy.json.
Replace YOUR_BUCKET_NAME and YOUR_ACCOUNT_ID in that file before running:
aws iam put-role-policy \
--role-name github-lfs-lambda-role \
--policy-name github-lfs-s3-access \
--policy-document file://docs/inline-policy.json \
--profile YOUR_PROFILE_NAMEReasoning: The role is intentionally scoped to only GetObject and
PutObject on the specific bucket, and GetSecretValue on the specific
secret. Minimal permissions reduce the blast radius of any misconfiguration.
Note on multiple buckets: If this Lambda is reused for multiple
repositories or S3 buckets, add additional Resource ARNs to
docs/inline-policy.json rather than creating separate roles.
Note on extra snapshot buckets: If using EXTRA_SNAPSHOT_BUCKETS, add
a s3:PutObject permission for each additional bucket to docs/inline-policy.json.
For buckets in a second AWS account, also apply docs/cross-account-bucket-policy.json
to the destination bucket — see the "Using Extra Snapshot Buckets in a Second AWS Account"
section below.
Purpose: Store curator API keys as a JSON object mapping curator names to keys. The Lambda reads this secret to authenticate upload requests.
See SECRETS_MANAGER.md for full instructions on creating and managing the secret. At a minimum, create the secret before deploying the Lambda:
aws secretsmanager create-secret \
--name github-lfs-api-keys \
--description "API keys for git-lfs-s3 upload authorization" \
--secret-string '{"curator-name":"REPLACE_WITH_KEY"}' \
--region us-east-1 \
--profile YOUR_PROFILE_NAMEGenerate a key for each curator:
openssl rand -hex 32Purpose: Implement the Git LFS Batch API. Receives requests from git-lfs clients, authenticates upload requests, and returns pre-signed S3 URLs for direct upload/download.
Configuration:
- Function name:
github-lfs - Runtime:
provided.al2023 - Architecture:
x86_64 - Timeout: 30 seconds
- IAM role:
github-lfs-lambda-role
Environment variables:
| Variable | Value | Purpose |
|---|---|---|
S3_BUCKET |
YOUR_BUCKET_NAME |
Target S3 bucket name |
LFS_SECRET_NAME |
github-lfs-api-keys |
Secrets Manager secret name |
LFS_PATH_PREFIX |
lfs/objects (default) |
S3 key prefix for LFS objects |
Note: Do not set AWS_REGION — this is reserved by the Lambda runtime.
Deploy:
make create-brokerPurpose: Reconstruct a human-readable snapshot of the repository in the
public/ prefix of the S3 bucket on every PR merge. Triggered by a GitHub
Action. Supports both full and incremental modes.
Configuration:
- Function name:
github-lfs-snapshot - Runtime:
provided.al2023 - Architecture:
x86_64 - Timeout: 300 seconds
- IAM role:
github-lfs-lambda-role
Environment variables:
| Variable | Value | Purpose |
|---|---|---|
LFS_BUCKET |
YOUR_BUCKET_NAME |
Source bucket for LFS objects |
SNAPSHOT_BUCKET |
YOUR_BUCKET_NAME |
Primary destination for snapshot files |
SNAPSHOT_PREFIX |
public/ |
Prefix for snapshot files (empty = bucket root) |
LFS_PATH_PREFIX |
lfs/objects (default) |
S3 key prefix for LFS objects |
EXTRA_SNAPSHOT_BUCKETS |
(optional) | Comma-separated list of additional buckets to sync to |
Note: LFS_BUCKET and SNAPSHOT_BUCKET are the same value — a single
bucket is used for both LFS storage and snapshot storage. EXTRA_SNAPSHOT_BUCKETS
is optional — if empty or unset only the primary bucket is written to.
Deploy:
make create-snapshotBoth Lambda functions require a public HTTPS endpoint. The Makefile
create-broker and create-snapshot targets handle Function URL creation
and public access permissions automatically as part of the deployment.
make create-broker
make create-snapshotIf you need to set up the Function URLs manually for any reason, the full CLI commands are:
# lfs-broker Function URL
aws lambda create-function-url-config \
--function-name github-lfs \
--auth-type NONE \
--region us-east-1 \
--profile YOUR_PROFILE_NAME
aws lambda add-permission \
--function-name github-lfs \
--statement-id FunctionURLAllowPublicAccess \
--action lambda:InvokeFunctionUrl \
--principal "*" \
--function-url-auth-type NONE \
--region us-east-1 \
--profile YOUR_PROFILE_NAME
aws lambda add-permission \
--function-name github-lfs \
--statement-id FunctionURLAllowInvoke \
--action lambda:InvokeFunction \
--principal "*" \
--region us-east-1 \
--profile YOUR_PROFILE_NAME
# lfs-snapshot Function URL
aws lambda create-function-url-config \
--function-name github-lfs-snapshot \
--auth-type NONE \
--region us-east-1 \
--profile YOUR_PROFILE_NAME
aws lambda add-permission \
--function-name github-lfs-snapshot \
--statement-id FunctionURLAllowPublicAccess \
--action lambda:InvokeFunctionUrl \
--principal "*" \
--function-url-auth-type NONE \
--region us-east-1 \
--profile YOUR_PROFILE_NAME
aws lambda add-permission \
--function-name github-lfs-snapshot \
--statement-id FunctionURLAllowInvoke \
--action lambda:InvokeFunction \
--principal "*" \
--region us-east-1 \
--profile YOUR_PROFILE_NAMEGet the Function URLs after creation:
make url-broker
make url-snapshot| Resource | Name | Purpose |
|---|---|---|
| S3 Bucket | YOUR_BUCKET_NAME |
LFS storage + snapshot storage |
| IAM Role | github-lfs-lambda-role |
Lambda execution permissions |
| Secrets Manager Secret | github-lfs-api-keys |
Curator API keys |
| Lambda Function | github-lfs |
LFS Batch API broker |
| Lambda Function | github-lfs-snapshot |
Repository snapshot |
| Lambda Function URLs | auto-generated | Public HTTPS endpoints |
Resources must be created in this order due to dependencies:
- S3 bucket — no dependencies
- Disable Block Public Access — required before applying bucket policy
- Bucket policy — requires Block Public Access to be disabled
- IAM role and policies — no dependencies, but needed before Lambda
- Secrets Manager secret — no dependencies, but needed before Lambda
- Lambda functions — requires IAM role ARN, bucket name, and secret name
- Function URLs and permissions — requires Lambda functions to exist
Updating Lambda code:
make deploy-broker
make deploy-snapshotThe Function URLs do not change when code is updated.
Monitoring:
- Lambda metrics: AWS Console -> Lambda -> Monitor tab
- Logs: CloudWatch -> Log groups -> /aws/lambda/github-lfs
- S3 storage: AWS Console -> S3 -> your bucket -> Metrics tab
Once AWS infrastructure is in place, add .lfsconfig to the datahub
repository root:
[lfs]
url = https://<your-broker-url>.lambda-url.us-east-1.on.aws/The existing .gitattributes in datahub already defines which file types
are managed by LFS and does not need to change:
*.tar.gz filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text
data*.txt filter=lfs diff=lfs merge=lfs -text
*.seg filter=lfs diff=lfs merge=lfs -text
*-validation.html filter=lfs diff=lfs merge=lfs -text
Files matching these patterns go to S3. All other files remain in GitHub as regular git objects and show full content diffs in pull requests.
See BACKFILL.md for full instructions. At a high level:
git clone --no-checkout https://github.com/cBioPortal/datahub.git
cd datahub
git lfs install --skip-smudge
git checkout master
git lfs pull
aws s3 sync .git/lfs/objects/ s3://YOUR_BUCKET_NAME/lfs/objects/ \
--profile YOUR_PROFILE_NAMEEach curator needs to store the API key once in their git credential manager.
After this one-time setup, git push works without any prompts:
git credential approve <<EOF
protocol=https
host=<your-broker-url>.lambda-url.us-east-1.on.aws
username=lfs
password=YOUR_LFS_API_KEY
EOFSee SECRETS_MANAGER.md for instructions on generating and distributing API keys.
The EXTRA_SNAPSHOT_BUCKETS environment variable allows the snapshot Lambda
to write to additional buckets alongside the primary snapshot bucket. These
extra buckets can live in a different AWS account — the streaming
GetObject + PutObject pattern used by the Lambda works across account
boundaries without any code changes.
aws s3api create-bucket \
--bucket SECOND_ACCOUNT_BUCKET \
--region us-east-1 \
--profile SECOND_ACCOUNT_PROFILEThe policy in docs/cross-account-bucket-policy.json grants the Lambda
role from the first account permission to write to the bucket in the second
account. Replace FIRST_ACCOUNT_ID and SECOND_ACCOUNT_BUCKET in the
file before running.
Note: this is a bucket policy applied to the destination bucket in the second account — it is separate from the trust policy and inline policy on the IAM role in the first account.
aws s3api put-bucket-policy \
--bucket SECOND_ACCOUNT_BUCKET \
--policy file://docs/cross-account-bucket-policy.json \
--profile SECOND_ACCOUNT_PROFILEAdd the second account's bucket to EXTRA_SNAPSHOT_BUCKETS in the Makefile:
EXTRA_SNAPSHOT_BUCKETS = SECOND_ACCOUNT_BUCKETFor multiple extra buckets, use a comma-separated list:
EXTRA_SNAPSHOT_BUCKETS = bucket-one,bucket-twoThen update the Lambda configuration:
aws lambda update-function-configuration \
--function-name github-lfs-snapshot \
--environment Variables="{LFS_BUCKET=YOUR_BUCKET_NAME,SNAPSHOT_BUCKET=YOUR_BUCKET_NAME,SNAPSHOT_PREFIX=public/,EXTRA_SNAPSHOT_BUCKETS=SECOND_ACCOUNT_BUCKET}" \
--region us-east-1 \
--profile YOUR_PROFILE_NAMEAdd s3:PutObject permission for the extra bucket to docs/inline-policy.json:
{
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::SECOND_ACCOUNT_BUCKET/*"
}Then reapply the policy:
aws iam put-role-policy \
--role-name github-lfs-lambda-role \
--policy-name github-lfs-s3-access \
--policy-document file://docs/inline-policy.json \
--profile YOUR_PROFILE_NAMETrigger a full snapshot and verify files appear in both buckets:
make full-snapshot
# Check primary bucket
aws s3 ls s3://YOUR_BUCKET_NAME/public/ --recursive --profile YOUR_PROFILE_NAME
# Check second account bucket
aws s3 ls s3://SECOND_ACCOUNT_BUCKET/ --recursive --profile SECOND_ACCOUNT_PROFILE