Offload Git LFS read bandwidth to the Hugging Face mirror by skearnes · Pull Request #244 · open-reaction-database/ord-data

skearnes · 2026-06-06T23:54:12Z

Problem

Public clones dominate our GitHub Git LFS bandwidth. Over the last month (measured from the org billing usage API) GitHub LFS served ~675 GB, of which ~585 GB (~87%) was clones/forks and only ~90 GB was CI — including ~314 GB on days with zero CI runs and a 262 GB single-day spike. Forks and anonymous clones cannot be access-restricted on a public repo, and the datasets are already mirrored to Hugging Face.

Approach

Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth for the bytes. GitHub still holds the canonical, complete copy of every LFS object; HF is a read-only replica kept current by the existing mirror job.

Layer	Change
`.lfsconfig` (new)	`lfs.url` → HF for clone/fetch. No `pushurl` (a fixed HTTPS pushurl breaks SSH pushers; a fixed SSH pushurl breaks CI).
`.gitattributes`	LFS scoped to `data/`, so a new submission staged at the repo root is an ordinary git file.
`validation.yml`	Read each matrix shard's LFS objects from GitHub, sparsely (was: `checkout lfs:true` pulled the whole dataset in all 11 jobs).
`submission.yml`	Read LFS from GitHub, and pull only the changed datasets (a sparse `--include` built from `changed_data_files.txt`), so submissions validate before their bytes are mirrored without fetching the whole repo.
`huggingface_mirror.yml`	Point LFS reads at GitHub (objects aren't on HF yet at mirror time).
`CONTRIBUTING.md` / `README.md`	Document the design and the rare fork-edit override.

CI and the mirror override lfs.url back to GitHub at runtime (git config lfs.url …) because freshly pushed objects are not on HF until the post-merge mirror job runs.

This PR also removes the count_reactions job and the reactions-count badge: it added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-dataset LFS pull, for little value.

Why this is safe for submissions

The data/ scoping makes the LFS boundary coincide with the submission boundary. A contributor stages a new dataset at the repo root (a plain git file) and pushes from a fork with no LFS setup. The submission "Update" step renames it into data/…, at which point it becomes an LFS object pushed to GitHub by CI (which can write). Validation reads objects from GitHub, so submissions are validated before they're ever expected on HF — no chicken-and-egg.

The sparse submission pull is safe because process_dataset.py (ord-schema v0.6.3) reads only the changed inputs listed in changed_data_files.txt and smudges base revisions of modified files on demand via lfs.url; it never scans unchanged datasets.

Edge cases (documented)

Editing an existing data/ object from a fork still produces an LFS object; pushing it needs a one-line git config lfs.pushurl … (see CONTRIBUTING.md). Corrections that re-stage at the root avoid this. Root files > 100 MB can't be pushed un-LFS'd (GitHub hard limit); such large datasets are maintainer/LFS territory anyway (e.g. the 1 GB USPTO parquet).

Verification

All 595 LFS objects on main were confirmed present on the HF LFS store (anonymous download batch API), so redirected clones resolve. Workflow YAML validated; new .gitattributes patterns confirmed via git check-attr (data/ → lfs, root → unspecified); SSH push confirmed working. The validation matrix self-tests the sparse pulls on this PR (it touches validation.yml).

Expected impact

GitHub LFS bandwidth drops from ~675 GB/mo toward the low tens of GB (CI only), with the ~585 GB of clone traffic moving to HF's CDN.

🤖 Generated with Claude Code

Greptile Summary

This PR offloads Git LFS read bandwidth from GitHub to the existing Hugging Face mirror by committing a .lfsconfig that redirects lfs.url to HF, while keeping GitHub as the authoritative LFS store. All CI workflows override the URL back to GitHub at runtime so freshly-pushed objects (not yet mirrored) are accessible.

.lfsconfig + .gitattributes: LFS reads are transparently redirected to HF for clones/forks; LFS tracking is scoped to data/ so root-level new submissions are plain git objects, removing the need for fork contributors to configure LFS at all.
Workflow rewrites: validation.yml replaces full lfs:true checkouts with sparse per-shard pulls from GitHub; submission.yml removes the count_reactions bot-commit job and adds a sparse pull scoped to changed dataset paths; huggingface_mirror.yml drops the now-redundant GIT_LFS_SKIP_SMUDGE env var and adds an unconditional step to point LFS reads at GitHub before uploading.
Docs: README.md and CONTRIBUTING.md are updated to explain the HF mirror design, the pushurl workaround for contributors editing existing data/ files, and the no-setup path for new submissions.

Confidence Score: 5/5

Safe to merge; the LFS redirect, sparse-pull logic, and CI overrides are all correctly wired up.

The design is carefully thought through end-to-end. New root-level submissions correctly register in NUM_CHANGED_FILES (via the full-diff grep on all changed files with dataset extensions), so lfs.url is always set to GitHub before any LFS push in the submission job. The validation and mirror workflows set the override unconditionally before their LFS operations. The no-pushurl decision in .lfsconfig is sound and well-documented. All 595 existing LFS objects were pre-verified on HF. No regressions were introduced to the submission or validation paths.

No files require special attention.

Important Files Changed

Filename	Overview
.lfsconfig	New config redirecting LFS reads to Hugging Face mirror; no pushurl by design with clear documentation of the reasoning
.gitattributes	LFS scoped to data/** patterns; intentional change enables root-level submissions without LFS setup
.github/workflows/validation.yml	Replaces full lfs:true checkout with sparse per-shard LFS pulls from GitHub; pb filters double as valid globs, parquet job uses explicit lfs_include/lfs_exclude fields
.github/workflows/submission.yml	Removes count_reactions job; converts to lfs:false checkout + sparse LFS pull; lfs.url correctly set to GitHub for all dataset-touching runs (root submissions register as NUM_CHANGED_FILES≥1 via grep on full diff)
.github/workflows/huggingface_mirror.yml	Removes now-redundant GIT_LFS_SKIP_SMUDGE env var; adds unconditional step to point LFS reads at GitHub before mirroring newly-pushed objects
CONTRIBUTING.md	Adds clear LFS note documenting the HF mirror redirect and the pushurl workaround for contributors editing existing data/ files
README.md	Removes reactions badge; reorganises Getting the Data section with clone-first option now that LFS redirect makes it seamless; adds full Git LFS / HF mirror design section
badges/reactions.svg	Deleted as part of removing the count_reactions job that kept it updated via bot commits on every PR

Comments Outside Diff (1)

.github/workflows/huggingface_mirror.yml, line 40-44 (link)

Redundant GIT_LFS_SKIP_SMUDGE alongside lfs: false

The GIT_LFS_SKIP_SMUDGE: 1 env var on the checkout step is now redundant. lfs: false already tells actions/checkout not to fetch LFS objects or configure LFS hooks — the smudge filter never runs. The env var is harmless but was meaningful in the old workflow where lfs: false was absent; it can now be removed to reduce confusion.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile}

Public clones account for ~87% of the GitHub LFS bandwidth quota (measured ~585 GB of ~675 GB over the last month; the rest is CI). Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth. - .lfsconfig: lfs.url -> HF for clone/fetch. No pushurl is committed (a fixed HTTPS pushurl breaks SSH pushers and a fixed SSH pushurl breaks CI); the actors that write LFS objects set lfs.url back to GitHub themselves. - CI and the mirror override lfs.url to GitHub at runtime, because newly pushed objects are not on HF until the post-merge mirror job runs. - validation.yml: pull only each matrix shard's LFS objects from GitHub (was: actions/checkout lfs:true pulled the whole dataset in all 11 jobs). - submission.yml: read LFS from GitHub so fork/branch submissions are validated before their bytes are mirrored to HF on merge to main. - .gitattributes: scope LFS to data/ so a new submission staged at the repo root is an ordinary git file, pushable from a fork with no LFS setup; the submission workflow turns it into an LFS object when it moves it into data/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Greptile review caught that the comment still credited a GitHub pushurl from .lfsconfig for the Update step's LFS uploads; .lfsconfig has no pushurl, so uploads use the earlier lfs.url override instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…README - validation.yml: document that matrix.filter doubles as the LFS --include glob for pb shards (and why parquet needs a separate lfs_include). - submission.yml: move the process_submission LFS pull after change detection and gate it on NUM_CHANGED_FILES so non-data PRs skip the full-repo pull. - README.md: wrap the new Git LFS section to ~80 columns for consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

count_reactions existed only to regenerate badges/reactions.svg. It added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-repository LFS pull. The badge adds little value, so remove the job, the README badge, and the generated SVG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

process_dataset reads only the changed inputs (and smudges base revisions of modified files on demand via lfs.url), so the submission job no longer needs the whole dataset. Build the lfs pull --include list from changed_data_files.txt instead of pulling everything. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With lfs: false, actions/checkout never runs the LFS smudge filter, so the GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # badges/reactions.svg

badges/ was removed from main in #244; accepted deletion to resolve conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

skearnes and others added 4 commits June 6, 2026 19:53

Update badges

852e8f6

Document the Git LFS / Hugging Face mirror setup in the README

e71e196

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

skearnes marked this pull request as ready for review June 7, 2026 00:09

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread .github/workflows/submission.yml Outdated

skearnes and others added 4 commits June 6, 2026 20:18

Drop redundant GIT_LFS_SKIP_SMUDGE from the mirror checkout

2a2ebfe

With lfs: false, actions/checkout never runs the LFS smudge filter, so the GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

skearnes requested review from bdeadman and connorcoley June 7, 2026 00:36

bdeadman approved these changes Jun 10, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into lfs-offload-reads-to-hf

3ebd642

# Conflicts: # badges/reactions.svg

skearnes merged commit 2e40c8e into main Jun 12, 2026
16 checks passed

skearnes deleted the lfs-offload-reads-to-hf branch June 12, 2026 01:46

skearnes mentioned this pull request Jun 12, 2026

Delete unused reactions badge script open-reaction-database/ord-schema#819

Merged

bdeadman added a commit that referenced this pull request Jun 18, 2026

Merge main into #243, resolving badges/reactions.svg conflict

3fcbea4

badges/ was removed from main in #244; accepted deletion to resolve conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offload Git LFS read bandwidth to the Hugging Face mirror#244

Offload Git LFS read bandwidth to the Hugging Face mirror#244
skearnes merged 9 commits into
mainfrom
lfs-offload-reads-to-hf

skearnes commented Jun 6, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

skearnes commented Jun 6, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Why this is safe for submissions

Edge cases (documented)

Verification

Expected impact

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

skearnes commented Jun 6, 2026 •

edited by greptile-apps Bot

Loading