Offload Git LFS read bandwidth to the Hugging Face mirror#244
Merged
Conversation
Public clones account for ~87% of the GitHub LFS bandwidth quota (measured ~585 GB of ~675 GB over the last month; the rest is CI). Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth. - .lfsconfig: lfs.url -> HF for clone/fetch. No pushurl is committed (a fixed HTTPS pushurl breaks SSH pushers and a fixed SSH pushurl breaks CI); the actors that write LFS objects set lfs.url back to GitHub themselves. - CI and the mirror override lfs.url to GitHub at runtime, because newly pushed objects are not on HF until the post-merge mirror job runs. - validation.yml: pull only each matrix shard's LFS objects from GitHub (was: actions/checkout lfs:true pulled the whole dataset in all 11 jobs). - submission.yml: read LFS from GitHub so fork/branch submissions are validated before their bytes are mirrored to HF on merge to main. - .gitattributes: scope LFS to data/ so a new submission staged at the repo root is an ordinary git file, pushable from a fork with no LFS setup; the submission workflow turns it into an LFS object when it moves it into data/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile review caught that the comment still credited a GitHub pushurl from .lfsconfig for the Update step's LFS uploads; .lfsconfig has no pushurl, so uploads use the earlier lfs.url override instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…README - validation.yml: document that matrix.filter doubles as the LFS --include glob for pb shards (and why parquet needs a separate lfs_include). - submission.yml: move the process_submission LFS pull after change detection and gate it on NUM_CHANGED_FILES so non-data PRs skip the full-repo pull. - README.md: wrap the new Git LFS section to ~80 columns for consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
count_reactions existed only to regenerate badges/reactions.svg. It added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-repository LFS pull. The badge adds little value, so remove the job, the README badge, and the generated SVG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
process_dataset reads only the changed inputs (and smudges base revisions of modified files on demand via lfs.url), so the submission job no longer needs the whole dataset. Build the lfs pull --include list from changed_data_files.txt instead of pulling everything. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With lfs: false, actions/checkout never runs the LFS smudge filter, so the GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bdeadman
approved these changes
Jun 10, 2026
# Conflicts: # badges/reactions.svg
bdeadman
added a commit
that referenced
this pull request
Jun 18, 2026
badges/ was removed from main in #244; accepted deletion to resolve conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Public clones dominate our GitHub Git LFS bandwidth. Over the last month (measured from the org billing usage API) GitHub LFS served ~675 GB, of which ~585 GB (~87%) was clones/forks and only ~90 GB was CI — including ~314 GB on days with zero CI runs and a 262 GB single-day spike. Forks and anonymous clones cannot be access-restricted on a public repo, and the datasets are already mirrored to Hugging Face.
Approach
Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth for the bytes. GitHub still holds the canonical, complete copy of every LFS object; HF is a read-only replica kept current by the existing mirror job.
.lfsconfig(new)lfs.url→ HF for clone/fetch. Nopushurl(a fixed HTTPS pushurl breaks SSH pushers; a fixed SSH pushurl breaks CI)..gitattributesdata/, so a new submission staged at the repo root is an ordinary git file.validation.ymlcheckout lfs:truepulled the whole dataset in all 11 jobs).submission.yml--includebuilt fromchanged_data_files.txt), so submissions validate before their bytes are mirrored without fetching the whole repo.huggingface_mirror.ymlCONTRIBUTING.md/README.mdCI and the mirror override
lfs.urlback to GitHub at runtime (git config lfs.url …) because freshly pushed objects are not on HF until the post-merge mirror job runs.This PR also removes the
count_reactionsjob and the reactions-count badge: it added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-dataset LFS pull, for little value.Why this is safe for submissions
The
data/scoping makes the LFS boundary coincide with the submission boundary. A contributor stages a new dataset at the repo root (a plain git file) and pushes from a fork with no LFS setup. The submission "Update" step renames it intodata/…, at which point it becomes an LFS object pushed to GitHub by CI (which can write). Validation reads objects from GitHub, so submissions are validated before they're ever expected on HF — no chicken-and-egg.The sparse submission pull is safe because
process_dataset.py(ord-schema v0.6.3) reads only the changed inputs listed inchanged_data_files.txtand smudges base revisions of modified files on demand vialfs.url; it never scans unchanged datasets.Edge cases (documented)
Editing an existing
data/object from a fork still produces an LFS object; pushing it needs a one-linegit config lfs.pushurl …(see CONTRIBUTING.md). Corrections that re-stage at the root avoid this. Root files > 100 MB can't be pushed un-LFS'd (GitHub hard limit); such large datasets are maintainer/LFS territory anyway (e.g. the 1 GB USPTO parquet).Verification
All 595 LFS objects on
mainwere confirmed present on the HF LFS store (anonymousdownloadbatch API), so redirected clones resolve. Workflow YAML validated; new.gitattributespatterns confirmed viagit check-attr(data/ → lfs, root → unspecified); SSH push confirmed working. The validation matrix self-tests the sparse pulls on this PR (it touchesvalidation.yml).Expected impact
GitHub LFS bandwidth drops from ~675 GB/mo toward the low tens of GB (CI only), with the ~585 GB of clone traffic moving to HF's CDN.
🤖 Generated with Claude Code
Greptile Summary
This PR offloads Git LFS read bandwidth from GitHub to the existing Hugging Face mirror by committing a
.lfsconfigthat redirectslfs.urlto HF, while keeping GitHub as the authoritative LFS store. All CI workflows override the URL back to GitHub at runtime so freshly-pushed objects (not yet mirrored) are accessible..lfsconfig+.gitattributes: LFS reads are transparently redirected to HF for clones/forks; LFS tracking is scoped todata/so root-level new submissions are plain git objects, removing the need for fork contributors to configure LFS at all.validation.ymlreplaces fulllfs:truecheckouts with sparse per-shard pulls from GitHub;submission.ymlremoves thecount_reactionsbot-commit job and adds a sparse pull scoped to changed dataset paths;huggingface_mirror.ymldrops the now-redundantGIT_LFS_SKIP_SMUDGEenv var and adds an unconditional step to point LFS reads at GitHub before uploading.README.mdandCONTRIBUTING.mdare updated to explain the HF mirror design, thepushurlworkaround for contributors editing existingdata/files, and the no-setup path for new submissions.Confidence Score: 5/5
Safe to merge; the LFS redirect, sparse-pull logic, and CI overrides are all correctly wired up.
The design is carefully thought through end-to-end. New root-level submissions correctly register in NUM_CHANGED_FILES (via the full-diff grep on all changed files with dataset extensions), so lfs.url is always set to GitHub before any LFS push in the submission job. The validation and mirror workflows set the override unconditionally before their LFS operations. The no-pushurl decision in .lfsconfig is sound and well-documented. All 595 existing LFS objects were pre-verified on HF. No regressions were introduced to the submission or validation paths.
No files require special attention.
Important Files Changed
Comments Outside Diff (1)
.github/workflows/huggingface_mirror.yml, line 40-44 (link)GIT_LFS_SKIP_SMUDGEalongsidelfs: falseThe
GIT_LFS_SKIP_SMUDGE: 1env var on the checkout step is now redundant.lfs: falsealready tellsactions/checkoutnot to fetch LFS objects or configure LFS hooks — the smudge filter never runs. The env var is harmless but was meaningful in the old workflow wherelfs: falsewas absent; it can now be removed to reduce confusion.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile