Skip to content

Offload Git LFS read bandwidth to the Hugging Face mirror#244

Merged
skearnes merged 9 commits into
mainfrom
lfs-offload-reads-to-hf
Jun 12, 2026
Merged

Offload Git LFS read bandwidth to the Hugging Face mirror#244
skearnes merged 9 commits into
mainfrom
lfs-offload-reads-to-hf

Conversation

@skearnes

@skearnes skearnes commented Jun 6, 2026

Copy link
Copy Markdown
Member

Problem

Public clones dominate our GitHub Git LFS bandwidth. Over the last month (measured from the org billing usage API) GitHub LFS served ~675 GB, of which ~585 GB (~87%) was clones/forks and only ~90 GB was CI — including ~314 GB on days with zero CI runs and a 262 GB single-day spike. Forks and anonymous clones cannot be access-restricted on a public repo, and the datasets are already mirrored to Hugging Face.

Approach

Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth for the bytes. GitHub still holds the canonical, complete copy of every LFS object; HF is a read-only replica kept current by the existing mirror job.

Layer Change
.lfsconfig (new) lfs.url → HF for clone/fetch. No pushurl (a fixed HTTPS pushurl breaks SSH pushers; a fixed SSH pushurl breaks CI).
.gitattributes LFS scoped to data/, so a new submission staged at the repo root is an ordinary git file.
validation.yml Read each matrix shard's LFS objects from GitHub, sparsely (was: checkout lfs:true pulled the whole dataset in all 11 jobs).
submission.yml Read LFS from GitHub, and pull only the changed datasets (a sparse --include built from changed_data_files.txt), so submissions validate before their bytes are mirrored without fetching the whole repo.
huggingface_mirror.yml Point LFS reads at GitHub (objects aren't on HF yet at mirror time).
CONTRIBUTING.md / README.md Document the design and the rare fork-edit override.

CI and the mirror override lfs.url back to GitHub at runtime (git config lfs.url …) because freshly pushed objects are not on HF until the post-merge mirror job runs.

This PR also removes the count_reactions job and the reactions-count badge: it added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-dataset LFS pull, for little value.

Why this is safe for submissions

The data/ scoping makes the LFS boundary coincide with the submission boundary. A contributor stages a new dataset at the repo root (a plain git file) and pushes from a fork with no LFS setup. The submission "Update" step renames it into data/…, at which point it becomes an LFS object pushed to GitHub by CI (which can write). Validation reads objects from GitHub, so submissions are validated before they're ever expected on HF — no chicken-and-egg.

The sparse submission pull is safe because process_dataset.py (ord-schema v0.6.3) reads only the changed inputs listed in changed_data_files.txt and smudges base revisions of modified files on demand via lfs.url; it never scans unchanged datasets.

Edge cases (documented)

Editing an existing data/ object from a fork still produces an LFS object; pushing it needs a one-line git config lfs.pushurl … (see CONTRIBUTING.md). Corrections that re-stage at the root avoid this. Root files > 100 MB can't be pushed un-LFS'd (GitHub hard limit); such large datasets are maintainer/LFS territory anyway (e.g. the 1 GB USPTO parquet).

Verification

All 595 LFS objects on main were confirmed present on the HF LFS store (anonymous download batch API), so redirected clones resolve. Workflow YAML validated; new .gitattributes patterns confirmed via git check-attr (data/ → lfs, root → unspecified); SSH push confirmed working. The validation matrix self-tests the sparse pulls on this PR (it touches validation.yml).

Expected impact

GitHub LFS bandwidth drops from ~675 GB/mo toward the low tens of GB (CI only), with the ~585 GB of clone traffic moving to HF's CDN.

🤖 Generated with Claude Code

Greptile Summary

This PR offloads Git LFS read bandwidth from GitHub to the existing Hugging Face mirror by committing a .lfsconfig that redirects lfs.url to HF, while keeping GitHub as the authoritative LFS store. All CI workflows override the URL back to GitHub at runtime so freshly-pushed objects (not yet mirrored) are accessible.

  • .lfsconfig + .gitattributes: LFS reads are transparently redirected to HF for clones/forks; LFS tracking is scoped to data/ so root-level new submissions are plain git objects, removing the need for fork contributors to configure LFS at all.
  • Workflow rewrites: validation.yml replaces full lfs:true checkouts with sparse per-shard pulls from GitHub; submission.yml removes the count_reactions bot-commit job and adds a sparse pull scoped to changed dataset paths; huggingface_mirror.yml drops the now-redundant GIT_LFS_SKIP_SMUDGE env var and adds an unconditional step to point LFS reads at GitHub before uploading.
  • Docs: README.md and CONTRIBUTING.md are updated to explain the HF mirror design, the pushurl workaround for contributors editing existing data/ files, and the no-setup path for new submissions.

Confidence Score: 5/5

Safe to merge; the LFS redirect, sparse-pull logic, and CI overrides are all correctly wired up.

The design is carefully thought through end-to-end. New root-level submissions correctly register in NUM_CHANGED_FILES (via the full-diff grep on all changed files with dataset extensions), so lfs.url is always set to GitHub before any LFS push in the submission job. The validation and mirror workflows set the override unconditionally before their LFS operations. The no-pushurl decision in .lfsconfig is sound and well-documented. All 595 existing LFS objects were pre-verified on HF. No regressions were introduced to the submission or validation paths.

No files require special attention.

Important Files Changed

Filename Overview
.lfsconfig New config redirecting LFS reads to Hugging Face mirror; no pushurl by design with clear documentation of the reasoning
.gitattributes LFS scoped to data/** patterns; intentional change enables root-level submissions without LFS setup
.github/workflows/validation.yml Replaces full lfs:true checkout with sparse per-shard LFS pulls from GitHub; pb filters double as valid globs, parquet job uses explicit lfs_include/lfs_exclude fields
.github/workflows/submission.yml Removes count_reactions job; converts to lfs:false checkout + sparse LFS pull; lfs.url correctly set to GitHub for all dataset-touching runs (root submissions register as NUM_CHANGED_FILES≥1 via grep on full diff)
.github/workflows/huggingface_mirror.yml Removes now-redundant GIT_LFS_SKIP_SMUDGE env var; adds unconditional step to point LFS reads at GitHub before mirroring newly-pushed objects
CONTRIBUTING.md Adds clear LFS note documenting the HF mirror redirect and the pushurl workaround for contributors editing existing data/ files
README.md Removes reactions badge; reorganises Getting the Data section with clone-first option now that LFS redirect makes it seamless; adds full Git LFS / HF mirror design section
badges/reactions.svg Deleted as part of removing the count_reactions job that kept it updated via bot commits on every PR

Comments Outside Diff (1)

  1. .github/workflows/huggingface_mirror.yml, line 40-44 (link)

    P2 Redundant GIT_LFS_SKIP_SMUDGE alongside lfs: false

    The GIT_LFS_SKIP_SMUDGE: 1 env var on the checkout step is now redundant. lfs: false already tells actions/checkout not to fetch LFS objects or configure LFS hooks — the smudge filter never runs. The env var is harmless but was meaningful in the old workflow where lfs: false was absent; it can now be removed to reduce confusion.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (3): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

skearnes and others added 4 commits June 6, 2026 19:53
Public clones account for ~87% of the GitHub LFS bandwidth quota (measured
~585 GB of ~675 GB over the last month; the rest is CI). Redirect LFS reads
to the existing HF mirror while keeping GitHub as the source of truth.

- .lfsconfig: lfs.url -> HF for clone/fetch. No pushurl is committed (a fixed
  HTTPS pushurl breaks SSH pushers and a fixed SSH pushurl breaks CI); the
  actors that write LFS objects set lfs.url back to GitHub themselves.
- CI and the mirror override lfs.url to GitHub at runtime, because newly
  pushed objects are not on HF until the post-merge mirror job runs.
- validation.yml: pull only each matrix shard's LFS objects from GitHub
  (was: actions/checkout lfs:true pulled the whole dataset in all 11 jobs).
- submission.yml: read LFS from GitHub so fork/branch submissions are
  validated before their bytes are mirrored to HF on merge to main.
- .gitattributes: scope LFS to data/ so a new submission staged at the repo
  root is an ordinary git file, pushable from a fork with no LFS setup; the
  submission workflow turns it into an LFS object when it moves it into data/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile review caught that the comment still credited a GitHub pushurl from
.lfsconfig for the Update step's LFS uploads; .lfsconfig has no pushurl, so
uploads use the earlier lfs.url override instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@skearnes skearnes marked this pull request as ready for review June 7, 2026 00:09
Comment thread .github/workflows/submission.yml Outdated
skearnes and others added 4 commits June 6, 2026 20:18
…README

- validation.yml: document that matrix.filter doubles as the LFS --include
  glob for pb shards (and why parquet needs a separate lfs_include).
- submission.yml: move the process_submission LFS pull after change detection
  and gate it on NUM_CHANGED_FILES so non-data PRs skip the full-repo pull.
- README.md: wrap the new Git LFS section to ~80 columns for consistency.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
count_reactions existed only to regenerate badges/reactions.svg. It added a
bot "Update badges" commit to every PR (extra friction and rebase churn) and
required a full-repository LFS pull. The badge adds little value, so remove the
job, the README badge, and the generated SVG.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
process_dataset reads only the changed inputs (and smudges base revisions of
modified files on demand via lfs.url), so the submission job no longer needs
the whole dataset. Build the lfs pull --include list from changed_data_files.txt
instead of pulling everything.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With lfs: false, actions/checkout never runs the LFS smudge filter, so the
GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@skearnes skearnes requested review from bdeadman and connorcoley June 7, 2026 00:36
@skearnes skearnes merged commit 2e40c8e into main Jun 12, 2026
16 checks passed
@skearnes skearnes deleted the lfs-offload-reads-to-hf branch June 12, 2026 01:46
bdeadman added a commit that referenced this pull request Jun 18, 2026
badges/ was removed from main in #244; accepted deletion to resolve conflict.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants