Offload Git LFS read bandwidth to the Hugging Face mirror (#244)

skearnes · claude · github-actions · web-flow · commit 2e40c8e8e02c · 2026-06-11T21:46:26.000-04:00
* Offload Git LFS read bandwidth to the Hugging Face mirror

Public clones account for ~87% of the GitHub LFS bandwidth quota (measured
~585 GB of ~675 GB over the last month; the rest is CI). Redirect LFS reads
to the existing HF mirror while keeping GitHub as the source of truth.

- .lfsconfig: lfs.url -&gt; HF for clone/fetch. No pushurl is committed (a fixed
  HTTPS pushurl breaks SSH pushers and a fixed SSH pushurl breaks CI); the
  actors that write LFS objects set lfs.url back to GitHub themselves.
- CI and the mirror override lfs.url to GitHub at runtime, because newly
  pushed objects are not on HF until the post-merge mirror job runs.
- validation.yml: pull only each matrix shard's LFS objects from GitHub
  (was: actions/checkout lfs:true pulled the whole dataset in all 11 jobs).
- submission.yml: read LFS from GitHub so fork/branch submissions are
  validated before their bytes are mirrored to HF on merge to main.
- .gitattributes: scope LFS to data/ so a new submission staged at the repo
  root is an ordinary git file, pushable from a fork with no LFS setup; the
  submission workflow turns it into an LFS object when it moves it into data/.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Update badges

* Document the Git LFS / Hugging Face mirror setup in the README

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Fix stale submission.yml comment referencing a removed pushurl

Greptile review caught that the comment still credited a GitHub pushurl from
.lfsconfig for the Update step's LFS uploads; .lfsconfig has no pushurl, so
uploads use the earlier lfs.url override instead.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Address review: clarify FILTER reuse, gate submission LFS pull, wrap README

- validation.yml: document that matrix.filter doubles as the LFS --include
  glob for pb shards (and why parquet needs a separate lfs_include).
- submission.yml: move the process_submission LFS pull after change detection
  and gate it on NUM_CHANGED_FILES so non-data PRs skip the full-repo pull.
- README.md: wrap the new Git LFS section to ~80 columns for consistency.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Drop the reactions-count badge and its CI job

count_reactions existed only to regenerate badges/reactions.svg. It added a
bot "Update badges" commit to every PR (extra friction and rebase churn) and
required a full-repository LFS pull. The badge adds little value, so remove the
job, the README badge, and the generated SVG.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Pull only changed datasets in submission CI

process_dataset reads only the changed inputs (and smudges base revisions of
modified files on demand via lfs.url), so the submission job no longer needs
the whole dataset. Build the lfs pull --include list from changed_data_files.txt
instead of pulling everything.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

* Drop redundant GIT_LFS_SKIP_SMUDGE from the mirror checkout

With lfs: false, actions/checkout never runs the LFS smudge filter, so the
GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.)

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Co-authored-by: github-actions &lt;github-actions@github.com&gt;
diff --git a/.gitattributes b/.gitattributes
@@ -1,4 +1,10 @@
-ord_data-* filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pb.gz filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
+# Published datasets under data/ are stored in Git LFS; clones fetch the objects
+# from the Hugging Face mirror (see .lfsconfig) to conserve GitHub bandwidth.
+#
+# LFS is scoped to data/ on purpose: a new submission staged at the repository
+# root is an ordinary git file with no LFS filter, so it can be pushed from a
+# fork without any LFS configuration. The submission workflow renames it into
+# data/ (process_dataset --update), at which point it becomes an LFS object.
+data/**/*.pb filter=lfs diff=lfs merge=lfs -text
+data/**/*.pb.gz filter=lfs diff=lfs merge=lfs -text
+data/**/*.parquet filter=lfs diff=lfs merge=lfs -text
diff --git a/.github/workflows/huggingface_mirror.yml b/.github/workflows/huggingface_mirror.yml
@@ -39,8 +39,6 @@ jobs:
         fetch-depth: 0
         lfs: false
         ref: ${{ github.event.pull_request.head.sha || github.sha }}
-      env:
-        GIT_LFS_SKIP_SMUDGE: 1
 
     - id: range
       env:
@@ -69,6 +67,11 @@ jobs:
 
     - run: pip install -r scripts/requirements.txt
 
+    # .lfsconfig points clone/fetch LFS reads at this same HF mirror, but the
+    # objects we are about to upload are not on HF yet — fetch them from GitHub.
+    - name: Point Git LFS reads at GitHub
+      run: git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
+
     - name: Mirror to Hugging Face
       env:
         HF_TOKEN: ${{ secrets.HF_TOKEN }}
diff --git a/.github/workflows/submission.yml b/.github/workflows/submission.yml
@@ -24,41 +24,6 @@ env:
   ORD_SCHEMA_TAG: v0.6.3
 
 jobs:
-  count_reactions:
-    if: ${{ ! github.event.pull_request.head.repo.fork }}
-    runs-on: ubuntu-latest
-    steps:
-    - name: Checkout ord-data
-      uses: actions/checkout@v4
-      with:
-        ref: ${{ github.event.pull_request.head.sha }}
-        lfs: true
-    - name: Checkout ord-schema
-      uses: actions/checkout@v4
-      with:
-        repository: Open-Reaction-Database/ord-schema
-        ref: ${{ env.ORD_SCHEMA_TAG }}
-        path: ord-schema
-    - uses: actions/setup-python@v5
-      with:
-        python-version: '3.11'
-    - name: Install ord_schema
-      run: |
-        cd "${GITHUB_WORKSPACE}/ord-schema"
-        python -m pip install --upgrade pip
-        python -m pip install wheel
-        python -m pip install .
-    - name: Update reactions badge
-      run: |
-        cd "${GITHUB_WORKSPACE}"
-        python ord-schema/badges/reactions.py --root=data --output=badges/reactions.svg
-        git add badges/reactions.svg
-        git config user.name github-actions
-        git config user.email github-actions@github.com
-        # Fail gracefully if there is nothing to commit.
-        git commit -a -m "Update badges" || (( $? == 1 ))
-        git push "https://${GITHUB_ACTOR}:${GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" "HEAD:${GITHUB_HEAD_REF}"
-
   check_file_types:
     runs-on: ubuntu-latest
     steps:
@@ -96,7 +61,7 @@ jobs:
       uses: actions/checkout@v4
       with:
         ref: ${{ github.event.pull_request.head.sha }}
-        lfs: true
+        lfs: false
     - name: Add upstream for comparisons to HEAD
       run: |
         cd "${GITHUB_WORKSPACE}"
@@ -125,6 +90,25 @@ jobs:
         echo "NUM_CHANGED_FILES=${LOCAL_NUM_CHANGED}" >> $GITHUB_ENV
         echo "Found ${LOCAL_NUM_CHANGED} changed dataset files"
         cat changed_data_files.txt
+    # Read LFS from GitHub, not the HF mirror that .lfsconfig points clones at:
+    # submissions are validated before their bytes are mirrored to HF on merge
+    # to main, and the "Update submission" step's uploads reuse this lfs.url
+    # override (.lfsconfig has no pushurl). Only the changed datasets are
+    # pulled: process_dataset reads just these inputs, and base revisions of
+    # modified files are smudged on demand through the same lfs.url, so the
+    # rest of the repository is never needed. Skipped entirely when no dataset
+    # files changed.
+    - name: Fetch changed LFS objects from GitHub
+      if: env.NUM_CHANGED_FILES != '0'
+      run: |
+        cd "${GITHUB_WORKSPACE}"
+        git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
+        # changed_data_files.txt holds `git diff --name-status` lines; the last
+        # field is the current path (the new path for renames). Paths without an
+        # LFS object at HEAD (root submissions, deletions) are no-ops here.
+        INCLUDE="$(awk '{print $NF}' changed_data_files.txt | paste -sd, -)"
+        echo "Pulling LFS objects for changed datasets: ${INCLUDE}"
+        git lfs pull --include="${INCLUDE}"
     - uses: actions/setup-python@v5
       with:
         python-version: '3.11'
diff --git a/.github/workflows/validation.yml b/.github/workflows/validation.yml
@@ -42,7 +42,23 @@ jobs:
     - name: Checkout ord-data
       uses: actions/checkout@v4
       with:
-        lfs: true
+        lfs: false
+    # .lfsconfig redirects clone/fetch LFS reads to the Hugging Face mirror to
+    # save GitHub bandwidth, but CI reads from GitHub: on a push to main the
+    # just-merged objects are not on HF yet, and pulling only this shard keeps
+    # the transfer tiny instead of fetching the whole dataset in every job.
+    # The checkout step already configured GitHub credentials for the pull.
+    #
+    # matrix.filter (e.g. data/[0-4][0-4]) is intentionally written to be valid
+    # both as the validate_dataset.py regex below and as an LFS path glob, so it
+    # doubles as the --include pattern here. (The parquet job needs a separate
+    # lfs_include because its filter is a lookahead regex, not a glob.)
+    - name: Fetch LFS shard from GitHub
+      env:
+        FILTER: ${{ matrix.filter }}
+      run: |
+        git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
+        git lfs pull --include="${FILTER}/*.pb*"
     - name: Checkout ord-schema
       uses: actions/checkout@v4
       with:
@@ -79,14 +95,31 @@ jobs:
           # validate_dataset.py.
           - name: uspto
             filter: 'ord_dataset-1158e351757f315b93cbcbe7bc55f38e\.parquet$'
+            lfs_include: 'data/*/ord_dataset-1158e351757f315b93cbcbe7bc55f38e.parquet'
+            lfs_exclude: ''
           # Everything else (negative lookahead on the USPTO parquet id).
           - name: other
             filter: '^(?!.*ord_dataset-1158e351757f315b93cbcbe7bc55f38e).*\.parquet$'
+            lfs_include: 'data/*/*.parquet'
+            lfs_exclude: 'data/*/ord_dataset-1158e351757f315b93cbcbe7bc55f38e.parquet'
     steps:
     - name: Checkout ord-data
       uses: actions/checkout@v4
       with:
-        lfs: true
+        lfs: false
+    # See validate_pb: read this shard's LFS objects from GitHub rather than the
+    # Hugging Face mirror that .lfsconfig points clones at.
+    - name: Fetch LFS shard from GitHub
+      env:
+        LFS_INCLUDE: ${{ matrix.lfs_include }}
+        LFS_EXCLUDE: ${{ matrix.lfs_exclude }}
+      run: |
+        git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
+        if [[ -n "${LFS_EXCLUDE}" ]]; then
+          git lfs pull --include="${LFS_INCLUDE}" --exclude="${LFS_EXCLUDE}"
+        else
+          git lfs pull --include="${LFS_INCLUDE}"
+        fi
     - name: Checkout ord-schema
       uses: actions/checkout@v4
       with:
diff --git a/.lfsconfig b/.lfsconfig
@@ -0,0 +1,24 @@
+# Git LFS endpoint configuration.
+#
+# Clone/fetch reads are served by the Hugging Face mirror so that public clones
+# and forks do not consume GitHub's LFS bandwidth quota. GitHub remains the
+# source of truth: objects are written to GitHub and mirrored to HF after each
+# merge to main.
+#
+# There is intentionally NO `pushurl` here. A committed HTTPS pushurl breaks
+# pushes from SSH remotes (git-lfs would try HTTPS auth), and a committed SSH
+# pushurl breaks CI (no SSH key). Instead, the actors that push LFS objects set
+# the write endpoint themselves:
+#   * CI and the mirror run `git config lfs.url <github>` (this also points
+#     their reads at GitHub, where freshly pushed objects live before they are
+#     mirrored to HF).
+#   * A contributor editing an existing data/ object from a fork runs
+#     `git config lfs.pushurl https://github.com/<user>/ord-data.git/info/lfs`.
+# New submissions are staged at the repository root (not LFS; see
+# .gitattributes), so the common contribution path needs none of this.
+#
+# locksverify is disabled because the HF mirror does not implement the LFS
+# locking API.
+[lfs]
+	url = https://huggingface.co/datasets/open-reaction-database/ord-data.git/info/lfs
+	locksverify = false
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -19,7 +19,22 @@ Great idea! Please [create a feature request](https://github.com/Open-Reaction-D
 Excellent! Please follow the
 [Submission Workflow](https://ord-schema.readthedocs.io/en/latest/submissions.html)
 in the documentation.
-   
+
+**Note on large files (Git LFS):** Published datasets under `data/` are stored
+with [Git LFS](https://git-lfs.com/), and clones fetch the objects from the
+[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data)
+to conserve GitHub bandwidth. A new submission staged at the repository root is
+an ordinary file, so you can push it from a fork with no LFS setup; the
+submission workflow moves it into `data/` for you.
+
+Only if you push changes to a file that already lives under `data/` (i.e. an LFS
+object) do you need to point LFS uploads at your own fork first, since you cannot
+write to the canonical repository's LFS store:
+
+```
+git config lfs.pushurl https://github.com/<your-username>/ord-data.git/info/lfs
+```
+
 ## Terms of Use
 
 By submitting Contributions (as defined below) to this project, you agree that
diff --git a/README.md b/README.md
@@ -1,43 +1,53 @@
 # ord-data
 
 ![](https://github.com/Open-Reaction-Database/ord-data/workflows/Validation/badge.svg)
-![](https://raw.githubusercontent.com/Open-Reaction-Database/ord-data/main/badges/reactions.svg)
 [![DOI](https://zenodo.org/badge/283813042.svg)](https://zenodo.org/badge/latestdoi/283813042)
 
 ## Getting the Data
 
-**We recommend downloading the dataset from
-[Hugging Face](https://huggingface.co/datasets/open-reaction-database/ord-data)
-instead of cloning this repository with Git LFS.** GitHub LFS bandwidth is a
-shared, limited resource, and heavy cloning traffic can exhaust our monthly
-quota and block downloads for everyone. The Hugging Face mirror has no such
-limit.
+The datasets live under [`data/`](data) and are stored with
+[Git LFS](https://git-lfs.com/). LFS reads are redirected to the
+[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data)
+via [`.lfsconfig`](.lfsconfig), so dataset objects are fetched from Hugging
+Face's CDN rather than from GitHub's shared (and limited) LFS bandwidth. This is
+automatic — you do not need to configure anything.
 
-### Option 1 (recommended): Download from Hugging Face
+### Option 1: Clone the repository
+
+```bash
+git clone https://github.com/open-reaction-database/ord-data.git
+```
+
+With [Git LFS](https://git-lfs.com/) installed, this pulls every dataset object
+from the Hugging Face mirror and gives you the full Git history with the data in
+place.
+
+### Option 2: Download only the data (a subset, or without Git history)
 
 ```bash
 pip install -r scripts/requirements.txt
 python scripts/download_from_huggingface.py
 ```
 
-The script mirrors the `data/` directory from the Hugging Face dataset into
-your local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to
-download only a subset, or `--output-dir <path>` to write somewhere other
-than the repository root. If you don't need the Git history, you can also
-clone this repo *without* LFS objects and then run the script:
+The script mirrors the `data/` directory from the Hugging Face dataset into your
+local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to download
+only a subset, or `--output-dir <path>` to write somewhere other than the
+repository root. To skip LFS entirely during the clone and fetch the data
+afterward:
 
 ```bash
 GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/open-reaction-database/ord-data.git
 cd ord-data
 python scripts/download_from_huggingface.py
 ```
 
-### Option 2: Clone with Git LFS
+You can also browse and download datasets directly from the
+[Hugging Face dataset page](https://huggingface.co/datasets/open-reaction-database/ord-data).
 
-If you have access to Git LFS bandwidth and need the `.pb.gz` files in place
-as part of a normal clone, install [Git LFS](https://git-lfs.github.com)
-before cloning. Please prefer Option 1 when possible so we don't exhaust the
-shared LFS quota.
+For how this LFS / Hugging Face mirror setup works (and what it means for
+contributors), see
+[Git LFS and the Hugging Face mirror](#git-lfs-and-the-hugging-face-mirror)
+below.
 
 ## Data Manipulation
 
@@ -92,6 +102,55 @@ rxn_json = json.loads(
 print(f"We have converted the {input_fname} to JSON format shown as below, \n{rxn_json}")
 ```
 
+## Git LFS and the Hugging Face mirror
+
+Dataset files under [`data/`](data) are stored with Git LFS. Clone and fork
+traffic was dominating GitHub's shared LFS bandwidth quota, so the repository is
+configured to keep that traffic off GitHub while leaving GitHub authoritative
+for the data:
+
+- **Reads come from Hugging Face.** [`.lfsconfig`](.lfsconfig) points `lfs.url`
+  at the
+  [Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data),
+  so clones and forks fetch LFS objects from HF's CDN instead of GitHub.
+- **GitHub remains the source of truth.** LFS objects are always written to
+  GitHub (storage there is fine; only download bandwidth was the problem), and
+  the [mirror workflow](.github/workflows/huggingface_mirror.yml) copies them to
+  Hugging Face after every merge to `main`. Hugging Face is purely a read
+  replica — every object is always retrievable from GitHub.
+- **LFS is scoped to `data/`** (see [`.gitattributes`](.gitattributes)). A new
+  dataset staged at the repository root is an ordinary Git file, so submissions
+  can be pushed from a fork with no LFS configuration; the submission workflow
+  turns the file into an LFS object when it moves it into `data/`.
+
+### For contributors
+
+- **Submitting a new dataset:** nothing special is required — stage your file at
+  the repository root and open a PR (see [CONTRIBUTING.md](CONTRIBUTING.md) and
+  the
+  [Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html)).
+- **Editing a file that already lives under `data/` from a fork:** that file is
+  an LFS object, so point LFS uploads at your own fork once before pushing (you
+  cannot write to the canonical repository's LFS store):
+
+  ```bash
+  git config lfs.pushurl https://github.com/<your-username>/ord-data.git/info/lfs
+  ```
+
+### For maintainers (CI)
+
+Freshly pushed objects are not on the Hugging Face mirror until the post-merge
+mirror job runs, so CI and the mirror override the read endpoint back to GitHub
+at runtime (`git config lfs.url …`):
+
+- [`validation.yml`](.github/workflows/validation.yml) pulls only each matrix
+  shard's objects from GitHub, sparsely, instead of the whole dataset in every
+  job.
+- [`submission.yml`](.github/workflows/submission.yml) reads from GitHub so fork
+  and branch submissions are validated before their bytes reach Hugging Face.
+- [`huggingface_mirror.yml`](.github/workflows/huggingface_mirror.yml) reads the
+  to-be-mirrored objects from GitHub.
+
 ## Contributing
 
 Please see the [Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html) documentation. Make sure to review the [license](https://github.com/open-reaction-database/ord-data/blob/main/LICENSE) and [terms of use](https://github.com/open-reaction-database/ord-data/blob/main/CONTRIBUTING.md#terms-of-use).
diff --git a/badges/reactions.svg b/badges/reactions.svg