Skip to content

Commit 2e40c8e

Browse files
skearnesclaudegithub-actions
authored
Offload Git LFS read bandwidth to the Hugging Face mirror (#244)
* Offload Git LFS read bandwidth to the Hugging Face mirror Public clones account for ~87% of the GitHub LFS bandwidth quota (measured ~585 GB of ~675 GB over the last month; the rest is CI). Redirect LFS reads to the existing HF mirror while keeping GitHub as the source of truth. - .lfsconfig: lfs.url -> HF for clone/fetch. No pushurl is committed (a fixed HTTPS pushurl breaks SSH pushers and a fixed SSH pushurl breaks CI); the actors that write LFS objects set lfs.url back to GitHub themselves. - CI and the mirror override lfs.url to GitHub at runtime, because newly pushed objects are not on HF until the post-merge mirror job runs. - validation.yml: pull only each matrix shard's LFS objects from GitHub (was: actions/checkout lfs:true pulled the whole dataset in all 11 jobs). - submission.yml: read LFS from GitHub so fork/branch submissions are validated before their bytes are mirrored to HF on merge to main. - .gitattributes: scope LFS to data/ so a new submission staged at the repo root is an ordinary git file, pushable from a fork with no LFS setup; the submission workflow turns it into an LFS object when it moves it into data/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update badges * Document the Git LFS / Hugging Face mirror setup in the README Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix stale submission.yml comment referencing a removed pushurl Greptile review caught that the comment still credited a GitHub pushurl from .lfsconfig for the Update step's LFS uploads; .lfsconfig has no pushurl, so uploads use the earlier lfs.url override instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Address review: clarify FILTER reuse, gate submission LFS pull, wrap README - validation.yml: document that matrix.filter doubles as the LFS --include glob for pb shards (and why parquet needs a separate lfs_include). - submission.yml: move the process_submission LFS pull after change detection and gate it on NUM_CHANGED_FILES so non-data PRs skip the full-repo pull. - README.md: wrap the new Git LFS section to ~80 columns for consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Drop the reactions-count badge and its CI job count_reactions existed only to regenerate badges/reactions.svg. It added a bot "Update badges" commit to every PR (extra friction and rebase churn) and required a full-repository LFS pull. The badge adds little value, so remove the job, the README badge, and the generated SVG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Pull only changed datasets in submission CI process_dataset reads only the changed inputs (and smudges base revisions of modified files on demand via lfs.url), so the submission job no longer needs the whole dataset. Build the lfs pull --include list from changed_data_files.txt instead of pulling everything. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Drop redundant GIT_LFS_SKIP_SMUDGE from the mirror checkout With lfs: false, actions/checkout never runs the LFS smudge filter, so the GIT_LFS_SKIP_SMUDGE env var was a no-op. (Greptile review.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions <github-actions@github.com>
1 parent e190a1b commit 2e40c8e

8 files changed

Lines changed: 187 additions & 64 deletions

File tree

.gitattributes

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,10 @@
1-
ord_data-* filter=lfs diff=lfs merge=lfs -text
2-
*.pb filter=lfs diff=lfs merge=lfs -text
3-
*.pb.gz filter=lfs diff=lfs merge=lfs -text
4-
*.parquet filter=lfs diff=lfs merge=lfs -text
1+
# Published datasets under data/ are stored in Git LFS; clones fetch the objects
2+
# from the Hugging Face mirror (see .lfsconfig) to conserve GitHub bandwidth.
3+
#
4+
# LFS is scoped to data/ on purpose: a new submission staged at the repository
5+
# root is an ordinary git file with no LFS filter, so it can be pushed from a
6+
# fork without any LFS configuration. The submission workflow renames it into
7+
# data/ (process_dataset --update), at which point it becomes an LFS object.
8+
data/**/*.pb filter=lfs diff=lfs merge=lfs -text
9+
data/**/*.pb.gz filter=lfs diff=lfs merge=lfs -text
10+
data/**/*.parquet filter=lfs diff=lfs merge=lfs -text

.github/workflows/huggingface_mirror.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,6 @@ jobs:
3939
fetch-depth: 0
4040
lfs: false
4141
ref: ${{ github.event.pull_request.head.sha || github.sha }}
42-
env:
43-
GIT_LFS_SKIP_SMUDGE: 1
4442

4543
- id: range
4644
env:
@@ -69,6 +67,11 @@ jobs:
6967

7068
- run: pip install -r scripts/requirements.txt
7169

70+
# .lfsconfig points clone/fetch LFS reads at this same HF mirror, but the
71+
# objects we are about to upload are not on HF yet — fetch them from GitHub.
72+
- name: Point Git LFS reads at GitHub
73+
run: git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
74+
7275
- name: Mirror to Hugging Face
7376
env:
7477
HF_TOKEN: ${{ secrets.HF_TOKEN }}

.github/workflows/submission.yml

Lines changed: 20 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -24,41 +24,6 @@ env:
2424
ORD_SCHEMA_TAG: v0.6.3
2525

2626
jobs:
27-
count_reactions:
28-
if: ${{ ! github.event.pull_request.head.repo.fork }}
29-
runs-on: ubuntu-latest
30-
steps:
31-
- name: Checkout ord-data
32-
uses: actions/checkout@v4
33-
with:
34-
ref: ${{ github.event.pull_request.head.sha }}
35-
lfs: true
36-
- name: Checkout ord-schema
37-
uses: actions/checkout@v4
38-
with:
39-
repository: Open-Reaction-Database/ord-schema
40-
ref: ${{ env.ORD_SCHEMA_TAG }}
41-
path: ord-schema
42-
- uses: actions/setup-python@v5
43-
with:
44-
python-version: '3.11'
45-
- name: Install ord_schema
46-
run: |
47-
cd "${GITHUB_WORKSPACE}/ord-schema"
48-
python -m pip install --upgrade pip
49-
python -m pip install wheel
50-
python -m pip install .
51-
- name: Update reactions badge
52-
run: |
53-
cd "${GITHUB_WORKSPACE}"
54-
python ord-schema/badges/reactions.py --root=data --output=badges/reactions.svg
55-
git add badges/reactions.svg
56-
git config user.name github-actions
57-
git config user.email github-actions@github.com
58-
# Fail gracefully if there is nothing to commit.
59-
git commit -a -m "Update badges" || (( $? == 1 ))
60-
git push "https://${GITHUB_ACTOR}:${GITHUB_TOKEN}@github.com/${GITHUB_REPOSITORY}.git" "HEAD:${GITHUB_HEAD_REF}"
61-
6227
check_file_types:
6328
runs-on: ubuntu-latest
6429
steps:
@@ -96,7 +61,7 @@ jobs:
9661
uses: actions/checkout@v4
9762
with:
9863
ref: ${{ github.event.pull_request.head.sha }}
99-
lfs: true
64+
lfs: false
10065
- name: Add upstream for comparisons to HEAD
10166
run: |
10267
cd "${GITHUB_WORKSPACE}"
@@ -125,6 +90,25 @@ jobs:
12590
echo "NUM_CHANGED_FILES=${LOCAL_NUM_CHANGED}" >> $GITHUB_ENV
12691
echo "Found ${LOCAL_NUM_CHANGED} changed dataset files"
12792
cat changed_data_files.txt
93+
# Read LFS from GitHub, not the HF mirror that .lfsconfig points clones at:
94+
# submissions are validated before their bytes are mirrored to HF on merge
95+
# to main, and the "Update submission" step's uploads reuse this lfs.url
96+
# override (.lfsconfig has no pushurl). Only the changed datasets are
97+
# pulled: process_dataset reads just these inputs, and base revisions of
98+
# modified files are smudged on demand through the same lfs.url, so the
99+
# rest of the repository is never needed. Skipped entirely when no dataset
100+
# files changed.
101+
- name: Fetch changed LFS objects from GitHub
102+
if: env.NUM_CHANGED_FILES != '0'
103+
run: |
104+
cd "${GITHUB_WORKSPACE}"
105+
git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
106+
# changed_data_files.txt holds `git diff --name-status` lines; the last
107+
# field is the current path (the new path for renames). Paths without an
108+
# LFS object at HEAD (root submissions, deletions) are no-ops here.
109+
INCLUDE="$(awk '{print $NF}' changed_data_files.txt | paste -sd, -)"
110+
echo "Pulling LFS objects for changed datasets: ${INCLUDE}"
111+
git lfs pull --include="${INCLUDE}"
128112
- uses: actions/setup-python@v5
129113
with:
130114
python-version: '3.11'

.github/workflows/validation.yml

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,23 @@ jobs:
4242
- name: Checkout ord-data
4343
uses: actions/checkout@v4
4444
with:
45-
lfs: true
45+
lfs: false
46+
# .lfsconfig redirects clone/fetch LFS reads to the Hugging Face mirror to
47+
# save GitHub bandwidth, but CI reads from GitHub: on a push to main the
48+
# just-merged objects are not on HF yet, and pulling only this shard keeps
49+
# the transfer tiny instead of fetching the whole dataset in every job.
50+
# The checkout step already configured GitHub credentials for the pull.
51+
#
52+
# matrix.filter (e.g. data/[0-4][0-4]) is intentionally written to be valid
53+
# both as the validate_dataset.py regex below and as an LFS path glob, so it
54+
# doubles as the --include pattern here. (The parquet job needs a separate
55+
# lfs_include because its filter is a lookahead regex, not a glob.)
56+
- name: Fetch LFS shard from GitHub
57+
env:
58+
FILTER: ${{ matrix.filter }}
59+
run: |
60+
git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
61+
git lfs pull --include="${FILTER}/*.pb*"
4662
- name: Checkout ord-schema
4763
uses: actions/checkout@v4
4864
with:
@@ -79,14 +95,31 @@ jobs:
7995
# validate_dataset.py.
8096
- name: uspto
8197
filter: 'ord_dataset-1158e351757f315b93cbcbe7bc55f38e\.parquet$'
98+
lfs_include: 'data/*/ord_dataset-1158e351757f315b93cbcbe7bc55f38e.parquet'
99+
lfs_exclude: ''
82100
# Everything else (negative lookahead on the USPTO parquet id).
83101
- name: other
84102
filter: '^(?!.*ord_dataset-1158e351757f315b93cbcbe7bc55f38e).*\.parquet$'
103+
lfs_include: 'data/*/*.parquet'
104+
lfs_exclude: 'data/*/ord_dataset-1158e351757f315b93cbcbe7bc55f38e.parquet'
85105
steps:
86106
- name: Checkout ord-data
87107
uses: actions/checkout@v4
88108
with:
89-
lfs: true
109+
lfs: false
110+
# See validate_pb: read this shard's LFS objects from GitHub rather than the
111+
# Hugging Face mirror that .lfsconfig points clones at.
112+
- name: Fetch LFS shard from GitHub
113+
env:
114+
LFS_INCLUDE: ${{ matrix.lfs_include }}
115+
LFS_EXCLUDE: ${{ matrix.lfs_exclude }}
116+
run: |
117+
git config lfs.url "https://github.com/${GITHUB_REPOSITORY}.git/info/lfs"
118+
if [[ -n "${LFS_EXCLUDE}" ]]; then
119+
git lfs pull --include="${LFS_INCLUDE}" --exclude="${LFS_EXCLUDE}"
120+
else
121+
git lfs pull --include="${LFS_INCLUDE}"
122+
fi
90123
- name: Checkout ord-schema
91124
uses: actions/checkout@v4
92125
with:

.lfsconfig

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Git LFS endpoint configuration.
2+
#
3+
# Clone/fetch reads are served by the Hugging Face mirror so that public clones
4+
# and forks do not consume GitHub's LFS bandwidth quota. GitHub remains the
5+
# source of truth: objects are written to GitHub and mirrored to HF after each
6+
# merge to main.
7+
#
8+
# There is intentionally NO `pushurl` here. A committed HTTPS pushurl breaks
9+
# pushes from SSH remotes (git-lfs would try HTTPS auth), and a committed SSH
10+
# pushurl breaks CI (no SSH key). Instead, the actors that push LFS objects set
11+
# the write endpoint themselves:
12+
# * CI and the mirror run `git config lfs.url <github>` (this also points
13+
# their reads at GitHub, where freshly pushed objects live before they are
14+
# mirrored to HF).
15+
# * A contributor editing an existing data/ object from a fork runs
16+
# `git config lfs.pushurl https://github.com/<user>/ord-data.git/info/lfs`.
17+
# New submissions are staged at the repository root (not LFS; see
18+
# .gitattributes), so the common contribution path needs none of this.
19+
#
20+
# locksverify is disabled because the HF mirror does not implement the LFS
21+
# locking API.
22+
[lfs]
23+
url = https://huggingface.co/datasets/open-reaction-database/ord-data.git/info/lfs
24+
locksverify = false

CONTRIBUTING.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,22 @@ Great idea! Please [create a feature request](https://github.com/Open-Reaction-D
1919
Excellent! Please follow the
2020
[Submission Workflow](https://ord-schema.readthedocs.io/en/latest/submissions.html)
2121
in the documentation.
22-
22+
23+
**Note on large files (Git LFS):** Published datasets under `data/` are stored
24+
with [Git LFS](https://git-lfs.com/), and clones fetch the objects from the
25+
[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data)
26+
to conserve GitHub bandwidth. A new submission staged at the repository root is
27+
an ordinary file, so you can push it from a fork with no LFS setup; the
28+
submission workflow moves it into `data/` for you.
29+
30+
Only if you push changes to a file that already lives under `data/` (i.e. an LFS
31+
object) do you need to point LFS uploads at your own fork first, since you cannot
32+
write to the canonical repository's LFS store:
33+
34+
```
35+
git config lfs.pushurl https://github.com/<your-username>/ord-data.git/info/lfs
36+
```
37+
2338
## Terms of Use
2439

2540
By submitting Contributions (as defined below) to this project, you agree that

README.md

Lines changed: 77 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,53 @@
11
# ord-data
22

33
![](https://github.com/Open-Reaction-Database/ord-data/workflows/Validation/badge.svg)
4-
![](https://raw.githubusercontent.com/Open-Reaction-Database/ord-data/main/badges/reactions.svg)
54
[![DOI](https://zenodo.org/badge/283813042.svg)](https://zenodo.org/badge/latestdoi/283813042)
65

76
## Getting the Data
87

9-
**We recommend downloading the dataset from
10-
[Hugging Face](https://huggingface.co/datasets/open-reaction-database/ord-data)
11-
instead of cloning this repository with Git LFS.** GitHub LFS bandwidth is a
12-
shared, limited resource, and heavy cloning traffic can exhaust our monthly
13-
quota and block downloads for everyone. The Hugging Face mirror has no such
14-
limit.
8+
The datasets live under [`data/`](data) and are stored with
9+
[Git LFS](https://git-lfs.com/). LFS reads are redirected to the
10+
[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data)
11+
via [`.lfsconfig`](.lfsconfig), so dataset objects are fetched from Hugging
12+
Face's CDN rather than from GitHub's shared (and limited) LFS bandwidth. This is
13+
automatic — you do not need to configure anything.
1514

16-
### Option 1 (recommended): Download from Hugging Face
15+
### Option 1: Clone the repository
16+
17+
```bash
18+
git clone https://github.com/open-reaction-database/ord-data.git
19+
```
20+
21+
With [Git LFS](https://git-lfs.com/) installed, this pulls every dataset object
22+
from the Hugging Face mirror and gives you the full Git history with the data in
23+
place.
24+
25+
### Option 2: Download only the data (a subset, or without Git history)
1726

1827
```bash
1928
pip install -r scripts/requirements.txt
2029
python scripts/download_from_huggingface.py
2130
```
2231

23-
The script mirrors the `data/` directory from the Hugging Face dataset into
24-
your local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to
25-
download only a subset, or `--output-dir <path>` to write somewhere other
26-
than the repository root. If you don't need the Git history, you can also
27-
clone this repo *without* LFS objects and then run the script:
32+
The script mirrors the `data/` directory from the Hugging Face dataset into your
33+
local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to download
34+
only a subset, or `--output-dir <path>` to write somewhere other than the
35+
repository root. To skip LFS entirely during the clone and fetch the data
36+
afterward:
2837

2938
```bash
3039
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/open-reaction-database/ord-data.git
3140
cd ord-data
3241
python scripts/download_from_huggingface.py
3342
```
3443

35-
### Option 2: Clone with Git LFS
44+
You can also browse and download datasets directly from the
45+
[Hugging Face dataset page](https://huggingface.co/datasets/open-reaction-database/ord-data).
3646

37-
If you have access to Git LFS bandwidth and need the `.pb.gz` files in place
38-
as part of a normal clone, install [Git LFS](https://git-lfs.github.com)
39-
before cloning. Please prefer Option 1 when possible so we don't exhaust the
40-
shared LFS quota.
47+
For how this LFS / Hugging Face mirror setup works (and what it means for
48+
contributors), see
49+
[Git LFS and the Hugging Face mirror](#git-lfs-and-the-hugging-face-mirror)
50+
below.
4151

4252
## Data Manipulation
4353

@@ -92,6 +102,55 @@ rxn_json = json.loads(
92102
print(f"We have converted the {input_fname} to JSON format shown as below, \n{rxn_json}")
93103
```
94104

105+
## Git LFS and the Hugging Face mirror
106+
107+
Dataset files under [`data/`](data) are stored with Git LFS. Clone and fork
108+
traffic was dominating GitHub's shared LFS bandwidth quota, so the repository is
109+
configured to keep that traffic off GitHub while leaving GitHub authoritative
110+
for the data:
111+
112+
- **Reads come from Hugging Face.** [`.lfsconfig`](.lfsconfig) points `lfs.url`
113+
at the
114+
[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data),
115+
so clones and forks fetch LFS objects from HF's CDN instead of GitHub.
116+
- **GitHub remains the source of truth.** LFS objects are always written to
117+
GitHub (storage there is fine; only download bandwidth was the problem), and
118+
the [mirror workflow](.github/workflows/huggingface_mirror.yml) copies them to
119+
Hugging Face after every merge to `main`. Hugging Face is purely a read
120+
replica — every object is always retrievable from GitHub.
121+
- **LFS is scoped to `data/`** (see [`.gitattributes`](.gitattributes)). A new
122+
dataset staged at the repository root is an ordinary Git file, so submissions
123+
can be pushed from a fork with no LFS configuration; the submission workflow
124+
turns the file into an LFS object when it moves it into `data/`.
125+
126+
### For contributors
127+
128+
- **Submitting a new dataset:** nothing special is required — stage your file at
129+
the repository root and open a PR (see [CONTRIBUTING.md](CONTRIBUTING.md) and
130+
the
131+
[Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html)).
132+
- **Editing a file that already lives under `data/` from a fork:** that file is
133+
an LFS object, so point LFS uploads at your own fork once before pushing (you
134+
cannot write to the canonical repository's LFS store):
135+
136+
```bash
137+
git config lfs.pushurl https://github.com/<your-username>/ord-data.git/info/lfs
138+
```
139+
140+
### For maintainers (CI)
141+
142+
Freshly pushed objects are not on the Hugging Face mirror until the post-merge
143+
mirror job runs, so CI and the mirror override the read endpoint back to GitHub
144+
at runtime (`git config lfs.url …`):
145+
146+
- [`validation.yml`](.github/workflows/validation.yml) pulls only each matrix
147+
shard's objects from GitHub, sparsely, instead of the whole dataset in every
148+
job.
149+
- [`submission.yml`](.github/workflows/submission.yml) reads from GitHub so fork
150+
and branch submissions are validated before their bytes reach Hugging Face.
151+
- [`huggingface_mirror.yml`](.github/workflows/huggingface_mirror.yml) reads the
152+
to-be-mirrored objects from GitHub.
153+
95154
## Contributing
96155

97156
Please see the [Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html) documentation. Make sure to review the [license](https://github.com/open-reaction-database/ord-data/blob/main/LICENSE) and [terms of use](https://github.com/open-reaction-database/ord-data/blob/main/CONTRIBUTING.md#terms-of-use).

badges/reactions.svg

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)