|
1 | 1 | # ord-data |
2 | 2 |
|
3 | 3 |  |
4 | | - |
5 | 4 | [](https://zenodo.org/badge/latestdoi/283813042) |
6 | 5 |
|
7 | 6 | ## Getting the Data |
8 | 7 |
|
9 | | -**We recommend downloading the dataset from |
10 | | -[Hugging Face](https://huggingface.co/datasets/open-reaction-database/ord-data) |
11 | | -instead of cloning this repository with Git LFS.** GitHub LFS bandwidth is a |
12 | | -shared, limited resource, and heavy cloning traffic can exhaust our monthly |
13 | | -quota and block downloads for everyone. The Hugging Face mirror has no such |
14 | | -limit. |
| 8 | +The datasets live under [`data/`](data) and are stored with |
| 9 | +[Git LFS](https://git-lfs.com/). LFS reads are redirected to the |
| 10 | +[Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data) |
| 11 | +via [`.lfsconfig`](.lfsconfig), so dataset objects are fetched from Hugging |
| 12 | +Face's CDN rather than from GitHub's shared (and limited) LFS bandwidth. This is |
| 13 | +automatic — you do not need to configure anything. |
15 | 14 |
|
16 | | -### Option 1 (recommended): Download from Hugging Face |
| 15 | +### Option 1: Clone the repository |
| 16 | + |
| 17 | +```bash |
| 18 | +git clone https://github.com/open-reaction-database/ord-data.git |
| 19 | +``` |
| 20 | + |
| 21 | +With [Git LFS](https://git-lfs.com/) installed, this pulls every dataset object |
| 22 | +from the Hugging Face mirror and gives you the full Git history with the data in |
| 23 | +place. |
| 24 | + |
| 25 | +### Option 2: Download only the data (a subset, or without Git history) |
17 | 26 |
|
18 | 27 | ```bash |
19 | 28 | pip install -r scripts/requirements.txt |
20 | 29 | python scripts/download_from_huggingface.py |
21 | 30 | ``` |
22 | 31 |
|
23 | | -The script mirrors the `data/` directory from the Hugging Face dataset into |
24 | | -your local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to |
25 | | -download only a subset, or `--output-dir <path>` to write somewhere other |
26 | | -than the repository root. If you don't need the Git history, you can also |
27 | | -clone this repo *without* LFS objects and then run the script: |
| 32 | +The script mirrors the `data/` directory from the Hugging Face dataset into your |
| 33 | +local checkout. Pass `--allow-pattern 'data/4d/*.pb.gz'` (repeatable) to download |
| 34 | +only a subset, or `--output-dir <path>` to write somewhere other than the |
| 35 | +repository root. To skip LFS entirely during the clone and fetch the data |
| 36 | +afterward: |
28 | 37 |
|
29 | 38 | ```bash |
30 | 39 | GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/open-reaction-database/ord-data.git |
31 | 40 | cd ord-data |
32 | 41 | python scripts/download_from_huggingface.py |
33 | 42 | ``` |
34 | 43 |
|
35 | | -### Option 2: Clone with Git LFS |
| 44 | +You can also browse and download datasets directly from the |
| 45 | +[Hugging Face dataset page](https://huggingface.co/datasets/open-reaction-database/ord-data). |
36 | 46 |
|
37 | | -If you have access to Git LFS bandwidth and need the `.pb.gz` files in place |
38 | | -as part of a normal clone, install [Git LFS](https://git-lfs.github.com) |
39 | | -before cloning. Please prefer Option 1 when possible so we don't exhaust the |
40 | | -shared LFS quota. |
| 47 | +For how this LFS / Hugging Face mirror setup works (and what it means for |
| 48 | +contributors), see |
| 49 | +[Git LFS and the Hugging Face mirror](#git-lfs-and-the-hugging-face-mirror) |
| 50 | +below. |
41 | 51 |
|
42 | 52 | ## Data Manipulation |
43 | 53 |
|
@@ -92,6 +102,55 @@ rxn_json = json.loads( |
92 | 102 | print(f"We have converted the {input_fname} to JSON format shown as below, \n{rxn_json}") |
93 | 103 | ``` |
94 | 104 |
|
| 105 | +## Git LFS and the Hugging Face mirror |
| 106 | + |
| 107 | +Dataset files under [`data/`](data) are stored with Git LFS. Clone and fork |
| 108 | +traffic was dominating GitHub's shared LFS bandwidth quota, so the repository is |
| 109 | +configured to keep that traffic off GitHub while leaving GitHub authoritative |
| 110 | +for the data: |
| 111 | + |
| 112 | +- **Reads come from Hugging Face.** [`.lfsconfig`](.lfsconfig) points `lfs.url` |
| 113 | + at the |
| 114 | + [Hugging Face mirror](https://huggingface.co/datasets/open-reaction-database/ord-data), |
| 115 | + so clones and forks fetch LFS objects from HF's CDN instead of GitHub. |
| 116 | +- **GitHub remains the source of truth.** LFS objects are always written to |
| 117 | + GitHub (storage there is fine; only download bandwidth was the problem), and |
| 118 | + the [mirror workflow](.github/workflows/huggingface_mirror.yml) copies them to |
| 119 | + Hugging Face after every merge to `main`. Hugging Face is purely a read |
| 120 | + replica — every object is always retrievable from GitHub. |
| 121 | +- **LFS is scoped to `data/`** (see [`.gitattributes`](.gitattributes)). A new |
| 122 | + dataset staged at the repository root is an ordinary Git file, so submissions |
| 123 | + can be pushed from a fork with no LFS configuration; the submission workflow |
| 124 | + turns the file into an LFS object when it moves it into `data/`. |
| 125 | + |
| 126 | +### For contributors |
| 127 | + |
| 128 | +- **Submitting a new dataset:** nothing special is required — stage your file at |
| 129 | + the repository root and open a PR (see [CONTRIBUTING.md](CONTRIBUTING.md) and |
| 130 | + the |
| 131 | + [Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html)). |
| 132 | +- **Editing a file that already lives under `data/` from a fork:** that file is |
| 133 | + an LFS object, so point LFS uploads at your own fork once before pushing (you |
| 134 | + cannot write to the canonical repository's LFS store): |
| 135 | + |
| 136 | + ```bash |
| 137 | + git config lfs.pushurl https://github.com/<your-username>/ord-data.git/info/lfs |
| 138 | + ``` |
| 139 | + |
| 140 | +### For maintainers (CI) |
| 141 | + |
| 142 | +Freshly pushed objects are not on the Hugging Face mirror until the post-merge |
| 143 | +mirror job runs, so CI and the mirror override the read endpoint back to GitHub |
| 144 | +at runtime (`git config lfs.url …`): |
| 145 | + |
| 146 | +- [`validation.yml`](.github/workflows/validation.yml) pulls only each matrix |
| 147 | + shard's objects from GitHub, sparsely, instead of the whole dataset in every |
| 148 | + job. |
| 149 | +- [`submission.yml`](.github/workflows/submission.yml) reads from GitHub so fork |
| 150 | + and branch submissions are validated before their bytes reach Hugging Face. |
| 151 | +- [`huggingface_mirror.yml`](.github/workflows/huggingface_mirror.yml) reads the |
| 152 | + to-be-mirrored objects from GitHub. |
| 153 | + |
95 | 154 | ## Contributing |
96 | 155 |
|
97 | 156 | Please see the [Submission Workflow](https://docs.open-reaction-database.org/en/latest/submissions.html) documentation. Make sure to review the [license](https://github.com/open-reaction-database/ord-data/blob/main/LICENSE) and [terms of use](https://github.com/open-reaction-database/ord-data/blob/main/CONTRIBUTING.md#terms-of-use). |
|
0 commit comments