|
| 1 | +# AGENTS.md |
| 2 | + |
| 3 | +`scikit-hep-testdata` distributes example HEP files (mostly ROOT and LHE) for testing downstream packages like `uproot` and `pylhe`. It is primarily a *data* package with a thin Python helper layer that resolves a filename to an absolute path, downloading and caching files on demand when they aren't present locally. |
| 4 | + |
| 5 | +## Commands |
| 6 | + |
| 7 | +```bash |
| 8 | +pip install -e .[test] # editable install pulls in test deps; also keeps data files local |
| 9 | +pytest # run the suite (config in pyproject.toml [tool.pytest]) |
| 10 | +pytest tests/test_local_files.py::test_data_path # single test |
| 11 | +prek -a --quiet # lint (ruff + black + mypy), preferred over pre-commit run -a |
| 12 | +uv run pytest # if working inside a uv-managed env |
| 13 | +``` |
| 14 | + |
| 15 | +Tests run with `filterwarnings = ["error"]`, so any warning fails the suite. mypy runs in strict mode over `src` only. |
| 16 | + |
| 17 | +## How file resolution works |
| 18 | + |
| 19 | +`data_path(filename, raise_missing=True, cache_dir=None)` in `local_files.py` is the entry point. The resolution order: |
| 20 | + |
| 21 | +1. **Remote files** (`remote_files.is_known_remote`): scoped names like `cms_hep_2012_tutorial/data.root`, defined in `remote_datasets.yml`. Downloaded (and tar-extracted) on first access into the cache dir. See "remote" below. |
| 22 | +2. **Local files**: names in `known_files`. If the file isn't physically present (sdist/wheel install strips data), it's downloaded from the `main` branch on GitHub raw and cached. |
| 23 | +3. Otherwise raise `FileNotFoundError` (unless `raise_missing=False`). |
| 24 | + |
| 25 | +Cache directory defaults to `~/.local/skhepdata` (`data.cache_path`), overridable via `cache_dir` / the CLI `--dir` flag. |
| 26 | + |
| 27 | +`known_files` is loaded from `src/skhep_testdata/data/file_list.txt`, which is **generated by `setup.py` at build time** by scanning `data/` for `.root/.lhe/.gz/.json/.hdf5`. Don't hand-edit `file_list.txt`. |
| 28 | + |
| 29 | +## The data-stripping build (setup.py) |
| 30 | + |
| 31 | +The package data files are large, so they are **excluded from the sdist/wheel by default**. `setup.py`: |
| 32 | +- Generates `file_list.txt` from the contents of `data/`. |
| 33 | +- A custom `SDist` command and `exclude_package_data` strip the actual data files unless `SKHEP_DATA=1` is set in the environment, or you do an editable install. |
| 34 | + |
| 35 | +This is why an end user installing from PyPI gets the helper code + `file_list.txt` but downloads actual files lazily. Keep this dual local/remote behavior in mind when changing install or path-resolution logic. |
| 36 | + |
| 37 | +## Remote datasets |
| 38 | + |
| 39 | +`remote_datasets.yml` maps a dataset name → `{url, files}`. `files` is either a list or a `{output_name: path_inside_tar}` map. `RemoteDatasetList` (a classmethod-only singleton with a `_all_files` class cache) flattens these into scoped `dataset/filename` keys. Tests use a separate `tests/test_remote_datasets.yml` loaded explicitly via `load_remote_configs(path)`. |
| 40 | + |
| 41 | +## Adding files |
| 42 | + |
| 43 | +- Drop the file into `src/skhep_testdata/data/`. It becomes a "local" file automatically (extension must be `.root/.lhe/.gz/.json/.hdf5` to be picked up by the build). No code change needed. |
| 44 | +- Large files (>~25 MB) should go to an external open-access repo and be wired in through `remote_datasets.yml` instead. |
| 45 | +- Scripts/notes used to *generate* ROOT test files live in `dev/make-root/` (ROOT C++ macros and a few Python scripts). |
| 46 | +- `check-added-large-files` pre-commit hook will flag oversized additions. |
| 47 | +- It is good practice to add a .readme file for files taken from elsewhere, using the same names(s) as the file(s) being added. |
| 48 | + |
| 49 | +## Packaging notes |
| 50 | + |
| 51 | +- Version is managed by `setuptools_scm` and written to `src/skhep_testdata/version.py` (generated; not committed/edited). |
| 52 | +- Two console scripts (`scikit-hep-testdata`, `skhep-testdata`) both point at `skhep_testdata.__main__:main`, equivalent to `python -m skhep_testdata`. |
0 commit comments