GenomeCF is a counterfactual validation standard for DNA sequence models.
It is built around a practical question:
If a DNA sequence model scores well on a held-out split, is it also stable under counterfactual perturbations, confounder control, split changes, and external biological transfer?
GenomeCF answers that question with a reusable benchmark, a release registry, a reporting checklist, and a reproducible evaluation workflow.
This folder is the GitHub-ready repository root.
The manuscript is intentionally not stored in this GitHub repository. Keep the paper sources and PDFs in a private, non-versioned sibling folder (for example ../paper/).
Large local-only runtime assets (for example cached embeddings and checkpoints) are kept outside the public repo in:
../local_runtime_assets/
GenomeCF includes:
- core real-data benchmark tasks
- external biological validation tasks
- MPRA variant-effect tasks
- GenomeCF-Synth shortcut-mechanism tasks
- model manifests, task manifests, split manifests, and perturbation manifests
- a canonical results registry
- a local documentation site and reporting checklist
- reproducible CLI commands for evaluation, summarization, validation, and artifact builds
package_src/genomecf/- installable Python package
- CLI
- benchmark and release helpers
configs/- task, model, split, perturbation, and synthetic configs
data/- lightweight public task bundles shipped with the repo
- see
data/README.mdfor large local-only benchmark assets
docs/- user docs, reproducibility docs, release notes, and static website output
envs/- optional environment definitions, including the Caduceus CUDA path
external/- lightweight helper assets that are safe to ship publicly
- local model checkpoints are excluded from Git and live under
../local_runtime_assets/
figures/- generated figures used by the benchmark website and local manuscript workflow
release/- release manifest, checksums, reproduction commands, and expected outputs
results/- canonical registry, release summaries, validation reports, publication tables, and traceability outputs
- heavyweight embedding caches are excluded from Git and live under
../local_runtime_assets/
scripts/- reproducibility helpers
src/- artifact-generation scripts
tests/- fast tests and release tests
Key top-level metadata files:
pyproject.tomlLICENSECITATION.cffcodemeta.json.zenodo.jsonCHANGELOG.md
GenomeCF is not trying to prove that a model "understands biology." Instead, it measures whether a model behaves consistently under checks that matter for interpretation and transfer.
The current release covers:
- held-out AUROC and related standard metrics
- reverse-complement instability and RC flip behavior
- mononucleotide and dinucleotide shuffle sensitivity
- calibration on original and perturbed inputs
- chromosome-grouped split robustness
- matched-negative evaluation
- GC-bin worst-group robustness
- external biological validation
- MPRA variant-effect reliability
- controlled synthetic shortcut-conflict tasks
pip install -e .pip install -e .[benchmark,dev]Verify the installation:
genomecf --helpThe default local path is enough for:
- quickstart reproduction
- release validation
- registry summaries
- most documentation and website builds
- lightweight smoke tests
Some larger foundation-model runs, especially caduceus_ph, use the documented CUDA route.
See:
docs/CADUCEUS_SETUP.mdenvs/caduceus.yml
The fastest end-to-end check is:
python -m pip install -e .[benchmark,dev]
genomecf reproduce-quickstartExpected artifact:
results/release/quickstart/quickstart_report.json
The GitHub repo intentionally excludes large local-only files that make version control fragile or impossible to publish cleanly:
- raw per-sequence benchmark text directories for the large core tasks
- local foundation-model checkpoints such as
external/dnabert2_local/ - embedding caches under
results/cache/ - temporary runtime outputs under
results/tmp/
When those assets exist in the sibling folder ../local_runtime_assets/, GenomeCF will automatically pick them up locally. This keeps the public repository small while preserving the full local workflow on your machine.
GenomeCF is designed to be driven primarily through the CLI.
genomecf --helpgenomecf evaluate \
--task human_nontata_promoters \
--model kmer_logistic_regression \
--split official \
--mode frozengenomecf external \
--task gue_human_tf_0 \
--model dnabert2genomecf variant \
--task mpra_bcl11a_enhancer \
--model dnabert2genomecf synth \
--task gc_conflict \
--model caduceus_phgenomecf summarize --suite core
genomecf summarize --suite nature_methodsgenomecf validate-results
genomecf check-report --results results/release/benchmark_registry.csvgenomecf reproduce-quickstart
genomecf reproduce-focal
genomecf reproduce-externalgenomecf build-websitegenomecf reproduce-quickstart
genomecf validate-resultsgenomecf summarize --suite nature_methodsThen open:
results/release/benchmark_summary.csvresults/release/external_validation_summary.csvresults/release/biological_case_study.csv
genomecf build-websiteOpen:
docs/site/index.html
For the full local benchmark workflow, the optional runtime assets can also live at:
../local_runtime_assets/
human_nontata_promotershuman_enhancers_cohnhuman_enhancers_ensemblhuman_ocr_ensembl
dummy_mouse_enhancers_ensembldrosophila_enhancers_stark
- TF-binding tasks
- histone-mark tasks
- MPRA variant-effect tasks
gc_correlatedgc_matchedgc_conflicttwo_motif_grammarmotif_position_conflict
Main models:
kmer_logistic_regressionsmall_cnnsmall_cnn_rc_augdnabert2caduceus_ph
Diagnostic baselines:
gc_onlycpg_onlyrepeat_onlylength_only
Appendix-only diagnostic foundation baseline:
nucleotide_transformer_v2
Main release artifacts:
- registry CSV:
results/release/benchmark_registry.csv - registry JSONL:
results/release/benchmark_registry.jsonl - release summary:
results/release/benchmark_summary.csv - model-task matrix:
results/release/model_task_matrix.csv - validation report:
results/release/validation_report.json
Release-bundle files:
release/GenomeCF_v1_manifest.jsonrelease/GenomeCF_v1_checksums.txtrelease/GenomeCF_v1_reproduction_commands.shrelease/GenomeCF_v1_expected_outputs.md
The manuscript sources and PDFs are kept in a private folder outside this repository and are not pushed to GitHub.
Start here:
docs/QUICKSTART.mddocs/PROTOCOL.mddocs/REPRODUCIBILITY_PROTOCOL.md
Benchmark and methods docs:
docs/BENCHMARK.mddocs/TASKS.mddocs/MODELS.mddocs/METRICS.mddocs/SPLITS.mddocs/RESULT_SCHEMA.mddocs/REPORTING_STANDARD.mddocs/EXTERNAL_VALIDATION.mddocs/BIOLOGICAL_CASE_STUDY.mddocs/SYNTHETIC_TASKS.mddocs/MOTIF_ANALYSIS.mddocs/GC_BIN_ROBUSTNESS.md
Availability and release docs:
docs/CODE_AVAILABILITY.mddocs/DATA_AVAILABILITY.mddocs/BENCHMARK_AVAILABILITY.mddocs/MODEL_AVAILABILITY.mddocs/ENVIRONMENT_AVAILABILITY.mddocs/ARTIFACT_EVALUATION.md
This folder is the version intended for GitHub.
Before pushing:
- run
genomecf validate-results - run
python -m pytest - inspect
docs/RELEASE_SCOPE.md - inspect
release/GenomeCF_v1_manifest.json - confirm that the private manuscript companion remains outside this repo in
../paper/ - initialize Git in this folder if you want a fresh repository:
git init
git add .
git commit -m "Initial GenomeCF release"- Docker definitions are included, but Docker cannot be smoke-tested locally on a machine where Docker is not installed.
- The heaviest Caduceus runs use the documented WSL2/Linux CUDA path rather than the default CPU path.
- Some supplement tables can still trigger non-fatal LaTeX overfull or underfull warnings during builds.
For the public release-facing description of what is included in this repo, read:
docs/RELEASE_SCOPE.md
Internal upgrade logs, private manuscript planning notes, and submission-management files are intentionally kept outside the public repo.
- citation metadata:
CITATION.cff - machine-readable metadata:
codemeta.json,.zenodo.json - license:
LICENSE