Skip to content

Commit 067d42d

Browse files
committed
initial recipe add
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent e62a7bb commit 067d42d

101 files changed

Lines changed: 35523 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,15 @@ coverage.xml
194194
Thumbs.db
195195

196196
.python_history
197+
198+
# Any training results
199+
results/
200+
job_output/
201+
wandb/
202+
203+
# Any model checkpoints
204+
*.safetensors
205+
checkpoint_export/
206+
207+
# Hydra outputs
208+
outputs/

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,12 @@ expert-level support.
1515

1616
BioNeMo Framework is part of a larger ecosystem of NVIDIA Biopharma products. Get notified of new releases, bug fixes, critical security updates, and more for biopharma. [Subscribe.](https://www.nvidia.com/en-us/clara/biopharma/product-updates/)
1717

18+
> [!NOTE]
19+
> BioNeMo Recipes are now available, which demonstrate high-performance model training outside of the NeMo Framework.
20+
> The recipes show how to train models that derive from HuggingFace `PreTrainedModel` classes, and use
21+
> [NVIDIA TransformerEngine](https://github.com/NVIDIA/TransformerEngine) layers for optimized attention kernels. For
22+
> more information, see the [BioNeMo Recipes README](./bionemo-recipes.md).
23+
1824
## Structure of the Framework
1925

2026
The `bionemo-framework` is organized into independently installable namespace packages. These are located under the

bionemo-recipes.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# BioNemo Recipes
2+
3+
BioNemo Recipes provides an easy path for the biological foundation model training community to scale up transformer-based models efficiently. Rather than offering a batteries-included training framework, we provide **model checkpoints** with TransformerEngine layers and **training recipes** that demonstrate how to achieve maximum throughput with popular open-source frameworks.
4+
5+
## Overview
6+
7+
The biological AI community is actively prototyping model architectures and needs tooling that prioritizes extensibility, interoperability, and ease-of-use alongside performance. BioNemo Recipes addresses this by offering:
8+
9+
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
10+
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11+
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
12+
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
13+
14+
### Use Cases
15+
16+
- **Foundation Model Developers**: AI researchers and ML engineers developing novel biological foundation models who need to scale up prototypes efficiently
17+
- **Foundation Model Customizers**: Domain scientists looking to fine-tune existing models with proprietary data for drug discovery and biological research
18+
19+
## Repository Structure
20+
21+
This repository contains two types of components:
22+
23+
### Models (`models/`)
24+
25+
Huggingface-compatible `PreTrainedModel` classes that use TransformerEngine layers internally. These are designed to be:
26+
27+
- **Distributed via Hugging Face Hub**: Pre-converted checkpoints available at [huggingface.co/nvidia](https://huggingface.co/nvidia)
28+
- **Drop-in replacements**: Compatible with `AutoModel.from_pretrained()` without additional dependencies
29+
- **Performance optimized**: Leverage TransformerEngine features like FP8 training and context parallelism
30+
31+
Example models include ESM-2, Geneformer, and AMPLIFY.
32+
33+
### Recipes (`recipes/`)
34+
35+
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
36+
37+
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38+
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
39+
- **Scaling strategies**: Single-GPU to multi-node training patterns
40+
- **Benchmarked performance**: Validated throughput and convergence metrics
41+
42+
Recipes are **not pip-installable packages** but serve as reference implementations that users can adapt for their own research.
43+
44+
## Quick Start
45+
46+
### Using Models
47+
48+
```python
49+
from transformers import AutoModel, AutoTokenizer
50+
51+
# Load a BioNemo model directly from Hugging Face
52+
model = AutoModel.from_pretrained("nvidia/AMPLIFY_120M")
53+
tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
54+
```
55+
56+
### Running Recipes
57+
58+
```bash
59+
# Navigate to a recipe
60+
cd recipes/esm2_native_te_nvfsdp
61+
62+
# Build and run
63+
docker build -t esm2_recipe .
64+
docker run --rm -it --gpus all esm2_recipe python train.py
65+
```
66+
67+
______________________________________________________________________
68+
69+
## Developer Guide
70+
71+
### Setting Up Development Environment
72+
73+
1. **Install pre-commit hooks:**
74+
75+
```bash
76+
pre-commit install
77+
```
78+
79+
Run hooks manually:
80+
81+
```bash
82+
pre-commit run --all-files
83+
```
84+
85+
2. **Test your changes:**
86+
Each model and recipe has its own build and test setup following this pattern:
87+
88+
```bash
89+
cd models/my_model # or recipes/my_recipe
90+
docker build . -t my_tag
91+
docker run --rm -it --gpus all my_tag pytest -v .
92+
```
93+
94+
### Coding Guidelines
95+
96+
We prioritize **readability and simplicity** over comprehensive feature coverage:
97+
98+
- **KISS over DRY**: It's better to have clear, duplicated code than complex abstractions
99+
- **One thing well**: Each recipe should demonstrate specific features clearly rather than trying to cover everything
100+
- **Self-contained**: Recipes cannot depend on cutting-edge code from other parts of the repository
101+
102+
### Testing Strategy
103+
104+
We use a three-tier testing approach:
105+
106+
#### L0 Tests (Pre-merge)
107+
108+
- **Purpose**: Fast validation that code works
109+
- **Runtime**: \<10 minutes, single GPU
110+
- **Frequency**: Run automatically on PRs
111+
- **Scope**: Basic functionality, checkpoint creation/loading
112+
113+
#### L1 Tests (Performance Monitoring)
114+
115+
- **Purpose**: Performance benchmarking and partial convergence validation
116+
- **Runtime**: Up to 4 hours, up to 16 GPUs
117+
- **Frequency**: Nightly/weekly
118+
- **Scope**: Throughput metrics, scaling validation
119+
120+
#### L2 Tests (Release Validation)
121+
122+
- **Purpose**: Full convergence and large-scale validation
123+
- **Runtime**: Multiple days, hundreds of GPUs
124+
- **Frequency**: Monthly or before releases
125+
- **Scope**: Complete model convergence, cross-platform validation
126+
127+
### Adding New Components
128+
129+
#### Adding a New Model
130+
131+
Models should be pip-installable packages that can export checkpoints to Hugging Face. See the
132+
[models README](models/README.md) for detailed guidelines on:
133+
134+
- Package structure and conventions
135+
- Checkpoint export procedures
136+
- Testing requirements
137+
- CI/CD integration
138+
139+
#### Adding a New Recipe
140+
141+
Recipes should be self-contained Docker environments demonstrating specific training patterns. See
142+
the [recipes README](recipes/README.md) for guidance on:
143+
144+
- Directory structure and naming
145+
- Hydra configuration management
146+
- Docker best practices
147+
- SLURM integration examples
148+
149+
### CI/CD Contract
150+
151+
All components must pass this basic validation:
152+
153+
```bash
154+
docker build -t {component_tag} .
155+
docker run --rm -it --gpus all {component_tag} pytest -v .
156+
```
157+
158+
#### Running CI/CD
159+
160+
To run the CI/CD pipeline locally, run the following command:
161+
162+
```bash
163+
./ci/build_and_test.py
164+
```
165+
166+
### Performance Expectations
167+
168+
We aim to provide the fastest available training implementations for biological foundation models, with documented benchmarks across NVIDIA hardware (A100, H100, H200, B100, B200, etc.).
169+
170+
## Contributing
171+
172+
We welcome contributions that advance the state of biological foundation model training. Please ensure your contributions:
173+
174+
1. Follow our coding guidelines emphasizing clarity
175+
2. Include appropriate tests (L0 minimum, L1/L2 as applicable)
176+
3. Provide clear documentation and examples
177+
4. Maintain compatibility with our supported frameworks
178+
179+
For detailed contribution guidelines, see our individual component READMEs:
180+
181+
- [Models Development Guide](models/README.md)
182+
- [Recipes Development Guide](recipes/README.md)
183+
184+
## License
185+
186+
[Add appropriate license information]
187+
188+
## Support
189+
190+
For technical support and questions:
191+
192+
- Check existing issues before opening a new one
193+
- Review our training recipes for implementation examples
194+
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies

recipes/.ruff.toml

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
line-length = 119
2+
target-version = "py312"
3+
4+
[lint]
5+
ignore = ["D100", "E501", "N811", "N814"]
6+
select = [
7+
"C", # Pylint conventions
8+
"D", # Documentation formatting
9+
"E", # style stuff, whitespaces
10+
"F", # important pyflakes lints
11+
"I", # import sorting
12+
"RUF", # Some Ruff-specific lints, unused noqas, etc.
13+
"W", # Pylint warnings
14+
"N",
15+
"NPY",
16+
"PERF",
17+
"PLE",
18+
"PLW",
19+
"FURB",
20+
]
21+
22+
# Allow fix for all enabled rules (when `--fix`) is provided.
23+
fixable = ["ALL"]
24+
unfixable = []
25+
26+
# Exclude a variety of commonly ignored directories.
27+
exclude = [
28+
".bzr",
29+
".direnv",
30+
".eggs",
31+
".git",
32+
".git-rewrite",
33+
".hg",
34+
".mypy_cache",
35+
".nox",
36+
".pants.d",
37+
".pytype",
38+
".ruff_cache",
39+
".svn",
40+
".tox",
41+
".venv",
42+
"__pypackages__",
43+
"_build",
44+
"buck-out",
45+
"build",
46+
"dist",
47+
"node_modules",
48+
"venv",
49+
"packages/nvFSDP/",
50+
]
51+
52+
# Ignore import violations in all `__init__.py` files.
53+
[lint.per-file-ignores]
54+
"__init__.py" = ["D104", "E402", "F401", "F403", "F811"]
55+
"test_*.py" = ["D"]
56+
"conftest.py" = ["D"]
57+
"scripts/*.py" = ["D"]
58+
"**/*.ipynb" = ["D"]
59+
60+
[lint.isort]
61+
lines-after-imports = 2
62+
known-third-party = ["wandb"]
63+
64+
[lint.pydocstyle]
65+
convention = "google"
66+
67+
[format]
68+
# Like Black, use double quotes for strings.
69+
quote-style = "double"
70+
71+
# Like Black, indent with spaces, rather than tabs.
72+
indent-style = "space"
73+
74+
# Like Black, respect magic trailing commas.
75+
skip-magic-trailing-comma = false
76+
77+
# Like Black, automatically detect the appropriate line ending.
78+
line-ending = "auto"

0 commit comments

Comments
 (0)