Commit 1942554
authored
Adds codon-fm-native-te recipe (#1531)
Summary
- Add bionemo-recipes/models/codonfm/ — a HuggingFace-compatible CodonFM
model using TransformerEngine, following the ESM2 pattern
- Add bionemo-recipes/recipes/codonfm_native_te/ — a self-contained
FSDP2 training recipe for CodonFM with FP8/FP4 quantization
support
- Add golden value regression tests cross-validated against the
codonfm_ptl_te non-exact (standard TETransformerLayer) implementation
models/codonfm/ (HF-compatible model)
The model code in models/codonfm/modeling_codonfm_te.py is the source of
truth, auto-synced to the recipe via check_copied_files.py.
- CodonFMConfig(PretrainedConfig) — HF-compatible config with 4 presets
(200k, 80M, 600M, 1B)
- CodonFMPreTrainedModel(PreTrainedModel) — base class with MAGNETO
initialization (xavier_normal with scaled gain), meta device
support
- CodonFMForMaskedLM — returns MaskedLMOutput, supports
output_hidden_states, per-layer FP8/FP4 precision via layer_precision
config
- CodonEmbedding — token + post-LayerNorm embedding
- CodonFMEncoder — stack of transformer_engine.pytorch.TransformerLayer
with RoPE, BSHD and THD attention formats
- CodonFMLMHead — Dense + GELU + LayerNormLinear (quantization disabled
for numerical stability)
- CodonTokenizer — 3-mer codon tokenizer (69 tokens: 5 special + 64
codons)
- dataset.py — BSHD/THD collators, synthetic and parquet dataset classes
73 tests (36 pass, 10 skip, 25 xfail, 2 xpass):
- Forward/backward smoke tests (BSHD + THD)
- FP8/FP4 quantization tests (DelayedScaling, Float8CurrentScaling,
Float8BlockScaling, MXFP8BlockScaling, NVFP4BlockScaling)
- Meta device and CUDA initialization tests
- Golden value regression tests — weights generated from codonfm_ptl_te
non-exact model, state dict mapped to native_te key format,
cross-model logit equivalence verified
- BSHD ↔ THD equivalence test — same weights, both formats, outputs
compared
recipes/codonfm_native_te/ (training recipe)
Self-contained FSDP2 training recipe with:
- Hydra config (defaults.yaml, L0_sanity.yaml)
- train_fsdp2.py — FSDP2 training loop with gradient clipping, LR
scheduling
- checkpoint.py — FSDP2 checkpoint save/load with save_pretrained
support
- perf_logger.py — WandB + stdout logging (loss, perplexity, tokens/sec,
GPU memory)
- quantization.py — FP8/FP4 recipe utilities
- Sample train.parquet for testing
63 tests covering model, tokenizer, quantization, and end-to-end
training.
CI integration
- ci/scripts/check_copied_files.py — added entries to sync:
- models/codonfm/modeling_codonfm_te.py →
recipes/codonfm_native_te/modeling_codonfm_te.py
- models/esm2/tests/common/ → models/codonfm/tests/common/
- .gitignore — added negation rule for golden value safetensors test
fixtures
What's left out / future work
- No HF Hub checkpoint yet — the published TE checkpoints
(nvidia/NV-CodonFM-Encodon-TE-80M-v1) use the "exact" EncodonTELayer
with
extra post-attention/post-MLP LayerNorms not present in standard
TETransformerLayer. Golden values will be updated once a native_te
checkpoint is trained and uploaded.
- Conversion tests skipped — CodonFM is natively TE; there is no HF
variant to convert to/from.
- THD padding tests skipped — pad_to_multiple_of and
cu_seq_lens_q_padded not yet implemented for CodonFM's tokenizer.
- _do_not_quantize patterns — currently ("lm_head.dense",
"lm_head.layer_norm_linear"). May need tuning as quantization recipes
evolve.
- Dockerfile — recipe does not yet include a Dockerfile for
containerized training.
Test plan
- cd bionemo-recipes/models/codonfm && pytest -v tests/ — 36 pass, 10
skip, 25 xfail, 2 xpass
- cd bionemo-recipes/recipes/codonfm_native_te && pytest -v tests/ — 63
pass
- python ci/scripts/check_copied_files.py — no diffs
- pre-commit run --all-files — clean
---------- END OF DESCRIPTION----
#### Authorizing CI Runs
We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.
- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.
#### Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request
with one of these commands:
- @coderabbitai review - Triggers a standard review
- @coderabbitai full review - Triggers a comprehensive review
See https://docs.coderabbit.ai/reference/review-commands for a full list
of commands.
### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->
- [ ] I have tested these changes locally
- [ ] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
- Added a CodonFM training recipe: model, tokenizer, datasets/collators,
dataloaders, distributed training entrypoint, checkpointing, scheduler,
perf logging, and configurable FP8/FP4 quantization and debug stats.
* **Tests**
- Comprehensive test suite and utilities including golden-value
generation, conversion/golden regression tests, FP8/THD coverage, and
sanity training runs.
* **Documentation**
- Recipe and shared test-library READMEs added.
* **Chores**
- .gitignore adjusted to allow tracking of golden_state_dict.safetensors
fixtures.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Jonathan Mitchell <jomitchell@nvidia.com>1 parent c301311 commit 1942554
42 files changed
Lines changed: 7184 additions & 0 deletions
File tree
- bionemo-recipes
- models/codonfm
- tests
- common
- recipes/codonfm_native_te
- hydra_config
- tests
- ci/scripts
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
205 | 205 | | |
206 | 206 | | |
207 | 207 | | |
| 208 | + | |
| 209 | + | |
208 | 210 | | |
209 | 211 | | |
210 | 212 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
0 commit comments