Skip to content

Commit e6468fc

Browse files
gagank1claude
andcommitted
Consolidate MFU tracking into perf_logger, address PR review feedback
Reworks MFU tracking per reviewer feedback on #1548: - Delete per-recipe flops.py, test_flops.py, and the CLI entirely - Inline ~30-line FLOPs helper into each recipe's existing perf_logger.py - MFU metrics (train/tflops_per_gpu, train/mfu_pct) flow through the existing torchmetrics -> WANDB path, respecting logging_frequency - Drop comm-overhead estimation; will be a separate future PR The new formula is per_token_flops(seq_len) * num_unpadded_tokens_on_rank. The unpadded-tokens counter (already used by tokens_per_second_per_gpu) is per-rank after DP/CP sharding and accumulated across grad-acc micro-batches, so the formula works uniformly across DDP/FSDP2/FSDP2+CP/DDP+CP/mFSDP and across BSHD and THD (sequence packing) with no per-strategy factors. Net: -3000 / +300 lines. Training scripts lose all MFU scaffolding; the only change per script is one extra kwarg on the PerfLogger constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
1 parent 27255a1 commit e6468fc

29 files changed

Lines changed: 601 additions & 4458 deletions

bionemo-recipes/recipes/codonfm_native_te/README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -185,14 +185,16 @@ Enable per-step Model FLOPs Utilization (MFU) logging during training by adding
185185
torchrun --nproc_per_node=1 train_fsdp2.py --config-name encodon_1b log_mfu=true
186186
```
187187

188-
This logs MFU (%), TFLOPS/GPU, and step time at each optimizer step. The module auto-detects model architecture from the model config.
188+
This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
189+
stdout:
189190

190-
The `flops.py` CLI provides standalone utilities:
191+
- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
192+
- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
191193

192-
```bash
193-
python flops.py gpu-info # Show GPU and peak TFLOPS
194-
torchrun --nproc_per_node=2 flops.py bandwidth # Measure P2P GPU bandwidth
195-
```
194+
The FLOPs formula auto-detects model architecture from the model config (MHA, standard FFN,
195+
vocabulary size) and scales with the actual unpadded token count on each rank. This means it
196+
naturally handles gradient accumulation, data parallelism, BSHD, and THD (sequence packing)
197+
without per-strategy code paths. The implementation lives in `perf_logger.py`.
196198

197199
## Developer Guide
198200

0 commit comments

Comments
 (0)