Skip to content

Commit 423eab7

Browse files
committed
docs: update MFU tracking sections in recipe READMEs
Reflect the modern two-pair metric layout (unpadded useful-work vs padded hardware view) and the peak-memory reporting fix. Applied identically to all four MFU-tracking recipes. Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
1 parent 909c1d7 commit 423eab7

4 files changed

Lines changed: 50 additions & 36 deletions

File tree

bionemo-recipes/recipes/codonfm_native_te/README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -179,22 +179,24 @@ A final model suitable for uploading to the Hugging Face Hub can be exported at
179179

180180
## MFU Tracking
181181

182-
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
182+
Enable per-step MFU logging by adding `log_mfu=true`:
183183

184184
```bash
185185
torchrun --nproc_per_node=1 train_fsdp2.py --config-name encodon_1b log_mfu=true
186186
```
187187

188-
This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
189-
stdout:
188+
Two pairs of metrics are emitted per logging interval:
190189

191-
- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
192-
- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
190+
- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
191+
- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view (HFU-like). Counts
192+
every slot the GPU processes, including BSHD row padding.
193193

194-
The FLOPs formula auto-detects model architecture from the model config (MHA, standard FFN,
195-
vocabulary size) and scales with the actual unpadded token count on each rank. This means it
196-
naturally handles gradient accumulation, data parallelism, BSHD, and THD (sequence packing)
197-
without per-strategy code paths. The implementation lives in `perf_logger.py`.
194+
Non-attention uses the unpadded/padded token count respectively; attention uses `Σ(Lᵢ²)` from
195+
`cu_seq_lens_q` (THD) or per-row `attention_mask.sum()` (BSHD) for the unpadded variant and
196+
`cu_seq_lens_q_padded` / full `B·S²` for the padded variant. Implementation in `perf_logger.py`.
197+
198+
Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window; `_mean_gb` is
199+
the post-step resting footprint.
198200

199201
## Developer Guide
200202

bionemo-recipes/recipes/esm2_native_te/README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -376,22 +376,25 @@ output = model(**inputs)
376376

377377
## MFU Tracking
378378

379-
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
379+
Enable per-step MFU logging by adding `log_mfu=true`:
380380

381381
```bash
382382
torchrun --nproc_per_node=2 train_fsdp2.py --config-name L1_3B log_mfu=true
383383
```
384384

385-
This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
386-
stdout:
385+
Two pairs of metrics are emitted per logging interval:
387386

388-
- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
389-
- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
387+
- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
388+
Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
389+
or per-row `attention_mask.sum()` (BSHD).
390+
- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
391+
GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
390392

391-
The FLOPs formula auto-detects model architecture from the HF config (MHA vs. GQA, gated vs.
392-
standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
393-
means it naturally handles data parallelism, context parallelism, BSHD, and THD (sequence packing)
394-
without per-strategy code paths. The implementation lives in `perf_logger.py`.
393+
The two pairs agree when the batch has no padding. The formula is CP-aware and auto-detects
394+
MHA/GQA and FFN layout from the HF config. Implementation in `perf_logger.py`.
395+
396+
Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window
397+
(`torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is resting.
395398

396399
## Developer Guide
397400

bionemo-recipes/recipes/llama3_native_te/README.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -414,22 +414,27 @@ vllm serve path/to/hf_converted_model
414414

415415
## MFU Tracking
416416

417-
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
417+
Enable per-step MFU logging by adding `log_mfu=true`:
418418

419419
```bash
420420
torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true
421421
```
422422

423-
This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
424-
stdout:
423+
Two pairs of metrics are emitted per logging interval:
425424

426-
- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
427-
- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
425+
- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
426+
Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
427+
or per-row `attention_mask.sum()` (BSHD).
428+
- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
429+
GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
428430

429-
The FLOPs formula auto-detects model architecture from the HF config (GQA vs. MHA, SwiGLU vs.
430-
standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
431-
means it naturally handles gradient accumulation, data parallelism, context parallelism, BSHD, and
432-
THD (sequence packing) without per-strategy code paths. The implementation lives in `perf_logger.py`.
431+
The two pairs agree when the batch has no padding (e.g. dense single-doc THD packs). The formula
432+
is CP-aware (global `Σ(Lᵢ²)` divided by `cp_size`) and auto-detects GQA/MHA and SwiGLU/standard
433+
FFN from the HF config. Implementation in `perf_logger.py`.
434+
435+
Memory metrics: `train/gpu_memory_allocated_max_gb` is the true transient peak per logging window
436+
(via `torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is the
437+
post-step resting footprint.
433438

434439
## Developer Guide
435440

bionemo-recipes/recipes/opengenome2_llama_native_te/README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -413,22 +413,26 @@ Control evaluation frequency with `validation.eval_interval` and `validation.num
413413

414414
## MFU Tracking
415415

416-
Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
416+
Enable per-step MFU logging by adding `log_mfu=true`:
417417

418418
```bash
419419
torchrun --nproc_per_node=2 train_fsdp2_cp.py log_mfu=true
420420
```
421421

422-
This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
423-
stdout:
422+
Two pairs of metrics are emitted per logging interval:
424423

425-
- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
426-
- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
424+
- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
425+
Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
426+
or per-row `attention_mask.sum()` (BSHD).
427+
- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
428+
GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
427429

428-
The FLOPs formula auto-detects model architecture from the HF config (GQA vs. MHA, SwiGLU vs.
429-
standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
430-
means it naturally handles gradient accumulation, data parallelism, context parallelism, BSHD, and
431-
THD (sequence packing) without per-strategy code paths. The implementation lives in `perf_logger.py`.
430+
The two pairs agree when the batch has no padding (e.g. dense single-doc THD packs — common for
431+
genomic data windowed to `max_seq_length`). The formula is CP-aware and auto-detects GQA/SwiGLU
432+
from the HF config. Implementation in `perf_logger.py`.
433+
434+
Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window
435+
(`torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is resting.
432436

433437
## Developer Guide
434438

0 commit comments

Comments
 (0)