docs: update MFU tracking sections in recipe READMEs

gagank1 · gagank1 · commit 423eab74233f · 2026-04-23T11:17:33.000-07:00
Reflect the modern two-pair metric layout (unpadded useful-work vs
padded hardware view) and the peak-memory reporting fix. Applied
identically to all four MFU-tracking recipes.

Signed-off-by: Gagan Kaushik &lt;gkaushik@nvidia.com&gt;
diff --git a/bionemo-recipes/recipes/codonfm_native_te/README.md b/bionemo-recipes/recipes/codonfm_native_te/README.md
@@ -179,22 +179,24 @@ A final model suitable for uploading to the Hugging Face Hub can be exported at
 
 ## MFU Tracking
 
-Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+Enable per-step MFU logging by adding `log_mfu=true`:
 
 ```bash
 torchrun --nproc_per_node=1 train_fsdp2.py --config-name encodon_1b log_mfu=true
 ```
 
-This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
-stdout:
+Two pairs of metrics are emitted per logging interval:
 
-- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
-- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
+- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
+- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view (HFU-like). Counts
+  every slot the GPU processes, including BSHD row padding.
 
-The FLOPs formula auto-detects model architecture from the model config (MHA, standard FFN,
-vocabulary size) and scales with the actual unpadded token count on each rank. This means it
-naturally handles gradient accumulation, data parallelism, BSHD, and THD (sequence packing)
-without per-strategy code paths. The implementation lives in `perf_logger.py`.
+Non-attention uses the unpadded/padded token count respectively; attention uses `Σ(Lᵢ²)` from
+`cu_seq_lens_q` (THD) or per-row `attention_mask.sum()` (BSHD) for the unpadded variant and
+`cu_seq_lens_q_padded` / full `B·S²` for the padded variant. Implementation in `perf_logger.py`.
+
+Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window; `_mean_gb` is
+the post-step resting footprint.
 
 ## Developer Guide
 
diff --git a/bionemo-recipes/recipes/esm2_native_te/README.md b/bionemo-recipes/recipes/esm2_native_te/README.md
@@ -376,22 +376,25 @@ output = model(**inputs)
 
 ## MFU Tracking
 
-Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+Enable per-step MFU logging by adding `log_mfu=true`:
 
 ```bash
 torchrun --nproc_per_node=2 train_fsdp2.py --config-name L1_3B log_mfu=true
 ```
 
-This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
-stdout:
+Two pairs of metrics are emitted per logging interval:
 
-- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
-- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
+- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
+  Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
+  or per-row `attention_mask.sum()` (BSHD).
+- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
+  GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
 
-The FLOPs formula auto-detects model architecture from the HF config (MHA vs. GQA, gated vs.
-standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
-means it naturally handles data parallelism, context parallelism, BSHD, and THD (sequence packing)
-without per-strategy code paths. The implementation lives in `perf_logger.py`.
+The two pairs agree when the batch has no padding. The formula is CP-aware and auto-detects
+MHA/GQA and FFN layout from the HF config. Implementation in `perf_logger.py`.
+
+Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window
+(`torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is resting.
 
 ## Developer Guide
 
diff --git a/bionemo-recipes/recipes/llama3_native_te/README.md b/bionemo-recipes/recipes/llama3_native_te/README.md
@@ -414,22 +414,27 @@ vllm serve path/to/hf_converted_model
 
 ## MFU Tracking
 
-Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+Enable per-step MFU logging by adding `log_mfu=true`:
 
 ```bash
 torchrun --nproc_per_node=2 train_fsdp2_cp.py --config-name L2_lingua_1b log_mfu=true
 ```
 
-This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
-stdout:
+Two pairs of metrics are emitted per logging interval:
 
-- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
-- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
+- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
+  Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
+  or per-row `attention_mask.sum()` (BSHD).
+- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
+  GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
 
-The FLOPs formula auto-detects model architecture from the HF config (GQA vs. MHA, SwiGLU vs.
-standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
-means it naturally handles gradient accumulation, data parallelism, context parallelism, BSHD, and
-THD (sequence packing) without per-strategy code paths. The implementation lives in `perf_logger.py`.
+The two pairs agree when the batch has no padding (e.g. dense single-doc THD packs). The formula
+is CP-aware (global `Σ(Lᵢ²)` divided by `cp_size`) and auto-detects GQA/MHA and SwiGLU/standard
+FFN from the HF config. Implementation in `perf_logger.py`.
+
+Memory metrics: `train/gpu_memory_allocated_max_gb` is the true transient peak per logging window
+(via `torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is the
+post-step resting footprint.
 
 ## Developer Guide
 
diff --git a/bionemo-recipes/recipes/opengenome2_llama_native_te/README.md b/bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
@@ -413,22 +413,26 @@ Control evaluation frequency with `validation.eval_interval` and `validation.num
 
 ## MFU Tracking
 
-Enable per-step Model FLOPs Utilization (MFU) logging during training by adding `log_mfu=true`:
+Enable per-step MFU logging by adding `log_mfu=true`:
 
 ```bash
 torchrun --nproc_per_node=2 train_fsdp2_cp.py log_mfu=true
 ```
 
-This adds two metrics at each logging interval, emitted alongside existing metrics via WANDB and
-stdout:
+Two pairs of metrics are emitted per logging interval:
 
-- `train/tflops_per_gpu` — achieved BF16 TFLOPS per GPU
-- `train/mfu_pct` — MFU as a percentage of the GPU's peak dense BF16 TFLOPS
+- `train/mfu_pct` / `train/tflops_per_gpu` — useful-work rate. Excludes padding of all kinds.
+  Non-attention uses the unpadded token count; attention uses `Σ(Lᵢ²)` from `cu_seq_lens_q` (THD)
+  or per-row `attention_mask.sum()` (BSHD).
+- `train/mfu_padded_pct` / `train/tflops_per_gpu_padded` — hardware view. Counts every slot the
+  GPU processes, including CP-zigzag and BSHD row padding. HFU-like.
 
-The FLOPs formula auto-detects model architecture from the HF config (GQA vs. MHA, SwiGLU vs.
-standard FFN, LM head presence) and scales with the actual unpadded token count on each rank. This
-means it naturally handles gradient accumulation, data parallelism, context parallelism, BSHD, and
-THD (sequence packing) without per-strategy code paths. The implementation lives in `perf_logger.py`.
+The two pairs agree when the batch has no padding (e.g. dense single-doc THD packs — common for
+genomic data windowed to `max_seq_length`). The formula is CP-aware and auto-detects GQA/SwiGLU
+from the HF config. Implementation in `perf_logger.py`.
+
+Memory: `train/gpu_memory_allocated_max_gb` is the true transient peak per window
+(`torch.cuda.max_memory_allocated()` + `reset_peak_memory_stats()`); `_mean_gb` is resting.
 
 ## Developer Guide