sae: config-gated dead-latent/FVU training fixes + inter-shard shuffle & pre-bias init shard sampling (NVIDIA-BioNeMo#1619)

polinabinder1 · claude · web-flow · commit 6aa51037463b · 2026-06-09T20:42:49.000Z
## Why Training TopK SAEs on Evo2 activations hit a severe **dead-latent** problem (a large fraction of features never fired, wasting capacity). `normalize_input` (already merged) fixed most of it; this PR adds the remaining **training-dynamics fixes we found necessary for Evo2 SAE training**. **Every change defaults to the previous behavior and is opt-in** — so you can reproduce or continue prior training runs **exactly as before**, and enable each fix only when you want it. The training recipe opts in; both `topk` options serialize in the checkpoint config so a reloaded SAE keeps its behavior. ## Changes — default = previous behavior, opt in per flag **1. Dead-latent inactivity counted in *total* tokens** — `dead_count_global` (default `False` = previous per-rank count) The auxk revival fires once a latent has been inactive for `dead_tokens_threshold` (10M) tokens, but the counter advanced by *this rank's* micro-batch — so under DDP it ran `world_size`× too slow and revival kicked in `world_size`× too late (≈80M effective tokens on 8 GPUs). Opt in with `dead_count_global=True` to count total tokens (× world_size); the `all_reduce(MIN)` still means "fired on any rank ⇒ reset." **2. Aggregate FVU + auxk loss** — `aggregate_loss` (bool, default `False` = previous per-token) The per-token loss ratio `mean_t(mse_t / var_t)` down-weights rare high-variance tokens, starving the latents that specialize on them (notably Evo2's heavy-tailed **sink tokens**) → they die. Opt in with `aggregate_loss=True` for a batch-level ratio (which also matches the reported `var_exp` metric). This single bool also fixes the **auxk residual** end-to-end: `False` keeps the previous `x - recon + pre_bias`; `True` uses the corrected `x - recon` (the true error, not `pre_bias`-dominated). **3. Shuffle + blend shards** — `mix_shards` (int, default `1` = previous) Shards are written in corpus order (all prokaryota, then all eukaryota). A contiguous per-rank slice trains a rank on one kingdom then switches mid-epoch → a visible **FVU cliff**. `mix_shards=1` (default) = previous behavior (one shard at a time, contiguous slice). Set `mix_shards=N>1` to **globally shuffle the shard list** before the per-rank split (so each rank gets a cross-section) **and** buffer/blend N shards per batch (≈N shards of peak RAM). **4. Spread the pre-bias-init sample** — `sample(num_shards=…)` (default `1` = previous single shard) `pre_bias` is initialized to the geometric median of a sample of activations (so the SAE starts centered). A single-shard sample biases it toward whatever is first in corpus order (one kingdom) → mis-centered init → more dead latents. Set `num_shards>1` to draw the sample across that many random shards spanning the store (≈one shard of peak RAM — each sub-sampled then freed). ## How to opt in (what the Evo2 recipe sets) ```python TopKSAE(..., aggregate_loss=True, dead_count_global=True) store.get_streaming_dataloader(..., mix_shards=8) # shuffle + blend 8 shards pre_bias0 = geometric_median(store.sample(n, num_shards=8)) # sample across 8 shards ``` The training recipe (separate PR) exposes these as CLI flags (`--aggregate-loss`, `--dead-count-global`, `--mix-shards`, `--presample-shards`). ## Opt-out summary | behavior | knob | default | opt in | |---|---|---|---| | global dead-token count | `dead_count_global` (bool) | `False` | `True` | | aggregate FVU + auxk loss | `aggregate_loss` (bool) | `False` | `True` | | shard shuffle + blending | `mix_shards` (int) | `1` | `>1` | | spread pre-init sample | `sample(num_shards=)` | `1` | `>1` | ## Tests — `sae/tests/test_topk.py` (CPU, no GPU) global-vs-local dead-token counting, the aggregate-FVU formula (`mse.mean()/var.mean()`), and that the opted-in flags round-trip through `_get_config()`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Polina Binder <pbinder@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
diff --git a/bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/activation_store.py b/bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/activation_store.py
@@ -332,18 +332,25 @@ def get_streaming_dataloader(
         rank: int = 0,
         world_size: int = 1,
         max_shards: Optional[int] = None,
+        mix_shards: int = 1,
     ) -> DataLoader:
         """Get a streaming DataLoader that reads one shard at a time from disk.
 
-        Each rank gets a disjoint slice of shards. Peak RAM per rank is ~1 shard.
+        Each rank gets a disjoint slice of shards. Peak RAM per rank is ~mix_shards shards.
 
         Args:
             batch_size: Batch size for training
-            shuffle: Whether to shuffle shard order and within-shard data
+            shuffle: Whether to shuffle within-shard (and within-buffer) data
             seed: Random seed for reproducibility
             rank: This rank's index (0-indexed)
             world_size: Total number of ranks
             max_shards: Limit total shards used (for subsampling). None = all.
+            mix_shards: How many shards to blend together. 1 (default) = previous behavior
+                (one shard at a time, contiguous per-rank slice, no global shuffle). >1
+                globally shuffles the shard list before the per-rank split (so each rank gets
+                a cross-section — found needed for Evo2 training, where shards are kingdom-
+                ordered, to avoid an fvu cliff) AND buffers/mixes that many shards per batch
+                (at ~mix_shards shards of peak RAM).
 
         Returns:
             DataLoader yielding [batch_size, hidden_dim] tensors
@@ -357,8 +364,16 @@ def get_streaming_dataloader(
         if n_total > 1 and pq.read_metadata(last_shard_path).num_rows < shard_size:
             n_total -= 1
 
-        # Assign equal shards to each rank (drop remainder to keep DDP in sync)
+        # When mixing (mix_shards > 1), shuffle the shard list BEFORE splitting across ranks
+        # so each rank gets a random cross-section of the whole parquet, not a contiguous
+        # slice. Found needed for Evo2 training, where shards are sequence-ordered (e.g. all
+        # prok then all euk): a contiguous per-rank slice trains a rank on one kingdom then
+        # switches, causing an fvu cliff. Deterministic across ranks via the shared seed.
+        # mix_shards == 1 keeps the previous contiguous behavior. Then assign equal shards per
+        # rank (drop remainder to keep DDP in sync).
         all_indices = list(range(n_total))
+        if mix_shards > 1:
+            np.random.default_rng(seed if seed is not None else 0).shuffle(all_indices)
         per_rank = n_total // world_size
         my_indices = all_indices[rank * per_rank : (rank + 1) * per_rank]
 
@@ -368,11 +383,46 @@ def get_streaming_dataloader(
             batch_size=batch_size,
             shuffle=shuffle,
             seed=seed,
+            mix_shards=mix_shards,
         )
 
         # batch_size=None: dataset already yields pre-formed batches
         return DataLoader(dataset, batch_size=None, num_workers=0)
 
+    def sample(self, n: int, seed: int = 0, num_shards: int = 1) -> torch.Tensor:
+        """Return ~n activation rows for pre-bias (geometric-median) init.
+
+        Defaults to a single shard, i.e. the previous behavior. Set ``num_shards`` > 1 to
+        draw from that many random shards spanning the whole parquet: we found this needed
+        for Evo2 SAE training, where shards are written in corpus order (e.g. all prokaryota
+        then all eukaryota), so a single-shard sample biases the geometric-median pre-bias
+        toward one kingdom and worsens dead latents. Peak RAM ~one shard (each is sub-sampled
+        then freed before the next).
+
+        Args:
+            n: Number of activation rows to return.
+            seed: RNG seed for the shard/row sampling and the final permutation; sampling is
+                deterministic given this seed.
+            num_shards: Number of shards to sample across, clamped to ``[1, self.n_shards]``.
+                1 (default) = previous single-shard behavior; >1 spreads the sample.
+
+        Returns:
+            A float ``torch.Tensor`` of shape ``(n, D)`` of sampled pre-bias activation rows
+            (``torch.from_numpy`` on concatenated per-shard slices loaded via
+            ``self._load_shard``), deterministic for the given ``seed``.
+        """
+        rng = np.random.default_rng(seed)
+        k = min(self.n_shards, max(1, num_shards))
+        chosen = rng.choice(self.n_shards, size=k, replace=False)
+        per = -(-n // k)  # ceil(n / k)
+        parts = []
+        for i in chosen:
+            shard = self._load_shard(int(i))
+            take = min(per, len(shard))
+            parts.append(shard[rng.choice(len(shard), size=take, replace=False)])
+        rows = torch.from_numpy(np.concatenate(parts)).float()
+        return rows[torch.randperm(len(rows), generator=torch.Generator().manual_seed(seed))][:n]
+
     def get_dataloader(
         self,
         batch_size: int = 4096,
@@ -491,12 +541,16 @@ def __init__(
         batch_size: int = 4096,
         shuffle: bool = True,
         seed: Optional[int] = None,
+        mix_shards: int = 1,
     ):
         self.store = store
         self.shard_indices = shard_indices
         self.batch_size = batch_size
         self.shuffle = shuffle
         self.seed = seed
+        # Shards to accumulate before flushing batches. >1 mixes rows across that
+        # many shards (true inter-shard shuffling) instead of one shard at a time.
+        self.mix_shards = max(1, mix_shards)
         self.max_batches = None  # Set externally to cap iteration (for DDP sync)
 
         # Approximate length: total tokens in assigned shards / batch_size
@@ -511,20 +565,29 @@ def __iter__(self) -> Iterator[torch.Tensor]:
             rng.shuffle(indices)
 
         buffer = None
+        shards_loaded = 0
         n_yielded = 0
-        for shard_idx in indices:
+        for shard_pos, shard_idx in enumerate(indices):
             shard = torch.from_numpy(self.store._load_shard(shard_idx)).float()
             if self.shuffle:
                 shard = shard[torch.randperm(len(shard))]
-
             buffer = torch.cat([buffer, shard]) if buffer is not None else shard
-
-            while len(buffer) >= self.batch_size:
-                if self.max_batches is not None and n_yielded >= self.max_batches:
-                    return
-                yield buffer[: self.batch_size]
-                buffer = buffer[self.batch_size :]
-                n_yielded += 1
+            shards_loaded += 1
+
+            # Flush only once mix_shards shards are buffered (or this is
+            # the last shard), shuffling the whole buffer first so each batch
+            # mixes rows from that many different parts of the parquet.
+            is_last = shard_pos == len(indices) - 1
+            if shards_loaded >= self.mix_shards or is_last:
+                if self.shuffle and self.mix_shards > 1:
+                    buffer = buffer[torch.randperm(len(buffer))]
+                while len(buffer) >= self.batch_size:
+                    if self.max_batches is not None and n_yielded >= self.max_batches:
+                        return
+                    yield buffer[: self.batch_size]
+                    buffer = buffer[self.batch_size :]
+                    n_yielded += 1
+                shards_loaded = 0
 
         # Yield remainder as a partial batch (skip if capped)
         if self.max_batches is None and buffer is not None and len(buffer) > 0:
diff --git a/bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/architectures/topk.py b/bionemo-recipes/interpretability/sparse_autoencoders/sae/src/sae/architectures/topk.py
@@ -22,6 +22,7 @@
 from typing import Any, Dict, Optional, Tuple
 
 import torch
+import torch.distributed as dist
 import torch.nn as nn
 import torch.nn.functional as F
 
@@ -59,6 +60,14 @@ class TopKSAE(SparseAutoencoder):
         auxk: Number of auxiliary latents for dead latent loss (None = disabled)
         auxk_coef: Coefficient for auxiliary loss (default: 1/32)
         dead_tokens_threshold: Tokens of inactivity before latent is considered dead (default 10M per Gao et al.)
+        aggregate_loss: If False (default), reduce the FVU and AuxK losses per-token (the
+            previous mean-of-per-row ratios). If True, use a single batch-level
+            ``mse.mean() / var.mean()`` ratio, which stops rare high-variance tokens from
+            being down-weighted (and thus their latents dying).
+        dead_count_global: If True, accumulate dead-latent inactivity counts across all DDP
+            ranks (total tokens = micro-batch x world_size); if False (default), count this
+            rank's micro-batch only. True makes the dead-threshold / AuxK revival fire on time
+            under data parallelism.
         init_encoder_from_decoder: If True, initialize encoder weights as transpose
             of decoder weights. From OpenAI paper: this + AuxK → nearly 0% dead latents.
     """
@@ -72,6 +81,8 @@ def __init__(
         auxk: Optional[int] = None,
         auxk_coef: float = 1 / 32,
         dead_tokens_threshold: int = 10_000_000,
+        aggregate_loss: bool = False,
+        dead_count_global: bool = False,
         init_encoder_from_decoder: bool = True,
         init_pre_bias: bool = True,
         decoder_impl: str = "dense",
@@ -94,6 +105,12 @@ def __init__(
         if decoder_impl not in ("dense", "triton"):
             raise ValueError(f"decoder_impl must be 'dense' or 'triton', got {decoder_impl!r}")
         self.decoder_impl = decoder_impl
+        # False (default = previous per-token reduction) | True (batch-level aggregate FVU/auxk
+        # ratio; opt in to fix dead latents starved by the per-token ratio on rare high-var tokens).
+        self.aggregate_loss = aggregate_loss
+        # False (default = previous per-rank count) | True (count inactivity in TOTAL tokens,
+        # x world_size, so dead-latent revival fires on time under DDP; opt in).
+        self.dead_count_global = dead_count_global
 
         # Pre-bias (subtracted from normalized input, added to output before denorm)
         self.pre_bias = nn.Parameter(torch.zeros(input_dim))
@@ -125,6 +142,8 @@ def _get_config(self) -> Dict[str, Any]:
             "auxk": self.auxk,
             "auxk_coef": self.auxk_coef,
             "dead_tokens_threshold": self.dead_tokens_threshold,
+            "aggregate_loss": self.aggregate_loss,
+            "dead_count_global": self.dead_count_global,
         }
 
     def _init_encoder_from_decoder(self) -> None:
@@ -288,8 +307,17 @@ def _update_dead_latent_stats(self, codes: torch.Tensor) -> None:
         # Check which latents were active (any sample in batch had activation > threshold)
         active_mask = (codes.abs() > 1e-3).any(dim=0)  # [hidden_dim]
 
-        # Reset counter for active latents, increment by token count for inactive
-        n_tokens = codes.shape[0]
+        # dead_count_global=True increments by GLOBAL tokens, not this rank's micro-batch:
+        # each of the world_size ranks processes codes.shape[0] tokens per step, so the
+        # inactivity counter must advance by codes.shape[0] * world_size to match
+        # dead_tokens_threshold's intended units (total training tokens). The default
+        # (per-rank count) makes the threshold (and auxk revival) trigger world_size x too
+        # late under DDP. The trainer's all_reduce(MIN) preserves "fired on any rank => reset".
+        if self.dead_count_global and dist.is_available() and dist.is_initialized():
+            world_size = dist.get_world_size()
+        else:
+            world_size = 1
+        n_tokens = codes.shape[0] * world_size
         self.stats_last_nonzero = torch.where(
             active_mask, torch.zeros_like(self.stats_last_nonzero), self.stats_last_nonzero + n_tokens
         )
@@ -331,25 +359,38 @@ def _compute_auxk_loss(
         # Decode auxiliary latents using only dead decoder columns (avoids full-width matmul)
         recon_aux = F.linear(codes_aux, self.decoder.weight[:, dead_indices], self.decoder.bias)
 
-        # Target is the residual (what primary reconstruction missed)
-        # Work in normalized space for the aux loss
+        # Target is the residual (what primary reconstruction missed).
+        # The corrected residual is x - recon (the actual reconstruction error). The legacy
+        # non-normalized form `x - recon + pre_bias` simplifies to `x - decoder(codes)`, whose
+        # norm is dominated by ||pre_bias|| rather than the actual error, weakening the aux
+        # gradient by ~(||pre_bias|| / ||error||)^2. Gated on aggregate_loss so False
+        # reproduces the previous auxk loss end-to-end; True uses the fix.
         if self.normalize_input and norm_info is not None:
-            # Normalize x to match the space where encoding happened
+            # Normalize x to match the space where encoding happened (already correct in both modes)
             x_norm = (x - norm_info["mu"]) / norm_info["std"]
             # Reuse codes from forward pass instead of re-encoding (or a precomputed
             # normalized recon, e.g. from the sparse/triton decode path).
             if recon_norm is None:
                 recon_norm = self.decoder(codes) + self.pre_bias
             residual = x_norm - recon_norm.detach()
+        elif not self.aggregate_loss:
+            residual = x - recon.detach() + self.pre_bias.detach()  # legacy (previous behavior)
         else:
-            residual = x - recon.detach() + self.pre_bias.detach()
-
-        # Normalized MSE: MSE / variance of target
-        mse = (recon_aux - residual).pow(2).mean(dim=-1)  # [batch]
-        target_var = residual.pow(2).mean(dim=-1)  # [batch]
-
-        # Avoid division by zero, use nan_to_num like OpenAI
-        normalized_mse = (mse / (target_var + 1e-8)).mean()
+            residual = x - recon.detach()  # corrected: the true reconstruction error
+
+        # AuxK normalized MSE: how much of the residual the dead latents recover. Default
+        # (aggregate_loss=False) is the legacy per-token ratio (mse_t / target_var_t), which
+        # up-weights already-well-reconstructed (small residual) tokens and down-weights the
+        # big missed structure dead latents should grab — mis-targeting revival and letting
+        # dead latents persist. aggregate_loss=True aggregates over the whole batch instead.
+        if not self.aggregate_loss:
+            mse = (recon_aux - residual).pow(2).mean(dim=-1)
+            target_var = residual.pow(2).mean(dim=-1)
+            normalized_mse = (mse / (target_var + 1e-8)).mean()
+        else:
+            mse = (recon_aux - residual).pow(2).mean()
+            target_var = residual.pow(2).mean()
+            normalized_mse = mse / (target_var + 1e-8)
 
         return normalized_mse
 
@@ -433,13 +474,19 @@ def loss(self, x: torch.Tensor, **kwargs) -> Dict[str, torch.Tensor]:
         # Update dead latent stats
         self._update_dead_latent_stats(codes)
 
-        # Primary reconstruction loss (FVU: fraction of variance unexplained)
-        # Center by pre_bias (learned per-dim mean) so denominator reflects
-        # actual signal variance, consistent with var_exp metric
-        mse = (recon - x).pow(2).mean(dim=-1)  # [batch]
-        x_centered = x - self.pre_bias
-        x_var = x_centered.pow(2).mean(dim=-1)  # [batch]
-        recon_loss = (mse / (x_var + 1e-8)).mean()
+        # Primary reconstruction loss (FVU: fraction of variance unexplained), centered by
+        # pre_bias to match the reported var_exp metric. Default (aggregate_loss=False) is the
+        # legacy per-token ratio mean_t(mse_t / x_var_t), which over-weights low-variance tokens
+        # and down-weights rare high-variance ones, starving the latents specialized on them.
+        # aggregate_loss=True uses a single batch-level mse.mean() / var.mean() ratio instead.
+        if not self.aggregate_loss:
+            mse = (recon - x).pow(2).mean(dim=-1)
+            x_var = (x - self.pre_bias).pow(2).mean(dim=-1)
+            recon_loss = (mse / (x_var + 1e-8)).mean()
+        else:
+            mse = (recon - x).pow(2).mean()
+            x_var = (x - self.pre_bias).pow(2).mean()
+            recon_loss = mse / (x_var + 1e-8)
 
         # Sparsity metric (for logging)
         l0 = (codes != 0).float().sum(dim=-1).mean()
diff --git a/bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_topk.py b/bionemo-recipes/interpretability/sparse_autoencoders/sae/tests/test_topk.py
@@ -0,0 +1,63 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-Apache2
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for TopKSAE training-quality options: loss reduction + global dead-latent counting."""
+
+import torch
+from sae.architectures import topk as topk_mod
+from sae.architectures.topk import TopKSAE
+
+
+def _make_sae(**kw):
+    torch.manual_seed(0)
+    return TopKSAE(input_dim=8, hidden_dim=16, top_k=4, normalize_input=False, **kw)
+
+
+def test_recon_loss_aggregate_matches_batch_fvu():
+    """aggregate_loss=True equals the batch-level FVU mse.mean()/var.mean()."""
+    x = torch.randn(32, 8)
+    sae = _make_sae(aggregate_loss=True)
+    recon = sae.forward_with_aux(x)["recon"]
+    expected = (recon - x).pow(2).mean() / ((x - sae.pre_bias).pow(2).mean() + 1e-8)
+    assert torch.allclose(sae.loss(x)["total"], expected)
+
+
+def test_dead_latent_count_global_vs_local(monkeypatch):
+    """dead_count_global advances the inactivity counter by tokens x world_size; else local."""
+    # Pretend we're in a 4-rank distributed run.
+    monkeypatch.setattr(topk_mod.dist, "is_available", lambda: True)
+    monkeypatch.setattr(topk_mod.dist, "is_initialized", lambda: True)
+    monkeypatch.setattr(topk_mod.dist, "get_world_size", lambda: 4)
+
+    codes = torch.zeros(10, 16)
+    codes[:, 0] = 1.0  # only latent 0 fires
+
+    g = _make_sae(dead_count_global=True)
+    g.stats_last_nonzero.zero_()
+    g._update_dead_latent_stats(codes)
+    assert int(g.stats_last_nonzero[0]) == 0  # fired -> reset
+    assert int(g.stats_last_nonzero[1]) == 10 * 4  # inactive -> tokens x world_size
+
+    loc = _make_sae(dead_count_global=False)
+    loc.stats_last_nonzero.zero_()
+    loc._update_dead_latent_stats(codes)
+    assert int(loc.stats_last_nonzero[1]) == 10  # inactive -> local micro-batch only
+
+
+def test_opted_in_options_round_trip_through_config():
+    """Opted-in (non-default) options serialize in the checkpoint config so a reload keeps them."""
+    cfg = _make_sae(aggregate_loss=True, dead_count_global=True)._get_config()
+    assert cfg["aggregate_loss"] is True
+    assert cfg["dead_count_global"] is True