fix dram_kv.hit_rate_pct normalization (#5777)

Xinyi Wang · facebook-github-bot · commit 4e605fa611a2 · 2026-05-27T09:43:23.000-07:00
Summary: X-link: facebookresearch/FBGEMM#2706 The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path): hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. - `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched. Differential Revision: D105727013
diff --git a/fbgemm_gpu/fbgemm_gpu/tbe/ssd/training.py b/fbgemm_gpu/fbgemm_gpu/tbe/ssd/training.py
@@ -1233,6 +1233,10 @@ def __init__(
         # 4: N_conflict_unique_misses, 5: N_conflict_misses
         self.last_reported_ssd_stats: list[float] = []
         self.last_reported_step = 0
+        # Stashed by _report_uvm_cache_stats so _report_l2_cache_perf_stats
+        # can normalize DRAM hit rate against total unique indices instead
+        # of only L1-miss lookups. See T272139146.
+        self._last_l1_num_unique: float = 0.0
 
         self.register_buffer(
             "ssd_cache_stats",
@@ -4185,6 +4189,7 @@ def _report_ssd_l1_cache_stats(self) -> None:
         # L1 cache hit rate
         num_unique = ssd_cache_stats_delta[UVMCacheStatsIndex.num_unique_indices]
         num_misses = ssd_cache_stats_delta[UVMCacheStatsIndex.num_unique_misses]
+        self._last_l1_num_unique = num_unique
         if num_unique > 0:
             l1_hit_rate_pct = 100.0 * (num_unique - num_misses) / num_unique
             # Per-TBE L1 hit rate
@@ -4847,8 +4852,23 @@ def _report_dram_kv_perf_stats(self) -> None:
                 data_bytes=dram_read_miss_count,
                 enable_tb_metrics=True,
             )
-            if dram_read_total > 0:
-                hit_rate_pct = 100.0 * dram_read_hit_count / dram_read_total
+            # Hit rate normalized to total unique requests (L1 hits + DRAM
+            # hits) / total. Stable across different l1_cache_size — the
+            # previous formula (dram_hits / dram_calls) dropped mechanically
+            # as L1 grew because only long-tail keys reached DRAM.
+            # See T272139146.
+            #
+            # Algebraically equivalent to the expanded form
+            #     L1_hit_rate + (1 - L1_hit_rate) * DRAM_hit_rate_conditional
+            # under the assumption that every L1 miss reaches DRAM, i.e.
+            #     num_misses == dram_read_hit_count + dram_read_miss_count.
+            # This holds today since DRAM is the only tier behind L1. If a
+            # future code path lets L1 misses bypass DRAM (e.g. direct SSD
+            # read), this simplified form will silently diverge from the
+            # explicit two-term form — revisit then.
+            num_unique = self._last_l1_num_unique
+            if num_unique > 0:
+                hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique
                 # Per-TBE hit rate
                 stats_reporter.report_data_amount(
                     iteration_step=self.step,