add overall L1→DRAM hit rate metric (#5777) by xywang9334 · Pull Request #5777 · pytorch/FBGEMM

xywang9334 · 2026-05-21T22:33:48Z

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2706

SSDTableBatchedEmbeddingBags already emits per-tier hit rates — ssd_tbe.prefetch.l1_hit_rate_pct, l2_cache.hit_rate_pct, and dram_kv.hit_rate_pct — but each is conditional on requests that reached that tier. As l1_cache_size grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?".

This diff adds an ssd_tbe.overall_hit_rate_pct aggregate metric (per-TBE: ssd_tbe.tbe_id{N}.overall_hit_rate_pct) defined as:

overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM.

Algebraically equivalent to the expanded form L1_hit + (1 - L1_hit) * DRAM_hit_conditional under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

The existing per-tier metrics (l1_hit_rate_pct, l2_cache.hit_rate_pct, dram_kv.hit_rate_pct) are left unchanged — they remain useful for diagnosing per-tier behavior.

Implementation:

_report_uvm_cache_stats stashes num_unique into _last_l1_num_unique so _report_dram_kv_perf_stats can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same should_report(self.step) cadence, so the stashed value corresponds to the same reporting window.

Reviewed By: kausv

Differential Revision: D105727013

meta-codesync · 2026-05-21T22:34:10Z

@xywang9334 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105727013.

Summary: X-link: facebookresearch/FBGEMM#2706 The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path): hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. - `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched. Differential Revision: D105727013

Summary: X-link: facebookresearch/FBGEMM#2706 `SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?". This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as: overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. Differential Revision: D105727013

Summary: X-link: facebookresearch/FBGEMM#2706 `SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?". This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as: overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. Reviewed By: kausv Differential Revision: D105727013

meta-codesync · 2026-06-04T00:09:12Z

This pull request has been merged in cb4a51d.

meta-cla Bot added the cla signed label May 21, 2026

meta-codesync Bot added fb-exported meta-exported labels May 21, 2026

xywang9334 force-pushed the export-D105727013 branch from 1d50176 to c7b12ea Compare May 21, 2026 22:53

meta-codesync Bot changed the title ~~fix dram_kv.hit_rate_pct normalization~~ fix dram_kv.hit_rate_pct normalization (#5777) May 21, 2026

xywang9334 force-pushed the export-D105727013 branch from c7b12ea to 4e605fa Compare May 27, 2026 16:43

xywang9334 force-pushed the export-D105727013 branch from 4e605fa to 59189b8 Compare May 29, 2026 17:55

meta-codesync Bot changed the title ~~fix dram_kv.hit_rate_pct normalization (#5777)~~ add overall L1→DRAM hit rate metric (#5777) May 29, 2026

xywang9334 force-pushed the export-D105727013 branch from 59189b8 to 9fed2a5 Compare June 1, 2026 17:42

xywang9334 force-pushed the export-D105727013 branch from 9fed2a5 to 2269007 Compare June 2, 2026 17:19

xywang9334 force-pushed the export-D105727013 branch from 2269007 to 3b55816 Compare June 2, 2026 18:53

meta-codesync Bot closed this in cb4a51d Jun 4, 2026

facebook-github-tools Bot added the Merged label Jun 4, 2026

gchalump added category:improvement contributor:Meta feature:tbessd labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add overall L1→DRAM hit rate metric (#5777)#5777

add overall L1→DRAM hit rate metric (#5777)#5777
xywang9334 wants to merge 1 commit into
pytorch:mainfrom
xywang9334:export-D105727013

xywang9334 commented May 21, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 21, 2026

Uh oh!

meta-codesync Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xywang9334 commented May 21, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 21, 2026

Uh oh!

meta-codesync Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xywang9334 commented May 21, 2026 •

edited by meta-codesync Bot

Loading