Skip to content

add overall L1→DRAM hit rate metric (#5777)#5777

Open
xywang9334 wants to merge 1 commit into
pytorch:mainfrom
xywang9334:export-D105727013
Open

add overall L1→DRAM hit rate metric (#5777)#5777
xywang9334 wants to merge 1 commit into
pytorch:mainfrom
xywang9334:export-D105727013

Conversation

@xywang9334
Copy link
Copy Markdown

@xywang9334 xywang9334 commented May 21, 2026

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2706

SSDTableBatchedEmbeddingBags already emits per-tier hit rates — ssd_tbe.prefetch.l1_hit_rate_pct, l2_cache.hit_rate_pct, and dram_kv.hit_rate_pct — but each is conditional on requests that reached that tier. As l1_cache_size grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?".

This diff adds an ssd_tbe.overall_hit_rate_pct aggregate metric (per-TBE: ssd_tbe.tbe_id{N}.overall_hit_rate_pct) defined as:

overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM.

Algebraically equivalent to the expanded form L1_hit + (1 - L1_hit) * DRAM_hit_conditional under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

The existing per-tier metrics (l1_hit_rate_pct, l2_cache.hit_rate_pct, dram_kv.hit_rate_pct) are left unchanged — they remain useful for diagnosing per-tier behavior.

Implementation:

  • _report_uvm_cache_stats stashes num_unique into _last_l1_num_unique so _report_dram_kv_perf_stats can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same should_report(self.step) cadence, so the stashed value corresponds to the same reporting window.

Reviewed By: kausv

Differential Revision: D105727013

@meta-cla meta-cla Bot added the cla signed label May 21, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 21, 2026

@xywang9334 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105727013.

xywang9334 pushed a commit to xywang9334/FBGEMM that referenced this pull request May 21, 2026
Summary:

X-link: facebookresearch/FBGEMM#2706

The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier.

This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path):

    hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers.

Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

Implementation:

- `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window.
- `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched.

Differential Revision: D105727013
@xywang9334 xywang9334 force-pushed the export-D105727013 branch from 1d50176 to c7b12ea Compare May 21, 2026 22:53
@meta-codesync meta-codesync Bot changed the title fix dram_kv.hit_rate_pct normalization fix dram_kv.hit_rate_pct normalization (#5777) May 21, 2026
@xywang9334 xywang9334 force-pushed the export-D105727013 branch from c7b12ea to 4e605fa Compare May 27, 2026 16:43
xywang9334 pushed a commit to xywang9334/FBGEMM that referenced this pull request May 27, 2026
Summary:

X-link: facebookresearch/FBGEMM#2706

The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier.

This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path):

    hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers.

Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

Implementation:

- `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window.
- `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched.

Differential Revision: D105727013
@xywang9334 xywang9334 force-pushed the export-D105727013 branch from 4e605fa to 59189b8 Compare May 29, 2026 17:55
@meta-codesync meta-codesync Bot changed the title fix dram_kv.hit_rate_pct normalization (#5777) add overall L1→DRAM hit rate metric (#5777) May 29, 2026
xywang9334 pushed a commit to xywang9334/FBGEMM that referenced this pull request May 29, 2026
Summary:

X-link: facebookresearch/FBGEMM#2706

`SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?".

This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as:

    overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM.

Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior.

Implementation:

- `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window.

Differential Revision: D105727013
Summary:

X-link: facebookresearch/FBGEMM#2706

`SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?".

This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as:

    overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique

i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM.

Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.

The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior.

Implementation:

- `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window.

Reviewed By: kausv

Differential Revision: D105727013
@xywang9334 xywang9334 force-pushed the export-D105727013 branch from 59189b8 to 9fed2a5 Compare June 1, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant