add overall L1→DRAM hit rate metric (#5777)#5777
Open
xywang9334 wants to merge 1 commit into
Open
Conversation
Contributor
|
@xywang9334 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105727013. |
xywang9334
pushed a commit
to xywang9334/FBGEMM
that referenced
this pull request
May 21, 2026
Summary: X-link: facebookresearch/FBGEMM#2706 The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path): hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. - `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched. Differential Revision: D105727013
1d50176 to
c7b12ea
Compare
c7b12ea to
4e605fa
Compare
xywang9334
pushed a commit
to xywang9334/FBGEMM
that referenced
this pull request
May 27, 2026
Summary: X-link: facebookresearch/FBGEMM#2706 The `dram_kv.hit_rate_pct` metric in `SSDTableBatchedEmbeddingBags` was computed as `dram_read_hit_count / (dram_read_hit_count + dram_read_miss_count)` — the denominator only counts requests that reached DRAM, i.e. L1 misses. When `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. This diff changes `dram_kv.hit_rate_pct` to be normalized against `num_unique` (total unique indices in the batch, captured from the L1 reporting path): hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique Semantically this is now the overall (L1 + DRAM) hit rate — the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between tiers. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. - `l1_hit_rate_pct` and `l2_cache.hit_rate_pct` are untouched. Differential Revision: D105727013
4e605fa to
59189b8
Compare
xywang9334
pushed a commit
to xywang9334/FBGEMM
that referenced
this pull request
May 29, 2026
Summary: X-link: facebookresearch/FBGEMM#2706 `SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?". This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as: overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. Differential Revision: D105727013
Summary: X-link: facebookresearch/FBGEMM#2706 `SSDTableBatchedEmbeddingBags` already emits per-tier hit rates — `ssd_tbe.prefetch.l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, and `dram_kv.hit_rate_pct` — but each is conditional on requests that reached that tier. As `l1_cache_size` grows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?". This diff adds an `ssd_tbe.overall_hit_rate_pct` aggregate metric (per-TBE: `ssd_tbe.tbe_id{N}.overall_hit_rate_pct`) defined as: overall_hit_rate_pct = 100.0 * (num_unique - dram_read_miss_count) / num_unique i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM. Algebraically equivalent to the expanded form `L1_hit + (1 - L1_hit) * DRAM_hit_conditional` under the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added. The existing per-tier metrics (`l1_hit_rate_pct`, `l2_cache.hit_rate_pct`, `dram_kv.hit_rate_pct`) are left unchanged — they remain useful for diagnosing per-tier behavior. Implementation: - `_report_uvm_cache_stats` stashes `num_unique` into `_last_l1_num_unique` so `_report_dram_kv_perf_stats` can use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the same `should_report(self.step)` cadence, so the stashed value corresponds to the same reporting window. Reviewed By: kausv Differential Revision: D105727013
59189b8 to
9fed2a5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2706
SSDTableBatchedEmbeddingBagsalready emits per-tier hit rates —ssd_tbe.prefetch.l1_hit_rate_pct,l2_cache.hit_rate_pct, anddram_kv.hit_rate_pct— but each is conditional on requests that reached that tier. Asl1_cache_sizegrows, L1 absorbs more keys and only the long-tail keys fall through to DRAM, so the L1-conditional DRAM hit rate drops mechanically even though the system is doing more — not less — work in the cheaper tier. None of the existing per-tier metrics give an at-a-glance answer to "what fraction of unique requests were served from cache (L1 or DRAM), without paying SSD cost?".This diff adds an
ssd_tbe.overall_hit_rate_pctaggregate metric (per-TBE:ssd_tbe.tbe_id{N}.overall_hit_rate_pct) defined as:i.e. the fraction of unique requests that did not miss at DRAM. The value stays stable as cache sizes shift between L1 and DRAM.
Algebraically equivalent to the expanded form
L1_hit + (1 - L1_hit) * DRAM_hit_conditionalunder the assumption that every L1 miss reaches DRAM (the only path today). A code comment documents this caveat in case a future SSD-bypass path is added.The existing per-tier metrics (
l1_hit_rate_pct,l2_cache.hit_rate_pct,dram_kv.hit_rate_pct) are left unchanged — they remain useful for diagnosing per-tier behavior.Implementation:
_report_uvm_cache_statsstashesnum_uniqueinto_last_l1_num_uniqueso_report_dram_kv_perf_statscan use it as the normalization denominator without re-reading L1 counters. Both reporters fire from the sameshould_report(self.step)cadence, so the stashed value corresponds to the same reporting window.Reviewed By: kausv
Differential Revision: D105727013