[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508
[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508tiankongdeguiji wants to merge 24 commits into
Conversation
…ING) Extend `_calculate_dynamicemb_table_storage_specific_size` and its callers with a `caching: bool` parameter. HYBRID keeps the original split formula (DDR = (1 - cache_ratio) * T); CACHING accounts the full backing store on host (DDR = T) regardless of cache_ratio. HBM accounting and metadata (key + score + digest + bucket header) stay identical between modes. The flag is sourced from `dynamicemb_options.caching` on the ShardingOption in `dynamicemb_calculate_shard_storages`. Subsequent commits wire this into the enumerator (per-table sweep over both modes) and the proposer (2D DP). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `_dynamicemb_effective_cache_ratio(cache_load_factor, caching, stats)`: HYBRID passes the ratio through (hash-partitioned HBM hit probability); CACHING boosts to `min(1.0, x * multiplier)`, default multiplier 10.0 via the `TZREC_CACHING_HIT_RATE_MULTIPLIER` env var. `cache_params.stats`, when provided, takes precedence — `1 - stats.expected_miss_rate(x)` — and is clamped to never fall below the HYBRID ratio. Inject the boost at planning time by monkey-patching `ShardPerfContext.build_shard_perf_contexts` to temporarily replace the sharding option's `cache_params.load_factor` with the effective ratio. The original cache_params is restored on return so the storage estimator (which reads cache_load_factor for HBM/DDR sizing) still sees the unboosted value. Net effect: at the same `cache_load_factor`, CACHING is strictly cheaper in perf and strictly more expensive in DDR. The DP proposer (next commit) trades one for the other against the topology budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the per-dynamicemb-table variant emission into `_emit_dynamicemb_variants(base, dynamicemb_options) -> List[ShardingOption]` and sweep both modes per cache_load_factor. When `cache_params` is unset, the helper emits 20 variants (10 factors × 2 modes); when fixed by the caller, it emits 2 (both modes at the fixed factor). Each variant owns a deep-copied `dynamicemb_options` so per-variant `caching` mutations stay isolated. `cache_params.stats` is preserved across all clones for the perf-side miss-rate override. The downstream 2D DP proposer (next commit) selects per table the (mode, ratio) pair that fits both HBM and host budgets while minimizing perf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite `DynamicProgrammingProposer.feedback` to discretize both per-rank HBM and per-rank DDR into bins (default 100 × 50 per device) and pick per table the (mode, cache_load_factor) pair that minimizes total perf while fitting both topology budgets. State: dp[table][hbm_bin][ddr_bin] = (perf, hbm_actual, ddr_actual) backtrack[table][hbm_bin][ddr_bin] = (opt_id, prev_hbm_bin, prev_ddr_bin) Options whose largest shard exceeds either per-device cap are pruned upfront. Backtracking enumerates feasible (hbm_total, ddr_total) cells in decreasing total memory, yielding Pareto-optimal proposals. The legacy `mem_bins_per_device` kwarg is preserved as an alias for `hbm_bins_per_device` so existing callers still work, and HBM-degenerate (CPU-only) or DDR-degenerate topologies collapse the unused axis to a single bin. Adds `DynamicProgrammingProposer2DTest`: - generous DDR → CACHING wins (cheaper perf, host can fit T) - tight DDR → CACHING rejected, falls back to HYBRID - high cache_load_factor → modes converge Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercise the full planning pipeline on a tiny TestSparseNN backed by a
DynamicEmbParameterConstraints constraint. Asserts:
- EmbeddingEnumerator yields 20 sharding options (2 modes × 10 factors)
- All cache_load_factors and both caching modes are represented
- Storage and perf estimators populate non-zero values for each option
- DynamicProgrammingProposer.propose() returns feasible plans whose
dynamicemb options carry a valid caching flag
Gated on has_dynamicemb + torch.cuda.is_available() since the dynamicemb
sharder requires CUDA at planning time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper (1) Restore the byte-budget framing in `_calculate_dynamicemb_table_storage_specific_size` docstring (`total_value_memory`, `num_buckets`, `hbm_budget`, `ddr_budget`) that was lost in the PR alibaba#508 rewrite, extending `ddr_budget` to cover both HYBRID and CACHING modes in one place. (2) Simplify `_dynamicemb_aware_build_shard_perf_contexts`: single return, no try/finally, no early-return branch. The unconditional `sharding_option.cache_params = original_cache_params` line at the end is a no-op when caching is False (we never mutated the field) and restores the original cache_params when caching is True. If the wrapped call raises, planning aborts and the option is discarded -- no need for try/finally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDR is host-shared across ranks co-located on one machine, so the per-option fit check in `DynamicProgrammingProposer.feedback` should compare `max_shard_ddr` against the largest machine's total DDR pool, not against any single rank's slice. HBM stays per-device (each GPU has its own HBM; no cross-rank sharing). Compute `max_machine_ddr` by summing per-device DDR within each contiguous `local_world_size`-sized window. Update the test fake-topology fixture to carry `local_world_size` (defaults to num_devices). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… copyright (5) Move `PlanUtilDynamicEmbE2ETest` from the standalone `tzrec/utils/plan_util_dynamicemb_e2e_test.py` into `plan_util_test.py` where the rest of the plan_util tests already live. Delete the standalone file. (6) Rename `DynamicProgrammingProposer2DTest` -> `DynamicProgrammingProposerTest`. The proposer class itself stays `DynamicProgrammingProposer`. (7) Bump copyright `2024 -> 2026` in the new file `tzrec/utils/dynamicemb_util_test.py`. The header in plan_util_test.py stays at 2024 since it was a pre-existing file we only extended. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urve Replace the heuristic ``_dynamicemb_effective_cache_ratio`` curve with constants fitted from an on-device perf sweep (``experiments/sweep_20260513-161030/full_a10gpu1.json``: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes: HYBRID @ x=1.0: 0.80 ms (HBM-only fast path) CACHING @ x<1.0: 2.63 ms (~3.3x slower) HYBRID @ x<1.0: 5.44 ms (~6.8x slower) Inverting the linear bw model bw = x_eff*HBM + (1-x_eff)*HBM_TO_DDR (torchrec defaults 897 / 32 GB/s) gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID). At x = 1.0 the runtime drops the host tier in both modes, so x_eff = 1.0 there. A +0.01*x tiebreaker term keeps the DP's ranking strict within each block. Also extend the ``_dynamicemb_aware_build_shard_perf_contexts`` wrapper to apply the empirical x_eff for HYBRID (previously HYBRID used raw ``cache_load_factor`` as x_eff -- inconsistent with the empirical data that shows HYBRID @ 0.9 is much slower than HYBRID @ 1.0 due to the storage backend switch). Drop the now-unused ``TZREC_CACHING_HIT_RATE_MULTIPLIER`` env knob and its related tests; rewrite ``EffectiveCacheRatioTest`` to assert the empirical strict-block ranking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same provenance is already covered in ``_dynamicemb_effective_cache_ratio``'s docstring below; the block comment above the constants was redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| result = _orig_build_shard_perf_contexts( | ||
| cls, | ||
| config, | ||
| shard_sizes, | ||
| sharding_option, | ||
| topology, | ||
| constraints, | ||
| sharder, | ||
| *args, | ||
| **kwargs, | ||
| ) | ||
| sharding_option.cache_params = original_cache_params | ||
| return result |
There was a problem hiding this comment.
Restore should be in finally. If _orig_build_shard_perf_contexts raises, sharding_option.cache_params is permanently left as the boosted x_eff clone. Downstream consumers — the storage estimator and _to_sharding_plan at line 435 (cache_params=sharding_option.cache_params) — then see the wrong load factor on every other option that touches the same ShardingOption. Suggested:
| result = _orig_build_shard_perf_contexts( | |
| cls, | |
| config, | |
| shard_sizes, | |
| sharding_option, | |
| topology, | |
| constraints, | |
| sharder, | |
| *args, | |
| **kwargs, | |
| ) | |
| sharding_option.cache_params = original_cache_params | |
| return result | |
| try: | |
| result = _orig_build_shard_perf_contexts( | |
| cls, | |
| config, | |
| shard_sizes, | |
| sharding_option, | |
| topology, | |
| constraints, | |
| sharder, | |
| *args, | |
| **kwargs, | |
| ) | |
| finally: | |
| sharding_option.cache_params = original_cache_params | |
| return result |
| for caching_mode in (False, True): | ||
| for load_factor in load_factors: | ||
| opt = copy.deepcopy(base_option) | ||
| opt.cache_params = CacheParams(load_factor=load_factor, stats=stats) | ||
| # pyre-ignore [16] | ||
| opt.dynamicemb_options = copy.deepcopy(dynamicemb_options) | ||
| opt.dynamicemb_options.caching = caching_mode | ||
| variants.append(opt) |
There was a problem hiding this comment.
Redundant double deep-copy of dynamicemb_options. The caller (line 1035) attaches dynamicemb_options to base_option before calling this function, so copy.deepcopy(base_option) on line 848 already deep-copies dynamicemb_options into opt. Line 851 then deep-copies the original again and replaces it. With 20 variants per table this doubles a non-trivial copy. Either drop line 851 and just set opt.dynamicemb_options.caching = caching_mode, or stop attaching dynamicemb_options on the caller side before this call.
| Given :math:`M` tables each with up to :math:`N` ShardingOptions, pick one | ||
| option per table to minimize total perf while respecting both per-rank HBM | ||
| and per-rank DDR budgets from the topology. | ||
|
|
There was a problem hiding this comment.
Docstring says "per-rank DDR" but the implementation enforces per-machine DDR (lines 349–361 sum DDR across local_world_size co-located ranks and the per-option prune at line 391 compares against max_machine_ddr). The inline comment at 350–352 has the right framing — the class docstring should match: "per-rank HBM and per-machine DDR budgets."
| if x >= 1.0: | ||
| return 1.0 | ||
| base = _DYNAMICEMB_CACHING_X_EFF_BASE if caching else _DYNAMICEMB_HYBRID_X_EFF_BASE | ||
| return base + _DYNAMICEMB_X_EFF_TIEBREAK * x |
There was a problem hiding this comment.
Sharp discontinuity at x=1.0. HYBRID@x=0.99 → 0.1199; HYBRID@x=1.00 → 1.0 — an ~8× jump for a 1% ratio change. The DP will reliably prefer x=1.0 over x=0.99 by a huge perf margin, then prefer CACHING@x=0.5 (0.285) over HYBRID@x=0.9 (0.119) even on workloads where HYBRID@0.9 is plainly faster in reality. If the empirical sweep really shows a step at x=1.0 because the runtime drops the host tier, please call that out as "x=1.0 = HBM-only kernel, not the same algorithm as x=0.99" — and consider whether the enumerator should emit a discrete mode={HBM_ONLY, HYBRID, CACHING} axis rather than packing the discontinuity into the same x knob.
| for table_i in range(1, table_count): | ||
| prev_dp = dp[table_i - 1] | ||
| cur_dp = dp[table_i] | ||
| cur_back = backtrack[table_i] | ||
| hbm_t = hbm_by_fqn[table_i] | ||
| ddr_t = ddr_by_fqn[table_i] | ||
| perf_t = perf_by_fqn[table_i] | ||
| for h in range(hbm_bins): | ||
| prev_dp_h = prev_dp[h] | ||
| for d in range(ddr_bins): | ||
| prev_state = prev_dp_h[d] | ||
| prev_perf = prev_state[0] | ||
| if prev_perf == _INF: | ||
| continue | ||
| prev_h_val = prev_state[1] | ||
| prev_d_val = prev_state[2] | ||
| for opt_j in range(option_count): | ||
| delta_perf = perf_t[opt_j] | ||
| if delta_perf == _INF: | ||
| continue | ||
| new_h = prev_h_val + hbm_t[opt_j] | ||
| new_d = prev_d_val + ddr_t[opt_j] | ||
| if new_h >= hbm_bins or new_d >= ddr_bins: | ||
| continue | ||
| new_h_i = int(new_h) | ||
| new_d_i = int(new_d) | ||
| new_perf = prev_perf + delta_perf | ||
| if cur_dp[new_h_i][new_d_i][0] > new_perf: | ||
| cur_dp[new_h_i][new_d_i] = (new_perf, new_h, new_d) | ||
| cur_back[new_h_i][new_d_i] = (opt_j, h, d) |
There was a problem hiding this comment.
2D DP cost scales with world size — consider clamping total bins. hbm_bins = 100 · num_devices and ddr_bins = 50 · num_devices. On a 16-GPU world the inner triple-loop is 1600 × 800 × option_count pure-Python iterations per table; with 100 dynamicemb tables (20 options each) that's a few-billion-op DP, and dp + backtrack allocate ~256M tuples. Even on smaller worlds the bin counts multiply faster than necessary — a per-device discretization isn't really needed since the budget is a single scalar per axis. Consider clamping totals (e.g. hbm_bins = min(hbm_bins_per_device · num_devices, 2_000)) or vectorizing the transition with NumPy. The defaults are fine for 2–4 GPUs but worth a guardrail before the next user hits this.
| class DynamicProgrammingProposerTest(unittest.TestCase): | ||
| """2D DP across HBM × DDR picks per-table mode under joint budgets.""" | ||
|
|
||
| def _run(self, search_space, topology): | ||
| proposer = DynamicProgrammingProposer( | ||
| hbm_bins_per_device=20, ddr_bins_per_device=20 | ||
| ) | ||
| proposer.load(search_space) | ||
| # First propose returns the lowest-mem-per-table seed. | ||
| proposer.propose() | ||
| proposer.feedback(partitionable=True, storage_constraint=topology) | ||
| proposals = [] | ||
| proposal = proposer.propose() | ||
| while proposal: | ||
| proposals.append(proposal) | ||
| proposer.feedback(partitionable=True, storage_constraint=topology) | ||
| proposal = proposer.propose() | ||
| return proposals | ||
|
|
||
| def test_caching_preferred_when_ddr_is_generous(self): | ||
| # Three options for one table: | ||
| # HYBRID @ x=1.0: hbm = T, ddr = 0, perf = high (HBM-only) | ||
| # HYBRID @ x=0.1: hbm = .1T, ddr = .9T, perf = high (slow) | ||
| # CACHING @ x=0.1: hbm = .1T, ddr = T, perf = low (fast — modeled hits) | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0), | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=2, hbm_per_device=2000, ddr_per_device=2000 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Best plan must be the CACHING option (perf=10). | ||
| best = min(proposals, key=lambda p: sum(o.total_perf for o in p)) | ||
| self.assertEqual(best[0].total_perf, 10.0) | ||
|
|
||
| def test_caching_rejected_when_ddr_is_tight(self): | ||
| # Host budget is only 950 — CACHING (ddr=1000) cannot fit; HYBRID can. | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0), | ||
| _FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0), | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=1, hbm_per_device=2000, ddr_per_device=950 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Every proposed plan must pick the HYBRID option (perf=80). | ||
| for p in proposals: | ||
| self.assertEqual(p[0].total_perf, 80.0) | ||
|
|
||
| def test_high_factor_collapses_modes(self): | ||
| # At x=1.0 HYBRID == CACHING in HBM and CACHING.ddr = T = HYBRID.hbm. | ||
| # If we offer just the high-factor options, DP picks one of them. | ||
| opts = [ | ||
| _FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), # HYBRID x=1.0 | ||
| _FakeShardingOption( | ||
| "table_a", hbm=1000, ddr=1000, perf=50.0 | ||
| ), # CACHING x=1.0 | ||
| ] | ||
| topology = _make_topology( | ||
| num_devices=1, hbm_per_device=1100, ddr_per_device=2000 | ||
| ) | ||
| proposals = self._run(opts, topology) | ||
| # Either option is fine — they're tied. Just verify the proposer | ||
| # returned something feasible. | ||
| self.assertGreater(len(proposals), 0) | ||
| for p in proposals: | ||
| self.assertEqual(p[0].total_perf, 50.0) |
There was a problem hiding this comment.
Missing the headline multi-table 2D-DP scenario. All three DynamicProgrammingProposerTest cases use a single table, so the cross-table DP transition at plan_util.py:434–463 is exercised only at table_i == 0. The PR's value prop is "pick CACHING for one table and HYBRID for another under joint HBM+DDR budgets" — a 2-table fixture where global DDR can only hold one CACHING but plenty of HBM is available would directly cover that. Same gap for the mem_bins_per_device legacy alias (untested) and the all-options-pruned / empty-search-space degenerate paths.
| only_values=only_values, | ||
| bucket_capacity=self.BUCKET_CAPACITY, | ||
| caching=caching, | ||
| ) |
There was a problem hiding this comment.
Direct unit test for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts is missing. Both have non-trivial branches — admission_counter HBM accounting (dynamicemb_util.py:663–672), the compute_device == "cpu" and not is_inference DDR branch (line 697), and the cache_params swap+restore (line 502–514) — that are reachable only via the E2E test, which is gated on has_dynamicemb and cuda. On a CPU dev box none of this is covered; a fake-ed _orig_build_shard_perf_contexts would let you verify the boost is applied and (more importantly) that the original cache_params is restored even when the wrapped call raises.
Review summarySolid PR overall — the 2D DP rewrite is clean, the mode-aware storage formula is well-tested, and the empirical perf-model docstring is refreshingly honest about its provenance. A few things worth addressing before merge (inline above): Should fix
Worth a follow-up
Docs
No security/multi-process concerns: the monkey-patches run under the import lock and each rank is its own process. |
`copy.deepcopy(base_option)` on the first line of each variant body already copies `dynamicemb_options` because the caller in `enumerate()` attaches it onto `base_option` before invoking `_emit_dynamicemb_variants`. The second `copy.deepcopy(dynamicemb_options)` was wasted work — at 20 variants per dynamicemb table it doubled a non-trivial copy. Drop it and mutate the already-fresh per-variant `dynamicemb_options.caching` directly. PR alibaba#508 review R3 (github-actions, 2026-05-13). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blowup The 2D DP discretizes each memory axis into `bins_per_device * num_devices` bins. On a 16-GPU world that becomes 1600 hbm bins × 800 ddr bins = 1.28M cells per table; with 100 dynamicemb tables and 20 options each, the inner Python loop is a few-billion-op DP and dp + backtrack tuples consume hundreds of MB. The HBM and DDR budgets are scalar, so multiplying bins by num_devices stops adding meaningful precision past a point. Clamp the per-axis bin count at `_DP_AXIS_BIN_CAP` (1024) to keep the multi-host case tractable. NumPy vectorization of the inner DP loop is a bigger refactor; leaving a TODO referencing PR alibaba#508 review R5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation * DynamicProgrammingProposer class docstring: "per-rank DDR" -> "per-rank HBM and per-machine DDR" so it matches the machine-DDR prune introduced in commit 38bd7ad (PR alibaba#508 review R4). * _dynamicemb_effective_cache_ratio: call out that the discontinuity at x=1.0 is a deliberate kernel switch (HBM-only DynamicEmbStorage vs dual-tier HybridStorage/DynamicEmbCache), not a smoothing artifact. Note a future refactor could lift this to a discrete `mode` axis on the enumerator (PR alibaba#508 review R2). * dynamicemb_calculate_shard_storages caching_ratio Args: rewrite the stale "ratio of HBM to DDR memory for UVM caching" framing -- under the dynamicemb modes, host always holds the full backing in CACHING, and caching_ratio sizes the HBM hot-row cache (top-level docs note). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch space Add three DynamicProgrammingProposerTest methods that the original PR alibaba#508 review (R6) flagged as missing: test_two_tables_pick_mixed_modes_under_joint_budget Two tables under a topology where both-HYBRID is HBM-infeasible and both-CACHING is DDR-infeasible. The DP must pick one of each. This exercises the cross-table transition at plan_util.py table_i == 1 which the existing single-table cases never reach. test_legacy_mem_bins_per_device_alias Verifies the back-compat kwarg `mem_bins_per_device` still wins over the new `hbm_bins_per_device` when both are passed. test_empty_search_space_returns_empty_proposal Edge case where `load([])` short-circuits via `table_count == 0`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper tests Combine PR alibaba#508 review R1 + R7. The new ``BuildShardPerfContextsWrapperTest.test_cache_params_restored_on_exception`` exists to pin the bug R1 describes: when ``_orig_build_shard_perf_contexts`` raises, the boosted ``x_eff`` clone of ``sharding_option.cache_params`` must not leak to the downstream storage estimator's view of the same option. The corresponding ``try/finally`` around the wrapped call closes the leak. Reverts the previous "single return, no try/finally" simplification. Also adds: * ``BuildShardPerfContextsWrapperTest`` — direct unit tests for the ShardPerfContext.build_shard_perf_contexts monkey-patch covering: boost applied for caching, boost applied for hybrid, no boost when the option has no dynamicemb_options, cache_params restored on success, and the restore-on-exception case driving R1. * ``DynamicEmbCalcShardStoragesTest.test_admission_counter_increases_hbm`` — direct cover for the admission_counter HBM accounting branch in ``dynamicemb_calculate_shard_storages`` that the CUDA-gated E2E test doesn't exercise on a CPU dev box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the DP proposer commits a (mode, cache_load_factor) per dynamicemb table, log one INFO line on rank 0 that names the runtime mode the dispatcher in BatchedDynamicEmbeddingTablesV2._create_cache_storage will actually pick: total_value_memory <= local_hbm_for_values -> HBM_ONLY else if caching -> CACHING else -> HYBRID Example: [dynamicemb plan] sparse.ebc.user_emb: mode=CACHING cache_load_factor=0.30 local_hbm_for_values=2.400GiB [dynamicemb plan] sparse.ebc.item_emb: mode=HBM_ONLY cache_load_factor=1.00 local_hbm_for_values=4.000GiB Replaces the previous "have to grep _print_memory_consume from the runtime to find out what mode the planner picked" workflow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream ``_print_memory_consume`` is mode-blind and always computes ``dram = total - hbm``, so for CACHING tables it under-reports the host backing (the actual storage_options keep ``max_capacity`` unchanged, i.e. host holds the full table). Surface the correct expectation in our own log so users can verify CACHING vs HYBRID storage from the planner side: HBM_ONLY: dram = 0 CACHING: dram = total_value_memory # full backing HYBRID: dram = total - local_hbm_for_values # partition Example for cat_0_emb @ 0.20 with Adam: mode=HYBRID local_hbm=0.954GiB local_dram=3.815GiB (= 0.8 * total) mode=CACHING local_hbm=0.954GiB local_dram=4.769GiB (= total) If the runtime memory print shows DRAM = 3.815GiB for a table logged as CACHING here, that is the print's mode-blind formula -- the real host allocation is the log's local_dram_for_values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…omputing The storage estimator (``dynamicemb_calculate_shard_storages`` -> ``_calculate_dynamicemb_storage_specific_sizes`` -> ``_calculate_dynamicemb_table_storage_specific_size``) already populates ``shards[i].storage.ddr`` per option for the chosen mode: HYBRID: ddr_specific = (1 - cache_ratio) * total_value_memory CACHING: ddr_specific = total_value_memory For cuda compute_device, ``Storage.ddr = ddr_specific`` (no I/O overhead added). So the value the planner already computed equals what the log line needs -- no point re-running ``_calculate_dynamicemb_table_storage_specific_size``. Read ``shards[0].storage.ddr`` directly. Identical numbers, less code, no second call to the storage formula. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small follow-ups on the dynamicemb plan log: * Rename keys: ``local_hbm_for_values`` -> ``local_hbm`` and ``local_dram_for_values`` -> ``local_dram``. Shorter, still unambiguous in context. * Switch HBM source to ``shards[0].storage.hbm`` for symmetry with the DDR side and to surface the true HBM footprint (values + metadata + counter + pipeline I/O), not just the runtime config knob. Upstream ``_print_memory_consume`` strips metadata from its HBM column too; this log reports the honest total. * Bump tzrec to 1.2.13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| if cache_load_factor >= 1.0: | ||
| mode = "HBM_ONLY" | ||
| dram_bytes = 0 # runtime drops the host tier | ||
| else: | ||
| mode = "CACHING" if dynamicemb_options.caching else "HYBRID" | ||
| dram_bytes = shards[0].storage.ddr |
There was a problem hiding this comment.
Storage estimator ↔ runtime ↔ log are not coherent at cache_load_factor >= 1.0.
The log overrides dram_bytes = 0 here because the runtime drops the host tier when the table fits HBM. But the storage estimator at _calculate_dynamicemb_table_storage_specific_size (line ~340) still returns total_value_memory DDR for CACHING mode regardless of cache_ratio — i.e. CACHING@1.0 reports a full backing-store DDR cost that the 2D DP then uses for pruning. Result: the DP can mark CACHING@1.0 plans infeasible on DDR-tight topologies even though they would actually fit at runtime, and the log will then report local_dram=0.000GiB for any plan that does make it through.
Suggest aligning the storage formula with the runtime: when cache_ratio >= 1.0, force CACHING's DDR to 0 too. Then the storage estimator, the DP, and this log all agree.
| # back-compat alias: ``mem_bins_per_device`` mapped to the HBM axis. | ||
| bins_h = ( | ||
| mem_bins_per_device | ||
| if mem_bins_per_device is not None | ||
| else hbm_bins_per_device | ||
| ) | ||
| self._hbm_bins_per_device: int = max(bins_h, 1) |
There was a problem hiding this comment.
When the caller passes both mem_bins_per_device=X and hbm_bins_per_device=Y, the legacy alias silently wins — test_legacy_mem_bins_per_device_alias even codifies this. That's the surprising direction; a user supplying the new name explicitly while a config layer injects the legacy one will get the legacy value with no warning. The class docstring also reads as a one-way alias ("Accepts legacy mem_bins_per_device as an alias").
Suggest either raising TypeError when both are non-None, or letting hbm_bins_per_device win and emitting a DeprecationWarning for mem_bins_per_device. Whatever the choice, document it in the Args: block.
| # Enumerate proposals in decreasing total memory order. Total memory | ||
| # is a tie-break heuristic — Pareto-optimal (perf, hbm, ddr) frontier | ||
| # tends to live at the high end, so larger-mem cells are explored | ||
| # first and small-mem fallbacks come later. | ||
| self._proposal_list = [] | ||
| last_back = backtrack[table_count - 1] | ||
| coords = sorted( | ||
| ( | ||
| (h, d) | ||
| for h in range(hbm_bins) | ||
| for d in range(ddr_bins) | ||
| if last_back[h][d][0] >= 0 | ||
| ), | ||
| key=lambda hd: hd[0] + hd[1], | ||
| reverse=True, | ||
| ) |
There was a problem hiding this comment.
The class docstring (line 255) says backtracking "yields Pareto-optimal proposals," but this loop actually enumerates every reachable (h, d) cell at the last table — up to hbm_bins × ddr_bins ≈ 800k+ proposals at the cap. Downstream consumers iterate propose()/feedback() once per proposal, so a planner run on a large world can churn through hundreds of thousands of plans rather than a small Pareto frontier.
Two minor things compound this:
- The sort key
hd[0] + hd[1]weighs the two axes equally even thoughhbm_bin_sizeandddr_bin_sizediffer — so "decreasing total memory" isn't really sorted by memory. last_back[h][d][0] >= 0is the only filter; nothing prunes dominated cells.
Either tighten to the Pareto frontier (drop (h, d) if some (h', d') with h' ≤ h, d' ≤ d, perf' ≤ perf exists), cap len(self._proposal_list), or update the docstring to say "all reachable plans, larger memory first."
| def _dynamicemb_aware_build_shard_perf_contexts( | ||
| cls, # pyre-ignore [2] | ||
| config, # pyre-ignore [2] | ||
| shard_sizes, # pyre-ignore [2] | ||
| sharding_option, # pyre-ignore [2] | ||
| topology, # pyre-ignore [2] | ||
| constraints, # pyre-ignore [2] | ||
| sharder, # pyre-ignore [2] |
There was a problem hiding this comment.
The wrapper hard-codes the first six positional parameters (config, shard_sizes, sharding_option, topology, constraints, sharder) and forwards everything else via *args, **kwargs. If torchrec renames sharding_option, reorders these six, or splits one into multiple, the wrapper either silently mis-dispatches (the boost goes into the wrong object) or raises an opaque TypeError deep inside the planner with no link back to this file.
Worth capturing inspect.signature(_orig_build_shard_perf_contexts) at patch time, asserting the first six parameter names match, and otherwise logging a warning + installing a pass-through. A pinned/asserted torchrec version range somewhere visible would also help.
| # bins/dev = 1600 bins, * 800 ddr bins = 1.28M cells per table). The | ||
| # budgets are scalar (total HBM, total DDR), so multiplying bins by | ||
| # num_devices stops adding precision past this cap. NumPy vectorization | ||
| # of the inner loop is a follow-up (PR #508 review R5). |
There was a problem hiding this comment.
(PR #508 review R5) looks like an in-flight review artifact — PR numbers and ad-hoc reviewer ticket IDs aren't useful once merged (git blame is the canonical pointer). Suggest dropping the parenthetical or replacing with an actual follow-up issue number.
| def _emit_dynamicemb_variants( | ||
| base_option: ShardingOption, | ||
| dynamicemb_options: Any, | ||
| ) -> List[ShardingOption]: | ||
| """Expand a dynamicemb ShardingOption into HYBRID + CACHING variants. | ||
|
|
||
| Sweeps both placement modes (``caching=False`` and ``caching=True``) and, | ||
| when ``base_option.cache_params`` is unset, ten cache_load_factor values | ||
| (0.1, 0.2, ..., 1.0). The downstream 2D DP proposer picks per table the | ||
| best (mode, ratio) that fits both HBM and host topology budgets. | ||
|
|
||
| Each returned ShardingOption owns a freshly deep-copied | ||
| ``dynamicemb_options`` instance so per-option ``caching`` mutations do not | ||
| bleed across variants. | ||
| """ | ||
| if base_option.cache_params is None: | ||
| load_factors = [(i + 1) / 10 for i in range(10)] | ||
| stats = None | ||
| else: | ||
| load_factors = [base_option.cache_params.load_factor] | ||
| stats = base_option.cache_params.stats | ||
| variants: List[ShardingOption] = [] | ||
| for caching_mode in (False, True): | ||
| for load_factor in load_factors: | ||
| opt = copy.deepcopy(base_option) | ||
| opt.cache_params = CacheParams(load_factor=load_factor, stats=stats) | ||
| # deepcopy(base_option) already produced a fresh dynamicemb_options. | ||
| opt.dynamicemb_options.caching = caching_mode # pyre-ignore [16] | ||
| variants.append(opt) |
There was a problem hiding this comment.
The dynamicemb_options parameter is never read — the function relies entirely on base_option.dynamicemb_options via copy.deepcopy(base_option) and opt.dynamicemb_options.caching = .... The caller also already assigns sharding_option.dynamicemb_options = dynamicemb_options immediately before invocation, so the argument is doubly redundant.
Either drop the parameter, or use it as the canonical source so callers don't have to mutate base_option first.
| for device in storage_constraint.devices: | ||
| max_device_hbm = max(max_device_hbm, device.storage.hbm or 0) | ||
| hbm_total += device.storage.hbm or 0 | ||
| ddr_total += device.storage.ddr or 0 | ||
| # DDR is host-shared across ranks co-located on one machine, so the | ||
| # per-option fit check compares against the largest machine's DDR pool | ||
| # -- not per-device. HBM is GPU-local, so its prune stays per-device. | ||
| per_host = getattr(storage_constraint, "local_world_size", None) or num_devices | ||
| per_host = max(per_host, 1) | ||
| max_machine_ddr = 0 | ||
| for host_start in range(0, num_devices, per_host): |
There was a problem hiding this comment.
This per-machine DDR pruning is the load-bearing reason for the 2D rewrite, but it's not exercised in tests — _make_topology in plan_util_test.py accepts local_world_size but every test falls through to the default (local_world_size or num_devices), so per_host == num_devices in every case. A bug where max_machine_ddr collapses to ddr_total or the window stride is off-by-one wouldn't be caught.
Worth a multi-host test: e.g. num_devices=4, local_world_size=2, ddr_per_device=500 (so each machine has 1000 DDR, total 2000), then assert an option whose shard.ddr=1500 is pruned even though 1500 < ddr_total.
| # Per-table mode log on rank 0. cache_load_factor=1.0 forces | ||
| # the runtime into HBM_ONLY (host tier dropped) regardless of | ||
| # ``caching`` -- mirror that override here so the log matches | ||
| # the runtime, not the planner's recorded (caching, factor). | ||
| if int(os.environ.get("RANK", 0)) == 0: | ||
| cache_load_factor = float(sharding_option.cache_load_factor) | ||
| if cache_load_factor >= 1.0: | ||
| mode = "HBM_ONLY" | ||
| dram_bytes = 0 # runtime drops the host tier | ||
| else: | ||
| mode = "CACHING" if dynamicemb_options.caching else "HYBRID" | ||
| dram_bytes = shards[0].storage.ddr | ||
| hbm_gib = shards[0].storage.hbm / (1 << 30) | ||
| dram_gib = dram_bytes / (1 << 30) | ||
| fqn = f"{sharding_option.path}.{sharding_option.name}" | ||
| logger.info( | ||
| f"[dynamicemb plan] {fqn}: mode={mode} " | ||
| f"cache_load_factor={cache_load_factor:.2f} " | ||
| f"local_hbm={hbm_gib:.3f}GiB " | ||
| f"local_dram={dram_gib:.3f}GiB" | ||
| ) |
There was a problem hiding this comment.
This ~25-line log block has zero coverage in dynamicemb_util_test.py / plan_util_test.py. Three independent decision points worth a test each:
cache_load_factor >= 1.0HBM_ONLY override (float-precision regressions would silently mis-label).int(os.environ.get("RANK", 0)) == 0gating (a regression toos.environ.get("RANK") == "0"would fail on the typical single-rank case whereRANKis unset).1 << 30GiB divisor (a regression to1e9would not be caught).
A mock.patch.object(logger, "info") test driving a synthetic ShardingOption through _to_sharding_plan would cover all three cheaply.
Code review summary (5 reviewer agents)Substantial, well-tested planner change. The 2D DP refactor is clean, the empirical-fit docstring on Correctness / coherence
Robustness
Test gaps (also flagged inline where relevant)
Minor
Performance note (no inline)
🤖 Generated with Claude Code |
Four cleanups to ``DynamicProgrammingProposer`` from PR alibaba#508 review round 2: * **Proposal emission (R3):** the previous backtracking enumerated every reachable ``(h, d)`` cell at the last table -- up to ``hbm_bins × ddr_bins`` ≈ 200k+ entries. For each HBM bin only the lowest-perf DDR cell matters (other cells at the same ``h`` are strictly dominated -- same HBM cost, ≥ perf). Emit one proposal per HBM bin in decreasing HBM order; proposal count caps at ``hbm_bins ≤ _DP_AXIS_BIN_CAP``. * **Drop legacy alias (R2):** delete ``mem_bins_per_device`` kwarg from ``__init__``. No in-tree caller passes it; carrying it with a silent precedence rule was a landmine. * **Drop unused parameter (R6):** ``_emit_dynamicemb_variants`` second parameter was never read; the function uses ``copy.deepcopy(base_option).dynamicemb_options`` instead. Drop the parameter and update the single caller. * **Trim comment (R5):** drop the trailing two lines of the ``_DP_AXIS_BIN_CAP`` block (the "stops adding precision" repeat and the dangling PR-ID parenthetical). Also update class docstring to describe the new "one per HBM bin" emission (was "Pareto-optimal proposals", which was inaccurate). Drop the now-obsolete ``test_legacy_mem_bins_per_device_alias``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The enumerator bounds ``cache_load_factor`` to ``{0.1, 0.2, ..., 1.0}``,
so the HBM_ONLY-override boundary is the exact value 1.0. Using ``>=
1.0`` would silently relabel any leaked-in ``cache_load_factor > 1.0``
(e.g. from a misconfigured DynamicEmbedding proto) as HBM_ONLY instead
of letting the upstream invariant violation surface loudly downstream
(a negative HYBRID DDR via ``1.0 - 1.5``).
PR alibaba#508 review N1-trimmed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **N7 -- multi-host DDR prune.** All prior ``DynamicProgrammingProposerTest`` cases used ``local_world_size == num_devices``, so ``per_host == num_devices`` collapsed the per-machine window to the whole world; the load-bearing multi-host pruning was effectively untested. ``test_per_machine_ddr_prune_on_multi_host_topology`` exercises 4 GPUs across 2 machines (each with 1000 MB DDR, total 2000): asserts an option with ``ddr=1500`` per shard is pruned (exceeds per-machine cap) and an option with ``ddr=900`` survives. * **N8 -- plan log coverage.** Extract the per-table log block into ``_log_dynamicemb_table_plan`` so the three decision points (``cache_load_factor == 1.0`` HBM_ONLY override, RANK env gating, GiB divisor) are testable without reconstructing a full ShardingOption pipeline. New ``PlanLogLineTest`` covers: HYBRID/CACHING at partial factor, HBM_ONLY at x=1.0 in both modes (override path), non-rank-0 silence, and a ``cache_load_factor=1.0001`` regression case that asserts the exact-equality check does NOT swallow a >1.0 escape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ocstring * Drop four test classes from ``dynamicemb_util_test.py``: ``EffectiveCacheRatioTest``, ``BuildShardPerfContextsWrapperTest``, ``DynamicEmbCalcShardStoragesTest``, ``PlanLogLineTest``. Coverage of the perf model, the build_shard_perf_contexts monkey-patch, the storage estimator, and the per-table log line is provided end-to-end by ``PlanUtilDynamicEmbE2ETest`` in ``plan_util_test.py``. Only ``StorageFormulaTest`` (the pure HBM/DDR byte arithmetic) is worth keeping as a unit test -- the rest is e2e territory. * Rewrite ``DynamicProgrammingProposer`` class docstring to match master's style: open with a one-line summary, frame the problem as matrices ``A`` / ``B`` with discretized state ``dp[i][f(k)]``, give the state-transition equation in LaTeX, and end with an ``Args`` block typed in parentheses. Extends master's 1D formulation to the 2D (HBM × DDR) case with cost matrices ``A^h``, ``A^d`` and budgets ``K_h``, ``K_d``, and notes the per-HBM-bin proposal emission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Make tzrec's embedding sharding planner take HBM and host-memory limits from the topology and decide per dynamicemb table whether to use HYBRID (HBM + DRAM hash-partitioned) or CACHING (HBM cache + full DRAM backing) mode. Per the consensus on NVIDIA/recsys-examples#306 (closed), both placement modes are kept and the planner — not the user — picks the best one under the joint budget.
{ HYBRID, CACHING } × { cache_load_factor 0.1, ..., 1.0 }. The base ShardingOption is deep-copied once per variant;dynamicemb_options.cachingis then flipped in-place on the already-fresh per-variant copy.(1 − cache_load_factor) · T, CACHING DDR =T(full backing store). HBM accounting and hash-table metadata stay HBM-side only, matching the existing tzrec convention.experiments/sweep_20260513-161030/full_a10gpu1.json: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes — HYBRID@1.0 = 0.80 ms (HBM-only fast path), CACHING@<1.0 = 2.63 ms (~3.3x slower), HYBRID@<1.0 = 5.44 ms (~6.8x slower). Inverting the linear bw model with torchrec defaults givesx_effbases 0.28 (CACHING) and 0.11 (HYBRID), withx_eff = 1.0atcache_load_factor = 1.0reflecting the runtime kernel switch. A small+0.01·xtiebreaker preserves a strict DP ranking within each block._DP_AXIS_BIN_CAPregardless ofddr_bins._to_sharding_planemits one INFO line per dynamicemb table on rank 0 so the picked mode is visible without grepping the runtime print. Numbers come from the storage estimator'sshards[0].storage.{hbm,ddr}(true HBM footprint including hash-table metadata + counter + pipeline I/O; mode-aware DDR), with an HBM_ONLY override atcache_load_factor == 1.0to reflect what the runtime actually does:_print_memory_consumeis mode-blind (alwaysdram = total − hbm) and strips metadata from HBM; this log line has the honest numbers.Topology.hbm_cap/Topology.ddr_capminus the existing reservation.Test plan
tzrec/utils/dynamicemb_util_test.py— storage formula + empirical perf-model + direct wrapper tests +PlanLogLineTestcovering HBM_ONLY-override / RANK gating / GiB unit /cache_load_factor>1.0regressiontzrec/utils/plan_util_test.py— variant emission + 2D DP unit tests including multi-table mixed-modes, multi-host (4 GPUs × 2 hosts) per-machine DDR prune, empty search space, and the headline DP testsplan_util_test.py)pre-commit run --files <changed>cleancachingflag,local_hbm/local_dramare the true footprints (the upstream runtime print under-reports both)torchrun --nproc-per-node=2 -m tzrec.train_eval ...🤖 Generated with Claude Code