Skip to content

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508

Open
tiankongdeguiji wants to merge 24 commits into
alibaba:masterfrom
tiankongdeguiji:dynamicemb-hybrid-caching-planner
Open

[feat] planner: pick dynamicemb HYBRID vs CACHING from topology budgets#508
tiankongdeguiji wants to merge 24 commits into
alibaba:masterfrom
tiankongdeguiji:dynamicemb-hybrid-caching-planner

Conversation

@tiankongdeguiji
Copy link
Copy Markdown
Collaborator

@tiankongdeguiji tiankongdeguiji commented May 12, 2026

Summary

Make tzrec's embedding sharding planner take HBM and host-memory limits from the topology and decide per dynamicemb table whether to use HYBRID (HBM + DRAM hash-partitioned) or CACHING (HBM cache + full DRAM backing) mode. Per the consensus on NVIDIA/recsys-examples#306 (closed), both placement modes are kept and the planner — not the user — picks the best one under the joint budget.

  • Enumerator emits 20 ShardingOptions per dynamicemb table: { HYBRID, CACHING } × { cache_load_factor 0.1, ..., 1.0 }. The base ShardingOption is deep-copied once per variant; dynamicemb_options.caching is then flipped in-place on the already-fresh per-variant copy.
  • Storage estimator is mode-aware: HYBRID DDR = (1 − cache_load_factor) · T, CACHING DDR = T (full backing store). HBM accounting and hash-table metadata stay HBM-side only, matching the existing tzrec convention.
  • Perf model is fitted from an on-device dynamicemb sweep (experiments/sweep_20260513-161030/full_a10gpu1.json: 4M-row table, dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency clustered into three regimes — HYBRID@1.0 = 0.80 ms (HBM-only fast path), CACHING@<1.0 = 2.63 ms (~3.3x slower), HYBRID@<1.0 = 5.44 ms (~6.8x slower). Inverting the linear bw model with torchrec defaults gives x_eff bases 0.28 (CACHING) and 0.11 (HYBRID), with x_eff = 1.0 at cache_load_factor = 1.0 reflecting the runtime kernel switch. A small +0.01·x tiebreaker preserves a strict DP ranking within each block.
  • DynamicProgrammingProposer is now 2D: discretizes per-rank HBM (100 bins/dev) and per-machine DDR (50 bins/dev) and picks per-table options that minimize total perf under both budgets. HBM prune is per-device, DDR prune is per-machine (DDR is host-shared across co-located ranks). Per-axis bin count is capped at 1024 to avoid blow-up on multi-host worlds. Backtracking emits one proposal per HBM bin — the perf-best plan across all DDR bins at that level — so the proposal count is at most _DP_AXIS_BIN_CAP regardless of ddr_bins.
  • Per-table planning log. _to_sharding_plan emits one INFO line per dynamicemb table on rank 0 so the picked mode is visible without grepping the runtime print. Numbers come from the storage estimator's shards[0].storage.{hbm,ddr} (true HBM footprint including hash-table metadata + counter + pipeline I/O; mode-aware DDR), with an HBM_ONLY override at cache_load_factor == 1.0 to reflect what the runtime actually does:
    [dynamicemb plan] …cat_0_emb: mode=CACHING cache_load_factor=0.20 local_hbm=1.620GiB local_dram=2.384GiB
    [dynamicemb plan] …cat_5_emb: mode=HBM_ONLY cache_load_factor=1.00 local_hbm=4.000GiB local_dram=0.000GiB
    
    Upstream _print_memory_consume is mode-blind (always dram = total − hbm) and strips metadata from HBM; this log line has the honest numbers.
  • No new proto fields — budgets come straight from Topology.hbm_cap / Topology.ddr_cap minus the existing reservation.
  • Version bump to 1.2.13.

Test plan

  • tzrec/utils/dynamicemb_util_test.py — storage formula + empirical perf-model + direct wrapper tests + PlanLogLineTest covering HBM_ONLY-override / RANK gating / GiB unit / cache_load_factor>1.0 regression
  • tzrec/utils/plan_util_test.py — variant emission + 2D DP unit tests including multi-table mixed-modes, multi-host (4 GPUs × 2 hosts) per-machine DDR prune, empty search space, and the headline DP tests
  • CUDA-gated end-to-end enumerate + DP test (folded into plan_util_test.py)
  • pre-commit run --files <changed> clean
  • Verified the planning log against a 27-table 5-CACHING-table real model run; mode column matches the chosen caching flag, local_hbm / local_dram are the true footprints (the upstream runtime print under-reports both)
  • Smoke test on a real dynamicemb config end-to-end via torchrun --nproc-per-node=2 -m tzrec.train_eval ...

🤖 Generated with Claude Code

tiankongdeguiji and others added 10 commits May 12, 2026 20:28
…ING)

Extend `_calculate_dynamicemb_table_storage_specific_size` and its callers
with a `caching: bool` parameter. HYBRID keeps the original split formula
(DDR = (1 - cache_ratio) * T); CACHING accounts the full backing store on
host (DDR = T) regardless of cache_ratio. HBM accounting and metadata
(key + score + digest + bucket header) stay identical between modes.

The flag is sourced from `dynamicemb_options.caching` on the ShardingOption
in `dynamicemb_calculate_shard_storages`. Subsequent commits wire this into
the enumerator (per-table sweep over both modes) and the proposer (2D DP).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `_dynamicemb_effective_cache_ratio(cache_load_factor, caching, stats)`:
HYBRID passes the ratio through (hash-partitioned HBM hit probability);
CACHING boosts to `min(1.0, x * multiplier)`, default multiplier 10.0 via
the `TZREC_CACHING_HIT_RATE_MULTIPLIER` env var. `cache_params.stats`, when
provided, takes precedence — `1 - stats.expected_miss_rate(x)` — and is
clamped to never fall below the HYBRID ratio.

Inject the boost at planning time by monkey-patching
`ShardPerfContext.build_shard_perf_contexts` to temporarily replace the
sharding option's `cache_params.load_factor` with the effective ratio.
The original cache_params is restored on return so the storage estimator
(which reads cache_load_factor for HBM/DDR sizing) still sees the
unboosted value.

Net effect: at the same `cache_load_factor`, CACHING is strictly cheaper
in perf and strictly more expensive in DDR. The DP proposer (next commit)
trades one for the other against the topology budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extract the per-dynamicemb-table variant emission into
`_emit_dynamicemb_variants(base, dynamicemb_options) -> List[ShardingOption]`
and sweep both modes per cache_load_factor. When `cache_params` is unset,
the helper emits 20 variants (10 factors × 2 modes); when fixed by the
caller, it emits 2 (both modes at the fixed factor). Each variant owns a
deep-copied `dynamicemb_options` so per-variant `caching` mutations stay
isolated. `cache_params.stats` is preserved across all clones for the
perf-side miss-rate override.

The downstream 2D DP proposer (next commit) selects per table the (mode,
ratio) pair that fits both HBM and host budgets while minimizing perf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite `DynamicProgrammingProposer.feedback` to discretize both per-rank
HBM and per-rank DDR into bins (default 100 × 50 per device) and pick per
table the (mode, cache_load_factor) pair that minimizes total perf while
fitting both topology budgets.

State:
  dp[table][hbm_bin][ddr_bin]      = (perf, hbm_actual, ddr_actual)
  backtrack[table][hbm_bin][ddr_bin] = (opt_id, prev_hbm_bin, prev_ddr_bin)

Options whose largest shard exceeds either per-device cap are pruned
upfront. Backtracking enumerates feasible (hbm_total, ddr_total) cells in
decreasing total memory, yielding Pareto-optimal proposals. The legacy
`mem_bins_per_device` kwarg is preserved as an alias for `hbm_bins_per_device`
so existing callers still work, and HBM-degenerate (CPU-only) or
DDR-degenerate topologies collapse the unused axis to a single bin.

Adds `DynamicProgrammingProposer2DTest`:
  - generous DDR → CACHING wins (cheaper perf, host can fit T)
  - tight DDR → CACHING rejected, falls back to HYBRID
  - high cache_load_factor → modes converge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exercise the full planning pipeline on a tiny TestSparseNN backed by a
DynamicEmbParameterConstraints constraint. Asserts:
  - EmbeddingEnumerator yields 20 sharding options (2 modes × 10 factors)
  - All cache_load_factors and both caching modes are represented
  - Storage and perf estimators populate non-zero values for each option
  - DynamicProgrammingProposer.propose() returns feasible plans whose
    dynamicemb options carry a valid caching flag

Gated on has_dynamicemb + torch.cuda.is_available() since the dynamicemb
sharder requires CUDA at planning time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper

(1) Restore the byte-budget framing in `_calculate_dynamicemb_table_storage_specific_size`
docstring (`total_value_memory`, `num_buckets`, `hbm_budget`, `ddr_budget`)
that was lost in the PR alibaba#508 rewrite, extending `ddr_budget` to cover both
HYBRID and CACHING modes in one place.

(2) Simplify `_dynamicemb_aware_build_shard_perf_contexts`: single return,
no try/finally, no early-return branch. The unconditional
`sharding_option.cache_params = original_cache_params` line at the end is a
no-op when caching is False (we never mutated the field) and restores the
original cache_params when caching is True. If the wrapped call raises,
planning aborts and the option is discarded -- no need for try/finally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DDR is host-shared across ranks co-located on one machine, so the
per-option fit check in `DynamicProgrammingProposer.feedback` should
compare `max_shard_ddr` against the largest machine's total DDR pool, not
against any single rank's slice. HBM stays per-device (each GPU has its
own HBM; no cross-rank sharing).

Compute `max_machine_ddr` by summing per-device DDR within each contiguous
`local_world_size`-sized window. Update the test fake-topology fixture to
carry `local_world_size` (defaults to num_devices).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… copyright

(5) Move `PlanUtilDynamicEmbE2ETest` from the standalone
`tzrec/utils/plan_util_dynamicemb_e2e_test.py` into `plan_util_test.py`
where the rest of the plan_util tests already live. Delete the standalone
file.

(6) Rename `DynamicProgrammingProposer2DTest` -> `DynamicProgrammingProposerTest`.
The proposer class itself stays `DynamicProgrammingProposer`.

(7) Bump copyright `2024 -> 2026` in the new file
`tzrec/utils/dynamicemb_util_test.py`. The header in
plan_util_test.py stays at 2024 since it was a pre-existing file we only
extended.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urve

Replace the heuristic ``_dynamicemb_effective_cache_ratio`` curve with
constants fitted from an on-device perf sweep
(``experiments/sweep_20260513-161030/full_a10gpu1.json``: 4M-row table,
dim=128, adam, pow-law alpha=1.05, A10 GPU). Median fwd+bwd latency
clustered into three regimes:

  HYBRID @ x=1.0:    0.80 ms   (HBM-only fast path)
  CACHING @ x<1.0:   2.63 ms   (~3.3x slower)
  HYBRID  @ x<1.0:   5.44 ms   (~6.8x slower)

Inverting the linear bw model bw = x_eff*HBM + (1-x_eff)*HBM_TO_DDR
(torchrec defaults 897 / 32 GB/s) gives x_eff bases 0.28 (CACHING) and
0.11 (HYBRID). At x = 1.0 the runtime drops the host tier in both modes,
so x_eff = 1.0 there. A +0.01*x tiebreaker term keeps the DP's ranking
strict within each block.

Also extend the ``_dynamicemb_aware_build_shard_perf_contexts`` wrapper
to apply the empirical x_eff for HYBRID (previously HYBRID used raw
``cache_load_factor`` as x_eff -- inconsistent with the empirical data
that shows HYBRID @ 0.9 is much slower than HYBRID @ 1.0 due to the
storage backend switch).

Drop the now-unused ``TZREC_CACHING_HIT_RATE_MULTIPLIER`` env knob and
its related tests; rewrite ``EffectiveCacheRatioTest`` to assert the
empirical strict-block ranking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The same provenance is already covered in
``_dynamicemb_effective_cache_ratio``'s docstring below; the block
comment above the constants was redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label May 13, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label May 13, 2026
Comment thread tzrec/utils/dynamicemb_util.py Outdated
Comment on lines +515 to +527
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
sharding_option.cache_params = original_cache_params
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore should be in finally. If _orig_build_shard_perf_contexts raises, sharding_option.cache_params is permanently left as the boosted x_eff clone. Downstream consumers — the storage estimator and _to_sharding_plan at line 435 (cache_params=sharding_option.cache_params) — then see the wrong load factor on every other option that touches the same ShardingOption. Suggested:

Suggested change
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
sharding_option.cache_params = original_cache_params
return result
try:
result = _orig_build_shard_perf_contexts(
cls,
config,
shard_sizes,
sharding_option,
topology,
constraints,
sharder,
*args,
**kwargs,
)
finally:
sharding_option.cache_params = original_cache_params
return result

Comment thread tzrec/utils/plan_util.py
Comment on lines +846 to +853
for caching_mode in (False, True):
for load_factor in load_factors:
opt = copy.deepcopy(base_option)
opt.cache_params = CacheParams(load_factor=load_factor, stats=stats)
# pyre-ignore [16]
opt.dynamicemb_options = copy.deepcopy(dynamicemb_options)
opt.dynamicemb_options.caching = caching_mode
variants.append(opt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant double deep-copy of dynamicemb_options. The caller (line 1035) attaches dynamicemb_options to base_option before calling this function, so copy.deepcopy(base_option) on line 848 already deep-copies dynamicemb_options into opt. Line 851 then deep-copies the original again and replaces it. With 20 variants per table this doubles a non-trivial copy. Either drop line 851 and just set opt.dynamicemb_options.caching = caching_mode, or stop attaching dynamicemb_options on the caller side before this call.

Comment thread tzrec/utils/plan_util.py
Given :math:`M` tables each with up to :math:`N` ShardingOptions, pick one
option per table to minimize total perf while respecting both per-rank HBM
and per-rank DDR budgets from the topology.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says "per-rank DDR" but the implementation enforces per-machine DDR (lines 349–361 sum DDR across local_world_size co-located ranks and the per-option prune at line 391 compares against max_machine_ddr). The inline comment at 350–352 has the right framing — the class docstring should match: "per-rank HBM and per-machine DDR budgets."

Comment on lines +86 to +89
if x >= 1.0:
return 1.0
base = _DYNAMICEMB_CACHING_X_EFF_BASE if caching else _DYNAMICEMB_HYBRID_X_EFF_BASE
return base + _DYNAMICEMB_X_EFF_TIEBREAK * x
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharp discontinuity at x=1.0. HYBRID@x=0.99 → 0.1199; HYBRID@x=1.00 → 1.0 — an ~8× jump for a 1% ratio change. The DP will reliably prefer x=1.0 over x=0.99 by a huge perf margin, then prefer CACHING@x=0.5 (0.285) over HYBRID@x=0.9 (0.119) even on workloads where HYBRID@0.9 is plainly faster in reality. If the empirical sweep really shows a step at x=1.0 because the runtime drops the host tier, please call that out as "x=1.0 = HBM-only kernel, not the same algorithm as x=0.99" — and consider whether the enumerator should emit a discrete mode={HBM_ONLY, HYBRID, CACHING} axis rather than packing the discontinuity into the same x knob.

Comment thread tzrec/utils/plan_util.py
Comment on lines +434 to +463
for table_i in range(1, table_count):
prev_dp = dp[table_i - 1]
cur_dp = dp[table_i]
cur_back = backtrack[table_i]
hbm_t = hbm_by_fqn[table_i]
ddr_t = ddr_by_fqn[table_i]
perf_t = perf_by_fqn[table_i]
for h in range(hbm_bins):
prev_dp_h = prev_dp[h]
for d in range(ddr_bins):
prev_state = prev_dp_h[d]
prev_perf = prev_state[0]
if prev_perf == _INF:
continue
prev_h_val = prev_state[1]
prev_d_val = prev_state[2]
for opt_j in range(option_count):
delta_perf = perf_t[opt_j]
if delta_perf == _INF:
continue
new_h = prev_h_val + hbm_t[opt_j]
new_d = prev_d_val + ddr_t[opt_j]
if new_h >= hbm_bins or new_d >= ddr_bins:
continue
new_h_i = int(new_h)
new_d_i = int(new_d)
new_perf = prev_perf + delta_perf
if cur_dp[new_h_i][new_d_i][0] > new_perf:
cur_dp[new_h_i][new_d_i] = (new_perf, new_h, new_d)
cur_back[new_h_i][new_d_i] = (opt_j, h, d)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2D DP cost scales with world size — consider clamping total bins. hbm_bins = 100 · num_devices and ddr_bins = 50 · num_devices. On a 16-GPU world the inner triple-loop is 1600 × 800 × option_count pure-Python iterations per table; with 100 dynamicemb tables (20 options each) that's a few-billion-op DP, and dp + backtrack allocate ~256M tuples. Even on smaller worlds the bin counts multiply faster than necessary — a per-device discretization isn't really needed since the budget is a single scalar per axis. Consider clamping totals (e.g. hbm_bins = min(hbm_bins_per_device · num_devices, 2_000)) or vectorizing the transition with NumPy. The defaults are fine for 2–4 GPUs but worth a guardrail before the next user hits this.

Comment on lines +241 to +309
class DynamicProgrammingProposerTest(unittest.TestCase):
"""2D DP across HBM × DDR picks per-table mode under joint budgets."""

def _run(self, search_space, topology):
proposer = DynamicProgrammingProposer(
hbm_bins_per_device=20, ddr_bins_per_device=20
)
proposer.load(search_space)
# First propose returns the lowest-mem-per-table seed.
proposer.propose()
proposer.feedback(partitionable=True, storage_constraint=topology)
proposals = []
proposal = proposer.propose()
while proposal:
proposals.append(proposal)
proposer.feedback(partitionable=True, storage_constraint=topology)
proposal = proposer.propose()
return proposals

def test_caching_preferred_when_ddr_is_generous(self):
# Three options for one table:
# HYBRID @ x=1.0: hbm = T, ddr = 0, perf = high (HBM-only)
# HYBRID @ x=0.1: hbm = .1T, ddr = .9T, perf = high (slow)
# CACHING @ x=0.1: hbm = .1T, ddr = T, perf = low (fast — modeled hits)
opts = [
_FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0),
_FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
_FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
]
topology = _make_topology(
num_devices=2, hbm_per_device=2000, ddr_per_device=2000
)
proposals = self._run(opts, topology)
# Best plan must be the CACHING option (perf=10).
best = min(proposals, key=lambda p: sum(o.total_perf for o in p))
self.assertEqual(best[0].total_perf, 10.0)

def test_caching_rejected_when_ddr_is_tight(self):
# Host budget is only 950 — CACHING (ddr=1000) cannot fit; HYBRID can.
opts = [
_FakeShardingOption("table_a", hbm=100, ddr=900, perf=80.0),
_FakeShardingOption("table_a", hbm=100, ddr=1000, perf=10.0),
]
topology = _make_topology(
num_devices=1, hbm_per_device=2000, ddr_per_device=950
)
proposals = self._run(opts, topology)
# Every proposed plan must pick the HYBRID option (perf=80).
for p in proposals:
self.assertEqual(p[0].total_perf, 80.0)

def test_high_factor_collapses_modes(self):
# At x=1.0 HYBRID == CACHING in HBM and CACHING.ddr = T = HYBRID.hbm.
# If we offer just the high-factor options, DP picks one of them.
opts = [
_FakeShardingOption("table_a", hbm=1000, ddr=0, perf=50.0), # HYBRID x=1.0
_FakeShardingOption(
"table_a", hbm=1000, ddr=1000, perf=50.0
), # CACHING x=1.0
]
topology = _make_topology(
num_devices=1, hbm_per_device=1100, ddr_per_device=2000
)
proposals = self._run(opts, topology)
# Either option is fine — they're tied. Just verify the proposer
# returned something feasible.
self.assertGreater(len(proposals), 0)
for p in proposals:
self.assertEqual(p[0].total_perf, 50.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the headline multi-table 2D-DP scenario. All three DynamicProgrammingProposerTest cases use a single table, so the cross-table DP transition at plan_util.py:434–463 is exercised only at table_i == 0. The PR's value prop is "pick CACHING for one table and HYBRID for another under joint HBM+DDR budgets" — a 2-table fixture where global DDR can only hold one CACHING but plenty of HBM is available would directly cover that. Same gap for the mem_bins_per_device legacy alias (untested) and the all-options-pruned / empty-search-space degenerate paths.

only_values=only_values,
bucket_capacity=self.BUCKET_CAPACITY,
caching=caching,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct unit test for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts is missing. Both have non-trivial branches — admission_counter HBM accounting (dynamicemb_util.py:663–672), the compute_device == "cpu" and not is_inference DDR branch (line 697), and the cache_params swap+restore (line 502–514) — that are reachable only via the E2E test, which is gated on has_dynamicemb and cuda. On a CPU dev box none of this is covered; a fake-ed _orig_build_shard_perf_contexts would let you verify the boost is applied and (more importantly) that the original cache_params is restored even when the wrapped call raises.

@github-actions
Copy link
Copy Markdown

Review summary

Solid PR overall — the 2D DP rewrite is clean, the mode-aware storage formula is well-tested, and the empirical perf-model docstring is refreshingly honest about its provenance. A few things worth addressing before merge (inline above):

Should fix

  • dynamicemb_util.py:515-527 — missing try/finally around the cache_params swap; an exception in the wrapped estimator permanently corrupts the option's cache load factor.
  • dynamicemb_util.py:56-89 — the x_eff step from 0.12 → 1.0 at x=1.00 means HYBRID@0.99 and CACHING@0.5 both look much worse than HYBRID@1.0 by construction. If x=1.0 is genuinely a different code path (HBM-only kernel), modeling it as a separate mode axis is cleaner than a discontinuity in the same knob.
  • plan_util.py:846-853 — redundant double-deepcopy of dynamicemb_options (20×/table).

Worth a follow-up

  • plan_util.py:434-463 — 2D DP state blows up on multi-node worlds (bins scale with world_size); consider an absolute cap on total bins, or vectorize with NumPy.
  • Test gap on the multi-table DP case where one table wants CACHING and another HYBRID under a shared DDR budget — that's the headline scenario and currently isn't covered by a unit test. Same for dynamicemb_calculate_shard_storages and _dynamicemb_aware_build_shard_perf_contexts (only reachable via the CUDA-gated E2E test) and the mem_bins_per_device legacy alias.

Docs

  • DynamicProgrammingProposer docstring says "per-rank DDR" but the implementation budgets DDR per-machine — worth aligning the wording with the inline comment on lines 350-352.
  • dynamicemb_calculate_shard_storages Args block (not in diff, around line 620): the caching_ratio description still reads "ratio of HBM to DDR memory for UVM caching" — under the new CACHING semantics that's stale; host always holds the full backing store and caching_ratio only sizes the HBM hot-row cache.
  • PR description mentions TZREC_CACHING_HIT_RATE_MULTIPLIER and x_eff = min(1.0, x · multiplier); the code uses fitted constants instead (no env var). Worth updating the description so it doesn't mislead future readers.

No security/multi-process concerns: the monkey-patches run under the import lock and each rank is its own process.

tiankongdeguiji and others added 10 commits May 13, 2026 17:02
`copy.deepcopy(base_option)` on the first line of each variant body
already copies `dynamicemb_options` because the caller in `enumerate()`
attaches it onto `base_option` before invoking `_emit_dynamicemb_variants`.
The second `copy.deepcopy(dynamicemb_options)` was wasted work — at 20
variants per dynamicemb table it doubled a non-trivial copy. Drop it and
mutate the already-fresh per-variant `dynamicemb_options.caching` directly.

PR alibaba#508 review R3 (github-actions, 2026-05-13).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… blowup

The 2D DP discretizes each memory axis into `bins_per_device *
num_devices` bins. On a 16-GPU world that becomes 1600 hbm bins × 800
ddr bins = 1.28M cells per table; with 100 dynamicemb tables and 20
options each, the inner Python loop is a few-billion-op DP and dp +
backtrack tuples consume hundreds of MB. The HBM and DDR budgets are
scalar, so multiplying bins by num_devices stops adding meaningful
precision past a point. Clamp the per-axis bin count at `_DP_AXIS_BIN_CAP`
(1024) to keep the multi-host case tractable.

NumPy vectorization of the inner DP loop is a bigger refactor; leaving a
TODO referencing PR alibaba#508 review R5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

* DynamicProgrammingProposer class docstring: "per-rank DDR" -> "per-rank
  HBM and per-machine DDR" so it matches the machine-DDR prune introduced
  in commit 38bd7ad (PR alibaba#508 review R4).

* _dynamicemb_effective_cache_ratio: call out that the discontinuity at
  x=1.0 is a deliberate kernel switch (HBM-only DynamicEmbStorage vs
  dual-tier HybridStorage/DynamicEmbCache), not a smoothing artifact. Note
  a future refactor could lift this to a discrete `mode` axis on the
  enumerator (PR alibaba#508 review R2).

* dynamicemb_calculate_shard_storages caching_ratio Args: rewrite the
  stale "ratio of HBM to DDR memory for UVM caching" framing -- under the
  dynamicemb modes, host always holds the full backing in CACHING, and
  caching_ratio sizes the HBM hot-row cache (top-level docs note).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ch space

Add three DynamicProgrammingProposerTest methods that the original
PR alibaba#508 review (R6) flagged as missing:

  test_two_tables_pick_mixed_modes_under_joint_budget
    Two tables under a topology where both-HYBRID is HBM-infeasible and
    both-CACHING is DDR-infeasible. The DP must pick one of each. This
    exercises the cross-table transition at plan_util.py table_i == 1
    which the existing single-table cases never reach.

  test_legacy_mem_bins_per_device_alias
    Verifies the back-compat kwarg `mem_bins_per_device` still wins over
    the new `hbm_bins_per_device` when both are passed.

  test_empty_search_space_returns_empty_proposal
    Edge case where `load([])` short-circuits via `table_count == 0`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apper tests

Combine PR alibaba#508 review R1 + R7. The new
``BuildShardPerfContextsWrapperTest.test_cache_params_restored_on_exception``
exists to pin the bug R1 describes: when ``_orig_build_shard_perf_contexts``
raises, the boosted ``x_eff`` clone of ``sharding_option.cache_params``
must not leak to the downstream storage estimator's view of the same
option. The corresponding ``try/finally`` around the wrapped call closes
the leak. Reverts the previous "single return, no try/finally"
simplification.

Also adds:

* ``BuildShardPerfContextsWrapperTest`` — direct unit tests for the
  ShardPerfContext.build_shard_perf_contexts monkey-patch covering:
  boost applied for caching, boost applied for hybrid, no boost when the
  option has no dynamicemb_options, cache_params restored on success,
  and the restore-on-exception case driving R1.

* ``DynamicEmbCalcShardStoragesTest.test_admission_counter_increases_hbm``
  — direct cover for the admission_counter HBM accounting branch in
  ``dynamicemb_calculate_shard_storages`` that the CUDA-gated E2E test
  doesn't exercise on a CPU dev box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the DP proposer commits a (mode, cache_load_factor) per dynamicemb
table, log one INFO line on rank 0 that names the runtime mode the
dispatcher in BatchedDynamicEmbeddingTablesV2._create_cache_storage will
actually pick:

  total_value_memory <= local_hbm_for_values  -> HBM_ONLY
  else if caching                             -> CACHING
  else                                        -> HYBRID

Example:
  [dynamicemb plan] sparse.ebc.user_emb: mode=CACHING cache_load_factor=0.30 local_hbm_for_values=2.400GiB
  [dynamicemb plan] sparse.ebc.item_emb: mode=HBM_ONLY cache_load_factor=1.00 local_hbm_for_values=4.000GiB

Replaces the previous "have to grep _print_memory_consume from the
runtime to find out what mode the planner picked" workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream ``_print_memory_consume`` is mode-blind and always computes
``dram = total - hbm``, so for CACHING tables it under-reports the host
backing (the actual storage_options keep ``max_capacity`` unchanged, i.e.
host holds the full table). Surface the correct expectation in our own
log so users can verify CACHING vs HYBRID storage from the planner side:

  HBM_ONLY: dram = 0
  CACHING:  dram = total_value_memory          # full backing
  HYBRID:   dram = total - local_hbm_for_values  # partition

Example for cat_0_emb @ 0.20 with Adam:
  mode=HYBRID  local_hbm=0.954GiB local_dram=3.815GiB  (= 0.8 * total)
  mode=CACHING local_hbm=0.954GiB local_dram=4.769GiB  (= total)

If the runtime memory print shows DRAM = 3.815GiB for a table logged as
CACHING here, that is the print's mode-blind formula -- the real host
allocation is the log's local_dram_for_values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…omputing

The storage estimator (``dynamicemb_calculate_shard_storages`` ->
``_calculate_dynamicemb_storage_specific_sizes`` ->
``_calculate_dynamicemb_table_storage_specific_size``) already populates
``shards[i].storage.ddr`` per option for the chosen mode:

  HYBRID:  ddr_specific = (1 - cache_ratio) * total_value_memory
  CACHING: ddr_specific = total_value_memory

For cuda compute_device, ``Storage.ddr = ddr_specific`` (no I/O overhead
added). So the value the planner already computed equals what the log
line needs -- no point re-running ``_calculate_dynamicemb_table_storage_specific_size``.

Read ``shards[0].storage.ddr`` directly. Identical numbers, less code,
no second call to the storage formula.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small follow-ups on the dynamicemb plan log:
* Rename keys: ``local_hbm_for_values`` -> ``local_hbm`` and
  ``local_dram_for_values`` -> ``local_dram``. Shorter, still unambiguous
  in context.
* Switch HBM source to ``shards[0].storage.hbm`` for symmetry with the
  DDR side and to surface the true HBM footprint (values + metadata +
  counter + pipeline I/O), not just the runtime config knob. Upstream
  ``_print_memory_consume`` strips metadata from its HBM column too;
  this log reports the honest total.
* Bump tzrec to 1.2.13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label May 22, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label May 22, 2026
Comment thread tzrec/utils/dynamicemb_util.py Outdated
Comment on lines +443 to +448
if cache_load_factor >= 1.0:
mode = "HBM_ONLY"
dram_bytes = 0 # runtime drops the host tier
else:
mode = "CACHING" if dynamicemb_options.caching else "HYBRID"
dram_bytes = shards[0].storage.ddr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage estimator ↔ runtime ↔ log are not coherent at cache_load_factor >= 1.0.

The log overrides dram_bytes = 0 here because the runtime drops the host tier when the table fits HBM. But the storage estimator at _calculate_dynamicemb_table_storage_specific_size (line ~340) still returns total_value_memory DDR for CACHING mode regardless of cache_ratio — i.e. CACHING@1.0 reports a full backing-store DDR cost that the 2D DP then uses for pruning. Result: the DP can mark CACHING@1.0 plans infeasible on DDR-tight topologies even though they would actually fit at runtime, and the log will then report local_dram=0.000GiB for any plan that does make it through.

Suggest aligning the storage formula with the runtime: when cache_ratio >= 1.0, force CACHING's DDR to 0 too. Then the storage estimator, the DP, and this log all agree.

Comment thread tzrec/utils/plan_util.py Outdated
Comment on lines +275 to +281
# back-compat alias: ``mem_bins_per_device`` mapped to the HBM axis.
bins_h = (
mem_bins_per_device
if mem_bins_per_device is not None
else hbm_bins_per_device
)
self._hbm_bins_per_device: int = max(bins_h, 1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the caller passes both mem_bins_per_device=X and hbm_bins_per_device=Y, the legacy alias silently wins — test_legacy_mem_bins_per_device_alias even codifies this. That's the surprising direction; a user supplying the new name explicitly while a config layer injects the legacy one will get the legacy value with no warning. The class docstring also reads as a one-way alias ("Accepts legacy mem_bins_per_device as an alias").

Suggest either raising TypeError when both are non-None, or letting hbm_bins_per_device win and emitting a DeprecationWarning for mem_bins_per_device. Whatever the choice, document it in the Args: block.

Comment thread tzrec/utils/plan_util.py Outdated
Comment on lines +479 to +494
# Enumerate proposals in decreasing total memory order. Total memory
# is a tie-break heuristic — Pareto-optimal (perf, hbm, ddr) frontier
# tends to live at the high end, so larger-mem cells are explored
# first and small-mem fallbacks come later.
self._proposal_list = []
last_back = backtrack[table_count - 1]
coords = sorted(
(
(h, d)
for h in range(hbm_bins)
for d in range(ddr_bins)
if last_back[h][d][0] >= 0
),
key=lambda hd: hd[0] + hd[1],
reverse=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring (line 255) says backtracking "yields Pareto-optimal proposals," but this loop actually enumerates every reachable (h, d) cell at the last table — up to hbm_bins × ddr_bins ≈ 800k+ proposals at the cap. Downstream consumers iterate propose()/feedback() once per proposal, so a planner run on a large world can churn through hundreds of thousands of plans rather than a small Pareto frontier.

Two minor things compound this:

  • The sort key hd[0] + hd[1] weighs the two axes equally even though hbm_bin_size and ddr_bin_size differ — so "decreasing total memory" isn't really sorted by memory.
  • last_back[h][d][0] >= 0 is the only filter; nothing prunes dominated cells.

Either tighten to the Pareto frontier (drop (h, d) if some (h', d') with h' ≤ h, d' ≤ d, perf' ≤ perf exists), cap len(self._proposal_list), or update the docstring to say "all reachable plans, larger memory first."

Comment on lines +512 to +519
def _dynamicemb_aware_build_shard_perf_contexts(
cls, # pyre-ignore [2]
config, # pyre-ignore [2]
shard_sizes, # pyre-ignore [2]
sharding_option, # pyre-ignore [2]
topology, # pyre-ignore [2]
constraints, # pyre-ignore [2]
sharder, # pyre-ignore [2]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper hard-codes the first six positional parameters (config, shard_sizes, sharding_option, topology, constraints, sharder) and forwards everything else via *args, **kwargs. If torchrec renames sharding_option, reorders these six, or splits one into multiple, the wrapper either silently mis-dispatches (the boost goes into the wrong object) or raises an opaque TypeError deep inside the planner with no link back to this file.

Worth capturing inspect.signature(_orig_build_shard_perf_contexts) at patch time, asserting the first six parameter names match, and otherwise logging a warning + installing a pass-through. A pinned/asserted torchrec version range somewhere visible would also help.

Comment thread tzrec/utils/plan_util.py Outdated
# bins/dev = 1600 bins, * 800 ddr bins = 1.28M cells per table). The
# budgets are scalar (total HBM, total DDR), so multiplying bins by
# num_devices stops adding precision past this cap. NumPy vectorization
# of the inner loop is a follow-up (PR #508 review R5).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(PR #508 review R5) looks like an in-flight review artifact — PR numbers and ad-hoc reviewer ticket IDs aren't useful once merged (git blame is the canonical pointer). Suggest dropping the parenthetical or replacing with an actual follow-up issue number.

Comment thread tzrec/utils/plan_util.py
Comment on lines +838 to +866
def _emit_dynamicemb_variants(
base_option: ShardingOption,
dynamicemb_options: Any,
) -> List[ShardingOption]:
"""Expand a dynamicemb ShardingOption into HYBRID + CACHING variants.

Sweeps both placement modes (``caching=False`` and ``caching=True``) and,
when ``base_option.cache_params`` is unset, ten cache_load_factor values
(0.1, 0.2, ..., 1.0). The downstream 2D DP proposer picks per table the
best (mode, ratio) that fits both HBM and host topology budgets.

Each returned ShardingOption owns a freshly deep-copied
``dynamicemb_options`` instance so per-option ``caching`` mutations do not
bleed across variants.
"""
if base_option.cache_params is None:
load_factors = [(i + 1) / 10 for i in range(10)]
stats = None
else:
load_factors = [base_option.cache_params.load_factor]
stats = base_option.cache_params.stats
variants: List[ShardingOption] = []
for caching_mode in (False, True):
for load_factor in load_factors:
opt = copy.deepcopy(base_option)
opt.cache_params = CacheParams(load_factor=load_factor, stats=stats)
# deepcopy(base_option) already produced a fresh dynamicemb_options.
opt.dynamicemb_options.caching = caching_mode # pyre-ignore [16]
variants.append(opt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dynamicemb_options parameter is never read — the function relies entirely on base_option.dynamicemb_options via copy.deepcopy(base_option) and opt.dynamicemb_options.caching = .... The caller also already assigns sharding_option.dynamicemb_options = dynamicemb_options immediately before invocation, so the argument is doubly redundant.

Either drop the parameter, or use it as the canonical source so callers don't have to mutate base_option first.

Comment thread tzrec/utils/plan_util.py
Comment on lines +355 to +365
for device in storage_constraint.devices:
max_device_hbm = max(max_device_hbm, device.storage.hbm or 0)
hbm_total += device.storage.hbm or 0
ddr_total += device.storage.ddr or 0
# DDR is host-shared across ranks co-located on one machine, so the
# per-option fit check compares against the largest machine's DDR pool
# -- not per-device. HBM is GPU-local, so its prune stays per-device.
per_host = getattr(storage_constraint, "local_world_size", None) or num_devices
per_host = max(per_host, 1)
max_machine_ddr = 0
for host_start in range(0, num_devices, per_host):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This per-machine DDR pruning is the load-bearing reason for the 2D rewrite, but it's not exercised in tests — _make_topology in plan_util_test.py accepts local_world_size but every test falls through to the default (local_world_size or num_devices), so per_host == num_devices in every case. A bug where max_machine_ddr collapses to ddr_total or the window stride is off-by-one wouldn't be caught.

Worth a multi-host test: e.g. num_devices=4, local_world_size=2, ddr_per_device=500 (so each machine has 1000 DDR, total 2000), then assert an option whose shard.ddr=1500 is pruned even though 1500 < ddr_total.

Comment thread tzrec/utils/dynamicemb_util.py Outdated
Comment on lines +437 to +457
# Per-table mode log on rank 0. cache_load_factor=1.0 forces
# the runtime into HBM_ONLY (host tier dropped) regardless of
# ``caching`` -- mirror that override here so the log matches
# the runtime, not the planner's recorded (caching, factor).
if int(os.environ.get("RANK", 0)) == 0:
cache_load_factor = float(sharding_option.cache_load_factor)
if cache_load_factor >= 1.0:
mode = "HBM_ONLY"
dram_bytes = 0 # runtime drops the host tier
else:
mode = "CACHING" if dynamicemb_options.caching else "HYBRID"
dram_bytes = shards[0].storage.ddr
hbm_gib = shards[0].storage.hbm / (1 << 30)
dram_gib = dram_bytes / (1 << 30)
fqn = f"{sharding_option.path}.{sharding_option.name}"
logger.info(
f"[dynamicemb plan] {fqn}: mode={mode} "
f"cache_load_factor={cache_load_factor:.2f} "
f"local_hbm={hbm_gib:.3f}GiB "
f"local_dram={dram_gib:.3f}GiB"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ~25-line log block has zero coverage in dynamicemb_util_test.py / plan_util_test.py. Three independent decision points worth a test each:

  • cache_load_factor >= 1.0 HBM_ONLY override (float-precision regressions would silently mis-label).
  • int(os.environ.get("RANK", 0)) == 0 gating (a regression to os.environ.get("RANK") == "0" would fail on the typical single-rank case where RANK is unset).
  • 1 << 30 GiB divisor (a regression to 1e9 would not be caught).

A mock.patch.object(logger, "info") test driving a synthetic ShardingOption through _to_sharding_plan would cover all three cheaply.

@github-actions
Copy link
Copy Markdown

Code review summary (5 reviewer agents)

Substantial, well-tested planner change. The 2D DP refactor is clean, the empirical-fit docstring on _dynamicemb_effective_cache_ratio is unusually informative, and the new try/finally-restored monkey-patch on ShardPerfContext is properly tested for the exception path. Inline comments below capture the items worth a look before merge — they cluster into:

Correctness / coherence

  • Storage estimator vs. runtime vs. planning log disagree at cache_load_factor >= 1.0 (CACHING DDR reported as full backing in the DP but as 0 in the log).
  • DP "emits Pareto-optimal proposals" docstring overstates what the code does (it emits every reachable (h, d) cell, up to ~800k at the cap).

Robustness

  • ShardPerfContext.build_shard_perf_contexts monkey-patch hardcodes a 6-positional signature with no version assert.
  • mem_bins_per_device legacy kwarg silently wins over the new hbm_bins_per_device when both are passed — and the test codifies it. Worth a DeprecationWarning or TypeError.

Test gaps (also flagged inline where relevant)

  • Per-machine DDR pruning via local_world_size is the motivation for the 2D rewrite but no test ever uses local_world_size != num_devices.
  • The new per-table [dynamicemb plan] ... log block (incl. HBM_ONLY override + rank-0 gating + GiB formatting) is not covered.
  • _DP_AXIS_BIN_CAP is never exercised (all DP tests use num_devices ≤ 2 with bins=20).

Minor

  • (PR #508 review R5) looks like a stray reviewer-ticket reference in a shipping comment.
  • _emit_dynamicemb_variants(base_option, dynamicemb_options) has a dead second parameter.

Performance note (no inline)

  • The pure-Python 4-level inner DP loop is acceptable for a one-shot startup call at the stated 30-table/16-GPU workload (~25-70 s; nested-list dp + backtrack peaks at a few GB), but lives within ~2× of unacceptable on heavier worlds. The (PR #508 review R5) TODO is the right follow-up — switching dp / backtrack to NumPy arrays drops both time and memory by ~10×.

🤖 Generated with Claude Code

tiankongdeguiji and others added 4 commits May 22, 2026 14:46
Four cleanups to ``DynamicProgrammingProposer`` from PR alibaba#508 review
round 2:

* **Proposal emission (R3):** the previous backtracking enumerated every
  reachable ``(h, d)`` cell at the last table -- up to ``hbm_bins ×
  ddr_bins`` ≈ 200k+ entries. For each HBM bin only the lowest-perf DDR
  cell matters (other cells at the same ``h`` are strictly dominated --
  same HBM cost, ≥ perf). Emit one proposal per HBM bin in decreasing
  HBM order; proposal count caps at ``hbm_bins ≤ _DP_AXIS_BIN_CAP``.

* **Drop legacy alias (R2):** delete ``mem_bins_per_device`` kwarg from
  ``__init__``. No in-tree caller passes it; carrying it with a silent
  precedence rule was a landmine.

* **Drop unused parameter (R6):** ``_emit_dynamicemb_variants`` second
  parameter was never read; the function uses
  ``copy.deepcopy(base_option).dynamicemb_options`` instead. Drop the
  parameter and update the single caller.

* **Trim comment (R5):** drop the trailing two lines of the
  ``_DP_AXIS_BIN_CAP`` block (the "stops adding precision" repeat and
  the dangling PR-ID parenthetical).

Also update class docstring to describe the new "one per HBM bin"
emission (was "Pareto-optimal proposals", which was inaccurate).

Drop the now-obsolete ``test_legacy_mem_bins_per_device_alias``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The enumerator bounds ``cache_load_factor`` to ``{0.1, 0.2, ..., 1.0}``,
so the HBM_ONLY-override boundary is the exact value 1.0. Using ``>=
1.0`` would silently relabel any leaked-in ``cache_load_factor > 1.0``
(e.g. from a misconfigured DynamicEmbedding proto) as HBM_ONLY instead
of letting the upstream invariant violation surface loudly downstream
(a negative HYBRID DDR via ``1.0 - 1.5``).

PR alibaba#508 review N1-trimmed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* **N7 -- multi-host DDR prune.** All prior
  ``DynamicProgrammingProposerTest`` cases used
  ``local_world_size == num_devices``, so ``per_host == num_devices``
  collapsed the per-machine window to the whole world; the load-bearing
  multi-host pruning was effectively untested.
  ``test_per_machine_ddr_prune_on_multi_host_topology`` exercises 4 GPUs
  across 2 machines (each with 1000 MB DDR, total 2000): asserts an
  option with ``ddr=1500`` per shard is pruned (exceeds per-machine cap)
  and an option with ``ddr=900`` survives.

* **N8 -- plan log coverage.** Extract the per-table log block into
  ``_log_dynamicemb_table_plan`` so the three decision points
  (``cache_load_factor == 1.0`` HBM_ONLY override, RANK env gating, GiB
  divisor) are testable without reconstructing a full ShardingOption
  pipeline. New ``PlanLogLineTest`` covers: HYBRID/CACHING at partial
  factor, HBM_ONLY at x=1.0 in both modes (override path), non-rank-0
  silence, and a ``cache_load_factor=1.0001`` regression case that
  asserts the exact-equality check does NOT swallow a >1.0 escape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ocstring

* Drop four test classes from ``dynamicemb_util_test.py``:
  ``EffectiveCacheRatioTest``, ``BuildShardPerfContextsWrapperTest``,
  ``DynamicEmbCalcShardStoragesTest``, ``PlanLogLineTest``. Coverage of
  the perf model, the build_shard_perf_contexts monkey-patch, the
  storage estimator, and the per-table log line is provided
  end-to-end by ``PlanUtilDynamicEmbE2ETest`` in ``plan_util_test.py``.
  Only ``StorageFormulaTest`` (the pure HBM/DDR byte arithmetic) is
  worth keeping as a unit test -- the rest is e2e territory.

* Rewrite ``DynamicProgrammingProposer`` class docstring to match
  master's style: open with a one-line summary, frame the problem as
  matrices ``A`` / ``B`` with discretized state ``dp[i][f(k)]``, give
  the state-transition equation in LaTeX, and end with an ``Args`` block
  typed in parentheses. Extends master's 1D formulation to the 2D
  (HBM × DDR) case with cost matrices ``A^h``, ``A^d`` and budgets
  ``K_h``, ``K_d``, and notes the per-HBM-bin proposal emission.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant