Skip to content

[feat] SID: add ResidualQuantizer / BaseSidModel base classes#538

Merged
tiankongdeguiji merged 4 commits into
alibaba:masterfrom
WhiteSwan1:sid-1-abstract
Jun 5, 2026
Merged

[feat] SID: add ResidualQuantizer / BaseSidModel base classes#538
tiankongdeguiji merged 4 commits into
alibaba:masterfrom
WhiteSwan1:sid-1-abstract

Conversation

@WhiteSwan1

@WhiteSwan1 WhiteSwan1 commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

First of three PRs splitting the Semantic-ID models onto a shared base. This one is purely additive — only the backend-agnostic foundation, with no concrete quantizer or model and no edits to existing files. RQ-KMeans follows in PR2, RQ-VAE in PR3.

What this adds:

  • ResidualQuantizer (abstract): owns the state every residual quantizer shares — embed_dim, per-layer codebook sizes (normalize_n_embed), residual-normalization flag, and the layer list — and the shared residual walk. _residual_pass drives the concrete get_codes / decode_codes / output_dim. Subclasses implement just the two per-layer primitives — _quantize_layer (encode) and _lookup_code (decode) — plus forward and get_codebook_embeddings.
  • BaseSidModel (abstract): the shared SID model scaffold — embedding-feature extraction, loss/metric initialization, and shared config parsing — that SidRqkmeans and SidRqvae will subclass.
  • types.py: QuantizeForwardMode, QuantizeOutput, ResidualQuantizerOutput.

Tests: normalize_n_embed, the abstract-base contract (shared state present, backend primitives raise).

No proto changes and no edits to existing modules — the concrete SidRqkmeans / SidRqvae messages, and the BaseModel.on_train_end lifecycle hook, arrive with their models in PR2 / PR3.

@WhiteSwan1 WhiteSwan1 changed the title [refactor] SID: add ResidualQuantizer / BaseSidModel base classes + o… [feat] SID: add ResidualQuantizer / BaseSidModel base classes Jun 2, 2026
@WhiteSwan1 WhiteSwan1 force-pushed the sid-1-abstract branch 3 times, most recently from be47e44 to cb7e846 Compare June 3, 2026 05:22
@WhiteSwan1 WhiteSwan1 added the claude-review Let Claude Review label Jun 3, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 3, 2026
Comment thread tzrec/modules/sid_generation/residual_quantizer.py Outdated
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/modules/sid/residual_quantizer_test.py
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review summary

Clean, well-scoped additive PR — the abstract ResidualQuantizer / BaseSidModel split is sensible, docstrings are thorough and accurate, and update_train_metric correctly wires into the real call site (main.py:438). The detach/accumulate semantics of _residual_pass match the docstring. Nothing blocking.

A few items, left inline:

  • decode_codes hardcodes dtype=torch.float — pins reconstruction to fp32 and silently upcasts under mixed precision; seed the accumulator from the first lookup so dtype/device follow the codebook.
  • _update_unique_sid_ratio divides by B without an empty-batch guard (ZeroDivisionError).
  • Test coverage — only the NotImplementedError paths are exercised; the core residual-walk math (get_codes/decode_codes, the detach invariant, the normalize_residuals=True branch) is untested. A tiny fake subclass would cover it.

Minor polish (no need to block):

  • update_train_metric(predictions: dict, ...) — every sibling/base uses Dict[str, torch.Tensor].
  • # noqa: ANN201 on the abstract forward is a no-op (ANN is not in .ruff.toml lint.select); siblings annotate their forward return.

WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
- decode_codes: seed the accumulator from the first lookup so device AND
  dtype follow the codebook, instead of pinning to fp32 (silently upcasting
  each layer's add under mixed precision). n_layers >= 1 is guaranteed.
- BaseSidModel._update_unique_sid_ratio: guard B == 0 (empty final shard
  under DDP/TorchRec) to avoid ZeroDivisionError.
- residual_quantizer_test: add a fake one-primitive subclass exercising the
  concrete residual walk the base owns — get_codes shape + aggregate ==
  Σ quantized_i, the detach invariant (codebook grad flows, input gets none),
  the normalize_residuals branch, and decode_codes sum + codebook dtype.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Jun 3, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 3, 2026
Comment thread tzrec/modules/sid_generation/residual_quantizer_test.py Outdated
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/modules/sid/residual_quantizer.py
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Code review summary

Reviewed across code-quality, performance, tests, docs, and security. This is clean, well-documented foundation code for the SID model split — the abstraction boundary (_quantize_layer/_lookup_code primitives vs. the shared _residual_pass walk) is cleanly drawn, the detach/grad semantics are correct and tested, and the decode_codes dtype/device-aware seeding and B == 0 empty-shard guard show good edge-case awareness. No blocking issues.

A few minor, optional items posted inline:

  • Teststest_normalize_residuals_branch only asserts shapes, so it wouldn't catch the normalization branch being removed; and _update_unique_sid_ratio (incl. its empty-shard guard) is pure-tensor logic that's untestable-free and worth a small unit test now.
  • Polishpredictions: dictDict[str, torch.Tensor] for consistency with BaseModel; an optional assert n_layers >= 1 in ResidualQuantizer.__init__ to self-guard the construction boundary for the upcoming PR2/PR3 subclasses.

Docs were verified accurate (the :class: forward-refs to not-yet-existing subclasses don't break the Sphinx build). LGTM as the additive foundation layer.

🤖 Generated with Claude Code

WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
…aba#538

- ResidualQuantizer.__init__: assert n_layers >= 1, making the boundary
  self-guarding (matches the "n_layers >= 1 is guaranteed" comment in
  decode_codes; a future n_layers=0 subclass would otherwise index OOB).
- BaseSidModel.update_train_metric: type predictions as Dict[str, Tensor]
  instead of bare dict, matching BaseModel / sibling models.
- sid_model_test.py (new): unit-test _update_unique_sid_ratio via __new__ —
  empty-batch no-op (guard) and the exact 0.75 ratio on a known-duplicate
  batch.
- residual_quantizer_test: make the normalize_residuals test behavioral —
  toggle the flag on the same input/codebook and assert the assignments
  differ, so the normalization branch is actually validated (the old shape
  assertions passed even with the branch deleted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/modules/sid_generation/types.py Outdated
Comment thread tzrec/modules/sid_generation/__init__.py Outdated
WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
…aba#538

- ResidualQuantizer.__init__: assert n_layers >= 1, making the boundary
  self-guarding (matches the "n_layers >= 1 is guaranteed" comment in
  decode_codes; a future n_layers=0 subclass would otherwise index OOB).
- BaseSidModel.update_train_metric: type predictions as Dict[str, Tensor]
  instead of bare dict, matching BaseModel / sibling models.
- sid_model_test.py (new): unit-test _update_unique_sid_ratio via __new__ —
  empty-batch no-op (guard) and the exact 0.75 ratio on a known-duplicate
  batch.
- residual_quantizer_test: make the normalize_residuals test behavioral —
  toggle the flag on the same input/codebook and assert the assignments
  differ, so the normalization branch is actually validated (the old shape
  assertions passed even with the branch deleted).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
- Metrics: switch the shared reconstruction metric to
  torchmetrics.MeanSquaredError (correct (preds, target) aggregation vs the
  biased mean-of-batch-means a MeanMetric gave), and add a UniqueRatio metric
  class (tzrec/metrics/unique_ratio.py) for codebook coverage instead of a
  MeanMetric fed a hand-computed ratio. BaseSidModel._update_unique_sid_ratio
  now delegates to it; the empty-batch guard + DDP reduction live in the
  metric. The RQ-VAE train-path mse moves to MeanSquaredError too.
- __init__.py: revert tzrec/modules/sid_generation/__init__.py to a bare
  package marker (no re-exports) and import ResidualKMeansQuantizer from its
  submodule in sid_rqkmeans, per "avoid adding to __init__.py".

Tests: tzrec/metrics/unique_ratio_test.py (ratio, empty no-op, mean over
batches); sid_model_test verifies the delegation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
- Metrics: switch the shared reconstruction metric to
  torchmetrics.MeanSquaredError (correct (preds, target) aggregation vs the
  biased mean-of-batch-means a MeanMetric gave), and add a UniqueRatio metric
  class (tzrec/metrics/unique_ratio.py) for codebook coverage instead of a
  MeanMetric fed a hand-computed ratio. The empty-batch guard + DDP reduction
  live in the metric; callers update it directly (the one-line
  _update_unique_sid_ratio passthrough is removed). The RQ-VAE train-path mse
  moves to MeanSquaredError too.
- __init__.py: revert tzrec/modules/sid_generation/__init__.py to a bare
  package marker (no re-exports) and import ResidualKMeansQuantizer from its
  submodule in sid_rqkmeans, per "avoid adding to __init__.py".

Tests: tzrec/metrics/unique_ratio_test.py (ratio, empty no-op, mean over
batches).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@WhiteSwan1 WhiteSwan1 added the claude-review Let Claude Review label Jun 3, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Jun 3, 2026
Comment thread tzrec/models/sid_model.py Outdated
Comment thread tzrec/metrics/unique_ratio.py
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review summary

Clean, genuinely additive PR — backend-agnostic SID base classes only, no edits to existing files. The abstraction is well-factored: ResidualQuantizer isolates the shared residual walk from the two backend primitives (_quantize_layer / _lookup_code), and BaseSidModel centralizes config parsing and shared metrics. Docstrings are thorough (the decode_codes dtype/device seeding and the detach invariant are nicely documented), and the _FakeQuantizer-based tests exercise the concrete base logic without pulling in real backends. The _residual_pass / decode_codes paths are AMP-safe with no spurious host-device syncs.

Two inline comments on items worth confirming:

  • sid_model.py — base registers mse/unique_sid_ratio in init_metric but provides no shared update_metric to populate them; docstring says subclasses "extend update_metric" but there's nothing to extend.
  • unique_ratio.py — metric is a mean of per-batch ratios (batch-size sensitive), not global codebook coverage; worth confirming the intended semantic.

Minor / non-blocking notes:

  • init_metric hardcodes the metric set rather than reading proto metrics config like other model families. Fine if SID metrics are intentionally fixed — worth a one-line note so it isn't later mistaken for a bug.
  • The :class: cross-refs to SidRqvae / SidRqkmeans / ResidualVectorQuantizer / ResidualKMeansQuantizer (not yet in the tree) may emit Sphinx "reference target not found" warnings until PR2/PR3 land — expected for the split, just flagging in case the doc build is strict.
  • A couple of low-priority test gaps: the n_layers >= 1 assertion and the temperature forwarding path are untested. The absence of a BaseSidModel test is justified (its proto fields don't exist yet); _extract_feature / init_metric could be unit-tested now with a stub if you want coverage before the subclass PR.

Nice work — no blocking issues.

WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 3, 2026
Addresses github-actions review on PR alibaba#538 (no behavior change):
- UniqueRatio stays the cheap per-batch metric (mean of distinct-rows /
  batch-size, two scalar states) — deliberately kept per-batch, not global
  distinct-SID coverage, to avoid accumulating an O(#distinct-SIDs) set
  during predict. Docstrings now say so plainly (a batch-size-sensitive
  diversity proxy, not global codebook coverage) instead of overselling it as
  "codebook coverage".
- BaseSidModel docstring: subclasses extend init_metric (via super) and
  *implement* update_metric (BaseModel.update_metric raises
  NotImplementedError, so there is nothing to "extend"); update_train_metric
  defaults to a no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First of three PRs splitting the Semantic-ID models onto a shared base. Purely
additive — only the backend-agnostic foundation, no concrete quantizer or
model and no edits to existing files. RQ-KMeans follows in PR2, RQ-VAE in PR3.

What this adds:
- ResidualQuantizer (abstract): owns the shared state (embed_dim, per-layer
  codebook sizes via normalize_n_embed, residual-normalization flag, layer
  list; asserts n_layers >= 1) and the shared residual walk — _residual_pass
  drives the concrete get_codes / decode_codes / output_dim. Subclasses
  implement just _quantize_layer (encode) and _lookup_code (decode), plus
  forward and get_codebook_embeddings.
- BaseSidModel (abstract): the shared SID model scaffold — embedding-feature
  extraction, loss/metric init (reconstruction MSE via
  torchmetrics.MeanSquaredError + codebook coverage via UniqueRatio), and
  shared config parsing — that SidRqkmeans / SidRqvae subclass.
- UniqueRatio (tzrec/metrics/unique_ratio.py): codebook-coverage metric (mean
  per-batch unique-row ratio) with empty-batch guard + DDP reduction.

Tests: normalize_n_embed; the abstract-base contract; the concrete residual
walk via a fake one-primitive subclass; and the UniqueRatio metric.

No proto changes and no edits to existing modules; __init__.py is a bare
package marker (no re-exports). The QuantizeForwardMode / QuantizeOutput /
ResidualQuantizerOutput types and the concrete SidRqkmeans / SidRqvae models
ship with the code that uses them in PR2 / PR3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
WhiteSwan1 and others added 3 commits June 5, 2026 03:47
…bump version to 1.2.16

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji merged commit 7dc1c18 into alibaba:master Jun 5, 2026
7 checks passed
WhiteSwan1 added a commit to WhiteSwan1/TorchEasyRec that referenced this pull request Jun 11, 2026
…_abstract

Brings the reviewed alibaba#539 foundation onto feat/sid_abstract (which already
carries alibaba#538 + an older RQ-VAE/RQ-Kmeans port), and syncs to upstream/master
(alibaba#540, alibaba#541, which alibaba#539 already contains).

Conflict resolutions:
- sid_rqkmeans.py(+test), residual_kmeans_quantizer.py, sid_model.py:
  take alibaba#539's canonical versions (BaseSidModel now hosts both SID models,
  with mse/rel_loss/unique_sid_ratio and the unified x_hat recon key).
- types.py: union — keep alibaba#539's QuantizeOutput, retain feat's
  QuantizeForwardMode enum + ResidualQuantizerOutput (RQ-VAE needs them).
- protos/models/sid_model.proto: union — alibaba#539's typed FaissKmeansConfig +
  clean SidRqkmeans, re-add feat's SinkhornConfig/ClipConfig/SidRqvae;
  drop the now-unused struct.proto import.
- protos/model.proto: enable `SidRqvae sid_rqvae = 600;` (the field alibaba#539
  reserved for this follow-up).
- main.py / model.py on_train_end: take alibaba#539's wording; drop feat's forced
  tail-checkpoint (SID models rely on the final=True tail save).

Transitional state: old modules/sid/kmeans.py still coexists with alibaba#539's
kmeans_quantize.py, and the RQ-VAE stack is still on the old abstraction —
both retired in the follow-up refactor commit. All SID modules import.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants