[refactor] sampler: derive sequence state from feature configs + accept bare sub-feature attr_fields#520
Conversation
Replace the global `_seq_field_delims: Dict[str, str]` (every sequence
input -> delim) with two candidate-side fields derived from the
matching feature config:
_sampler_seq_delim -- "" if `item_id_field` isn't a sequence input;
the parent feature's `sequence_delim`
otherwise (covers top-level
`sequence_id_feature` AND grouped sequence
sub-features).
_sampler_seq_inputs -- candidate-side sequence input names:
{feature.name} for top-level; union of
`sequence_input_names` across all sub-features
sharing `sequence_name` for grouped.
Lookup is `item_id_field in feature.sequence_input_names`, which
returns `[feature.name]` for top-level FG_NONE/FG_BUCKETIZE sequences
and the flattened input names for grouped FG_DAG/FG_NORMAL sub-features
-- single check covers both cases the old `_seq_field_delims` dict did.
Three downstream simplifications in `tzrec/datasets/dataset.py`:
* `launch_sampler_cluster`: outer-list-strip filter switches from
`f.name in self._seq_field_delims` to
`f.name in self._sampler_seq_inputs`, narrowing strip to the
candidate-sequence inputs (excludes unrelated grouped sequences and
non-sequence item-side attrs from the same lookup feature).
* `_apply_negative_sampler`: passes the single `self._sampler_seq_delim`
to `build_sampler_input` instead of the whole dict.
* `_merge_sampled_features`: gates the per-key block-suffix combine on
`k in self._sampler_seq_inputs` (a single set membership check
replacing the per-key dict lookup). Correct for both top-level and
grouped, and for mixed-attrs configs where a lookup feature exposes
both candidate-sequence sub-features and unrelated item-side attrs.
`tzrec/datasets/utils.py`:
* `build_sampler_input(...)` signature change from
`seq_field_delims: Dict[str, str]` to `seq_delim: str`. Docstring
updated to name both top-level `sequence_id_feature` and grouped
sub-feature cases (the master version mentioned only top-level).
DSSM/MIND/TDM behaviour unchanged: when `item_id_field` is a top-level
scalar, `_sampler_seq_delim` stays empty, `_sampler_seq_inputs` stays
empty, and every branch degrades to the pre-refactor scalar path.
Tests:
* `dataset_test.py::test_launch_sampler_cluster_multi_attr_strip_decision_matrix`
-- replace `_seq_field_delims` membership assertions with
`_sampler_seq_delim` / `_sampler_seq_inputs` checks on the same
mixed-attrs lookup_feature case (`cat_map` non-sequence item-side
attr stays excluded; `click_seq__cat_key` grouped sub-feature stays
included).
* `utils_test.py::test_build_sampler_input` -- update
`@parameterized.expand` rows to pass `seq_delim=` (str); rename the
"no entry" case to "empty_seq_delim_passthrough".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…equence candidate
HSTUMatch is the only model whose sampler config carried qualified
`attr_fields` (e.g. `attr_fields: "cand_seq__video_id"`), leaking
DataParser's `{sequence_name}__{sub_feature}` flatten convention into
user-facing config. Every other sampler config (DSSM/MIND/TDM) uses
the bare sub-feature name because their candidate is a top-level
feature where bare-name == flattened-name.
Translate at the dataset boundary via an explicit alias map:
_sampler_bare_attr_to_sequence_input: Dict[str, str]
For grouped-sequence candidates (when item_id_field's matching feature
is `is_grouped_sequence`), the map is built from the candidate-
sequence input names by stripping `feature.sequence_name + feature._underline`:
{"video_id": "cand_seq__video_id", ...}
`_resolve_sampler_attr_field(name) -> str` does
`map.get(name, name)`, and `launch_sampler_cluster` rewrites
`sampler_config.attr_fields` in place over a deep-copy of the
sampler sub-message (so `self._data_config` is not mutated). Three
cases handled by the alias-map's `.get(a, a)` fallthrough:
1. Bare grouped-sequence sub-feature name -> rewritten to qualified.
2. Already-qualified name -> no map entry -> passed through.
3. Top-level item-side attr from the same lookup feature
(e.g. `cat_map` excluded from `sequence_fields`) -> no map entry
-> passed through.
Flatten prefix uses `feature._underline` directly so RTP-safe ("_" vs
"__" never enumerated). Deep-copy guard is `if alias map is non-empty`,
so DSSM/MIND/TDM (alias map empty) skip the copy entirely.
Test config + doc switch to the new convention:
attr_fields: "video_id" (bare sub-feature name)
item_id_field: "cand_seq__video_id" (qualified; doubles as the
sequence_name source)
Add `dataset_test.py::test_launch_sampler_cluster_bare_attr_resolves_against_seq_prefix`
exercising the rewrite path end-to-end: asserts the alias map content,
that `self._data_config` is not mutated, and that the sampler sees the
qualified column name after `launch_sampler_cluster`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-path, flatten loop Addresses /simplify review on top of the prior two commits: * Trim the 17-line block comment above `_sampler_seq_*` init to 4 lines. Original block mostly paraphrased the code (WHAT); keep the load- bearing WHY (HSTUMatch bare-name convention + RTP-safe prefix). * Restore the scalar-config (DSSM/MIND/TDM) fast-path in `launch_sampler_cluster`: skip the `sampler_fields` list-comp when `_sampler_seq_inputs` is empty -- pass through `self.input_fields` unchanged like master did via the old guard. * Flatten the three-level init loop with `continue` guards. * Drop inaccurate `pyre-ignore [16]` on `feature._underline` -- the attribute is publicly defined on `BaseFeature.__init__` and other in-class call sites (e.g. `feature.py:473,749,787`) read it without any ignore. * Annotate `_sampler_seq_inputs` as `Set[str]` (was bare `set`), matching the `Dict[str, str]` style on the next line. No behavior change. Tests stay green (`tzrec.datasets.dataset_test`, `tzrec.datasets.utils_test`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… focused test
The pre-existing `test_launch_sampler_cluster_multi_attr_strip_decision_matrix`
still wrote `attr_fields=["cat_map", "click_seq__cat_key"]` -- the OLD
qualified form for the sequence sub-feature, the very leak this PR is
fixing. Update to `["cat_map", "cat_key"]` (bare for the sub-feature,
top-level item-side `cat_map` unchanged), which both:
* Demonstrates the canonical user-facing form under the new convention.
* Exercises the alias-map rewrite (`cat_key` -> `click_seq__cat_key`)
end-to-end alongside the fallthrough (`cat_map` has no alias entry).
Now-redundant assertions from the focused
`test_launch_sampler_cluster_bare_attr_resolves_against_seq_prefix`
test fold into the multi_attr test:
* `_sampler_bare_attr_to_sequence_input == {"cat_key": "click_seq__cat_key"}`
(alias-map structural check).
* `list(data_config.negative_sampler.attr_fields) == ["cat_map", "cat_key"]`
after launch (deep-copy guard: `self._data_config` not mutated).
* `assertNotIn("cat_key", _sampler._attr_names)` (rewrite replaces,
doesn't append).
Drop the focused test; the multi_attr test is now the comprehensive
end-to-end check (rewrite + fallthrough + outer-list strip in one).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly grouped item_id_field supported
Domain invariant the prior rounds over-engineered against: when
`item_id_field` is sequence-positive, `attr_fields` are uniformly
sequence too -- they're all item-side per-positive, and the per-row
positive layout makes any non-sequence item-side attr structurally
incoherent. Real configs confirm: every NegativeSampler /
HardNegativeSampler config in `tzrec/tests/configs/` and `examples/`
is either all-scalar (DSSM/MIND/TDM) or uniformly grouped-sequence
(HSTUMatch).
Collapse to two fields driven by that invariant:
_sampler_seq_delim -- "" if not in sequence mode; the matching
grouped sequence's `sequence_delim` otherwise.
_sampler_seq_prefix -- "" if not in sequence mode; the flatten
prefix `f"{sequence_name}{_underline}"`
otherwise. Replaces both `_sampler_seq_inputs`
(set; dropped) and
`_sampler_bare_attr_to_sequence_input` (dict;
dropped) from the prior rounds.
Rewrite rule shrinks to one line -- unconditional prepend:
sampler_config.attr_fields[:] = [
self._sampler_seq_prefix + a
for a in sampler_config.attr_fields
]
No `startswith` guard for the legacy qualified form -- this PR
establishes bare names as the single canonical convention, and the
migration burden is small (HSTUMatch is the only model that ever wrote
the qualified form; that config is flipped in commit ffae57b). Legacy
qualified attr_fields now fail loud at sampler init (the double-
prefixed column doesn't exist).
`launch_sampler_cluster`'s outer-list strip narrows its filter from
`_sampler_seq_inputs` to `set(sampler_config.attr_fields)` -- in every
real config `item_id_field` is one of the `attr_fields` entries
(post-rewrite), so no separate union is needed.
`_merge_sampled_features` collapses to a single delim gate
(`if not self._sampler_seq_delim`); the Round 1 per-key
`_sampler_seq_inputs` membership check was defensive complexity for
the would-be mixed-attrs case that doesn't exist in real configs.
Top-level `sequence_id_feature` as `item_id_field` is now rejected via
an `assert feature.is_grouped_sequence`: this shape is unused by any
current config, and migrating the two `_TestReader` tests that posited
it (`test_dataset_with_sampler_list_item_id`,
`test_dataset_with_hard_negative_sampler_list_item_id`) to the grouped
style matches how multi-positive sampling actually works in
production.
`_resolve_sampler_attr_field` helper dropped (one-line `.get` on a
now-removed dict). `Set` import dropped (no other uses).
`test_launch_sampler_cluster_multi_attr_strip_decision_matrix` renamed
to `test_launch_sampler_cluster_grouped_sequence_strip_and_rewrite`
and reframed: drops the synthetic `cat_map`-in-attr_fields scenario
(mixed-attrs configs don't exist in production), keeps `cat_map` as a
parquet column NOT in attr_fields to verify strip filter precision,
adds the prefix assertion `_sampler_seq_prefix == "click_seq__"`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comments + docstrings I added in earlier rounds had paragraph-style explanations. Trim to one-liners per `feedback_concise_inline_comments`: * `BaseDataset.__init__` sampler-seq state init: 7-line block -> 1 line. * `launch_sampler_cluster` rewrite block: 5-line comment -> 1 line. * `launch_sampler_cluster` strip block: 7-line comment -> 1 line. * `_merge_sampled_features` docstring: 7 lines -> 4 lines. * `test_launch_sampler_cluster_grouped_sequence_strip_and_rewrite` docstring: 12 lines -> 5; trim three WHAT-style assertion comments. No code-behavior change; tests stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… sub-feature types
`test_launch_sampler_cluster_grouped_sequence_strip_and_rewrite` used
a single LookupFeature sub with `sequence_fields=["cat_key"]` to
construct one `list<list<int64>>` candidate input. The synthetic
shape (multi-value layered under multi-positive) doesn't reflect
HSTUMatch's actual usage where the candidate sequence carries
multiple sub-features of varied types (id, raw, ...).
Replace with three sub-features under a single `sequence_feature`:
* `id_feature item_id` -> `click_seq__item_id: list<int64>`
* `id_feature cat_id` -> `click_seq__cat_id: list<int64>`
* `raw_feature watch_time` -> `click_seq__watch_time: list<float32>`
`attr_fields=["item_id", "cat_id", "watch_time"]` exercises the
prefix-prepend rewrite across all three; the strip filter scopes
correctly per type (int64 / int64 / float32).
Drops the `sequence_fields` complexity (LookupFeature-only knob) and
the `cat_map` parquet column (the `assertNotIn("cat_map", ...)`
assertion was checking sampler-side state unrelated to the strip
filter's behavior).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f57e06d to
b0213f8
Compare
| assert feature.is_grouped_sequence, ( | ||
| f"item_id_field '{self._sampler_item_id_field}' is a " | ||
| f"sequence input but its matching feature is not a " | ||
| f"grouped sequence; only grouped sequence sub-features " | ||
| f"are supported as item_id_field." | ||
| ) |
There was a problem hiding this comment.
This is user-facing config validation, not an internal invariant — python -O strips the assert, after which feature.sequence_name (defaulted to None at feature.py:427 for non-grouped features) flows into feature.sequence_name + feature._underline two lines below and raises TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'. A confusing failure mode for what's a clear config error. Convert to raise ValueError(...) to match the explicit-raise pattern used elsewhere in this file (e.g. dataset.py:260).
| assert feature.is_grouped_sequence, ( | |
| f"item_id_field '{self._sampler_item_id_field}' is a " | |
| f"sequence input but its matching feature is not a " | |
| f"grouped sequence; only grouped sequence sub-features " | |
| f"are supported as item_id_field." | |
| ) | |
| if not feature.is_grouped_sequence: | |
| raise ValueError( | |
| f"item_id_field '{self._sampler_item_id_field}' is a " | |
| "sequence input but its matching feature is not a " | |
| "grouped sequence; only grouped sequence sub-features " | |
| "are supported as item_id_field." | |
| ) |
Worth pairing with a small negative test (assertRaisesRegex) so the contract is locked.
| @@ -708,23 +715,35 @@ def test_launch_sampler_cluster_multi_attr_strip_decision_matrix(self): | |||
| sequence_delim=";", | |||
| features=[ | |||
| feature_pb2.SeqFeatureConfig( | |||
| lookup_feature=feature_pb2.LookupFeature( | |||
| feature_name="lookup_c", | |||
| map="item:cat_map", | |||
| key="item:cat_key", | |||
| sequence_fields=["cat_key"], | |||
| id_feature=feature_pb2.IdFeature( | |||
| feature_name="item_id", | |||
| expression="item:item_id", | |||
| num_buckets=200, | |||
| embedding_dim=8, | |||
| ) | |||
| ), | |||
| feature_pb2.SeqFeatureConfig( | |||
| id_feature=feature_pb2.IdFeature( | |||
| feature_name="cat_id", | |||
| expression="item:cat_id", | |||
| num_buckets=10, | |||
| embedding_dim=8, | |||
| ) | |||
| ), | |||
| feature_pb2.SeqFeatureConfig( | |||
| raw_feature=feature_pb2.RawFeature( | |||
| feature_name="duration", | |||
| expression="item:duration", | |||
| ) | |||
| ), | |||
| ], | |||
| ) | |||
| ), | |||
| ] | |||
| features = create_features( | |||
| feature_cfgs, | |||
| fg_mode=data_pb2.FgMode.FG_NORMAL, | |||
| neg_fields=["cat_map", "cat_key"], | |||
| neg_fields=["item_id", "cat_id", "duration"], | |||
| force_base_data_group=True, | |||
| ) | |||
| dataset = _TestDataset( | |||
| @@ -736,8 +755,9 @@ def test_launch_sampler_cluster_multi_attr_strip_decision_matrix(self): | |||
| negative_sampler=sampler_pb2.NegativeSampler( | |||
| input_path=f.name, | |||
| num_sample=4, | |||
| attr_fields=["cat_map", "click_seq__cat_key"], | |||
| item_id_field="click_seq__cat_key", | |||
| # bare names; prefix-rewritten at launch | |||
| attr_fields=["item_id", "cat_id", "duration"], | |||
| item_id_field="click_seq__item_id", | |||
| ), | |||
| force_base_data_group=True, | |||
| ), | |||
| @@ -746,22 +766,23 @@ def test_launch_sampler_cluster_multi_attr_strip_decision_matrix(self): | |||
| input_fields=input_fields, | |||
| mode=Mode.TRAIN, | |||
| ) | |||
| # Narrowed _seq_field_delims excludes the non-sequence item-side input. | |||
| self.assertIn("click_seq__cat_key", dataset._seq_field_delims) | |||
| self.assertNotIn("cat_map", dataset._seq_field_delims) | |||
| self.assertEqual(dataset._sampler_seq_delim, ";") | |||
| self.assertEqual(dataset._sampler_seq_prefix, "click_seq__") | |||
|
|
|||
| dataset.launch_sampler_cluster(2) | |||
| # outer guard True (item_id_field is sequence-positive): | |||
| # - cat_key: list<list<int64>> -> list<int64> (one strip). | |||
| # - cat_map: list<string>, not in _seq_field_delims -> unstripped. | |||
| cat_key_idx = dataset._sampler._attr_names.index("click_seq__cat_key") | |||
| cat_map_idx = dataset._sampler._attr_names.index("cat_map") | |||
| self.assertEqual( | |||
| dataset._sampler._attr_types[cat_key_idx], pa.list_(pa.int64()) | |||
| ) | |||
| # Deep-copy guard: data_config not mutated by the rewrite. | |||
| self.assertEqual( | |||
| dataset._sampler._attr_types[cat_map_idx], pa.list_(pa.string()) | |||
| list(dataset._data_config.negative_sampler.attr_fields), | |||
| ["item_id", "cat_id", "duration"], | |||
| ) | |||
| # Each bare entry prefix-rewritten and outer-list stripped. | |||
| for qualified, expected_type in [ | |||
| ("click_seq__item_id", pa.int64()), | |||
| ("click_seq__cat_id", pa.int64()), | |||
| ("click_seq__duration", pa.float32()), | |||
| ]: | |||
| idx = dataset._sampler._attr_names.index(qualified) | |||
| self.assertEqual(dataset._sampler._attr_types[idx], expected_type) | |||
There was a problem hiding this comment.
Two coverage gaps worth a follow-up commit:
-
No negative test for the
assert feature.is_grouped_sequencepath atdataset.py:176. A 5-lineassertRaisesRegexconstruction with a top-levelsequence_id_feature+ matchingitem_id_fieldwould lock the rejection — without it, the assertion can be silently weakened in a future refactor (and would be invisible underpython -O; see the related comment on the assert itself). -
No
large_listcoverage in this test. The strip predicate atdataset.py:227handles bothis_listandis_large_list, but every input field here ispa.list_(...). Adding a singlepa.large_list(pa.int64())sub-feature row would cover the second arm.
Review summaryNet read: clean simplification. Collapsing the Filed 4 inline comments. Highest-value ones:
Doc note at 🤖 Generated with Claude Code |
… -> raise ValueError
Two PR-review fixes on top of the Round 3 prefix collapse:
1. BaseDataset's `assert feature.is_grouped_sequence, ...` is user-
facing config validation, not an internal invariant. Under
`python -O`/`-OO` the assert is stripped and the next line --
`feature.sequence_name + feature._underline` -- runs unconditionally;
for the rejected top-level sequence_id_feature shape `sequence_name`
is None (feature.py:427), producing `TypeError: unsupported operand
type(s) for +: 'NoneType' and 'str'`. Convert to `raise ValueError`
matching the explicit-raise pattern at dataset.py:260.
2. `feature._underline` (feature.py:426) is a private attribute; its
prior call sites at feature.py:473, 749, 787 were all in-class.
The Round 3 cross-module read from dataset.py is the first one
crossing the underscore boundary. Add a public `grouped_sequence_prefix`
property on BaseFeature next to `is_grouped_sequence`:
@Property
def grouped_sequence_prefix(self) -> str:
return f"{self.sequence_name}{self._underline}" if self._is_grouped_seq else ""
Empty-string for non-grouped features so callers can use it
unconditionally. Migrate the dataset.py read AND the three in-class
call sites:
* feature.py:469-475 -- `name` property: the if/else for grouped vs
not collapses to one line via the property's "" fallback.
* feature.py:749 -- _is_sequence_input: prefix construction.
* feature.py:787 -- side_inputs: seq_prefix construction.
The explicit raise from (1) stays as the source of truth for config
validation (the property's "" fallback only protects against the
None+str TypeError; without the explicit raise a non-grouped item_id
would silently flow through with empty prefix + set delim, producing
a half-configured sequence path that fails further downstream).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Refactor
BaseDatasetsonegative_sampler.attr_fieldsuses bare sub-feature names (e.g."video_id") whenitem_id_fieldis a grouped sequence sub-feature, matching the convention every other sampler config (DSSM/MIND/TDM) already uses. HSTUMatch was the outlier writing the qualified flattened form ("cand_seq__video_id"), leakingDataParser's flatten convention into user-facing config.Design
Two
BaseDatasetfields, both empty unlessitem_id_fieldis a grouped sequence sub-feature:_underline(the in-feature separator,_under RTP /__otherwise) is read off the matching feature — never hardcoded — so the rewrite is RTP-safe.launch_sampler_clusterprepends the prefix to everyattr_fieldsentry (deep-copying the sampler sub-message sodata_configisn't mutated):Unconditional prepend — no
startswithguard. Bare is the single canonical form; legacy qualified configs ("cand_seq__video_id"from #506) fail loud at sampler init.The outer-list strip filter scopes to
set(sampler_config.attr_fields)(post-rewrite)._merge_sampled_featuresgates the block-suffix combine on_sampler_seq_delimalone.Domain invariant the design relies on
When
item_id_fieldis a grouped sequence sub-feature,attr_fieldsare uniformly sequence-positive — they're all item-side per-positive. There's no "mixed" case where some entries are sequence and others are top-level scalars; the per-row multi-positive layout makes any non-sequence item-side attr structurally incoherent. Confirmed by grepping every config intzrec/tests/configs/+examples/. Top-levelsequence_id_featureasitem_id_fieldis rejected via anassert feature.is_grouped_sequence(unused by any current config; the two_TestReadertests that posited it are migrated to grouped style).Config + doc update
negative_sampler { input_path: "data/test/kuairand-1k-match-item-gl.txt" - attr_fields: "cand_seq__video_id" + attr_fields: "video_id" item_id_field: "cand_seq__video_id" }docs/source/models/hstu_match.mdgains a one-paragraph note:item_id_fieldis qualified (it doubles as the sequence-prefix source),attr_fieldsis bare.Test plan
python -m unittest tzrec.datasets.dataset_test tzrec.datasets.utils_test tzrec.datasets.sampler_test— green on local A10.python -m unittest tzrec.tests.match_integration_test.MatchIntegrationTest.test_hstu_with_fg_train_eval— green on local A10 (~193s).pre-commit run— clean on all touched files.Source
Slice of #518 (
feat/hstu-match-scalar-item-export), reimplemented from scratch with simplification driven by the domain invariant: a uniform prefix string captures the boundary, no per-attr branching, no alias dict, no defensive set probes. The HSTUMatchItemTower scalar export view from #518 is orthogonal and deferred to a follow-up PR.🤖 Generated with Claude Code