Skip to content

Commit 81033e6

Browse files
authored
feat(config): add deterministic fingerprint for workflow configs (#587)
* feat(config): add deterministic fingerprint for workflow configs (#584) Provides DataDesignerConfig.fingerprint() and a freestanding fingerprint_config() helper that produce a content-addressable sha256 hash of the data-relevant portion of a workflow config. Identical configs hash identically across processes and Python versions; fields that don't affect generated rows (tool_configs, profilers, skip_health_check, max_parallel_requests, timeout, HuggingFace seed token/endpoint) are excluded. Custom column generators contribute their registered name and generator_params (L1) by default; opt-in custom_column_source=True also hashes inspect.getsource() of each generator (L2) and degrades gracefully with a warning when the source can't be retrieved. The normalization scheme is versioned via CONFIG_HASH_VERSION so future changes can be detected as "unknown identity" rather than mismatch. * test(config): cover constraints, processors, extra_body, provider, and seed strategies in fingerprint tests Also document L1 __name__-collision and L2 whitespace-sensitivity limitations in fingerprint_config(), and drop the json.dumps default=str fallback so non-JSON-native values fail loudly instead of silently degrading determinism. * feat(config): include tool_configs in fingerprint identity The set of MCP tools an LLM column can call (providers, allow_tools, max_tool_call_turns, tool_alias) shapes what the model produces, so tool_configs is identity-relevant. Only timeout_sec is excluded, mirroring how inference_parameters.timeout is treated as a runtime knob rather than a data-identity field. Updates the fingerprint_config docstring's Include/Exclude lists, flips the existing tool_configs exclusion test, and adds coverage for tool_alias / providers / allow_tools / max_tool_call_turns inclusion plus timeout_sec exclusion. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor * no need to export config hash stuff to config * refactor(config): tighten fingerprint identity, drop L2 source hashing Drops the opt-in `custom_column_source` (L2) source-hashing path and addresses the canonicalization gaps the reviewers found. L2 had several silent footguns: closures with different captured state collapsed to the same hash, the empty `custom_column_sources: []` payload key made L1 and L2 disagree even on configs with no custom columns, `inspect.unwrap()` could raise `ValueError` on `__wrapped__` cycles (uncaught), and same-`__name__` collisions silently came back when `getsource()` failed. Removing it shrinks the public surface, deletes ~50 lines of helper code, and resolves seven review comments at once. Strengthens L1 identity for custom columns: the payload now includes `__qualname__`, `__module__`, and the `@custom_column_generator()` decorator metadata (`required_columns`, `side_effect_columns`, `model_aliases`) in addition to `__name__` + `generator_params`. This disambiguates same-`__name__`-different-scope collisions and prevents silently dropping DAG-affecting metadata. Canonicalizes alias-keyed lookup tables and optional collections so builder-API and YAML-loaded configs producing identical datasets fingerprint identically: * `model_configs` and `tool_configs` are sorted by alias before hashing (column order remains identity, since columns are DAG nodes). * `None` and `[]` collapse to "absent" for top-level optional collections (`model_configs`, `tool_configs`, `constraints`, `processors`) and for `tool_configs[*].allow_tools`. Consolidates the excluded-fields constants behind a single canonical table comment and drops the Sphinx `:func:`/`:class:` roles in the docstrings to match the rest of the codebase. Test coverage adds order-independence tests for `model_configs` and `tool_configs`, parametrized `None`-vs-`[]` equivalence tests for all four optional top-level collections plus `allow_tools`, qualname-disambiguation, and decorator-metadata change detection. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor * test(fingerprint): rename _hash helper to _compute_hash Function names should be action words; `_hash` is a noun. Rename the test-only helper to `_compute_hash` to match its verb-form behavior (it computes a hash from a config). No behavioral change. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor * test(fingerprint): pin closure-capture limitation; restore test names The previous _hash -> _compute_hash blanket rename also caught the test names that happen to end in "_hash()" (e.g. test_changing_X_changes_hash). "hash" is a noun there — it describes what the test is about, not the helper being called. Restore the original names; only the helper itself stays renamed. Add `test_closure_captured_state_is_a_known_limitation` per @johnnygreco's approval follow-up: factory-built closures with different captured state share __name__/__qualname__/__module__/source and so fingerprint identically. Pin that behavior so a future change either keeps the limitation or has to delete the matching docstring paragraph in lockstep. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> Made-with: Cursor --------- Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
1 parent 47c72b3 commit 81033e6

3 files changed

Lines changed: 777 additions & 0 deletions

File tree

packages/data-designer-config/src/data_designer/config/data_designer_config.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from data_designer.config.analysis.column_profilers import ColumnProfilerConfigT
1111
from data_designer.config.column_types import ColumnConfigT
1212
from data_designer.config.exportable_config import ExportableConfigBase
13+
from data_designer.config.fingerprint import fingerprint_config
1314
from data_designer.config.mcp import ToolConfig
1415
from data_designer.config.models import ModelConfig
1516
from data_designer.config.processor_types import ProcessorConfigT
@@ -42,3 +43,16 @@ class DataDesignerConfig(ExportableConfigBase):
4243
constraints: list[ColumnConstraintInputT] | None = None
4344
profilers: list[ColumnProfilerConfigT] | None = None
4445
processors: list[Annotated[ProcessorConfigT, Field(discriminator="processor_type")]] | None = None
46+
47+
def fingerprint(self) -> dict[str, str | int]:
48+
"""Compute a deterministic content-addressable fingerprint of this config.
49+
50+
See `data_designer.config.fingerprint.fingerprint_config` for the full
51+
list of identity-relevant and excluded fields, and how custom column
52+
generators are identified.
53+
54+
Returns:
55+
A dict with `config_hash`, `config_hash_algo`, and
56+
`config_hash_version`.
57+
"""
58+
return fingerprint_config(self)
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
"""Deterministic content-addressable fingerprint for a workflow config.
5+
6+
The fingerprint identifies the *data-relevant* portion of a `DataDesignerConfig`
7+
so that two configs producing the same dataset hash to the same value, while
8+
configs differing only in environment, runtime, or post-generation analysis
9+
hash to different values when they should and to the same value when they
10+
shouldn't.
11+
12+
The hash is computed over a canonical JSON dump of the config (Pydantic
13+
`model_dump(mode="json")`) with non-identity fields removed. Column order is
14+
part of identity (DAG ordering); alias-keyed lookup tables (`model_configs`,
15+
`tool_configs`) are sorted by alias so their internal order is irrelevant.
16+
Empty/`None` optional collections are canonicalized to a single representation
17+
so that builder-API and YAML-loaded configs producing identical datasets
18+
fingerprint identically.
19+
20+
The normalization scheme is versioned via `CONFIG_HASH_VERSION`. Persist the
21+
version alongside the hash so future scheme changes can be detected as
22+
"unknown identity" rather than "definite mismatch".
23+
"""
24+
25+
from __future__ import annotations
26+
27+
import hashlib
28+
import json
29+
from collections.abc import Iterable
30+
from typing import TYPE_CHECKING, Any
31+
32+
from data_designer.config.column_configs import CustomColumnConfig
33+
34+
if TYPE_CHECKING:
35+
from data_designer.config.data_designer_config import DataDesignerConfig
36+
37+
CONFIG_HASH_VERSION = 1
38+
CONFIG_HASH_ALGO = "sha256"
39+
40+
41+
# ---------------------------------------------------------------------------
42+
# Excluded fields (single canonical table). Each entry is excluded from the
43+
# fingerprint because it doesn't affect generated rows:
44+
#
45+
# profilers : post-generation analysis
46+
# model_configs[*].skip_health_check : startup probe, not generation
47+
# inference_parameters.{max_parallel_requests, timeout}
48+
# : concurrency / timing only
49+
# tool_configs[*].timeout_sec : per-call timing knob
50+
# HuggingFaceSeedSource.{token, endpoint}
51+
# : auth + env, not data identity
52+
# ---------------------------------------------------------------------------
53+
_EXCLUDED_TOP_LEVEL_KEYS: frozenset[str] = frozenset({"profilers"})
54+
_EXCLUDED_MODEL_KEYS: frozenset[str] = frozenset({"skip_health_check"})
55+
_EXCLUDED_INFERENCE_KEYS: frozenset[str] = frozenset({"max_parallel_requests", "timeout"})
56+
_EXCLUDED_TOOL_CONFIG_KEYS: frozenset[str] = frozenset({"timeout_sec"})
57+
_EXCLUDED_HF_SEED_KEYS: frozenset[str] = frozenset({"token", "endpoint"})
58+
59+
# Optional collections whose `None` and `[]` representations must collapse so
60+
# that builder-API and YAML-loaded configs producing identical datasets
61+
# fingerprint identically.
62+
_TOP_LEVEL_OPTIONAL_COLLECTIONS: frozenset[str] = frozenset(
63+
{"model_configs", "tool_configs", "constraints", "processors"}
64+
)
65+
_TOOL_CONFIG_OPTIONAL_COLLECTIONS: frozenset[str] = frozenset({"allow_tools"})
66+
67+
68+
# ---------------------------------------------------------------------------
69+
# Public API
70+
# ---------------------------------------------------------------------------
71+
72+
73+
def fingerprint_config(config: DataDesignerConfig) -> dict[str, str | int]:
74+
"""Compute a deterministic fingerprint of a workflow config.
75+
76+
The fingerprint is content-addressable: identical configs (modulo excluded
77+
fields) produce identical hashes across processes, Python versions, and
78+
module load orders. Changing any identity-relevant field changes the hash;
79+
changing an excluded field does not.
80+
81+
Identity-relevant fields:
82+
* `columns` - names, types, generator params, processors, validators,
83+
skip/drop flags. Column order is part of identity (DAG ordering).
84+
* `model_configs` - alias, model, provider, sampling-relevant inference
85+
params (temperature, top_p, max_tokens, extra_body). Sorted by alias.
86+
* `tool_configs` - alias, providers, allow_tools, max_tool_call_turns
87+
(the set of MCP tools shapes generation). Sorted by tool_alias.
88+
* `seed_config` - source path, sampling strategy, selection strategy.
89+
* `constraints`, top-level `processors`.
90+
91+
See module-level constants for the canonical excluded-fields table.
92+
93+
Custom column generators contribute their function's `__name__`,
94+
`__qualname__`, `__module__`, `generator_params`, and the decorator
95+
metadata set by `@custom_column_generator()` (`required_columns`,
96+
`side_effect_columns`, `model_aliases`).
97+
98+
Limitation: closures captured via factory functions (e.g. `make_gen(factor)`
99+
returning a `gen` whose body references `factor`) share `__name__`,
100+
`__qualname__`, `__module__`, and source text, so two closures with
101+
different captured state will fingerprint identically. The fingerprint
102+
cannot see closure cell values.
103+
104+
Args:
105+
config: The workflow config to fingerprint.
106+
107+
Returns:
108+
A dict with `config_hash` (`"sha256:..."`), `config_hash_algo`, and
109+
`config_hash_version` suitable for embedding in dataset metadata.
110+
"""
111+
payload = _normalize_config_dict(config.to_dict(), config)
112+
# No `default=` fallback: a non-JSON-native value would silently break determinism (e.g. repr with memory addresses).
113+
canonical = json.dumps(payload, sort_keys=True, separators=(",", ":"))
114+
digest = hashlib.sha256(canonical.encode("utf-8")).hexdigest()
115+
return {
116+
"config_hash": f"{CONFIG_HASH_ALGO}:{digest}",
117+
"config_hash_algo": CONFIG_HASH_ALGO,
118+
"config_hash_version": CONFIG_HASH_VERSION,
119+
}
120+
121+
122+
# ---------------------------------------------------------------------------
123+
# Private helpers
124+
# ---------------------------------------------------------------------------
125+
126+
127+
def _drop_keys(source: dict[str, Any], keys: Iterable[str]) -> dict[str, Any]:
128+
keyset = set(keys)
129+
return {k: v for k, v in source.items() if k not in keyset}
130+
131+
132+
def _drop_empty_optional(source: dict[str, Any], keys: Iterable[str]) -> dict[str, Any]:
133+
"""Drop keys whose value is `None` or an empty list.
134+
135+
`None` and `[]` are user-equivalent for optional collection fields; this
136+
collapses both to "absent" before hashing.
137+
"""
138+
keyset = set(keys)
139+
return {k: v for k, v in source.items() if not (k in keyset and (v is None or v == []))}
140+
141+
142+
def _normalize_model_config(model_config: dict[str, Any]) -> dict[str, Any]:
143+
normalized = _drop_keys(model_config, _EXCLUDED_MODEL_KEYS)
144+
inference_params = normalized.get("inference_parameters")
145+
if isinstance(inference_params, dict):
146+
normalized["inference_parameters"] = _drop_keys(inference_params, _EXCLUDED_INFERENCE_KEYS)
147+
return normalized
148+
149+
150+
def _normalize_tool_config(tool_config: dict[str, Any]) -> dict[str, Any]:
151+
normalized = _drop_keys(tool_config, _EXCLUDED_TOOL_CONFIG_KEYS)
152+
return _drop_empty_optional(normalized, _TOOL_CONFIG_OPTIONAL_COLLECTIONS)
153+
154+
155+
def _normalize_seed_config(seed_config: dict[str, Any]) -> dict[str, Any]:
156+
normalized = dict(seed_config)
157+
seed_source = normalized.get("source")
158+
if isinstance(seed_source, dict) and seed_source.get("seed_type") == "hf":
159+
normalized["source"] = _drop_keys(seed_source, _EXCLUDED_HF_SEED_KEYS)
160+
return normalized
161+
162+
163+
def _enrich_custom_columns(config: DataDesignerConfig, columns_dump: list[dict[str, Any]]) -> list[dict[str, Any]]:
164+
"""Replace each custom column's serialized `generator_function` (just the
165+
bare `__name__`) with a richer identity dict that includes `__qualname__`,
166+
`__module__`, and the `@custom_column_generator()` decorator metadata.
167+
168+
Walks `config.columns` and `columns_dump` in lockstep so positional
169+
correspondence is reliable.
170+
"""
171+
enriched: list[dict[str, Any]] = []
172+
for col, dumped in zip(config.columns, columns_dump):
173+
if isinstance(col, CustomColumnConfig):
174+
fn = col.generator_function
175+
metadata = getattr(fn, "custom_column_metadata", {}) or {}
176+
dumped = {
177+
**dumped,
178+
"generator_function": {
179+
"name": getattr(fn, "__name__", None),
180+
"qualname": getattr(fn, "__qualname__", None),
181+
"module": getattr(fn, "__module__", None),
182+
"metadata": metadata,
183+
},
184+
}
185+
enriched.append(dumped)
186+
return enriched
187+
188+
189+
def _normalize_config_dict(config_dict: dict[str, Any], config: DataDesignerConfig) -> dict[str, Any]:
190+
normalized = _drop_keys(config_dict, _EXCLUDED_TOP_LEVEL_KEYS)
191+
normalized = _drop_empty_optional(normalized, _TOP_LEVEL_OPTIONAL_COLLECTIONS)
192+
193+
columns = normalized.get("columns")
194+
if columns:
195+
normalized["columns"] = _enrich_custom_columns(config, columns)
196+
197+
model_configs = normalized.get("model_configs")
198+
if model_configs:
199+
normalized["model_configs"] = sorted(
200+
(_normalize_model_config(mc) for mc in model_configs),
201+
key=lambda mc: mc.get("alias", ""),
202+
)
203+
204+
tool_configs = normalized.get("tool_configs")
205+
if tool_configs:
206+
normalized["tool_configs"] = sorted(
207+
(_normalize_tool_config(tc) for tc in tool_configs),
208+
key=lambda tc: tc.get("tool_alias", ""),
209+
)
210+
211+
seed_config = normalized.get("seed_config")
212+
if seed_config:
213+
normalized["seed_config"] = _normalize_seed_config(seed_config)
214+
215+
return normalized

0 commit comments

Comments
 (0)