-
Notifications
You must be signed in to change notification settings - Fork 112
Allow custom check messages #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c28d041
6721317
ab9a4ff
05baafb
468f229
e6fe9e5
0c2132f
c08cf31
b46ca2b
10313ce
e13b0ef
580a4af
6ec79a5
050c1d7
7ed09c7
52cd2df
4560ac6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -151,9 +151,11 @@ def _build_result_struct(self, condition: Column, skipped: bool = False) -> Colu | |
| # or use literal run time if explicitly overridden | ||
| run_time_expr = F.current_timestamp() if self.run_time_overwrite is None else F.lit(self.run_time_overwrite) | ||
|
|
||
| message_col = self._build_message_col(condition) | ||
|
|
||
| return F.struct( | ||
| F.lit(self.check.name).alias("name"), | ||
| condition.alias("message"), | ||
| message_col.alias("message"), | ||
| self.check.columns_as_string_expr.alias("columns"), | ||
| F.lit(self.check.filter or None).cast("string").alias("filter"), | ||
| F.lit(self.check.check_func.__name__).alias("function"), | ||
|
|
@@ -167,6 +169,28 @@ def _build_result_struct(self, condition: Column, skipped: bool = False) -> Colu | |
| F.lit(skipped or None).alias("skipped"), | ||
|
mwojtyczka marked this conversation as resolved.
|
||
| ).cast(dq_result_item_schema) | ||
|
|
||
| def _build_message_col(self, condition: Column) -> Column: | ||
| """ | ||
| Builds the message column, using the default message or the user-supplied | ||
| ``message_expr`` from the rule definition. The expression is evaluated as-is — DQX | ||
| does not substitute placeholders. Accepts either a Spark SQL expression string or a | ||
| Spark Column. | ||
|
|
||
| Args: | ||
| condition: Default DQX condition message returned by evaluating the DQX check function | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When Passing as struct would be more flexible, but it will add significant complexity for users when defining the message func so maybe just document this. |
||
|
|
||
| Returns: | ||
| The custom DQX condition message if ``message_expr`` is set on the rule, otherwise the | ||
| default DQX condition message. | ||
| """ | ||
| if self.check.message_expr is None: | ||
| return condition | ||
|
|
||
| custom_message = ( | ||
| self.check.message_expr if isinstance(self.check.message_expr, Column) else F.expr(self.check.message_expr) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arbitrary Spark SQL evaluation of strings from non-trusted sources.
- check: { function: is_not_null, arguments: { column: id } }
message_expr: "concat('hi ', (select secret_key from secrets.config limit 1))"Not a classic SQL injection (Spark
The project's CLAUDE.md security policy states:
Options:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. UX trap: a plain string passes through A new user will write: DQRowRule(..., message_expr="Email must not be null")
The docs example correctly shows Options:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check if it is a string, to avoid problems with spark Column vs spark connect Column |
||
| ) | ||
| return F.when(condition.isNotNull(), custom_message).otherwise(F.lit(None).cast("string")) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No length cap on rendered message. A pathological expression like Consider wrapping with |
||
|
|
||
| def _get_invalid_cols_message(self) -> str: | ||
| """ | ||
| Returns invalid columns message containing info about invalid columns to check should be applied to or filter. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -165,6 +165,11 @@ class DQRule(abc.ABC, DQRuleTypeMixin, SingleColumnMixin, MultipleColumnsMixin): | |
| * *check_func_args* (optional) - Positional arguments for the check function (excluding *column*). | ||
| * *check_func_kwargs* (optional) - Keyword arguments for the check function (excluding *column*). | ||
| * *user_metadata* (optional) - User-defined key-value pairs added to metadata generated by the check. | ||
| * *message_expr* (optional) - User-defined expression used as the check failure message. Accepts either | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No placeholder support for The PR description lists these as expected context, but the implementation evaluates the expression as-is. Users who want a message like This will be a frequent feature request as a follow-up. Worth either:
|
||
| a Spark SQL expression string or a Spark *Column* expression. The expression is evaluated as-is. | ||
| Any column references, casts, or rule-identifying literals must be supplied directly by the caller | ||
| (e.g., ``F.concat(F.lit('age_positive: value '), F.col('age').cast('string'))`` or | ||
| ``"concat('age_positive: value ', cast(age as string))"``). | ||
| """ | ||
|
|
||
| check_func: Callable | ||
|
|
@@ -176,6 +181,7 @@ class DQRule(abc.ABC, DQRuleTypeMixin, SingleColumnMixin, MultipleColumnsMixin): | |
| check_func_args: list[Any] = field(default_factory=list) | ||
| check_func_kwargs: dict[str, Any] = field(default_factory=dict) | ||
| user_metadata: dict[str, str] | None = None | ||
| message_expr: str | Column | None = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overloaded parameter type — consider splitting before this becomes public API.
Not a blocker — your current shape works — but worth deciding now since changing the field name later is a breaking change. |
||
|
|
||
| def __post_init__(self): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add best-effort A typo'd SQL string ( if isinstance(self.message_expr, str) and self.message_expr:
try:
F.expr(self.message_expr)
except Exception as exc:
raise InvalidParameterError(
f"message_expr is not a valid Spark SQL expression: {exc}. "
f"Note: literal strings must be wrapped in SQL quotes, e.g. \"'my message'\"."
) from excCheap (build-only — no DataFrame eval), and the error message hints at the unquoted-string trap in the previous comment. |
||
| self._validate_rule_type(self.check_func) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MEDIUM - Type annotation: The field is |
||
|
|
@@ -259,6 +265,10 @@ def to_dict(self) -> dict: | |
|
|
||
| if self.user_metadata: | ||
| metadata["user_metadata"] = self.user_metadata | ||
| # Only string expressions can be round-tripped through metadata; Column objects are | ||
| # in-process Spark expressions with no canonical YAML/JSON representation. | ||
| if isinstance(self.message_expr, str) and self.message_expr: | ||
| metadata["message_expr"] = self.message_expr | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A user who passes Options:
Logging is friendlier; raising forces the user to make an explicit choice. |
||
| return metadata | ||
|
|
||
| def _initialize_column_if_missing(self): | ||
|
|
@@ -428,6 +438,7 @@ class DQForEachColRule(DQRuleTypeMixin): | |
| check_func_args: list[Any] = field(default_factory=list) | ||
| check_func_kwargs: dict[str, Any] = field(default_factory=dict) | ||
| user_metadata: dict[str, str] | None = None | ||
| message_expr: str | Column | None = None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The integration test correctly notes this ("The same expression is used for every generated rule. To produce a per-column message, reference each column inline"), but the field itself has no docstring or warning. A user expecting per-column message substitution will be surprised when all generated rules emit identical messages — and won't know the workaround. Add a one-line comment on the field describing the shared-expression behaviour, ideally with the inline-reference workaround pointer: # Shared across every generated rule; reference column names inline (e.g. "concat('a=', cast(a as string))")
# if you need per-column messages, or construct rules individually instead of via DQForEachColRule.
message_expr: str | Column | None = None |
||
|
|
||
| def get_rules(self) -> list[DQRule]: | ||
| """Build a list of rules for a set of columns. | ||
|
|
@@ -453,6 +464,7 @@ def get_rules(self) -> list[DQRule]: | |
| criticality=self.criticality, | ||
| filter=self.filter, | ||
| user_metadata=self.user_metadata, | ||
| message_expr=self.message_expr, | ||
| ) | ||
| ) | ||
| else: # default to row-level rule | ||
|
|
@@ -467,6 +479,7 @@ def get_rules(self) -> list[DQRule]: | |
| criticality=self.criticality, | ||
| filter=self.filter, | ||
| user_metadata=self.user_metadata, | ||
| message_expr=self.message_expr, | ||
| ) | ||
| ) | ||
| return rules | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.