-
Notifications
You must be signed in to change notification settings - Fork 137
Implement column-level PII protection for sample collection #832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,132 @@ | ||
| import json | ||
|
|
||
| import pytest | ||
| from dbt_project import DbtProject | ||
|
|
||
| SENSITIVE_COLUMN = "email" | ||
| SAFE_COLUMN = "order_count" | ||
|
|
||
| SAMPLES_QUERY = """ | ||
| with latest_elementary_test_result as ( | ||
| select id | ||
| from {{{{ ref("elementary_test_results") }}}} | ||
| where lower(table_name) = lower('{test_id}') | ||
| order by created_at desc | ||
| limit 1 | ||
| ) | ||
|
|
||
| select result_row | ||
| from {{{{ ref("test_result_rows") }}}} | ||
| where elementary_test_results_id in (select * from latest_elementary_test_result) | ||
| """ | ||
|
|
||
| TEST_SAMPLE_ROW_COUNT = 5 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_enabled(test_id: str, dbt_project: DbtProject): | ||
| """Test that PII columns are excluded when column-level PII protection is enabled""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: f"user{i}@example.com", SAFE_COLUMN: None} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "not_null", | ||
| test_args=dict(column_name=SAFE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_columns": True, | ||
| "pii_column_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| assert len(samples) == TEST_SAMPLE_ROW_COUNT | ||
| for sample in samples: | ||
| assert SENSITIVE_COLUMN not in sample | ||
| assert SAFE_COLUMN in sample | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_disabled(test_id: str, dbt_project: DbtProject): | ||
| """Test that all columns are included when column-level PII protection is disabled""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: f"user{i}@example.com", SAFE_COLUMN: None} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "not_null", | ||
| test_args=dict(column_name=SAFE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_columns": False, | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| assert len(samples) == TEST_SAMPLE_ROW_COUNT | ||
| for sample in samples: | ||
| assert SENSITIVE_COLUMN in sample | ||
| assert SAFE_COLUMN in sample | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_all_columns_pii(test_id: str, dbt_project: DbtProject): | ||
| """Test behavior when all columns are tagged as PII""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: f"user{i}@example.com", SAFE_COLUMN: i} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "not_null", | ||
| test_args=dict(column_name=SAFE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_columns": True, | ||
| "pii_column_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "pass" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| assert len(samples) == TEST_SAMPLE_ROW_COUNT | ||
| for sample in samples: | ||
| assert "_no_non_pii_columns" in sample | ||
| assert sample["_no_non_pii_columns"] == 1 | ||
| assert SENSITIVE_COLUMN not in sample | ||
| assert SAFE_COLUMN not in sample | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| {% macro get_pii_columns_from_parent_model(flattened_test) %} | ||
| {% set pii_columns = [] %} | ||
|
|
||
| {% if not elementary.get_config_var('disable_samples_on_pii_columns') %} | ||
| {% do return(pii_columns) %} | ||
| {% endif %} | ||
|
|
||
| {% set parent_model_unique_id = elementary.insensitive_get_dict_value(flattened_test, 'parent_model_unique_id') %} | ||
| {% set parent_model = elementary.get_node(parent_model_unique_id) %} | ||
|
|
||
| {% if not parent_model %} | ||
| {% do return(pii_columns) %} | ||
| {% endif %} | ||
|
|
||
| {% set column_nodes = parent_model.get("columns") %} | ||
| {% if not column_nodes %} | ||
| {% do return(pii_columns) %} | ||
| {% endif %} | ||
|
|
||
| {% set pii_column_tags = elementary.get_config_var('pii_column_tags') %} | ||
| {% if pii_column_tags is string %} | ||
| {% set pii_column_tags = [pii_column_tags] %} | ||
| {% endif %} | ||
|
|
||
| {% for column_node in column_nodes.values() %} | ||
| {% set config_dict = column_node.get('config', {}) %} | ||
| {% set config_tags = config_dict.get('tags', []) %} | ||
| {% set global_tags = column_node.get('tags', []) %} | ||
| {% set meta_dict = column_node.get('meta', {}) %} | ||
| {% set meta_tags = meta_dict.get('tags', []) %} | ||
| {% set all_column_tags = config_tags + global_tags + meta_tags %} | ||
|
|
||
| {% for pii_tag in pii_column_tags %} | ||
| {% if pii_tag in all_column_tags %} | ||
| {% do pii_columns.append(column_node.get('name')) %} | ||
| {% break %} | ||
| {% endif %} | ||
| {% endfor %} | ||
| {% endfor %} | ||
|
|
||
| {% do return(pii_columns) %} | ||
| {% endmacro %} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify test compatibility with skipped database adapter.
The tests are marked to skip on ClickHouse. Ensure this is necessary and document why this adapter is excluded.
Let me check if there are adapter-specific issues or if this is a broader pattern:
🏁 Script executed:
Length of output: 38846
Document reason for skipping PII sampling on ClickHouse
Please add a comment above the skip decorator to explain why column-level PII sampling tests aren’t supported by the ClickHouse adapter, mirroring the pattern in other tests.
• integration_tests/tests/test_column_pii_sampling.py (around line 26)
Example:
🤖 Prompt for AI Agents