-
Notifications
You must be signed in to change notification settings - Fork 137
Implement column-level PII protection for sample collection #833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 9 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
5d4dd69
Implement column-level PII protection for sample collection
devin-ai-integration[bot] 3ba7a44
Add disable_samples column configuration flag
devin-ai-integration[bot] a159aa7
Add disable_samples column configuration flag
devin-ai-integration[bot] 9fa0eb1
Fix disable_samples to exclude columns instead of skipping sampling
devin-ai-integration[bot] 6014bac
Resolve merge conflicts - keep column exclusion implementation
devin-ai-integration[bot] d481311
Address GitHub feedback: rename global_tags to column_tags and add te…
devin-ai-integration[bot] c45cba3
Merge remote-tracking branch 'origin/master' into devin-ELE-4850-1753…
arbiv 4907483
unified config vars
arbiv 949b357
fixed to support builtin tests
arbiv 06e9451
using dbt parser
arbiv 3b6854b
changed column logic to not sample if one of the columns is in the te…
arbiv 057a2cc
improved tests and code readability
arbiv 89e49de
Address Itamar's code review comments
devin-ai-integration[bot] c5b9f45
Fix string/list handling in PII column detection macros
devin-ai-integration[bot] 8cc4050
Remove redundant model PII tag checking logic
devin-ai-integration[bot] 812d9c2
improved tests
arbiv 25ddd67
fixed cr comment
arbiv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,313 @@ | ||
| import json | ||
|
|
||
| import pytest | ||
| from dbt_project import DbtProject | ||
|
|
||
| SENSITIVE_COLUMN = "email" | ||
| SAFE_COLUMN = "order_count" | ||
|
|
||
| SAMPLES_QUERY = """ | ||
| with latest_elementary_test_result as ( | ||
| select id | ||
| from {{{{ ref("elementary_test_results") }}}} | ||
| where lower(table_name) = lower('{test_id}') | ||
| order by created_at desc | ||
| limit 1 | ||
| ) | ||
|
|
||
| select result_row | ||
| from {{{{ ref("test_result_rows") }}}} | ||
| where elementary_test_results_id in (select * from latest_elementary_test_result) | ||
| """ | ||
|
|
||
| TEST_SAMPLE_ROW_COUNT = 5 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_enabled(test_id: str, dbt_project: DbtProject): | ||
| """Test that PII columns are excluded when column-level PII protection is enabled""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: None} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| assert len(samples) == 1 | ||
| assert samples[0]["n_records"] == 10 | ||
| assert len(samples[0]) == 1 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_disabled(test_id: str, dbt_project: DbtProject): | ||
| """Test that all columns are included when column-level PII protection is disabled""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: None} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": False, | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # sample should be {'unique_field': 'user@example.com', 'n_records': 10} | ||
| assert len(samples) == 1 | ||
| for sample in samples: | ||
| # The original column name is mapped to 'unique_field' in unique tests | ||
| assert "unique_field" in sample | ||
| assert "n_records" in sample | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_tags_exist_but_flag_disabled( | ||
| test_id: str, dbt_project: DbtProject | ||
| ): | ||
| """Test that when PII tags exist but disable_samples_on_pii_tags is false, samples are collected normally""" | ||
| data = [{SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: 1} for i in range(10)] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SAFE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_column=None, | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": False, # Flag is disabled | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # When flag is disabled, we get the full sample (not limited by PII filtering) | ||
| assert len(samples) == 1 | ||
| for sample in samples: | ||
| # The original column name is mapped to 'unique_field' in unique tests | ||
| assert "unique_field" in sample | ||
| assert "n_records" in sample | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_all_columns_pii(test_id: str, dbt_project: DbtProject): | ||
| """Test behavior when all columns are tagged as PII""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: f"user{i}@example.com", SAFE_COLUMN: i} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "not_null", | ||
| test_args=dict(column_name=SAFE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| ], | ||
| test_column=None, | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "pass" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # When all columns are PII, no samples should be collected | ||
| assert len(samples) == 0 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_unique_test_column_mapping(test_id: str, dbt_project: DbtProject): | ||
| """Test that column mapping correctly maps unique test columns""" | ||
| data = [{SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: i} for i in range(10)] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # Should only contain n_records, not unique_field (which contains PII) | ||
| assert len(samples) == 1 | ||
| assert "n_records" in samples[0] | ||
| assert "unique_field" not in samples[0] | ||
| assert len(samples[0]) == 1 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_accepted_values_test_column_mapping(test_id: str, dbt_project: DbtProject): | ||
| """Test that column mapping correctly maps accepted_values test columns""" | ||
| data = [{SENSITIVE_COLUMN: "invalid_value", SAFE_COLUMN: i} for i in range(10)] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "accepted_values", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN, values=["valid1", "valid2"]), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # Should only contain n_records, not value (which contains PII) | ||
| assert len(samples) == 1 | ||
| assert "n_records" in samples[0] | ||
| assert "value" not in samples[0] | ||
| assert len(samples[0]) == 1 | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_not_null_test_column_mapping(test_id: str, dbt_project: DbtProject): | ||
| """Test that column mapping correctly handles not_null test columns""" | ||
| data = [{SENSITIVE_COLUMN: None, SAFE_COLUMN: i} for i in range(10)] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "not_null", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # Should only contain _no_non_excluded_columns when all columns are PII | ||
| assert len(samples) == TEST_SAMPLE_ROW_COUNT | ||
| for sample in samples: | ||
| assert "_no_non_excluded_columns" in sample | ||
| assert SENSITIVE_COLUMN not in sample | ||
| assert SAFE_COLUMN not in sample | ||
|
|
||
|
|
||
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_multiple_pii_columns_mapping(test_id: str, dbt_project: DbtProject): | ||
| """Test that column mapping handles multiple PII columns correctly""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: "user@example.com", "phone": "123-456-7890", SAFE_COLUMN: i} | ||
| for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": "phone", "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| # Should only contain n_records, not unique_field or phone (which contain PII) | ||
| assert len(samples) == 1 | ||
| assert "n_records" in samples[0] | ||
| assert "unique_field" not in samples[0] | ||
| assert "phone" not in samples[0] | ||
| assert len(samples[0]) == 1 | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration variable name inconsistency
Line 45 uses
disable_samples_on_pii_tags: Truewhich matches the incorrect variable name in the macro, but according to the PR objectives it should bedisable_samples_on_pii_columns. This test will need to be updated when the macro is fixed.The test logic and assertions are correct for the intended behavior.
🤖 Prompt for AI Agents