Implement column-level PII protection for sample collection#833
Conversation
- Add disable_samples_on_pii_columns and pii_column_tags config variables - Create get_pii_columns_from_parent_model helper macro for column PII detection - Modify query_test_result_rows to dynamically exclude PII columns from SELECT - Add comprehensive integration tests for column-level PII protection - Maintain backward compatibility with existing table-level PII detection - Follow existing patterns for configuration, tag handling, and testing Addresses Linear issue ELE-4850 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
WalkthroughThis update introduces column-level PII protection for dbt test sample collection. It adds macros to detect PII columns, modifies sampling logic to exclude PII-tagged or explicitly disabled columns, and provides new configuration options. Comprehensive integration tests verify correct behavior for various PII and disable-sample scenarios, ensuring granular control over sample data exposure. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DbtProject
participant dbt_test_macros
participant SystemUtils
User->>DbtProject: Run test with column config/PII tags
DbtProject->>dbt_test_macros: Execute handle_dbt_test(flattened_test)
dbt_test_macros->>SystemUtils: get_pii_columns_from_parent_model(flattened_test)
SystemUtils-->>dbt_test_macros: Return list of PII columns
dbt_test_macros->>dbt_test_macros: should_disable_sampling_for_pii(flattened_test)
alt Sampling disabled
dbt_test_macros->>dbt_test_macros: Set sample_limit = 0
else Sampling enabled
dbt_test_macros->>dbt_test_macros: get_columns_to_exclude_from_sampling(flattened_test)
end
dbt_test_macros->>DbtProject: Return test results with correct sample rows/columns
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
👋 @devin-ai-integration[bot] |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
macros/edr/system/system_utils/is_pii_column.sql (1)
21-23: Consider using ensure_list helper for cleaner code.The string-to-list conversion logic is functional but could be more robust.
If Elementary has a helper macro for ensuring list types, consider using it:
- {% if pii_column_tags is string %} - {% set pii_column_tags = [pii_column_tags] %} - {% endif %} + {% set pii_column_tags = elementary.ensure_list(pii_column_tags) %}Alternatively, the current approach could handle additional edge cases:
{% set pii_column_tags = elementary.get_config_var('pii_column_tags') %} - {% if pii_column_tags is string %} - {% set pii_column_tags = [pii_column_tags] %} - {% endif %} + {% if pii_column_tags is string %} + {% set pii_column_tags = [pii_column_tags] %} + {% elif not pii_column_tags %} + {% set pii_column_tags = [] %} + {% endif %}macros/edr/materializations/test/test.sql (1)
116-137: LGTM! Robust PII column filtering logic with good edge case handling.The implementation correctly:
- Detects PII columns only when
flattened_testis provided- Queries column names from the test results to build selective queries
- Filters out PII columns while preserving safe columns
- Handles the edge case where all columns are PII with a descriptive dummy column
The
_no_non_pii_columnsdummy column provides clear indication when no safe columns remain.Consider performance optimization for high-frequency tests.
The additional query to retrieve column names adds overhead. For high-frequency test scenarios, consider caching column metadata or optimizing this lookup.
integration_tests/tests/test_column_pii_sampling.py (1)
97-133: LGTM! Good edge case coverage with intentional test design.The test validates the scenario where all columns are tagged as PII by checking for the dummy
_no_non_pii_columnscolumn. The test data is intentionally designed with non-null values to pass the not_null constraint, allowing focus on the sampling behavior rather than test failure mechanics.Consider adding error handling for empty sample results.
While the current tests assume samples will always be present, consider adding assertions to handle cases where the query might return no results due to test execution issues.
samples = [ json.loads(row["result_row"]) for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) ] assert samples, "Expected sample rows but none were found" assert len(samples) == TEST_SAMPLE_ROW_COUNT
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)macros/edr/materializations/test/test.sql(2 hunks)macros/edr/system/system_utils/get_config_var.sql(1 hunks)macros/edr/system/system_utils/is_pii_column.sql(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
macros/edr/materializations/test/test.sql (3)
Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.
🧬 Code Graph Analysis (1)
integration_tests/tests/test_column_pii_sampling.py (2)
integration_tests/tests/conftest.py (2)
dbt_project(88-89)test_id(108-111)integration_tests/tests/dbt_project.py (2)
DbtProject(42-279)run_query(51-58)
🔇 Additional comments (9)
macros/edr/system/system_utils/get_config_var.sql (1)
58-59: LGTM! Configuration variables properly added.The new configuration variables are well-designed:
disable_samples_on_pii_columnsdefaults tofalsefor backward compatibilitypii_column_tagsprovides sensible default tags covering common PII naming conventions- Both variables follow the existing naming and structure patterns in the configuration
macros/edr/system/system_utils/is_pii_column.sql (2)
25-31: LGTM! Comprehensive tag collection approach.The macro correctly collects tags from multiple metadata sources (
config.tags, globaltags, andmeta.tags), ensuring comprehensive PII detection coverage regardless of where users define their PII tags.
4-6: Fix inverted logic condition.The condition logic appears inverted. The macro should return early when PII column sampling is disabled (
disable_samples_on_pii_columnsisfalse), but currently returns early when it's enabled.Apply this diff to fix the logic:
- {% if not elementary.get_config_var('disable_samples_on_pii_columns') %} + {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}Wait, let me reconsider this. The variable name is
disable_samples_on_pii_columns, so when it'sfalse(default), we want to disable the PII protection and return an empty list. When it'strue, we want to enable PII protection and proceed with the detection logic.The current logic
if not disable_samples_on_pii_columnsmeans "if PII protection is enabled", so return early. This is incorrect.- {% if not elementary.get_config_var('disable_samples_on_pii_columns') %} + {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}Actually, let me think through this more carefully:
disable_samples_on_pii_columns=false(default) → PII protection disabled → return empty listdisable_samples_on_pii_columns=true→ PII protection enabled → proceed with detectionSo the condition should be:
if not disable_samples_on_pii_columns(if PII protection is disabled), return empty list.But the current code has
if not elementary.get_config_var('disable_samples_on_pii_columns')which means "if PII protection is disabled", return empty list. This is actually correct!Let me re-examine... No wait, I think I'm confusing myself. Let me look at the PR objectives again.
From the PR objectives:
disable_samples_on_pii_columnsdefaults to false, and when enabled (true), it activates the PII column protection. So:
disable_samples_on_pii_columns=false→ No PII protection, return empty listdisable_samples_on_pii_columns=true→ Enable PII protection, proceed with detectionThe current code:
if not elementary.get_config_var('disable_samples_on_pii_columns')means "if disable_samples_on_pii_columns is false", return empty list. This is correct behavior.Actually, I think I misunderstood the variable name. Let me re-read the PR objectives more carefully...
From the PR: "disable_samples_on_pii_columns (defaulting to false)" and "when PII sampling is enabled". So this variable name is a bit confusing, but it seems like:
disable_samples_on_pii_columns=true→ Enable PII column protection (disable sampling on PII columns)disable_samples_on_pii_columns=false→ Disable PII column protection (allow sampling on PII columns)So the current logic is correct: if
disable_samples_on_pii_columnsis false, return empty list (no protection). If it's true, proceed with detection.Wait, I think I need to be more careful here. The variable name
disable_samples_on_pii_columnsis confusing. Let me interpret it as:
disable_samples_on_pii_columns = truemeans "disable sampling on PII columns" = enable PII protectiondisable_samples_on_pii_columns = falsemeans "don't disable sampling on PII columns" = disable PII protectionSo the current logic
if not disable_samples_on_pii_columnsmeans "if we're not disabling sampling on PII columns", return empty list. This is correct.Actually, let me step back and look at this more systematically. The logic should be:
- If PII protection is disabled, return empty list (no columns to exclude)
- If PII protection is enabled, proceed to find PII columns
The variable name
disable_samples_on_pii_columnswhentruemeans "disable sampling on PII columns" which is equivalent to "enable PII protection".So:
disable_samples_on_pii_columns = false→ Don't disable sampling = no PII protection → return empty listdisable_samples_on_pii_columns = true→ Disable sampling on PII = enable PII protection → proceed with detectionThe current code:
if not disable_samples_on_pii_columnsmeans "if false", return empty list. This is correct!I was getting confused by the double negative. The logic is actually correct.
[actions]
Skip this comment since the logic is actually correct.
-->Likely an incorrect or invalid review comment.
macros/edr/materializations/test/test.sql (3)
54-55: LGTM! Proper parameter passing for PII detection.The
flattened_testargument is correctly passed toquery_test_result_rowsto enable PII column detection. The parameter formatting and line breaking maintain readability.
107-107: LGTM! Function signature properly updated.The optional
flattened_testparameter is correctly added to the function signature, maintaining backward compatibility.
143-143: LGTM! Dynamic select clause properly implemented.The query correctly uses the dynamically built
select_clauseinstead of the hardcoded*, enabling selective column filtering based on PII detection.integration_tests/tests/test_column_pii_sampling.py (3)
1-24: LGTM! Well-structured test module with clear constants and query template.The module setup is clean with appropriate imports, descriptive constants, and a reusable query template. The
SAMPLES_QUERYcorrectly retrieves sample rows from the latest test execution.
26-60: LGTM! Comprehensive test for enabled PII protection.The test correctly validates that:
- PII columns are excluded from sample rows when protection is enabled
- Safe columns remain accessible for debugging
- Test failure behavior is preserved
- Sample count matches expectations
The test data and configuration properly simulate real-world PII scenarios.
62-95: LGTM! Proper validation of disabled PII protection.The test correctly verifies that when PII protection is disabled, all columns (including PII-tagged ones) are included in sample rows, maintaining backward compatibility.
- Add is_sampling_disabled_for_column macro to check column config - Modify query_test_result_rows to skip sampling when disable_samples=true - Add integration tests for disable_samples functionality - disable_samples flag overrides PII detection and works at column level - Preserves existing PII column filtering functionality Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
- Add is_sampling_disabled_for_column macro to check column config - Modify query_test_result_rows to skip sampling when disable_samples=true - Add integration tests for disable_samples functionality - disable_samples flag overrides PII detection and works at column level - Preserves existing PII column filtering functionality Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
macros/edr/materializations/test/test.sql (1)
129-147: Performance and SQL parsing concerns need attentionThe implementation has two significant issues that were flagged in previous reviews:
Double query execution: Lines 133-139 execute the test SQL with
limit 0to get column names, then line 153 executes it again for sampling. This doubles latency and warehouse costs.Unquoted column identifiers: Line 143 joins raw column names without proper quoting, which can break with special characters, reserved words, or mixed-case columns.
These issues were previously identified and solutions were suggested in past reviews.
🧹 Nitpick comments (2)
macros/edr/materializations/test/test.sql (1)
213-256: Consider the fragility of SQL string parsingWhile this dynamic approach is comprehensive, parsing SQL strings with text matching is inherently fragile. Consider these potential issues:
- SQL formatting variations could break pattern matching
- Complex queries with subqueries or CTEs might not be parsed correctly
- Different SQL dialects might use different syntax
The static mapping approach in
get_test_result_column_mappingmight be more reliable for standard dbt tests, with this dynamic approach as a fallback only when needed.integration_tests/tests/test_column_pii_sampling.py (1)
97-134: Potential test duplicationThis test appears to cover the same scenario as
test_column_pii_sampling_disabled- both test cases where PII tags exist butdisable_samples_on_pii_tagsisFalse. Consider consolidating these tests or ensuring they cover distinct scenarios.The test logic itself is correct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)integration_tests/tests/test_disable_samples_config.py(1 hunks)macros/edr/materializations/test/test.sql(2 hunks)macros/edr/system/system_utils/is_pii_column.sql(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- macros/edr/system/system_utils/is_pii_column.sql
- integration_tests/tests/test_disable_samples_config.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.
Applied to files:
macros/edr/materializations/test/test.sql
🔇 Additional comments (12)
macros/edr/materializations/test/test.sql (5)
67-68: LGTM - Parameter addition enables PII functionalityThe addition of the
flattened_testparameter correctly enables the new PII column detection functionality in thequery_test_result_rowsmacro.
120-120: LGTM - Macro signature correctly updatedThe addition of the
flattened_testparameter enables the new PII column exclusion functionality.
180-211: LGTM - Comprehensive test type mappingThe static column mapping correctly handles the standard dbt test types (unique, accepted_values, relationships, not_null) and their respective column aliases. This provides a good foundation for column exclusion logic.
258-272: LGTM - Clean mapping applicationThe logic correctly applies column mapping to excluded columns, providing a clean interface between exclusion detection and column mapping functionality.
351-367: LGTM - Robust column configuration checkingThe macro correctly navigates the model structure to check column-level
disable_test_samplesconfiguration with appropriate null safety checks.integration_tests/tests/test_column_pii_sampling.py (7)
1-24: LGTM - Clean test setup and constantsThe file structure is well-organized with clear constants and a properly structured sample query for retrieving test results.
61-95: LGTM - Addresses backward compatibility concernThis test correctly validates the scenario mentioned in past review comments: when PII tags exist but the flag is disabled, samples should include PII columns normally. The assertions properly verify both 'unique_field' and 'n_records' are present.
Note: The configuration variable name will need to be updated when the macro is fixed.
136-168: Verify edge case behavior with implementationThis test expects no samples when all columns are PII (
assert len(samples) == 0), but the macro implementation (lines 144-146 in the first file) creates a dummy column"1 as _no_non_excluded_columns"when all columns are excluded.Please verify if the expected behavior should be:
- No samples collected (as this test expects), or
- Samples with only the dummy column (as the implementation suggests)
The test logic is otherwise sound for this important edge case.
171-204: LGTM - Proper unique test column mapping validationThis test correctly validates that unique test column mapping works with PII exclusion - the PII-tagged column mapped to 'unique_field' is properly excluded from samples.
206-239: LGTM - Proper accepted_values test column mapping validationThis test correctly validates that accepted_values test column mapping works with PII exclusion - the PII-tagged column mapped to 'value' is properly excluded from samples.
241-275: LGTM - Validates dummy column behaviorThis test correctly validates the behavior when PII columns are excluded from not_null tests - it expects the
_no_non_excluded_columnsdummy column, which aligns with the macro implementation.
277-314: LGTM - Comprehensive multiple PII columns testThis test properly validates that multiple PII-tagged columns are all excluded from samples, ensuring the exclusion logic works correctly with multiple columns.
| @pytest.mark.skip_targets(["clickhouse"]) | ||
| def test_column_pii_sampling_enabled(test_id: str, dbt_project: DbtProject): | ||
| """Test that PII columns are excluded when column-level PII protection is enabled""" | ||
| data = [ | ||
| {SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: None} for i in range(10) | ||
| ] | ||
|
|
||
| test_result = dbt_project.test( | ||
| test_id, | ||
| "unique", | ||
| test_args=dict(column_name=SENSITIVE_COLUMN), | ||
| data=data, | ||
| columns=[ | ||
| {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}}, | ||
| {"name": SAFE_COLUMN}, | ||
| ], | ||
| test_vars={ | ||
| "enable_elementary_test_materialization": True, | ||
| "test_sample_row_count": TEST_SAMPLE_ROW_COUNT, | ||
| "disable_samples_on_pii_tags": True, | ||
| "pii_tags": ["pii"], | ||
| }, | ||
| ) | ||
| assert test_result["status"] == "fail" | ||
|
|
||
| samples = [ | ||
| json.loads(row["result_row"]) | ||
| for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id)) | ||
| ] | ||
|
|
||
| assert len(samples) == 1 | ||
| assert samples[0]["n_records"] == 10 | ||
| assert len(samples[0]) == 1 | ||
|
|
There was a problem hiding this comment.
Configuration variable name inconsistency
Line 45 uses disable_samples_on_pii_tags: True which matches the incorrect variable name in the macro, but according to the PR objectives it should be disable_samples_on_pii_columns. This test will need to be updated when the macro is fixed.
The test logic and assertions are correct for the intended behavior.
🤖 Prompt for AI Agents
In integration_tests/tests/test_column_pii_sampling.py around lines 26 to 59,
the test uses the configuration variable disable_samples_on_pii_tags which is
inconsistent with the intended variable name disable_samples_on_pii_columns.
Update the test_vars dictionary to replace disable_samples_on_pii_tags with
disable_samples_on_pii_columns to align with the correct macro variable name and
PR objectives.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (5)
macros/edr/materializations/test/test.sql (1)
154-174: Critical: Fix configuration variable nameLine 161 uses
disable_samples_on_pii_tagsbut according to the PR objectives and past review comments, it should bedisable_samples_on_pii_columns. This mismatch will prevent the PII exclusion feature from working correctly.Apply this fix:
- {% if elementary.get_config_var('disable_samples_on_pii_tags') %} + {% if elementary.get_config_var('disable_samples_on_pii_columns') %}integration_tests/tests/test_column_pii_sampling.py (4)
97-134: Good test for backward compatibility scenarioThis test correctly validates the scenario where PII tags exist but the feature is disabled, ensuring backward compatibility. However, Line 117 needs the same configuration variable name fix.
Apply this fix:
- "disable_samples_on_pii_tags": False, # Flag is disabled + "disable_samples_on_pii_columns": False, # Flag is disabled
136-462: Comprehensive test coverage with consistent configuration issueThe test suite provides excellent coverage of various scenarios:
- All columns tagged as PII
- Different dbt test types (unique, accepted_values, not_null)
- Multiple PII columns
- Custom SQL tests with complex scenarios
However, all tests consistently use
disable_samples_on_pii_tagsinstead ofdisable_samples_on_pii_columnsin their configuration. This needs to be fixed throughout the file.Apply this fix to all test functions:
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
26-59: Critical: Configuration variable name inconsistencyLine 45 uses
disable_samples_on_pii_tags: Truebut according to the PR objectives, this should bedisable_samples_on_pii_columns. This matches the same issue in the macro and needs to be fixed for the tests to work correctly.Apply this fix:
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
61-95: Fix configuration variable nameLine 79 has the same configuration variable name issue - should be
disable_samples_on_pii_columnsinstead ofdisable_samples_on_pii_tags.The test logic correctly validates that PII columns are included in samples when the feature is disabled, ensuring backward compatibility.
🧹 Nitpick comments (1)
macros/edr/materializations/test/test.sql (1)
178-239: Complex but necessary column mapping logicThis macro implements sophisticated column mapping to handle different dbt test types. While complex, the multiple matching strategies (exact match, common patterns, heuristics) are necessary to handle the variety of ways test results are structured.
The approach creates temporary relations for column introspection, which adds some performance overhead but provides accurate column mapping.
Consider adding more detailed logging or comments to explain the matching strategies for future maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)macros/edr/materializations/test/test.sql(2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
Applied to files:
macros/edr/materializations/test/test.sql
📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Applied to files:
macros/edr/materializations/test/test.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (1.8.0, postgres) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, bigquery) / test
🔇 Additional comments (4)
macros/edr/materializations/test/test.sql (4)
120-152: Improved approach for column-level PII exclusionThe new implementation provides a cleaner approach to excluding PII columns from test samples. The logic properly handles edge cases like when all columns are excluded by falling back to a dummy column.
However, note that creating temporary relations (Line 134) adds some performance overhead compared to static column mapping approaches.
241-256: LGTM - Clean helper for column introspectionThis helper macro properly creates temporary relations for column metadata extraction using the standard
limit 0approach. The implementation integrates well with Elementary's existing temp table utilities.
258-274: Well-implemented column sampling checkThis macro properly retrieves column-level configuration from the parent model and checks for the
disable_test_samplesflag. The null checks and fallback logic are appropriate for handling edge cases.
66-68: LGTM - Proper integration of flattened_test parameterThe addition of the
flattened_testparameter toquery_test_result_rowscorrectly enables the new column exclusion functionality while maintaining backward compatibility.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (12)
integration_tests/tests/test_column_pii_sampling.py (12)
45-45: Configuration variable name inconsistency persists.This line still uses
disable_samples_on_pii_tags: Truewhich is inconsistent with the intended variable namedisable_samples_on_pii_columnsmentioned in the PR objectives. This matches the previous review comment that flagged this issue.Apply this fix to align with the correct variable name:
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
77-77: Configuration variable name inconsistency.Same issue as line 45 - the variable name should be
disable_samples_on_pii_columnsinstead ofdisable_samples_on_pii_tags.- "disable_samples_on_pii_tags": False, + "disable_samples_on_pii_columns": False,
115-115: Configuration variable name inconsistency.Another occurrence of the incorrect variable name.
- "disable_samples_on_pii_tags": False, # Flag is disabled + "disable_samples_on_pii_columns": False, # Flag is disabled
154-154: Configuration variable name inconsistency.Yet another occurrence of the incorrect variable name.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
186-186: Configuration variable name inconsistency.Continuing pattern of incorrect variable name usage.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
217-217: Configuration variable name inconsistency.Same issue in the accepted_values test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
248-248: Configuration variable name inconsistency.Same issue in the not_null test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
283-283: Configuration variable name inconsistency.Same issue in the multiple PII columns test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
313-313: Configuration variable name inconsistency.Same issue in the custom SQL test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
344-344: Configuration variable name inconsistency.Same issue in the meta tags test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
382-382: Configuration variable name inconsistency.Same issue in the custom SQL test with multiple PII columns.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
413-413: Configuration variable name inconsistency.Same issue in the subquery test.
- "disable_samples_on_pii_tags": True, + "disable_samples_on_pii_columns": True,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)macros/edr/materializations/test/test.sql(3 hunks)macros/edr/system/system_utils/is_pii_column.sql(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- macros/edr/system/system_utils/is_pii_column.sql
- macros/edr/materializations/test/test.sql
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, postgres) / test
- GitHub Check: test (1.8.0, postgres) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_official, snowflake) / test
🔇 Additional comments (4)
integration_tests/tests/test_column_pii_sampling.py (4)
1-24: Well-structured test setup with clear constants and query template.The imports, constants, and SQL query template are appropriately defined for comprehensive integration testing of the PII sampling functionality.
338-338: Interesting test case for meta tags.This test validates that PII tags can be specified in the
metasection rather thanconfig.tags, which adds good coverage for different tag placement options.
56-56: Comprehensive test coverage with consistent assertions.The tests effectively validate that when PII protection is enabled, no sample rows are collected (asserting
len(samples) == 0). This comprehensive coverage includes various test types (unique, not_null, accepted_values), edge cases (all columns PII), and different PII tag configurations.Also applies to: 166-166, 197-197, 228-228, 259-259, 293-293, 324-324, 355-355, 392-392, 422-422
87-92: Proper validation of disabled PII protection scenarios.These tests correctly verify that when PII protection is disabled, sample data is collected normally with the expected structure and values. The assertions validate both the presence of expected fields and their correct values.
Also applies to: 127-131
| {% endmacro %} | ||
|
|
||
| {% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false) %} | ||
| {% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false, flattened_test=none) %} |
There was a problem hiding this comment.
Is the flattened_test param needed?
| @@ -0,0 +1,47 @@ | |||
| {% macro get_pii_columns_from_parent_model(flattened_test) %} | |||
There was a problem hiding this comment.
Let's rename the file to match the macro?
|
|
||
| {# column -> meta -> tags #} | ||
| {% set meta_dict = column_node.get('meta', {}) %} | ||
| {% set meta_tags = meta_dict.get('tags', []) %} |
There was a problem hiding this comment.
We should also consider the case where the model itself has a PII tag.
There was a problem hiding this comment.
This is implemented here:
|
|
||
| {% for column_node in column_nodes.values() %} | ||
| {# column -> tags #} | ||
| {% set column_tags = column_node.get('tags', []) %} |
There was a problem hiding this comment.
Consider creating a separate macro just to get the tags of a column
- Rename is_pii_column.sql to get_pii_columns_from_parent_model.sql to match macro name - Extract get_column_tags helper macro for better code organization - Add model-level PII tag support - if model has PII tags, all columns are considered PII - Add clarifying comment about intentional over-censoring with * queries in CTEs - Keep flattened_test parameter in query_test_result_rows for future functionality Fixes all outstanding code review feedback from haritamar on PR #833 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql (1)
59-61: Leverage the locally-defined macro directlyReferencing the same-package macro via the package namespace works, but makes refactors harder (e.g. renaming the package).
Using the bare macro name keeps the call shorter and avoids an unnecessary dependency edge.- {% set all_column_tags_lower = elementary.get_column_tags(column_node) %} + {% set all_column_tags_lower = get_column_tags(column_node) %}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
macros/edr/materializations/test/test.sql(3 hunks)macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- macros/edr/materializations/test/test.sql
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
- Fix get_column_tags to properly handle string tags vs lists - Fix pii_tags processing to handle scalar strings correctly - Apply consistent string/list normalization to model tags - Addresses CodeRabbit comments r2250021209 and r2250021214 The core issue was that Jinja2's 'is iterable' test returns True for strings, causing scalar string tags like 'pii' to be treated as character arrays ['p','i','i'] instead of single-element lists ['pii']. Fixed by using explicit string checking with 'src is string' to properly distinguish between strings and actual lists/arrays. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
- Remove model-level PII tag checking from get_pii_columns_from_parent_model - This logic is handled elsewhere as noted by Yosef - Simplifies macro to focus only on column-level PII detection Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)
42-46: Config keydisable_samples_on_pii_tagsis correct – earlier warning can be ignoredPast feedback suggested renaming this variable to
disable_samples_on_pii_columns, but the maintainer explicitly confirmed the project intentionally usesdisable_samples_on_pii_tags(see retrieved learning from arbiv).
No change is required here.Also applies to: 70-77, 148-152, 187-191, 218-223, 250-254, 281-285, 318-319, 346-350, 378-381
🧹 Nitpick comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)
1-91: Heavy duplication – consider parametrising the suiteEach test repeats the same setup pattern (data creation,
dbt_project.testcall, sample query, assertions). Pytest’s@pytest.mark.parametrizeor a helper fixture would shrink this file dramatically, speed execution, and simplify future maintenance.Also applies to: 95-167, 170-233, 236-295, 298-359, 362-390
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
integration_tests/tests/test_column_pii_sampling.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: in the dbt-data-reliability project, the config key for disabling samples on pii tags is intentional...
Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.
Applied to files:
integration_tests/tests/test_column_pii_sampling.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: test (1.8.0, postgres) / test
- GitHub Check: test (latest_official, clickhouse) / test
- GitHub Check: test (latest_official, redshift) / test
- GitHub Check: test (latest_official, bigquery) / test
- GitHub Check: test (latest_pre, postgres) / test
- GitHub Check: test (latest_official, snowflake) / test
- GitHub Check: test (latest_official, trino) / test
- GitHub Check: test (latest_official, athena) / test
- GitHub Check: test (latest_official, databricks_catalog) / test
- GitHub Check: test (latest_official, postgres) / test
🔇 Additional comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)
110-114: Verify macro variable name for custom PII tag listAll tests pass a
pii_tagsvariable, but the PR description and original design referencepii_column_tags.
Please double-check the macros’var()lookup to ensure the spelling aligns; otherwise these tests may silently bypass the intended override.Also applies to: 249-254, 281-285, 349-350, 379-381
Implement column-level PII protection for sample collection
Summary by CodeRabbit
New Features
Bug Fixes
Tests