Implement column-level PII protection for sample collection by devin-ai-integration[bot] · Pull Request #833 · elementary-data/dbt-data-reliability

devin-ai-integration · 2025-07-30T08:13:49Z

Implement column-level PII protection for sample collection

Summary by CodeRabbit

New Features
- Added support for disabling sample row collection for columns tagged as PII or explicitly configured to disable sampling during tests.
- Introduced configuration options to control sample data exposure for sensitive columns, including custom tags and per-column settings.
- Enhanced test coverage to verify correct handling of PII and sampling exclusion under various configurations.
Bug Fixes
- Ensured that sample rows are properly excluded for tests involving PII or disabled columns, preventing unintended data exposure.
Tests
- Added comprehensive integration tests to validate PII sampling behavior and the "disable_test_samples" configuration across multiple scenarios.

- Add disable_samples_on_pii_columns and pii_column_tags config variables - Create get_pii_columns_from_parent_model helper macro for column PII detection - Modify query_test_result_rows to dynamically exclude PII columns from SELECT - Add comprehensive integration tests for column-level PII protection - Maintain backward compatibility with existing table-level PII detection - Follow existing patterns for configuration, tag handling, and testing Addresses Linear issue ELE-4850 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

devin-ai-integration · 2025-07-30T08:13:51Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

linear · 2025-07-30T08:13:52Z

ELE-4850 Extend PII protection to column-level granularity with column exclusion approach

coderabbitai · 2025-07-30T08:13:56Z

Walkthrough

This update introduces column-level PII protection for dbt test sample collection. It adds macros to detect PII columns, modifies sampling logic to exclude PII-tagged or explicitly disabled columns, and provides new configuration options. Comprehensive integration tests verify correct behavior for various PII and disable-sample scenarios, ensuring granular control over sample data exposure.

Changes

Cohort / File(s)	Change Summary
Integration Tests: PII Sampling `integration_tests/tests/test_column_pii_sampling.py`	Adds comprehensive integration tests for column-level PII sampling, covering various configurations and test types, verifying correct inclusion/exclusion of sample rows and columns based on PII tags and sampling settings.
Integration Tests: Disable Samples Config `integration_tests/tests/test_disable_samples_config.py`	Introduces integration tests for the `disable_test_samples` column config, testing its interaction with PII tags and verifying sample row exclusion logic for multiple columns and configurations.
DbtProject Test API `integration_tests/tests/dbt_project.py`	Updates the `DbtProject.test` method to accept a `column_config` parameter, supporting per-column sample disabling and PII tagging in test setup.
Macros: PII Column Detection & Sampling Logic `macros/edr/materializations/test/test.sql`	Adds macros for detecting PII columns, determining when to disable sampling, and excluding columns from samples. Updates test materialization logic to enforce column-level PII and disable-sample configurations.
Macros: System Utils for PII `macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql`	Adds macros to extract tags from columns and parent models, and to identify PII columns based on tags and config, supporting column-level PII detection for sample exclusion logic.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DbtProject
    participant dbt_test_macros
    participant SystemUtils

    User->>DbtProject: Run test with column config/PII tags
    DbtProject->>dbt_test_macros: Execute handle_dbt_test(flattened_test)
    dbt_test_macros->>SystemUtils: get_pii_columns_from_parent_model(flattened_test)
    SystemUtils-->>dbt_test_macros: Return list of PII columns
    dbt_test_macros->>dbt_test_macros: should_disable_sampling_for_pii(flattened_test)
    alt Sampling disabled
        dbt_test_macros->>dbt_test_macros: Set sample_limit = 0
    else Sampling enabled
        dbt_test_macros->>dbt_test_macros: get_columns_to_exclude_from_sampling(flattened_test)
    end
    dbt_test_macros->>DbtProject: Return test results with correct sample rows/columns

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Add new configuration variables for column-level PII protection (ELE-4850)	✅
Create a column PII detection helper macro (ELE-4850)	✅
Modify sample collection logic to exclude PII columns, not just tables (ELE-4850)	✅
Integration tests for column-level PII protection (ELE-4850)	✅
Update documentation (ELE-4850)	❌	Documentation updates are not present in this PR.

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
None found	All code changes are directly related to the objectives in ELE-4850.

Possibly related PRs

Add configuration to disable sample collection for PII tables #831: Modifies DbtProject.test to accept additional parameters related to disabling sample collection for PII columns and adds integration tests for sample suppression based on PII tagging and disable flags.

Suggested reviewers

michael-myaskovsky

Poem

In the warren where the data flows,
We burrow deep where the PII grows.
Now, with nimble code and clever tags,
Only secrets hide—no more red flags!
Samples hop past columns marked "private,"
Leaving carrots for the rest—oh, how I delight it!
🥕🐇 Data safe, and tests so bright!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 812d9c2 and 25ddd67.

📒 Files selected for processing (1)

macros/edr/materializations/test/test.sql (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

macros/edr/materializations/test/test.sql

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test (latest_official, postgres) / test
GitHub Check: test (latest_official, redshift) / test
GitHub Check: test (latest_official, clickhouse) / test
GitHub Check: test (1.8.0, postgres) / test
GitHub Check: test (latest_official, trino) / test
GitHub Check: test (latest_official, bigquery) / test
GitHub Check: test (latest_official, snowflake) / test
GitHub Check: test (latest_pre, postgres) / test
GitHub Check: test (latest_official, databricks_catalog) / test
GitHub Check: test (latest_official, athena) / test

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin-ELE-4850-1753859795

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-07-30T08:14:02Z

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

macros/edr/system/system_utils/is_pii_column.sql (1)
21-23: Consider using ensure_list helper for cleaner code.

The string-to-list conversion logic is functional but could be more robust.

If Elementary has a helper macro for ensuring list types, consider using it:
-  {% if pii_column_tags is string %}
-    {% set pii_column_tags = [pii_column_tags] %}
-  {% endif %}
+  {% set pii_column_tags = elementary.ensure_list(pii_column_tags) %}
Alternatively, the current approach could handle additional edge cases:
   {% set pii_column_tags = elementary.get_config_var('pii_column_tags') %}
-  {% if pii_column_tags is string %}
-    {% set pii_column_tags = [pii_column_tags] %}
-  {% endif %}
+  {% if pii_column_tags is string %}
+    {% set pii_column_tags = [pii_column_tags] %}
+  {% elif not pii_column_tags %}
+    {% set pii_column_tags = [] %}
+  {% endif %}
macros/edr/materializations/test/test.sql (1)

116-137: LGTM! Robust PII column filtering logic with good edge case handling.

The implementation correctly:

Detects PII columns only when flattened_test is provided

Queries column names from the test results to build selective queries

Filters out PII columns while preserving safe columns

Handles the edge case where all columns are PII with a descriptive dummy column

The _no_non_pii_columns dummy column provides clear indication when no safe columns remain.

Consider performance optimization for high-frequency tests.

The additional query to retrieve column names adds overhead. For high-frequency test scenarios, consider caching column metadata or optimizing this lookup.
integration_tests/tests/test_column_pii_sampling.py (1)
97-133: LGTM! Good edge case coverage with intentional test design.

The test validates the scenario where all columns are tagged as PII by checking for the dummy _no_non_pii_columns column. The test data is intentionally designed with non-null values to pass the not_null constraint, allowing focus on the sampling behavior rather than test failure mechanics.

Consider adding error handling for empty sample results.

While the current tests assume samples will always be present, consider adding assertions to handle cases where the query might return no results due to test execution issues.
samples = [
    json.loads(row["result_row"])
    for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id))
]
assert samples, "Expected sample rows but none were found"
assert len(samples) == TEST_SAMPLE_ROW_COUNT

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc8fba5 and 5d4dd69.

📒 Files selected for processing (4)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)
macros/edr/materializations/test/test.sql (2 hunks)
macros/edr/system/system_utils/get_config_var.sql (1 hunks)
macros/edr/system/system_utils/is_pii_column.sql (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

macros/edr/materializations/test/test.sql (3)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.

🧬 Code Graph Analysis (1)

integration_tests/tests/test_column_pii_sampling.py (2)

integration_tests/tests/conftest.py (2)

dbt_project (88-89)

test_id (108-111)

integration_tests/tests/dbt_project.py (2)

DbtProject (42-279)

run_query (51-58)

🔇 Additional comments (9)

macros/edr/system/system_utils/get_config_var.sql (1)

58-59: LGTM! Configuration variables properly added.

The new configuration variables are well-designed:

disable_samples_on_pii_columns defaults to false for backward compatibility

pii_column_tags provides sensible default tags covering common PII naming conventions

Both variables follow the existing naming and structure patterns in the configuration
macros/edr/system/system_utils/is_pii_column.sql (2)

25-31: LGTM! Comprehensive tag collection approach.

The macro correctly collects tags from multiple metadata sources (config.tags, global tags, and meta.tags), ensuring comprehensive PII detection coverage regardless of where users define their PII tags.

4-6: Fix inverted logic condition.

The condition logic appears inverted. The macro should return early when PII column sampling is disabled (disable_samples_on_pii_columns is false), but currently returns early when it's enabled.

Apply this diff to fix the logic:
-  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
+  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
Wait, let me reconsider this. The variable name is disable_samples_on_pii_columns, so when it's false (default), we want to disable the PII protection and return an empty list. When it's true, we want to enable PII protection and proceed with the detection logic.

The current logic if not disable_samples_on_pii_columns means "if PII protection is enabled", so return early. This is incorrect.
-  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
+  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
Actually, let me think through this more carefully:

disable_samples_on_pii_columns = false (default) → PII protection disabled → return empty list

disable_samples_on_pii_columns = true → PII protection enabled → proceed with detection

So the condition should be: if not disable_samples_on_pii_columns (if PII protection is disabled), return empty list.

But the current code has if not elementary.get_config_var('disable_samples_on_pii_columns') which means "if PII protection is disabled", return empty list. This is actually correct!

Let me re-examine... No wait, I think I'm confusing myself. Let me look at the PR objectives again.

From the PR objectives: disable_samples_on_pii_columns defaults to false, and when enabled (true), it activates the PII column protection. So:

disable_samples_on_pii_columns = false → No PII protection, return empty list

disable_samples_on_pii_columns = true → Enable PII protection, proceed with detection

The current code: if not elementary.get_config_var('disable_samples_on_pii_columns') means "if disable_samples_on_pii_columns is false", return empty list. This is correct behavior.

Actually, I think I misunderstood the variable name. Let me re-read the PR objectives more carefully...

From the PR: "disable_samples_on_pii_columns (defaulting to false)" and "when PII sampling is enabled". So this variable name is a bit confusing, but it seems like:

disable_samples_on_pii_columns = true → Enable PII column protection (disable sampling on PII columns)

disable_samples_on_pii_columns = false → Disable PII column protection (allow sampling on PII columns)

So the current logic is correct: if disable_samples_on_pii_columns is false, return empty list (no protection). If it's true, proceed with detection.

Wait, I think I need to be more careful here. The variable name disable_samples_on_pii_columns is confusing. Let me interpret it as:

disable_samples_on_pii_columns = true means "disable sampling on PII columns" = enable PII protection

disable_samples_on_pii_columns = false means "don't disable sampling on PII columns" = disable PII protection

So the current logic if not disable_samples_on_pii_columns means "if we're not disabling sampling on PII columns", return empty list. This is correct.

Actually, let me step back and look at this more systematically. The logic should be:

If PII protection is disabled, return empty list (no columns to exclude)

If PII protection is enabled, proceed to find PII columns

The variable name disable_samples_on_pii_columns when true means "disable sampling on PII columns" which is equivalent to "enable PII protection".

So:

disable_samples_on_pii_columns = false → Don't disable sampling = no PII protection → return empty list

disable_samples_on_pii_columns = true → Disable sampling on PII = enable PII protection → proceed with detection

The current code: if not disable_samples_on_pii_columns means "if false", return empty list. This is correct!

I was getting confused by the double negative. The logic is actually correct.
[actions]
Skip this comment since the logic is actually correct.
-->

Likely an incorrect or invalid review comment.
macros/edr/materializations/test/test.sql (3)

54-55: LGTM! Proper parameter passing for PII detection.

The flattened_test argument is correctly passed to query_test_result_rows to enable PII column detection. The parameter formatting and line breaking maintain readability.

107-107: LGTM! Function signature properly updated.

The optional flattened_test parameter is correctly added to the function signature, maintaining backward compatibility.

143-143: LGTM! Dynamic select clause properly implemented.

The query correctly uses the dynamically built select_clause instead of the hardcoded *, enabling selective column filtering based on PII detection.

integration_tests/tests/test_column_pii_sampling.py (3)

1-24: LGTM! Well-structured test module with clear constants and query template.

The module setup is clean with appropriate imports, descriptive constants, and a reusable query template. The SAMPLES_QUERY correctly retrieves sample rows from the latest test execution.

26-60: LGTM! Comprehensive test for enabled PII protection.

The test correctly validates that:

PII columns are excluded from sample rows when protection is enabled

Safe columns remain accessible for debugging

Test failure behavior is preserved

Sample count matches expectations

The test data and configuration properly simulate real-world PII scenarios.

62-95: LGTM! Proper validation of disabled PII protection.

The test correctly verifies that when PII protection is disabled, all columns (including PII-tagged ones) are included in sample rows, maintaining backward compatibility.

- Add is_sampling_disabled_for_column macro to check column config - Modify query_test_result_rows to skip sampling when disable_samples=true - Add integration tests for disable_samples functionality - disable_samples flag overrides PII detection and works at column level - Preserves existing PII column filtering functionality Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

macros/edr/materializations/test/test.sql (1)

129-147: Performance and SQL parsing concerns need attention

The implementation has two significant issues that were flagged in previous reviews:

Double query execution: Lines 133-139 execute the test SQL with limit 0 to get column names, then line 153 executes it again for sampling. This doubles latency and warehouse costs.

Unquoted column identifiers: Line 143 joins raw column names without proper quoting, which can break with special characters, reserved words, or mixed-case columns.

These issues were previously identified and solutions were suggested in past reviews.

🧹 Nitpick comments (2)

macros/edr/materializations/test/test.sql (1)

213-256: Consider the fragility of SQL string parsing

While this dynamic approach is comprehensive, parsing SQL strings with text matching is inherently fragile. Consider these potential issues:

SQL formatting variations could break pattern matching

Complex queries with subqueries or CTEs might not be parsed correctly

Different SQL dialects might use different syntax

The static mapping approach in get_test_result_column_mapping might be more reliable for standard dbt tests, with this dynamic approach as a fallback only when needed.

integration_tests/tests/test_column_pii_sampling.py (1)

97-134: Potential test duplication

This test appears to cover the same scenario as test_column_pii_sampling_disabled - both test cases where PII tags exist but disable_samples_on_pii_tags is False. Consider consolidating these tests or ensuring they cover distinct scenarios.

The test logic itself is correct.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c45cba3 and 949b357.

📒 Files selected for processing (4)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)
integration_tests/tests/test_disable_samples_config.py (1 hunks)
macros/edr/materializations/test/test.sql (2 hunks)
macros/edr/system/system_utils/is_pii_column.sql (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

macros/edr/system/system_utils/is_pii_column.sql
integration_tests/tests/test_disable_samples_config.py

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.

Applied to files:

macros/edr/materializations/test/test.sql

🔇 Additional comments (12)

macros/edr/materializations/test/test.sql (5)

67-68: LGTM - Parameter addition enables PII functionality

The addition of the flattened_test parameter correctly enables the new PII column detection functionality in the query_test_result_rows macro.

120-120: LGTM - Macro signature correctly updated

The addition of the flattened_test parameter enables the new PII column exclusion functionality.

180-211: LGTM - Comprehensive test type mapping

The static column mapping correctly handles the standard dbt test types (unique, accepted_values, relationships, not_null) and their respective column aliases. This provides a good foundation for column exclusion logic.

258-272: LGTM - Clean mapping application

The logic correctly applies column mapping to excluded columns, providing a clean interface between exclusion detection and column mapping functionality.

351-367: LGTM - Robust column configuration checking

The macro correctly navigates the model structure to check column-level disable_test_samples configuration with appropriate null safety checks.

integration_tests/tests/test_column_pii_sampling.py (7)

1-24: LGTM - Clean test setup and constants

The file structure is well-organized with clear constants and a properly structured sample query for retrieving test results.

61-95: LGTM - Addresses backward compatibility concern

This test correctly validates the scenario mentioned in past review comments: when PII tags exist but the flag is disabled, samples should include PII columns normally. The assertions properly verify both 'unique_field' and 'n_records' are present.

Note: The configuration variable name will need to be updated when the macro is fixed.

136-168: Verify edge case behavior with implementation

This test expects no samples when all columns are PII (assert len(samples) == 0), but the macro implementation (lines 144-146 in the first file) creates a dummy column "1 as _no_non_excluded_columns" when all columns are excluded.

Please verify if the expected behavior should be:

No samples collected (as this test expects), or

Samples with only the dummy column (as the implementation suggests)

The test logic is otherwise sound for this important edge case.

171-204: LGTM - Proper unique test column mapping validation

This test correctly validates that unique test column mapping works with PII exclusion - the PII-tagged column mapped to 'unique_field' is properly excluded from samples.

206-239: LGTM - Proper accepted_values test column mapping validation

This test correctly validates that accepted_values test column mapping works with PII exclusion - the PII-tagged column mapped to 'value' is properly excluded from samples.

241-275: LGTM - Validates dummy column behavior

This test correctly validates the behavior when PII columns are excluded from not_null tests - it expects the _no_non_excluded_columns dummy column, which aligns with the macro implementation.

277-314: LGTM - Comprehensive multiple PII columns test

This test properly validates that multiple PII-tagged columns are all excluded from samples, ensuring the exclusion logic works correctly with multiple columns.

coderabbitai · 2025-08-02T21:17:21Z

+@pytest.mark.skip_targets(["clickhouse"])
+def test_column_pii_sampling_enabled(test_id: str, dbt_project: DbtProject):
+    """Test that PII columns are excluded when column-level PII protection is enabled"""
+    data = [
+        {SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: None} for i in range(10)
+    ]
+
+    test_result = dbt_project.test(
+        test_id,
+        "unique",
+        test_args=dict(column_name=SENSITIVE_COLUMN),
+        data=data,
+        columns=[
+            {"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}},
+            {"name": SAFE_COLUMN},
+        ],
+        test_vars={
+            "enable_elementary_test_materialization": True,
+            "test_sample_row_count": TEST_SAMPLE_ROW_COUNT,
+            "disable_samples_on_pii_tags": True,
+            "pii_tags": ["pii"],
+        },
+    )
+    assert test_result["status"] == "fail"
+
+    samples = [
+        json.loads(row["result_row"])
+        for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id))
+    ]
+
+    assert len(samples) == 1
+    assert samples[0]["n_records"] == 10
+    assert len(samples[0]) == 1
+


⚠️ Potential issue

Configuration variable name inconsistency

Line 45 uses disable_samples_on_pii_tags: True which matches the incorrect variable name in the macro, but according to the PR objectives it should be disable_samples_on_pii_columns. This test will need to be updated when the macro is fixed.

The test logic and assertions are correct for the intended behavior.

🤖 Prompt for AI Agents

In integration_tests/tests/test_column_pii_sampling.py around lines 26 to 59, the test uses the configuration variable disable_samples_on_pii_tags which is inconsistent with the intended variable name disable_samples_on_pii_columns. Update the test_vars dictionary to replace disable_samples_on_pii_tags with disable_samples_on_pii_columns to align with the correct macro variable name and PR objectives.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

macros/edr/materializations/test/test.sql (1)
154-174: Critical: Fix configuration variable name

Line 161 uses disable_samples_on_pii_tags but according to the PR objectives and past review comments, it should be disable_samples_on_pii_columns. This mismatch will prevent the PII exclusion feature from working correctly.

Apply this fix:
-  {% if elementary.get_config_var('disable_samples_on_pii_tags') %}
+  {% if elementary.get_config_var('disable_samples_on_pii_columns') %}
integration_tests/tests/test_column_pii_sampling.py (4)
97-134: Good test for backward compatibility scenario

This test correctly validates the scenario where PII tags exist but the feature is disabled, ensuring backward compatibility. However, Line 117 needs the same configuration variable name fix.

Apply this fix:
-            "disable_samples_on_pii_tags": False,  # Flag is disabled
+            "disable_samples_on_pii_columns": False,  # Flag is disabled
136-462: Comprehensive test coverage with consistent configuration issue

The test suite provides excellent coverage of various scenarios:

All columns tagged as PII

Different dbt test types (unique, accepted_values, not_null)

Multiple PII columns

Custom SQL tests with complex scenarios

However, all tests consistently use disable_samples_on_pii_tags instead of disable_samples_on_pii_columns in their configuration. This needs to be fixed throughout the file.

Apply this fix to all test functions:
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
26-59: Critical: Configuration variable name inconsistency

Line 45 uses disable_samples_on_pii_tags: True but according to the PR objectives, this should be disable_samples_on_pii_columns. This matches the same issue in the macro and needs to be fixed for the tests to work correctly.

Apply this fix:
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
61-95: Fix configuration variable name

Line 79 has the same configuration variable name issue - should be disable_samples_on_pii_columns instead of disable_samples_on_pii_tags.

The test logic correctly validates that PII columns are included in samples when the feature is disabled, ensuring backward compatibility.

🧹 Nitpick comments (1)

macros/edr/materializations/test/test.sql (1)

178-239: Complex but necessary column mapping logic

This macro implements sophisticated column mapping to handle different dbt test types. While complex, the multiple matching strategies (exact match, common patterns, heuristics) are necessary to handle the variety of ways test results are structured.

The approach creates temporary relations for column introspection, which adds some performance overhead but provides accurate column mapping.

Consider adding more detailed logging or comments to explain the matching strategies for future maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 949b357 and 06e9451.

📒 Files selected for processing (2)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)
macros/edr/materializations/test/test.sql (2 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Applied to files:

macros/edr/materializations/test/test.sql

📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Applied to files:

macros/edr/materializations/test/test.sql

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test (latest_official, clickhouse) / test
GitHub Check: test (latest_official, postgres) / test
GitHub Check: test (latest_pre, postgres) / test
GitHub Check: test (latest_official, athena) / test
GitHub Check: test (latest_official, snowflake) / test
GitHub Check: test (latest_official, databricks_catalog) / test
GitHub Check: test (latest_official, trino) / test
GitHub Check: test (1.8.0, postgres) / test
GitHub Check: test (latest_official, redshift) / test
GitHub Check: test (latest_official, bigquery) / test

🔇 Additional comments (4)

macros/edr/materializations/test/test.sql (4)

120-152: Improved approach for column-level PII exclusion

The new implementation provides a cleaner approach to excluding PII columns from test samples. The logic properly handles edge cases like when all columns are excluded by falling back to a dummy column.

However, note that creating temporary relations (Line 134) adds some performance overhead compared to static column mapping approaches.

241-256: LGTM - Clean helper for column introspection

This helper macro properly creates temporary relations for column metadata extraction using the standard limit 0 approach. The implementation integrates well with Elementary's existing temp table utilities.

258-274: Well-implemented column sampling check

This macro properly retrieves column-level configuration from the parent model and checks for the disable_test_samples flag. The null checks and fallback logic are appropriate for handling edge cases.

66-68: LGTM - Proper integration of flattened_test parameter

The addition of the flattened_test parameter to query_test_result_rows correctly enables the new column exclusion functionality while maintaining backward compatibility.

…st query

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (12)

integration_tests/tests/test_column_pii_sampling.py (12)
45-45: Configuration variable name inconsistency persists.

This line still uses disable_samples_on_pii_tags: True which is inconsistent with the intended variable name disable_samples_on_pii_columns mentioned in the PR objectives. This matches the previous review comment that flagged this issue.

Apply this fix to align with the correct variable name:
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
77-77: Configuration variable name inconsistency.

Same issue as line 45 - the variable name should be disable_samples_on_pii_columns instead of disable_samples_on_pii_tags.
-            "disable_samples_on_pii_tags": False,
+            "disable_samples_on_pii_columns": False,
115-115: Configuration variable name inconsistency.

Another occurrence of the incorrect variable name.
-            "disable_samples_on_pii_tags": False,  # Flag is disabled
+            "disable_samples_on_pii_columns": False,  # Flag is disabled
154-154: Configuration variable name inconsistency.

Yet another occurrence of the incorrect variable name.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
186-186: Configuration variable name inconsistency.

Continuing pattern of incorrect variable name usage.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
217-217: Configuration variable name inconsistency.

Same issue in the accepted_values test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
248-248: Configuration variable name inconsistency.

Same issue in the not_null test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
283-283: Configuration variable name inconsistency.

Same issue in the multiple PII columns test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
313-313: Configuration variable name inconsistency.

Same issue in the custom SQL test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
344-344: Configuration variable name inconsistency.

Same issue in the meta tags test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
382-382: Configuration variable name inconsistency.

Same issue in the custom SQL test with multiple PII columns.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
413-413: Configuration variable name inconsistency.

Same issue in the subquery test.
-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b6854b and 057a2cc.

📒 Files selected for processing (3)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)
macros/edr/materializations/test/test.sql (3 hunks)
macros/edr/system/system_utils/is_pii_column.sql (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

macros/edr/system/system_utils/is_pii_column.sql
macros/edr/materializations/test/test.sql

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test (latest_official, trino) / test
GitHub Check: test (latest_official, clickhouse) / test
GitHub Check: test (latest_official, athena) / test
GitHub Check: test (latest_official, redshift) / test
GitHub Check: test (latest_official, postgres) / test
GitHub Check: test (1.8.0, postgres) / test
GitHub Check: test (latest_pre, postgres) / test
GitHub Check: test (latest_official, databricks_catalog) / test
GitHub Check: test (latest_official, bigquery) / test
GitHub Check: test (latest_official, snowflake) / test

🔇 Additional comments (4)

integration_tests/tests/test_column_pii_sampling.py (4)

1-24: Well-structured test setup with clear constants and query template.

The imports, constants, and SQL query template are appropriately defined for comprehensive integration testing of the PII sampling functionality.

338-338: Interesting test case for meta tags.

This test validates that PII tags can be specified in the meta section rather than config.tags, which adds good coverage for different tag placement options.

56-56: Comprehensive test coverage with consistent assertions.

The tests effectively validate that when PII protection is enabled, no sample rows are collected (asserting len(samples) == 0). This comprehensive coverage includes various test types (unique, not_null, accepted_values), edge cases (all columns PII), and different PII tag configurations.

Also applies to: 166-166, 197-197, 228-228, 259-259, 293-293, 324-324, 355-355, 392-392, 422-422

87-92: Proper validation of disabled PII protection scenarios.

These tests correctly verify that when PII protection is disabled, sample data is collected normally with the expected structure and values. The assertions validate both the presence of expected fields and their correct values.

Also applies to: 127-131

haritamar · 2025-08-03T14:52:18Z

 {% endmacro %}

-{% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false) %}
+{% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false, flattened_test=none) %}


Is the flattened_test param needed?

haritamar · 2025-08-03T14:53:42Z

@@ -0,0 +1,47 @@
+{% macro get_pii_columns_from_parent_model(flattened_test) %}


Let's rename the file to match the macro?

haritamar · 2025-08-03T14:56:46Z

+
+    {# column -> meta -> tags #}
+    {% set meta_dict = column_node.get('meta', {}) %}
+    {% set meta_tags = meta_dict.get('tags', []) %}


We should also consider the case where the model itself has a PII tag.

This is implemented here:

dbt-data-reliability/macros/edr/materializations/test/test.sql

Line 62 in 25ddd67

{% elif elementary.is_pii_table(flattened_test) %}

haritamar · 2025-08-03T14:57:52Z

+
+  {% for column_node in column_nodes.values() %}
+    {# column -> tags #}
+    {% set column_tags = column_node.get('tags', []) %}


Consider creating a separate macro just to get the tags of a column

- Rename is_pii_column.sql to get_pii_columns_from_parent_model.sql to match macro name - Extract get_column_tags helper macro for better code organization - Add model-level PII tag support - if model has PII tags, all columns are considered PII - Add clarifying comment about intentional over-censoring with * queries in CTEs - Keep flattened_test parameter in query_test_result_rows for future functionality Fixes all outstanding code review feedback from haritamar on PR #833 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql (1)
59-61: Leverage the locally-defined macro directly

Referencing the same-package macro via the package namespace works, but makes refactors harder (e.g. renaming the package).
Using the bare macro name keeps the call shorter and avoids an unnecessary dependency edge.
-    {% set all_column_tags_lower = elementary.get_column_tags(column_node) %}
+    {% set all_column_tags_lower = get_column_tags(column_node) %}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 057a2cc and 89e49de.

📒 Files selected for processing (2)

macros/edr/materializations/test/test.sql (3 hunks)
macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

macros/edr/materializations/test/test.sql

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

- Fix get_column_tags to properly handle string tags vs lists - Fix pii_tags processing to handle scalar strings correctly - Apply consistent string/list normalization to model tags - Addresses CodeRabbit comments r2250021209 and r2250021214 The core issue was that Jinja2's 'is iterable' test returns True for strings, causing scalar string tags like 'pii' to be treated as character arrays ['p','i','i'] instead of single-element lists ['pii']. Fixed by using explicit string checking with 'src is string' to properly distinguish between strings and actual lists/arrays. Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

- Remove model-level PII tag checking from get_pii_columns_from_parent_model - This logic is handled elsewhere as noted by Yosef - Simplifies macro to focus only on column-level PII detection Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

integration_tests/tests/test_column_pii_sampling.py (1)

42-46: Config key disable_samples_on_pii_tags is correct – earlier warning can be ignored

Past feedback suggested renaming this variable to disable_samples_on_pii_columns, but the maintainer explicitly confirmed the project intentionally uses disable_samples_on_pii_tags (see retrieved learning from arbiv).
No change is required here.

Also applies to: 70-77, 148-152, 187-191, 218-223, 250-254, 281-285, 318-319, 346-350, 378-381

🧹 Nitpick comments (1)

integration_tests/tests/test_column_pii_sampling.py (1)

1-91: Heavy duplication – consider parametrising the suite

Each test repeats the same setup pattern (data creation, dbt_project.test call, sample query, assertions). Pytest’s @pytest.mark.parametrize or a helper fixture would shrink this file dramatically, speed execution, and simplify future maintenance.

Also applies to: 95-167, 170-233, 236-295, 298-359, 362-390

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cc4050 and 812d9c2.

📒 Files selected for processing (1)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.

Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

📚 Learning: in the dbt-data-reliability project, the config key for disabling samples on pii tags is intentional...

Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.

Applied to files:

integration_tests/tests/test_column_pii_sampling.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: test (1.8.0, postgres) / test
GitHub Check: test (latest_official, clickhouse) / test
GitHub Check: test (latest_official, redshift) / test
GitHub Check: test (latest_official, bigquery) / test
GitHub Check: test (latest_pre, postgres) / test
GitHub Check: test (latest_official, snowflake) / test
GitHub Check: test (latest_official, trino) / test
GitHub Check: test (latest_official, athena) / test
GitHub Check: test (latest_official, databricks_catalog) / test
GitHub Check: test (latest_official, postgres) / test

🔇 Additional comments (1)

integration_tests/tests/test_column_pii_sampling.py (1)

110-114: Verify macro variable name for custom PII tag list

All tests pass a pii_tags variable, but the PR description and original design reference pii_column_tags.
Please double-check the macros’ var() lookup to ensure the spelling aligns; otherwise these tests may silently bypass the intended override.

Also applies to: 249-254, 281-285, 349-350, 379-381

devin-ai-integration Bot temporarily deployed to elementary_test_env July 30, 2025 08:13 Inactive

devin-ai-integration Bot had a problem deploying to elementary_test_env July 30, 2025 08:13 Failure

arbiv mentioned this pull request Jul 30, 2025

Implement column-level PII protection for sample collection #832

Closed

5 tasks

coderabbitai Bot reviewed Jul 30, 2025

View reviewed changes

devin-ai-integration Bot had a problem deploying to elementary_test_env July 30, 2025 12:25 Error

devin-ai-integration Bot temporarily deployed to elementary_test_env July 30, 2025 12:25 Inactive

devin-ai-integration Bot had a problem deploying to elementary_test_env July 30, 2025 12:25 Error

arbiv requested changes Jul 30, 2025

View reviewed changes

Comment thread macros/edr/materializations/test/test.sql Outdated

arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error

coderabbitai Bot reviewed Aug 2, 2025

View reviewed changes

using dbt parser

06e9451

arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure

coderabbitai Bot reviewed Aug 2, 2025

View reviewed changes

arbiv added 2 commits August 3, 2025 11:48

changed column logic to not sample if one of the columns is in the te…

3b6854b

…st query

improved tests and code readability

057a2cc

coderabbitai Bot reviewed Aug 3, 2025

View reviewed changes

Comment thread integration_tests/tests/test_column_pii_sampling.py Outdated

haritamar reviewed Aug 3, 2025

View reviewed changes

coderabbitai Bot reviewed Aug 3, 2025

View reviewed changes

Comment thread macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql

Comment thread macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql

Comment thread macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql Outdated

devin-ai-integration Bot and others added 3 commits August 4, 2025 04:51

improved tests

812d9c2

coderabbitai Bot reviewed Aug 4, 2025

View reviewed changes

fixed cr comment

25ddd67

haritamar approved these changes Aug 4, 2025

View reviewed changes

coderabbitai Bot mentioned this pull request Oct 5, 2025

Ele 4724 dbt fusion support #825

Merged

		@@ -0,0 +1,47 @@
		{% macro get_pii_columns_from_parent_model(flattened_test) %}

Conversation

devin-ai-integration Bot commented Jul 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

devin-ai-integration Bot commented Jul 30, 2025

🤖 Devin AI Engineer

Uh oh!

linear Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

github-actions Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haritamar Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

haritamar Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

haritamar Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

arbiv Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

haritamar Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

devin-ai-integration Bot commented Jul 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 30, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)