Skip to content

Implement column-level PII protection for sample collection#833

Merged
arbiv merged 17 commits into
masterfrom
devin-ELE-4850-1753859795
Aug 4, 2025
Merged

Implement column-level PII protection for sample collection#833
arbiv merged 17 commits into
masterfrom
devin-ELE-4850-1753859795

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Jul 30, 2025

Implement column-level PII protection for sample collection

Summary by CodeRabbit

  • New Features

    • Added support for disabling sample row collection for columns tagged as PII or explicitly configured to disable sampling during tests.
    • Introduced configuration options to control sample data exposure for sensitive columns, including custom tags and per-column settings.
    • Enhanced test coverage to verify correct handling of PII and sampling exclusion under various configurations.
  • Bug Fixes

    • Ensured that sample rows are properly excluded for tests involving PII or disabled columns, preventing unintended data exposure.
  • Tests

    • Added comprehensive integration tests to validate PII sampling behavior and the "disable_test_samples" configuration across multiple scenarios.

- Add disable_samples_on_pii_columns and pii_column_tags config variables
- Create get_pii_columns_from_parent_model helper macro for column PII detection
- Modify query_test_result_rows to dynamically exclude PII columns from SELECT
- Add comprehensive integration tests for column-level PII protection
- Maintain backward compatibility with existing table-level PII detection
- Follow existing patterns for configuration, tag handling, and testing

Addresses Linear issue ELE-4850

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link
Copy Markdown

linear Bot commented Jul 30, 2025

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jul 30, 2025

Walkthrough

This update introduces column-level PII protection for dbt test sample collection. It adds macros to detect PII columns, modifies sampling logic to exclude PII-tagged or explicitly disabled columns, and provides new configuration options. Comprehensive integration tests verify correct behavior for various PII and disable-sample scenarios, ensuring granular control over sample data exposure.

Changes

Cohort / File(s) Change Summary
Integration Tests: PII Sampling
integration_tests/tests/test_column_pii_sampling.py
Adds comprehensive integration tests for column-level PII sampling, covering various configurations and test types, verifying correct inclusion/exclusion of sample rows and columns based on PII tags and sampling settings.
Integration Tests: Disable Samples Config
integration_tests/tests/test_disable_samples_config.py
Introduces integration tests for the disable_test_samples column config, testing its interaction with PII tags and verifying sample row exclusion logic for multiple columns and configurations.
DbtProject Test API
integration_tests/tests/dbt_project.py
Updates the DbtProject.test method to accept a column_config parameter, supporting per-column sample disabling and PII tagging in test setup.
Macros: PII Column Detection & Sampling Logic
macros/edr/materializations/test/test.sql
Adds macros for detecting PII columns, determining when to disable sampling, and excluding columns from samples. Updates test materialization logic to enforce column-level PII and disable-sample configurations.
Macros: System Utils for PII
macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql
Adds macros to extract tags from columns and parent models, and to identify PII columns based on tags and config, supporting column-level PII detection for sample exclusion logic.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DbtProject
    participant dbt_test_macros
    participant SystemUtils

    User->>DbtProject: Run test with column config/PII tags
    DbtProject->>dbt_test_macros: Execute handle_dbt_test(flattened_test)
    dbt_test_macros->>SystemUtils: get_pii_columns_from_parent_model(flattened_test)
    SystemUtils-->>dbt_test_macros: Return list of PII columns
    dbt_test_macros->>dbt_test_macros: should_disable_sampling_for_pii(flattened_test)
    alt Sampling disabled
        dbt_test_macros->>dbt_test_macros: Set sample_limit = 0
    else Sampling enabled
        dbt_test_macros->>dbt_test_macros: get_columns_to_exclude_from_sampling(flattened_test)
    end
    dbt_test_macros->>DbtProject: Return test results with correct sample rows/columns
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Assessment against linked issues

Objective Addressed Explanation
Add new configuration variables for column-level PII protection (ELE-4850)
Create a column PII detection helper macro (ELE-4850)
Modify sample collection logic to exclude PII columns, not just tables (ELE-4850)
Integration tests for column-level PII protection (ELE-4850)
Update documentation (ELE-4850) Documentation updates are not present in this PR.

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
None found All code changes are directly related to the objectives in ELE-4850.

Possibly related PRs

Suggested reviewers

  • michael-myaskovsky

Poem

In the warren where the data flows,
We burrow deep where the PII grows.
Now, with nimble code and clever tags,
Only secrets hide—no more red flags!
Samples hop past columns marked "private,"
Leaving carrots for the rest—oh, how I delight it!
🥕🐇 Data safe, and tests so bright!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 812d9c2 and 25ddd67.

📒 Files selected for processing (1)
  • macros/edr/materializations/test/test.sql (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • macros/edr/materializations/test/test.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test (latest_official, postgres) / test
  • GitHub Check: test (latest_official, redshift) / test
  • GitHub Check: test (latest_official, clickhouse) / test
  • GitHub Check: test (1.8.0, postgres) / test
  • GitHub Check: test (latest_official, trino) / test
  • GitHub Check: test (latest_official, bigquery) / test
  • GitHub Check: test (latest_official, snowflake) / test
  • GitHub Check: test (latest_pre, postgres) / test
  • GitHub Check: test (latest_official, databricks_catalog) / test
  • GitHub Check: test (latest_official, athena) / test
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin-ELE-4850-1753859795

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Copy Markdown
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
macros/edr/system/system_utils/is_pii_column.sql (1)

21-23: Consider using ensure_list helper for cleaner code.

The string-to-list conversion logic is functional but could be more robust.

If Elementary has a helper macro for ensuring list types, consider using it:

-  {% if pii_column_tags is string %}
-    {% set pii_column_tags = [pii_column_tags] %}
-  {% endif %}
+  {% set pii_column_tags = elementary.ensure_list(pii_column_tags) %}

Alternatively, the current approach could handle additional edge cases:

   {% set pii_column_tags = elementary.get_config_var('pii_column_tags') %}
-  {% if pii_column_tags is string %}
-    {% set pii_column_tags = [pii_column_tags] %}
-  {% endif %}
+  {% if pii_column_tags is string %}
+    {% set pii_column_tags = [pii_column_tags] %}
+  {% elif not pii_column_tags %}
+    {% set pii_column_tags = [] %}
+  {% endif %}
macros/edr/materializations/test/test.sql (1)

116-137: LGTM! Robust PII column filtering logic with good edge case handling.

The implementation correctly:

  • Detects PII columns only when flattened_test is provided
  • Queries column names from the test results to build selective queries
  • Filters out PII columns while preserving safe columns
  • Handles the edge case where all columns are PII with a descriptive dummy column

The _no_non_pii_columns dummy column provides clear indication when no safe columns remain.

Consider performance optimization for high-frequency tests.

The additional query to retrieve column names adds overhead. For high-frequency test scenarios, consider caching column metadata or optimizing this lookup.

integration_tests/tests/test_column_pii_sampling.py (1)

97-133: LGTM! Good edge case coverage with intentional test design.

The test validates the scenario where all columns are tagged as PII by checking for the dummy _no_non_pii_columns column. The test data is intentionally designed with non-null values to pass the not_null constraint, allowing focus on the sampling behavior rather than test failure mechanics.

Consider adding error handling for empty sample results.

While the current tests assume samples will always be present, consider adding assertions to handle cases where the query might return no results due to test execution issues.

samples = [
    json.loads(row["result_row"])
    for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id))
]
assert samples, "Expected sample rows but none were found"
assert len(samples) == TEST_SAMPLE_ROW_COUNT
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc8fba5 and 5d4dd69.

📒 Files selected for processing (4)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
  • macros/edr/materializations/test/test.sql (2 hunks)
  • macros/edr/system/system_utils/get_config_var.sql (1 hunks)
  • macros/edr/system/system_utils/is_pii_column.sql (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
macros/edr/materializations/test/test.sql (3)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.

🧬 Code Graph Analysis (1)
integration_tests/tests/test_column_pii_sampling.py (2)
integration_tests/tests/conftest.py (2)
  • dbt_project (88-89)
  • test_id (108-111)
integration_tests/tests/dbt_project.py (2)
  • DbtProject (42-279)
  • run_query (51-58)
🔇 Additional comments (9)
macros/edr/system/system_utils/get_config_var.sql (1)

58-59: LGTM! Configuration variables properly added.

The new configuration variables are well-designed:

  • disable_samples_on_pii_columns defaults to false for backward compatibility
  • pii_column_tags provides sensible default tags covering common PII naming conventions
  • Both variables follow the existing naming and structure patterns in the configuration
macros/edr/system/system_utils/is_pii_column.sql (2)

25-31: LGTM! Comprehensive tag collection approach.

The macro correctly collects tags from multiple metadata sources (config.tags, global tags, and meta.tags), ensuring comprehensive PII detection coverage regardless of where users define their PII tags.


4-6: Fix inverted logic condition.

The condition logic appears inverted. The macro should return early when PII column sampling is disabled (disable_samples_on_pii_columns is false), but currently returns early when it's enabled.

Apply this diff to fix the logic:

-  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
+  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}

Wait, let me reconsider this. The variable name is disable_samples_on_pii_columns, so when it's false (default), we want to disable the PII protection and return an empty list. When it's true, we want to enable PII protection and proceed with the detection logic.

The current logic if not disable_samples_on_pii_columns means "if PII protection is enabled", so return early. This is incorrect.

-  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}
+  {% if not elementary.get_config_var('disable_samples_on_pii_columns') %}

Actually, let me think through this more carefully:

  • disable_samples_on_pii_columns = false (default) → PII protection disabled → return empty list
  • disable_samples_on_pii_columns = true → PII protection enabled → proceed with detection

So the condition should be: if not disable_samples_on_pii_columns (if PII protection is disabled), return empty list.

But the current code has if not elementary.get_config_var('disable_samples_on_pii_columns') which means "if PII protection is disabled", return empty list. This is actually correct!

Let me re-examine... No wait, I think I'm confusing myself. Let me look at the PR objectives again.

From the PR objectives: disable_samples_on_pii_columns defaults to false, and when enabled (true), it activates the PII column protection. So:

  • disable_samples_on_pii_columns = false → No PII protection, return empty list
  • disable_samples_on_pii_columns = true → Enable PII protection, proceed with detection

The current code: if not elementary.get_config_var('disable_samples_on_pii_columns') means "if disable_samples_on_pii_columns is false", return empty list. This is correct behavior.

Actually, I think I misunderstood the variable name. Let me re-read the PR objectives more carefully...

From the PR: "disable_samples_on_pii_columns (defaulting to false)" and "when PII sampling is enabled". So this variable name is a bit confusing, but it seems like:

  • disable_samples_on_pii_columns = true → Enable PII column protection (disable sampling on PII columns)
  • disable_samples_on_pii_columns = false → Disable PII column protection (allow sampling on PII columns)

So the current logic is correct: if disable_samples_on_pii_columns is false, return empty list (no protection). If it's true, proceed with detection.

Wait, I think I need to be more careful here. The variable name disable_samples_on_pii_columns is confusing. Let me interpret it as:

  • disable_samples_on_pii_columns = true means "disable sampling on PII columns" = enable PII protection
  • disable_samples_on_pii_columns = false means "don't disable sampling on PII columns" = disable PII protection

So the current logic if not disable_samples_on_pii_columns means "if we're not disabling sampling on PII columns", return empty list. This is correct.

Actually, let me step back and look at this more systematically. The logic should be:

  • If PII protection is disabled, return empty list (no columns to exclude)
  • If PII protection is enabled, proceed to find PII columns

The variable name disable_samples_on_pii_columns when true means "disable sampling on PII columns" which is equivalent to "enable PII protection".

So:

  • disable_samples_on_pii_columns = false → Don't disable sampling = no PII protection → return empty list
  • disable_samples_on_pii_columns = true → Disable sampling on PII = enable PII protection → proceed with detection

The current code: if not disable_samples_on_pii_columns means "if false", return empty list. This is correct!

I was getting confused by the double negative. The logic is actually correct.
[actions]
Skip this comment since the logic is actually correct.
-->

Likely an incorrect or invalid review comment.

macros/edr/materializations/test/test.sql (3)

54-55: LGTM! Proper parameter passing for PII detection.

The flattened_test argument is correctly passed to query_test_result_rows to enable PII column detection. The parameter formatting and line breaking maintain readability.


107-107: LGTM! Function signature properly updated.

The optional flattened_test parameter is correctly added to the function signature, maintaining backward compatibility.


143-143: LGTM! Dynamic select clause properly implemented.

The query correctly uses the dynamically built select_clause instead of the hardcoded *, enabling selective column filtering based on PII detection.

integration_tests/tests/test_column_pii_sampling.py (3)

1-24: LGTM! Well-structured test module with clear constants and query template.

The module setup is clean with appropriate imports, descriptive constants, and a reusable query template. The SAMPLES_QUERY correctly retrieves sample rows from the latest test execution.


26-60: LGTM! Comprehensive test for enabled PII protection.

The test correctly validates that:

  • PII columns are excluded from sample rows when protection is enabled
  • Safe columns remain accessible for debugging
  • Test failure behavior is preserved
  • Sample count matches expectations

The test data and configuration properly simulate real-world PII scenarios.


62-95: LGTM! Proper validation of disabled PII protection.

The test correctly verifies that when PII protection is disabled, all columns (including PII-tagged ones) are included in sample rows, maintaining backward compatibility.

- Add is_sampling_disabled_for_column macro to check column config
- Modify query_test_result_rows to skip sampling when disable_samples=true
- Add integration tests for disable_samples functionality
- disable_samples flag overrides PII detection and works at column level
- Preserves existing PII column filtering functionality

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
Comment thread macros/edr/materializations/test/test.sql Outdated
- Add is_sampling_disabled_for_column macro to check column config
- Modify query_test_result_rows to skip sampling when disable_samples=true
- Add integration tests for disable_samples functionality
- disable_samples flag overrides PII detection and works at column level
- Preserves existing PII column filtering functionality

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:11 — with GitHub Actions Error
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
macros/edr/materializations/test/test.sql (1)

129-147: Performance and SQL parsing concerns need attention

The implementation has two significant issues that were flagged in previous reviews:

  1. Double query execution: Lines 133-139 execute the test SQL with limit 0 to get column names, then line 153 executes it again for sampling. This doubles latency and warehouse costs.

  2. Unquoted column identifiers: Line 143 joins raw column names without proper quoting, which can break with special characters, reserved words, or mixed-case columns.

These issues were previously identified and solutions were suggested in past reviews.

🧹 Nitpick comments (2)
macros/edr/materializations/test/test.sql (1)

213-256: Consider the fragility of SQL string parsing

While this dynamic approach is comprehensive, parsing SQL strings with text matching is inherently fragile. Consider these potential issues:

  1. SQL formatting variations could break pattern matching
  2. Complex queries with subqueries or CTEs might not be parsed correctly
  3. Different SQL dialects might use different syntax

The static mapping approach in get_test_result_column_mapping might be more reliable for standard dbt tests, with this dynamic approach as a fallback only when needed.

integration_tests/tests/test_column_pii_sampling.py (1)

97-134: Potential test duplication

This test appears to cover the same scenario as test_column_pii_sampling_disabled - both test cases where PII tags exist but disable_samples_on_pii_tags is False. Consider consolidating these tests or ensuring they cover distinct scenarios.

The test logic itself is correct.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c45cba3 and 949b357.

📒 Files selected for processing (4)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
  • integration_tests/tests/test_disable_samples_config.py (1 hunks)
  • macros/edr/materializations/test/test.sql (2 hunks)
  • macros/edr/system/system_utils/is_pii_column.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • macros/edr/system/system_utils/is_pii_column.sql
  • integration_tests/tests/test_disable_samples_config.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.

Applied to files:

  • macros/edr/materializations/test/test.sql
🔇 Additional comments (12)
macros/edr/materializations/test/test.sql (5)

67-68: LGTM - Parameter addition enables PII functionality

The addition of the flattened_test parameter correctly enables the new PII column detection functionality in the query_test_result_rows macro.


120-120: LGTM - Macro signature correctly updated

The addition of the flattened_test parameter enables the new PII column exclusion functionality.


180-211: LGTM - Comprehensive test type mapping

The static column mapping correctly handles the standard dbt test types (unique, accepted_values, relationships, not_null) and their respective column aliases. This provides a good foundation for column exclusion logic.


258-272: LGTM - Clean mapping application

The logic correctly applies column mapping to excluded columns, providing a clean interface between exclusion detection and column mapping functionality.


351-367: LGTM - Robust column configuration checking

The macro correctly navigates the model structure to check column-level disable_test_samples configuration with appropriate null safety checks.

integration_tests/tests/test_column_pii_sampling.py (7)

1-24: LGTM - Clean test setup and constants

The file structure is well-organized with clear constants and a properly structured sample query for retrieving test results.


61-95: LGTM - Addresses backward compatibility concern

This test correctly validates the scenario mentioned in past review comments: when PII tags exist but the flag is disabled, samples should include PII columns normally. The assertions properly verify both 'unique_field' and 'n_records' are present.

Note: The configuration variable name will need to be updated when the macro is fixed.


136-168: Verify edge case behavior with implementation

This test expects no samples when all columns are PII (assert len(samples) == 0), but the macro implementation (lines 144-146 in the first file) creates a dummy column "1 as _no_non_excluded_columns" when all columns are excluded.

Please verify if the expected behavior should be:

  • No samples collected (as this test expects), or
  • Samples with only the dummy column (as the implementation suggests)

The test logic is otherwise sound for this important edge case.


171-204: LGTM - Proper unique test column mapping validation

This test correctly validates that unique test column mapping works with PII exclusion - the PII-tagged column mapped to 'unique_field' is properly excluded from samples.


206-239: LGTM - Proper accepted_values test column mapping validation

This test correctly validates that accepted_values test column mapping works with PII exclusion - the PII-tagged column mapped to 'value' is properly excluded from samples.


241-275: LGTM - Validates dummy column behavior

This test correctly validates the behavior when PII columns are excluded from not_null tests - it expects the _no_non_excluded_columns dummy column, which aligns with the macro implementation.


277-314: LGTM - Comprehensive multiple PII columns test

This test properly validates that multiple PII-tagged columns are all excluded from samples, ensuring the exclusion logic works correctly with multiple columns.

Comment on lines +26 to +59
@pytest.mark.skip_targets(["clickhouse"])
def test_column_pii_sampling_enabled(test_id: str, dbt_project: DbtProject):
"""Test that PII columns are excluded when column-level PII protection is enabled"""
data = [
{SENSITIVE_COLUMN: "user@example.com", SAFE_COLUMN: None} for i in range(10)
]

test_result = dbt_project.test(
test_id,
"unique",
test_args=dict(column_name=SENSITIVE_COLUMN),
data=data,
columns=[
{"name": SENSITIVE_COLUMN, "config": {"tags": ["pii"]}},
{"name": SAFE_COLUMN},
],
test_vars={
"enable_elementary_test_materialization": True,
"test_sample_row_count": TEST_SAMPLE_ROW_COUNT,
"disable_samples_on_pii_tags": True,
"pii_tags": ["pii"],
},
)
assert test_result["status"] == "fail"

samples = [
json.loads(row["result_row"])
for row in dbt_project.run_query(SAMPLES_QUERY.format(test_id=test_id))
]

assert len(samples) == 1
assert samples[0]["n_records"] == 10
assert len(samples[0]) == 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Configuration variable name inconsistency

Line 45 uses disable_samples_on_pii_tags: True which matches the incorrect variable name in the macro, but according to the PR objectives it should be disable_samples_on_pii_columns. This test will need to be updated when the macro is fixed.

The test logic and assertions are correct for the intended behavior.

🤖 Prompt for AI Agents
In integration_tests/tests/test_column_pii_sampling.py around lines 26 to 59,
the test uses the configuration variable disable_samples_on_pii_tags which is
inconsistent with the intended variable name disable_samples_on_pii_columns.
Update the test_vars dictionary to replace disable_samples_on_pii_tags with
disable_samples_on_pii_columns to align with the correct macro variable name and
PR objectives.

Comment thread macros/edr/materializations/test/test.sql
Comment thread macros/edr/materializations/test/test.sql Outdated
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
@arbiv arbiv had a problem deploying to elementary_test_env August 2, 2025 21:39 — with GitHub Actions Failure
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
macros/edr/materializations/test/test.sql (1)

154-174: Critical: Fix configuration variable name

Line 161 uses disable_samples_on_pii_tags but according to the PR objectives and past review comments, it should be disable_samples_on_pii_columns. This mismatch will prevent the PII exclusion feature from working correctly.

Apply this fix:

-  {% if elementary.get_config_var('disable_samples_on_pii_tags') %}
+  {% if elementary.get_config_var('disable_samples_on_pii_columns') %}
integration_tests/tests/test_column_pii_sampling.py (4)

97-134: Good test for backward compatibility scenario

This test correctly validates the scenario where PII tags exist but the feature is disabled, ensuring backward compatibility. However, Line 117 needs the same configuration variable name fix.

Apply this fix:

-            "disable_samples_on_pii_tags": False,  # Flag is disabled
+            "disable_samples_on_pii_columns": False,  # Flag is disabled

136-462: Comprehensive test coverage with consistent configuration issue

The test suite provides excellent coverage of various scenarios:

  • All columns tagged as PII
  • Different dbt test types (unique, accepted_values, not_null)
  • Multiple PII columns
  • Custom SQL tests with complex scenarios

However, all tests consistently use disable_samples_on_pii_tags instead of disable_samples_on_pii_columns in their configuration. This needs to be fixed throughout the file.

Apply this fix to all test functions:

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

26-59: Critical: Configuration variable name inconsistency

Line 45 uses disable_samples_on_pii_tags: True but according to the PR objectives, this should be disable_samples_on_pii_columns. This matches the same issue in the macro and needs to be fixed for the tests to work correctly.

Apply this fix:

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

61-95: Fix configuration variable name

Line 79 has the same configuration variable name issue - should be disable_samples_on_pii_columns instead of disable_samples_on_pii_tags.

The test logic correctly validates that PII columns are included in samples when the feature is disabled, ensuring backward compatibility.

🧹 Nitpick comments (1)
macros/edr/materializations/test/test.sql (1)

178-239: Complex but necessary column mapping logic

This macro implements sophisticated column mapping to handle different dbt test types. While complex, the multiple matching strategies (exact match, common patterns, heuristics) are necessary to handle the variety of ways test results are structured.

The approach creates temporary relations for column introspection, which adds some performance overhead but provides accurate column mapping.

Consider adding more detailed logging or comments to explain the matching strategies for future maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 949b357 and 06e9451.

📒 Files selected for processing (2)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
  • macros/edr/materializations/test/test.sql (2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: the dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` i...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the `agate_to_dict` macro in `macros/utils/run_queries/agate_to_dict.sql` intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original `agate_table`, while in non-fusion mode it returns `agate_table.to_dict()`. This is because the content/structure of `agate_table` is different between the two modes.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in the elementary dbt package, the snowflake profile generation macro intentionally always includes ...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Applied to files:

  • macros/edr/materializations/test/test.sql
📚 Learning: in dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private...
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.

Applied to files:

  • macros/edr/materializations/test/test.sql
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test (latest_official, clickhouse) / test
  • GitHub Check: test (latest_official, postgres) / test
  • GitHub Check: test (latest_pre, postgres) / test
  • GitHub Check: test (latest_official, athena) / test
  • GitHub Check: test (latest_official, snowflake) / test
  • GitHub Check: test (latest_official, databricks_catalog) / test
  • GitHub Check: test (latest_official, trino) / test
  • GitHub Check: test (1.8.0, postgres) / test
  • GitHub Check: test (latest_official, redshift) / test
  • GitHub Check: test (latest_official, bigquery) / test
🔇 Additional comments (4)
macros/edr/materializations/test/test.sql (4)

120-152: Improved approach for column-level PII exclusion

The new implementation provides a cleaner approach to excluding PII columns from test samples. The logic properly handles edge cases like when all columns are excluded by falling back to a dummy column.

However, note that creating temporary relations (Line 134) adds some performance overhead compared to static column mapping approaches.


241-256: LGTM - Clean helper for column introspection

This helper macro properly creates temporary relations for column metadata extraction using the standard limit 0 approach. The implementation integrates well with Elementary's existing temp table utilities.


258-274: Well-implemented column sampling check

This macro properly retrieves column-level configuration from the parent model and checks for the disable_test_samples flag. The null checks and fallback logic are appropriate for handling edge cases.


66-68: LGTM - Proper integration of flattened_test parameter

The addition of the flattened_test parameter to query_test_result_rows correctly enables the new column exclusion functionality while maintaining backward compatibility.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (12)
integration_tests/tests/test_column_pii_sampling.py (12)

45-45: Configuration variable name inconsistency persists.

This line still uses disable_samples_on_pii_tags: True which is inconsistent with the intended variable name disable_samples_on_pii_columns mentioned in the PR objectives. This matches the previous review comment that flagged this issue.

Apply this fix to align with the correct variable name:

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

77-77: Configuration variable name inconsistency.

Same issue as line 45 - the variable name should be disable_samples_on_pii_columns instead of disable_samples_on_pii_tags.

-            "disable_samples_on_pii_tags": False,
+            "disable_samples_on_pii_columns": False,

115-115: Configuration variable name inconsistency.

Another occurrence of the incorrect variable name.

-            "disable_samples_on_pii_tags": False,  # Flag is disabled
+            "disable_samples_on_pii_columns": False,  # Flag is disabled

154-154: Configuration variable name inconsistency.

Yet another occurrence of the incorrect variable name.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

186-186: Configuration variable name inconsistency.

Continuing pattern of incorrect variable name usage.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

217-217: Configuration variable name inconsistency.

Same issue in the accepted_values test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

248-248: Configuration variable name inconsistency.

Same issue in the not_null test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

283-283: Configuration variable name inconsistency.

Same issue in the multiple PII columns test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

313-313: Configuration variable name inconsistency.

Same issue in the custom SQL test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

344-344: Configuration variable name inconsistency.

Same issue in the meta tags test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

382-382: Configuration variable name inconsistency.

Same issue in the custom SQL test with multiple PII columns.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,

413-413: Configuration variable name inconsistency.

Same issue in the subquery test.

-            "disable_samples_on_pii_tags": True,
+            "disable_samples_on_pii_columns": True,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b6854b and 057a2cc.

📒 Files selected for processing (3)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
  • macros/edr/materializations/test/test.sql (3 hunks)
  • macros/edr/system/system_utils/is_pii_column.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • macros/edr/system/system_utils/is_pii_column.sql
  • macros/edr/materializations/test/test.sql
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/commands/create_elementary_user.sql:29-30
Timestamp: 2025-07-27T11:19:12.019Z
Learning: In dbt macros that generate profile parameters, sensitive values like passwords, tokens, and private keys are typically represented as placeholder strings enclosed in angle brackets (e.g., "<PASSWORD>", "<contents of generated private key>") rather than actual sensitive values. These placeholders are safe to print to console as they are instructional text, not real secrets.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test (latest_official, trino) / test
  • GitHub Check: test (latest_official, clickhouse) / test
  • GitHub Check: test (latest_official, athena) / test
  • GitHub Check: test (latest_official, redshift) / test
  • GitHub Check: test (latest_official, postgres) / test
  • GitHub Check: test (1.8.0, postgres) / test
  • GitHub Check: test (latest_pre, postgres) / test
  • GitHub Check: test (latest_official, databricks_catalog) / test
  • GitHub Check: test (latest_official, bigquery) / test
  • GitHub Check: test (latest_official, snowflake) / test
🔇 Additional comments (4)
integration_tests/tests/test_column_pii_sampling.py (4)

1-24: Well-structured test setup with clear constants and query template.

The imports, constants, and SQL query template are appropriately defined for comprehensive integration testing of the PII sampling functionality.


338-338: Interesting test case for meta tags.

This test validates that PII tags can be specified in the meta section rather than config.tags, which adds good coverage for different tag placement options.


56-56: Comprehensive test coverage with consistent assertions.

The tests effectively validate that when PII protection is enabled, no sample rows are collected (asserting len(samples) == 0). This comprehensive coverage includes various test types (unique, not_null, accepted_values), edge cases (all columns PII), and different PII tag configurations.

Also applies to: 166-166, 197-197, 228-228, 259-259, 293-293, 324-324, 355-355, 392-392, 422-422


87-92: Proper validation of disabled PII protection scenarios.

These tests correctly verify that when PII protection is disabled, sample data is collected normally with the expected structure and values. The assertions validate both the presence of expected fields and their correct values.

Also applies to: 127-131

Comment thread integration_tests/tests/test_column_pii_sampling.py Outdated
{% endmacro %}

{% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false) %}
{% macro query_test_result_rows(sample_limit=none, ignore_passed_tests=false, flattened_test=none) %}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the flattened_test param needed?

@@ -0,0 +1,47 @@
{% macro get_pii_columns_from_parent_model(flattened_test) %}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the file to match the macro?


{# column -> meta -> tags #}
{% set meta_dict = column_node.get('meta', {}) %}
{% set meta_tags = meta_dict.get('tags', []) %}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider the case where the model itself has a PII tag.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented here:

{% elif elementary.is_pii_table(flattened_test) %}


{% for column_node in column_nodes.values() %}
{# column -> tags #}
{% set column_tags = column_node.get('tags', []) %}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating a separate macro just to get the tags of a column

Comment thread macros/edr/materializations/test/test.sql
- Rename is_pii_column.sql to get_pii_columns_from_parent_model.sql to match macro name
- Extract get_column_tags helper macro for better code organization
- Add model-level PII tag support - if model has PII tags, all columns are considered PII
- Add clarifying comment about intentional over-censoring with * queries in CTEs
- Keep flattened_test parameter in query_test_result_rows for future functionality

Fixes all outstanding code review feedback from haritamar on PR #833

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql (1)

59-61: Leverage the locally-defined macro directly

Referencing the same-package macro via the package namespace works, but makes refactors harder (e.g. renaming the package).
Using the bare macro name keeps the call shorter and avoids an unnecessary dependency edge.

-    {% set all_column_tags_lower = elementary.get_column_tags(column_node) %}
+    {% set all_column_tags_lower = get_column_tags(column_node) %}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 057a2cc and 89e49de.

📒 Files selected for processing (2)
  • macros/edr/materializations/test/test.sql (3 hunks)
  • macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • macros/edr/materializations/test/test.sql
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Comment thread macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql Outdated
devin-ai-integration Bot and others added 3 commits August 4, 2025 04:51
- Fix get_column_tags to properly handle string tags vs lists
- Fix pii_tags processing to handle scalar strings correctly
- Apply consistent string/list normalization to model tags
- Addresses CodeRabbit comments r2250021209 and r2250021214

The core issue was that Jinja2's 'is iterable' test returns True for strings,
causing scalar string tags like 'pii' to be treated as character arrays
['p','i','i'] instead of single-element lists ['pii']. Fixed by using
explicit string checking with 'src is string' to properly distinguish
between strings and actual lists/arrays.

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
- Remove model-level PII tag checking from get_pii_columns_from_parent_model
- This logic is handled elsewhere as noted by Yosef
- Simplifies macro to focus only on column-level PII detection

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)

42-46: Config key disable_samples_on_pii_tags is correct – earlier warning can be ignored

Past feedback suggested renaming this variable to disable_samples_on_pii_columns, but the maintainer explicitly confirmed the project intentionally uses disable_samples_on_pii_tags (see retrieved learning from arbiv).
No change is required here.

Also applies to: 70-77, 148-152, 187-191, 218-223, 250-254, 281-285, 318-319, 346-350, 378-381

🧹 Nitpick comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)

1-91: Heavy duplication – consider parametrising the suite

Each test repeats the same setup pattern (data creation, dbt_project.test call, sample query, assertions). Pytest’s @pytest.mark.parametrize or a helper fixture would shrink this file dramatically, speed execution, and simplify future maintenance.

Also applies to: 95-167, 170-233, 236-295, 298-359, 362-390

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cc4050 and 812d9c2.

📒 Files selected for processing (1)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.
Learnt from: haritamar
PR: elementary-data/dbt-data-reliability#827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.
📚 Learning: in the dbt-data-reliability project, the config key for disabling samples on pii tags is intentional...
Learnt from: arbiv
PR: elementary-data/dbt-data-reliability#833
File: macros/edr/system/system_utils/get_pii_columns_from_parent_model.sql:20-22
Timestamp: 2025-08-04T04:46:05.802Z
Learning: In the dbt-data-reliability project, the config key for disabling samples on PII tags is intentionally named "disable_samples_on_pii_tags" (not "disable_samples_on_pii_columns" as might be expected), as confirmed by the maintainer arbiv.

Applied to files:

  • integration_tests/tests/test_column_pii_sampling.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: test (1.8.0, postgres) / test
  • GitHub Check: test (latest_official, clickhouse) / test
  • GitHub Check: test (latest_official, redshift) / test
  • GitHub Check: test (latest_official, bigquery) / test
  • GitHub Check: test (latest_pre, postgres) / test
  • GitHub Check: test (latest_official, snowflake) / test
  • GitHub Check: test (latest_official, trino) / test
  • GitHub Check: test (latest_official, athena) / test
  • GitHub Check: test (latest_official, databricks_catalog) / test
  • GitHub Check: test (latest_official, postgres) / test
🔇 Additional comments (1)
integration_tests/tests/test_column_pii_sampling.py (1)

110-114: Verify macro variable name for custom PII tag list

All tests pass a pii_tags variable, but the PR description and original design reference pii_column_tags.
Please double-check the macros’ var() lookup to ensure the spelling aligns; otherwise these tests may silently bypass the intended override.

Also applies to: 249-254, 281-285, 349-350, 379-381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants