Implement column-level PII protection for sample collection by devin-ai-integration[bot] · Pull Request #832 · elementary-data/dbt-data-reliability

devin-ai-integration · 2025-07-30T07:26:02Z

Implement column-level PII protection for sample collection

Summary

This PR extends the existing PII protection feature to work at the column level instead of just the table level. Previously, if a table was tagged as PII, no samples were collected from failing rows. Now users can tag specific columns as PII, and only those columns will be excluded from sample data while preserving non-sensitive columns for debugging.

Key Changes:

Added two new configuration variables: disable_samples_on_pii_columns (boolean, default: false) and pii_column_tags (list, default: ['pii', 'personal', 'sensitive'])
Created get_pii_columns_from_parent_model macro to detect PII columns by checking column tags against configured PII tags
Modified query_test_result_rows to dynamically exclude PII columns from SELECT statements instead of using SELECT *
Added comprehensive integration tests for the new functionality
Maintains full backward compatibility - feature is disabled by default

Review & Testing Checklist for Human

End-to-end PII detection testing - Create a dbt model with columns tagged as PII using config: {tags: ['pii']} and verify that test samples exclude those columns while including non-PII columns
Backward compatibility verification - Ensure existing sample collection behavior is unchanged when disable_samples_on_pii_columns: false (the default)
Edge case testing - Test scenarios where all columns are PII-tagged, no columns are PII-tagged, and mixed tag sources (config vs global vs meta tags)
SQL generation robustness - Verify the dynamic SELECT clause construction works correctly across different database adapters and doesn't break with special column names
Configuration validation - Test that pii_column_tags works with both string and list inputs, and that boolean config parsing works correctly

Recommended Test Plan:

Set up a test model with mixed PII/non-PII columns
Enable the feature with disable_samples_on_pii_columns: true
Run a failing test and verify sample data excludes PII columns
Test with feature disabled to ensure backward compatibility
Run the new integration tests: cd integration_tests/tests && pytest test_column_pii_sampling.py -vvv

Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    A["get_config_var.sql<br/>(config variables)"]:::minor-edit --> B["query_test_result_rows<br/>(sample collection)"]:::major-edit
    C["is_pii_column.sql<br/>(PII detection)"]:::major-edit --> B
    D["handle_dbt_test<br/>(test materialization)"]:::minor-edit --> B
    E["test_column_pii_sampling.py<br/>(integration tests)"]:::major-edit
    F["dbt model columns<br/>(with PII tags)"]:::context --> C
    
    B --> G["test_result_rows table<br/>(filtered samples)"]:::context
    
    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF

Notes

The implementation follows existing Elementary patterns for configuration management, macro structure, and testing
When all columns are PII-tagged, the query returns a placeholder column _no_non_pii_columns to avoid SQL errors
Column tags are checked from all sources (config, global, meta) following dbt conventions
The feature is disabled by default to ensure zero impact on existing installations

Session Details:

Devin session: https://app.devin.ai/sessions/fe843845565644f9ab9c8045354e1184
Requested by: @arbiv
Addresses Linear issue: ELE-4850

Summary by CodeRabbit

New Features
- Added support for filtering out columns tagged as Personally Identifiable Information (PII) from test result samples, based on configurable tags and settings.
- Introduced new configuration options to control PII column sampling behavior.
Bug Fixes
- Ensured that when all columns are tagged as PII, a special indicator is returned instead of exposing sensitive data.
Tests
- Added integration tests to verify correct handling of PII columns in test result samples under various configurations.

- Add disable_samples_on_pii_columns and pii_column_tags config variables - Create get_pii_columns_from_parent_model helper macro for column PII detection - Modify query_test_result_rows to dynamically exclude PII columns from SELECT - Add comprehensive integration tests for column-level PII protection - Maintain backward compatibility with existing table-level PII detection - Follow existing patterns for configuration, tag handling, and testing Addresses Linear issue ELE-4850 Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>

devin-ai-integration · 2025-07-30T07:26:05Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

linear · 2025-07-30T07:26:06Z

ELE-4850 Extend PII protection to column-level granularity with column exclusion approach

coderabbitai · 2025-07-30T07:26:10Z

Walkthrough

The changes introduce column-level PII protection by adding configuration variables, a macro to detect PII columns based on tags, and logic to exclude such columns from test result samples. Integration tests are added to verify correct behavior when PII sampling is enabled or disabled and when all columns are PII-tagged.

Changes

Cohort / File(s)	Change Summary
Integration Tests for Column-Level PII Sampling `integration_tests/tests/test_column_pii_sampling.py`	Adds three integration test functions to validate column-level PII sampling: verifies exclusion of PII-tagged columns when enabled, inclusion when disabled, and behavior when all columns are PII-tagged. Uses pytest and dbt project fixtures.
Sample Query and Macro Logic `macros/edr/materializations/test/test.sql`	Updates the `handle_dbt_test` and `query_test_result_rows` macros to support column-level PII exclusion. Passes `flattened_test` to enable context-aware filtering. Modifies sample queries to exclude PII columns dynamically and adds logic for fallback when all columns are PII.
Configuration Variables `macros/edr/system/system_utils/get_config_var.sql`	Adds `disable_samples_on_pii_columns` (default: `false`) and `pii_column_tags` (default: `['pii', 'personal', 'sensitive']`) to the default configuration dictionary for controlling column-level PII sampling behavior.
PII Column Detection Macro `macros/edr/system/system_utils/is_pii_column.sql`	Introduces `get_pii_columns_from_parent_model` macro to identify PII columns from a parent model by matching tags from multiple metadata sources against configured PII tags. Returns a list of PII columns or empty if disabled.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant dbt_test
    participant Macro_handle_dbt_test
    participant Macro_query_test_result_rows
    participant Macro_get_pii_columns_from_parent_model
    participant Database

    User->>dbt_test: Run test with config (PII sampling enabled/disabled)
    dbt_test->>Macro_handle_dbt_test: Execute test logic
    Macro_handle_dbt_test->>Macro_query_test_result_rows: Request sample rows (with flattened_test)
    Macro_query_test_result_rows->>Macro_get_pii_columns_from_parent_model: Identify PII columns
    Macro_get_pii_columns_from_parent_model-->>Macro_query_test_result_rows: Return list of PII columns
    Macro_query_test_result_rows->>Database: Query test results (excluding PII columns)
    Database-->>Macro_query_test_result_rows: Return sample rows
    Macro_query_test_result_rows-->>Macro_handle_dbt_test: Return filtered samples
    Macro_handle_dbt_test-->>dbt_test: Return test result with samples
    dbt_test-->>User: Show test outcome and samples

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Add new configuration variables for column-level PII protection (`disable_samples_on_pii_columns`, `pii_column_tags`) (ELE-4850)	✅
Create helper macro to detect if a specific column has PII tags, checking metadata/tags (ELE-4850)	✅
Modify sample collection logic to exclude PII columns, preserving non-PII columns (ELE-4850)	✅
Add integration tests for column-level PII protection (ELE-4850)	✅

Poem

In the warren where data flows free,
A clever new trick for PII!
Columns tagged "sensitive" now hide,
While safe ones show with pride.
With macros and tests, we hop along,
Protecting secrets where they belong.
🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/ELE-4850-1753859795

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-07-30T07:26:11Z

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

macros/edr/system/system_utils/is_pii_column.sql (1)

25-38: Consider optimizing the nested loop for better performance.

The current implementation uses nested loops which could be optimized for better performance, especially with models having many columns and PII tags.

Consider using set operations for more efficient tag matching:

  {% for column_node in column_nodes.values() %}
    {% set config_dict = column_node.get('config', {}) %}
    {% set config_tags = config_dict.get('tags', []) %}
    {% set global_tags = column_node.get('tags', []) %}
    {% set meta_dict = column_node.get('meta', {}) %}
    {% set meta_tags = meta_dict.get('tags', []) %}
    {% set all_column_tags = config_tags + global_tags + meta_tags %}
    
-   {% for pii_tag in pii_column_tags %}
-     {% if pii_tag in all_column_tags %}
-       {% do pii_columns.append(column_node.get('name')) %}
-       {% break %}
-     {% endif %}
-   {% endfor %}
+   {% if all_column_tags | intersect(pii_column_tags) | length > 0 %}
+     {% do pii_columns.append(column_node.get('name')) %}
+   {% endif %}
  {% endfor %}

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc8fba5 and 5d4dd69.

📒 Files selected for processing (4)

integration_tests/tests/test_column_pii_sampling.py (1 hunks)
macros/edr/materializations/test/test.sql (2 hunks)
macros/edr/system/system_utils/get_config_var.sql (1 hunks)
macros/edr/system/system_utils/is_pii_column.sql (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

macros/edr/materializations/test/test.sql (3)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.

🔇 Additional comments (7)

macros/edr/system/system_utils/get_config_var.sql (1)

58-59: LGTM! Configuration variables follow established patterns.

The new configuration variables are well-named, have sensible defaults, and maintain backward compatibility by being disabled by default. The default PII tags cover common naming conventions.

macros/edr/system/system_utils/is_pii_column.sql (1)

1-42: Well-structured macro with comprehensive PII detection logic.

The macro follows good defensive programming practices with early returns and null checks. The tag collection from multiple sources (config.tags, tags, meta.tags) ensures comprehensive coverage of dbt's column metadata structure.

macros/edr/materializations/test/test.sql (3)

54-55: LGTM! Parameter passing enables PII detection.

The modification correctly passes the flattened_test parameter to enable PII column detection in the query function.

107-137: Solid implementation of PII-aware column filtering.

The logic correctly:

Detects PII columns using the new macro

Dynamically determines available columns

Excludes PII columns from selection

Handles the edge case where all columns are PII with a meaningful placeholder

123-130: No changes needed for column detection

After searching the codebase:

The only use of LIMIT 0 is in macros/edr/materializations/test/test.sql (lines 123–130).

All column metadata comes through elementary.run_query, which wraps dbt.run_query and correctly returns column names across every supported adapter.

There are no adapter-specific overrides or alternate column-detection patterns in the repository.

You can safely rely on this generic approach.

integration_tests/tests/test_column_pii_sampling.py (2)

1-133: Excellent comprehensive test coverage for PII sampling functionality.

The integration tests provide thorough coverage of the new feature including:

Main functionality with PII filtering enabled

Backward compatibility with feature disabled

Edge case handling when all columns are PII

The test structure, data setup, and assertions are well-designed and should catch any regressions.

120-120: Ignore the all-PII sampling status concern
The not_null constraint is applied to SAFE_COLUMN, which contains only non-null integer values, so expecting "pass" is correct. The disable_samples_on_pii_columns flag only suppresses sample row materialization—it does not change the underlying null-check logic. No changes needed.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-07-30T07:33:22Z

+TEST_SAMPLE_ROW_COUNT = 5
+
+
+@pytest.mark.skip_targets(["clickhouse"])


💡 Verification agent

🧩 Analysis chain

Verify test compatibility with skipped database adapter.

The tests are marked to skip on ClickHouse. Ensure this is necessary and document why this adapter is excluded.

Let me check if there are adapter-specific issues or if this is a broader pattern:

🏁 Script executed:

#!/bin/bash # Description: Check for ClickHouse-related skip patterns and adapter compatibility echo "Searching for other ClickHouse skip patterns..." rg -A 2 -B 2 "skip.*clickhouse" integration_tests/ echo -e "\nChecking for adapter-specific test issues..." rg -A 5 -B 5 "clickhouse" integration_tests/ | head -20 echo -e "\nLooking for general adapter compatibility patterns..." rg "skip_targets" integration_tests/ | head -10

Length of output: 38846

Document reason for skipping PII sampling on ClickHouse
Please add a comment above the skip decorator to explain why column-level PII sampling tests aren’t supported by the ClickHouse adapter, mirroring the pattern in other tests.

• integration_tests/tests/test_column_pii_sampling.py (around line 26)

Example:

# Column-level PII sampling is not supported on ClickHouse @pytest.mark.skip_targets(["clickhouse"])

🤖 Prompt for AI Agents

In integration_tests/tests/test_column_pii_sampling.py at line 26, add a comment above the @pytest.mark.skip_targets(["clickhouse"]) decorator explaining that column-level PII sampling is not supported on ClickHouse. This comment should clearly state the reason for skipping the test on ClickHouse, following the existing pattern in other tests.

arbiv · 2025-07-30T08:15:49Z

created #833 instead

devin-ai-integration Bot had a problem deploying to elementary_test_env July 30, 2025 07:26 Failure

coderabbitai Bot reviewed Jul 30, 2025

View reviewed changes

devin-ai-integration Bot temporarily deployed to elementary_test_env July 30, 2025 08:13 Inactive

devin-ai-integration Bot had a problem deploying to elementary_test_env July 30, 2025 08:13 Failure

arbiv closed this Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement column-level PII protection for sample collection#832

Implement column-level PII protection for sample collection#832
devin-ai-integration[bot] wants to merge 1 commit into
masterfrom
devin/ELE-4850-1753859795

devin-ai-integration Bot commented Jul 30, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

devin-ai-integration Bot commented Jul 30, 2025

Uh oh!

linear Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot commented Jul 30, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

github-actions Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 30, 2025

Uh oh!

arbiv commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		TEST_SAMPLE_ROW_COUNT = 5


		@pytest.mark.skip_targets(["clickhouse"])

Conversation

devin-ai-integration Bot commented Jul 30, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement column-level PII protection for sample collection

Summary

Review & Testing Checklist for Human

Diagram

Notes

Summary by CodeRabbit

Uh oh!

devin-ai-integration Bot commented Jul 30, 2025

🤖 Devin AI Engineer

Uh oh!

linear Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

github-actions Bot commented Jul 30, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

arbiv commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration Bot commented Jul 30, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 30, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)