Skip to content

Implement column-level PII protection for sample collection#832

Closed
devin-ai-integration[bot] wants to merge 1 commit into
masterfrom
devin/ELE-4850-1753859795
Closed

Implement column-level PII protection for sample collection#832
devin-ai-integration[bot] wants to merge 1 commit into
masterfrom
devin/ELE-4850-1753859795

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Jul 30, 2025

Implement column-level PII protection for sample collection

Summary

This PR extends the existing PII protection feature to work at the column level instead of just the table level. Previously, if a table was tagged as PII, no samples were collected from failing rows. Now users can tag specific columns as PII, and only those columns will be excluded from sample data while preserving non-sensitive columns for debugging.

Key Changes:

  • Added two new configuration variables: disable_samples_on_pii_columns (boolean, default: false) and pii_column_tags (list, default: ['pii', 'personal', 'sensitive'])
  • Created get_pii_columns_from_parent_model macro to detect PII columns by checking column tags against configured PII tags
  • Modified query_test_result_rows to dynamically exclude PII columns from SELECT statements instead of using SELECT *
  • Added comprehensive integration tests for the new functionality
  • Maintains full backward compatibility - feature is disabled by default

Review & Testing Checklist for Human

  • End-to-end PII detection testing - Create a dbt model with columns tagged as PII using config: {tags: ['pii']} and verify that test samples exclude those columns while including non-PII columns
  • Backward compatibility verification - Ensure existing sample collection behavior is unchanged when disable_samples_on_pii_columns: false (the default)
  • Edge case testing - Test scenarios where all columns are PII-tagged, no columns are PII-tagged, and mixed tag sources (config vs global vs meta tags)
  • SQL generation robustness - Verify the dynamic SELECT clause construction works correctly across different database adapters and doesn't break with special column names
  • Configuration validation - Test that pii_column_tags works with both string and list inputs, and that boolean config parsing works correctly

Recommended Test Plan:

  1. Set up a test model with mixed PII/non-PII columns
  2. Enable the feature with disable_samples_on_pii_columns: true
  3. Run a failing test and verify sample data excludes PII columns
  4. Test with feature disabled to ensure backward compatibility
  5. Run the new integration tests: cd integration_tests/tests && pytest test_column_pii_sampling.py -vvv

Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    A["get_config_var.sql<br/>(config variables)"]:::minor-edit --> B["query_test_result_rows<br/>(sample collection)"]:::major-edit
    C["is_pii_column.sql<br/>(PII detection)"]:::major-edit --> B
    D["handle_dbt_test<br/>(test materialization)"]:::minor-edit --> B
    E["test_column_pii_sampling.py<br/>(integration tests)"]:::major-edit
    F["dbt model columns<br/>(with PII tags)"]:::context --> C
    
    B --> G["test_result_rows table<br/>(filtered samples)"]:::context
    
    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF
Loading

Notes

  • The implementation follows existing Elementary patterns for configuration management, macro structure, and testing
  • When all columns are PII-tagged, the query returns a placeholder column _no_non_pii_columns to avoid SQL errors
  • Column tags are checked from all sources (config, global, meta) following dbt conventions
  • The feature is disabled by default to ensure zero impact on existing installations

Session Details:

Summary by CodeRabbit

  • New Features
    • Added support for filtering out columns tagged as Personally Identifiable Information (PII) from test result samples, based on configurable tags and settings.
    • Introduced new configuration options to control PII column sampling behavior.
  • Bug Fixes
    • Ensured that when all columns are tagged as PII, a special indicator is returned instead of exposing sensitive data.
  • Tests
    • Added integration tests to verify correct handling of PII columns in test result samples under various configurations.

- Add disable_samples_on_pii_columns and pii_column_tags config variables
- Create get_pii_columns_from_parent_model helper macro for column PII detection
- Modify query_test_result_rows to dynamically exclude PII columns from SELECT
- Add comprehensive integration tests for column-level PII protection
- Maintain backward compatibility with existing table-level PII detection
- Follow existing patterns for configuration, tag handling, and testing

Addresses Linear issue ELE-4850

Co-Authored-By: Yosef Arbiv <yosef.arbiv@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link
Copy Markdown

linear Bot commented Jul 30, 2025

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jul 30, 2025

Walkthrough

The changes introduce column-level PII protection by adding configuration variables, a macro to detect PII columns based on tags, and logic to exclude such columns from test result samples. Integration tests are added to verify correct behavior when PII sampling is enabled or disabled and when all columns are PII-tagged.

Changes

Cohort / File(s) Change Summary
Integration Tests for Column-Level PII Sampling
integration_tests/tests/test_column_pii_sampling.py
Adds three integration test functions to validate column-level PII sampling: verifies exclusion of PII-tagged columns when enabled, inclusion when disabled, and behavior when all columns are PII-tagged. Uses pytest and dbt project fixtures.
Sample Query and Macro Logic
macros/edr/materializations/test/test.sql
Updates the handle_dbt_test and query_test_result_rows macros to support column-level PII exclusion. Passes flattened_test to enable context-aware filtering. Modifies sample queries to exclude PII columns dynamically and adds logic for fallback when all columns are PII.
Configuration Variables
macros/edr/system/system_utils/get_config_var.sql
Adds disable_samples_on_pii_columns (default: false) and pii_column_tags (default: ['pii', 'personal', 'sensitive']) to the default configuration dictionary for controlling column-level PII sampling behavior.
PII Column Detection Macro
macros/edr/system/system_utils/is_pii_column.sql
Introduces get_pii_columns_from_parent_model macro to identify PII columns from a parent model by matching tags from multiple metadata sources against configured PII tags. Returns a list of PII columns or empty if disabled.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant dbt_test
    participant Macro_handle_dbt_test
    participant Macro_query_test_result_rows
    participant Macro_get_pii_columns_from_parent_model
    participant Database

    User->>dbt_test: Run test with config (PII sampling enabled/disabled)
    dbt_test->>Macro_handle_dbt_test: Execute test logic
    Macro_handle_dbt_test->>Macro_query_test_result_rows: Request sample rows (with flattened_test)
    Macro_query_test_result_rows->>Macro_get_pii_columns_from_parent_model: Identify PII columns
    Macro_get_pii_columns_from_parent_model-->>Macro_query_test_result_rows: Return list of PII columns
    Macro_query_test_result_rows->>Database: Query test results (excluding PII columns)
    Database-->>Macro_query_test_result_rows: Return sample rows
    Macro_query_test_result_rows-->>Macro_handle_dbt_test: Return filtered samples
    Macro_handle_dbt_test-->>dbt_test: Return test result with samples
    dbt_test-->>User: Show test outcome and samples
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Assessment against linked issues

Objective Addressed Explanation
Add new configuration variables for column-level PII protection (disable_samples_on_pii_columns, pii_column_tags) (ELE-4850)
Create helper macro to detect if a specific column has PII tags, checking metadata/tags (ELE-4850)
Modify sample collection logic to exclude PII columns, preserving non-PII columns (ELE-4850)
Add integration tests for column-level PII protection (ELE-4850)

Poem

In the warren where data flows free,
A clever new trick for PII!
Columns tagged "sensitive" now hide,
While safe ones show with pride.
With macros and tests, we hop along,
Protecting secrets where they belong.
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/ELE-4850-1753859795

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions
Copy link
Copy Markdown
Contributor

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
macros/edr/system/system_utils/is_pii_column.sql (1)

25-38: Consider optimizing the nested loop for better performance.

The current implementation uses nested loops which could be optimized for better performance, especially with models having many columns and PII tags.

Consider using set operations for more efficient tag matching:

  {% for column_node in column_nodes.values() %}
    {% set config_dict = column_node.get('config', {}) %}
    {% set config_tags = config_dict.get('tags', []) %}
    {% set global_tags = column_node.get('tags', []) %}
    {% set meta_dict = column_node.get('meta', {}) %}
    {% set meta_tags = meta_dict.get('tags', []) %}
    {% set all_column_tags = config_tags + global_tags + meta_tags %}
    
-   {% for pii_tag in pii_column_tags %}
-     {% if pii_tag in all_column_tags %}
-       {% do pii_columns.append(column_node.get('name')) %}
-       {% break %}
-     {% endif %}
-   {% endfor %}
+   {% if all_column_tags | intersect(pii_column_tags) | length > 0 %}
+     {% do pii_columns.append(column_node.get('name')) %}
+   {% endif %}
  {% endfor %}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc8fba5 and 5d4dd69.

📒 Files selected for processing (4)
  • integration_tests/tests/test_column_pii_sampling.py (1 hunks)
  • macros/edr/materializations/test/test.sql (2 hunks)
  • macros/edr/system/system_utils/get_config_var.sql (1 hunks)
  • macros/edr/system/system_utils/is_pii_column.sql (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
macros/edr/materializations/test/test.sql (3)

Learnt from: haritamar
PR: #825
File: macros/utils/table_operations/create_table_as.sql:0-0
Timestamp: 2025-07-22T15:41:57.338Z
Learning: The dbt.create_table_as macro breaks in dbt-fusion when not run from a materialization context, requiring custom implementations for table creation functionality in dbt-fusion compatible packages.

Learnt from: haritamar
PR: #827
File: macros/utils/cross_db_utils/generate_elementary_profile_args.sql:30-31
Timestamp: 2025-07-27T11:33:35.918Z
Learning: In the Elementary dbt package, the Snowflake profile generation macro intentionally always includes private_key and private_key_passphrase parameters regardless of the authentication method. This is a strategic decision to deprecate password authentication and guide all new users toward keypair authentication, even if it means showing keypair parameters when they might not currently be needed.

Learnt from: haritamar
PR: #825
File: macros/utils/run_queries/agate_to_dict.sql:1-7
Timestamp: 2025-07-15T19:28:35.548Z
Learning: In dbt-data-reliability, the agate_to_dict macro in macros/utils/run_queries/agate_to_dict.sql intentionally returns different types based on fusion/non-fusion mode to handle breaking changes in dbt Fusion. In fusion mode, it returns the original agate_table, while in non-fusion mode it returns agate_table.to_dict(). This is because the content/structure of agate_table is different between the two modes.

🔇 Additional comments (7)
macros/edr/system/system_utils/get_config_var.sql (1)

58-59: LGTM! Configuration variables follow established patterns.

The new configuration variables are well-named, have sensible defaults, and maintain backward compatibility by being disabled by default. The default PII tags cover common naming conventions.

macros/edr/system/system_utils/is_pii_column.sql (1)

1-42: Well-structured macro with comprehensive PII detection logic.

The macro follows good defensive programming practices with early returns and null checks. The tag collection from multiple sources (config.tags, tags, meta.tags) ensures comprehensive coverage of dbt's column metadata structure.

macros/edr/materializations/test/test.sql (3)

54-55: LGTM! Parameter passing enables PII detection.

The modification correctly passes the flattened_test parameter to enable PII column detection in the query function.


107-137: Solid implementation of PII-aware column filtering.

The logic correctly:

  • Detects PII columns using the new macro
  • Dynamically determines available columns
  • Excludes PII columns from selection
  • Handles the edge case where all columns are PII with a meaningful placeholder

123-130: No changes needed for column detection

After searching the codebase:

  • The only use of LIMIT 0 is in macros/edr/materializations/test/test.sql (lines 123–130).
  • All column metadata comes through elementary.run_query, which wraps dbt.run_query and correctly returns column names across every supported adapter.
  • There are no adapter-specific overrides or alternate column-detection patterns in the repository.

You can safely rely on this generic approach.

integration_tests/tests/test_column_pii_sampling.py (2)

1-133: Excellent comprehensive test coverage for PII sampling functionality.

The integration tests provide thorough coverage of the new feature including:

  • Main functionality with PII filtering enabled
  • Backward compatibility with feature disabled
  • Edge case handling when all columns are PII

The test structure, data setup, and assertions are well-designed and should catch any regressions.


120-120: Ignore the all-PII sampling status concern
The not_null constraint is applied to SAFE_COLUMN, which contains only non-null integer values, so expecting "pass" is correct. The disable_samples_on_pii_columns flag only suppresses sample row materialization—it does not change the underlying null-check logic. No changes needed.

Likely an incorrect or invalid review comment.

TEST_SAMPLE_ROW_COUNT = 5


@pytest.mark.skip_targets(["clickhouse"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify test compatibility with skipped database adapter.

The tests are marked to skip on ClickHouse. Ensure this is necessary and document why this adapter is excluded.

Let me check if there are adapter-specific issues or if this is a broader pattern:


🏁 Script executed:

#!/bin/bash
# Description: Check for ClickHouse-related skip patterns and adapter compatibility
echo "Searching for other ClickHouse skip patterns..."
rg -A 2 -B 2 "skip.*clickhouse" integration_tests/

echo -e "\nChecking for adapter-specific test issues..."
rg -A 5 -B 5 "clickhouse" integration_tests/ | head -20

echo -e "\nLooking for general adapter compatibility patterns..."
rg "skip_targets" integration_tests/ | head -10

Length of output: 38846


Document reason for skipping PII sampling on ClickHouse
Please add a comment above the skip decorator to explain why column-level PII sampling tests aren’t supported by the ClickHouse adapter, mirroring the pattern in other tests.

• integration_tests/tests/test_column_pii_sampling.py (around line 26)

Example:

# Column-level PII sampling is not supported on ClickHouse
@pytest.mark.skip_targets(["clickhouse"])
🤖 Prompt for AI Agents
In integration_tests/tests/test_column_pii_sampling.py at line 26, add a comment
above the @pytest.mark.skip_targets(["clickhouse"]) decorator explaining that
column-level PII sampling is not supported on ClickHouse. This comment should
clearly state the reason for skipping the test on ClickHouse, following the
existing pattern in other tests.

@arbiv
Copy link
Copy Markdown
Contributor

arbiv commented Jul 30, 2025

created #833 instead

@arbiv arbiv closed this Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant