Skip to content

feat(ingestion): cross-connector filter visibility report#28355

Open
harshach wants to merge 6 commits into
mainfrom
harshach/connector-includes-excludes-report
Open

feat(ingestion): cross-connector filter visibility report#28355
harshach wants to merge 6 commits into
mainfrom
harshach/connector-includes-excludes-report

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented May 21, 2026

Describe your changes:

Generalizes #28336's per-database discovery + filter-reason logging into a reusable helper that every connector family — Database, Dashboard, Pipeline, Messaging, Storage, MLModel — now uses. At end of each source step it emits a single grep-friendly FILTER VISIBILITY REPORT block listing the discovered count, every filtered name + reason, and the kept count per entity type, so users can immediately tell whether a missing entity was removed by includes/excludes or never visible to the ingestion role at all. Storage profile is the diff only — discovered counts (int per type) plus the existing Status.filtered list now carrying richer reasons; no discovered-name or kept-name lists, both are derivable.

Type of change:

  • Improvement

High-level design:

  • New ingestion/src/metadata/utils/filter_visibility.py — three helpers (log_discovered, log_filtered, log_step_summary) own the message format centrally so every connector family produces identical output.
  • Status gained discovered_counts: Dict[str, int] (in-memory only, no spec/JSON schema change) so the report can compute kept = discovered − filtered.
  • Base-class chokepoints instrumented in database_service.py (_get_filtered_database_names, _get_filtered_schema_names), common_db_source.py (get_tables_name_and_type for tables + views), dashboard_service.py, pipeline_service.py, messaging_service.py, storage_service.py, mlmodel_service.py — each base class's close() emits the report.
  • Per-connector get_database_names discovery lifted into Snowflake (replacing fix(snowflake): log discovered databases and filter-out reasons #28336's inline logging with helper calls — same output, central format), Postgres, BigQuery, Redshift, MSSQL. MySQL inherits the no-op single-DB default and needs no override.
  • mlmodel concrete connectors sagemaker/metadata.py and mlflow/metadata.py switched their existing filter_by_mlmodel calls to use the helper so they participate in the report.
  • Subtle correctness note_get_filtered_database_names is reached from both main ingestion and the mark_databases_as_deleted maintenance pass. Discovery logging is therefore placed in each connector's get_database_names, not in *_raw, to avoid double-counting; a docstring on the base helper flags this for future contributors.

Example output (Snowflake, Cvent-like config):

======================================================================
 FILTER VISIBILITY REPORT: snowflake
======================================================================

Database (databaseFilterPattern):
  Visible to ingestion user: 10
  Filtered out (4):
    BACKUP_DB    → did not pass databaseFilterPattern, matched against 'svc.BACKUP_DB', useFqnForFiltering=True
    BACKUP_DB_2  → did not pass databaseFilterPattern, matched against 'svc.BACKUP_DB_2', useFqnForFiltering=True
    OLD_TEMP     → did not pass databaseFilterPattern, matched against 'svc.OLD_TEMP', useFqnForFiltering=True
    TEMP_DB      → did not pass databaseFilterPattern, matched against 'svc.TEMP_DB', useFqnForFiltering=True
  Will be published to OpenMetadata: 6

Schema (schemaFilterPattern):
  Visible to ingestion user: 4
  Filtered out (1):
    tmp  → did not pass schemaFilterPattern
  Will be published to OpenMetadata: 3
======================================================================

Tests:

Use cases covered

  • Customer hits a Snowflake/Postgres/BigQuery/Redshift/MSSQL account, configures databaseFilterPattern.excludes, and can immediately grep FILTER VISIBILITY REPORT to confirm which databases got removed and why
  • Same flow for dashboard (dashboardFilterPattern, projectFilterPattern), pipeline (pipelineFilterPattern), messaging (topicFilterPattern), storage (containerFilterPattern), mlmodel (mlModelFilterPattern)
  • Legacy status.filter(name, "Database Filtered Out") callers predating this helper still get bucketed into the right entity-type section of the report (back-compat)

Unit tests

  • Added ingestion/tests/unit/utils/test_filter_visibility.py — 12 tests covering count accumulation, generator inputs, optional kwargs, rich reason storage, consolidated report format, empty-state no-op, unrelated-reason skip, and legacy-reason-string back-compat
  • Verified: existing connector unit tests still pass (189/190 across test_snowflake, test_postgres, test_bigquery, test_redshift, test_mssql, test_mysql, test_common_db_source, test_tableau, test_looker, test_airflow, test_s3_storage, test_sagemaker — 1 skipped, 0 failed)

Backend integration tests

  • Not applicable (no backend API changes)

Ingestion integration tests

  • Not applicable (instrumentation is purely additive — same Status.filter is still called with the same "<EntityType> Filtered Out" prefix; persisted StepSummary.filtered count is unchanged)

Playwright (UI) tests

  • Not applicable (no UI changes)

Manual testing performed

  1. source env/bin/activate && cd ingestion && pip install -e . --no-deps
  2. python -m pytest ingestion/tests/unit/utils/test_filter_visibility.py -v → 12/12 pass
  3. python -m pytest <12 connector test files> → 189/190 pass (1 skipped, 0 failed)
  4. cd ingestion && make py_format_check → clean
  5. End-to-end demo script — verified the consolidated report block renders correctly with discovered counts, filtered names + reasons inline, and kept counts per entity type

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: N/A — no spec changes.
  • For UI changes: N/A — no UI changes.
  • I have added tests (unit) and listed them above.

🤖 Generated with Claude Code


Summary by Gitar

  • Refined filter visibility logic:
    • Improved get_filtered_count in status.py to correctly aggregate helper-driven and legacy filter() counts without double-counting.
  • Improved dashboard reporting:
    • Coerced project_name to string in dashboard_service.py for consistent formatting in visibility reports.
  • Enhanced test suite:
    • Added unit tests in test_filter_visibility.py to verify accuracy under mixed helper and legacy caller scenarios.

This will update automatically on new commits.

Generalize Snowflake's per-database discovery + filter-reason logging
(#28336) to a reusable helper used by every connector family —
Database, Dashboard, Pipeline, Messaging, Storage, MLModel. At end of
each source step, emit a single grep-friendly "FILTER VISIBILITY
REPORT" block listing the discovered count, every filtered name +
reason, and the kept count per entity type, so users can tell whether
a missing entity was removed by includes/excludes or never visible to
the ingestion role.

Storage profile is the diff only: discovered counts (int per type) +
the existing Status.filtered list (now carrying richer reasons). No
discovered-name or kept-name lists; both are derivable. No spec/JSON
schema change, no UI work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshach harshach requested a review from a team as a code owner May 21, 2026 20:40
Copilot AI review requested due to automatic review settings May 21, 2026 20:40
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 21, 2026
Comment thread ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py Outdated
Comment thread ingestion/src/metadata/utils/filter_visibility.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a reusable ingestion-side logging utility to standardize “discovered → filtered → kept” visibility across connector families, culminating in a single grep-friendly FILTER VISIBILITY REPORT emitted at step close to help users distinguish source-permission invisibility from filter-pattern exclusion.

Changes:

  • Introduces metadata.utils.filter_visibility helpers (log_discovered, log_filtered, log_step_summary) and wires them into multiple connector base classes and concrete sources.
  • Extends Status with an in-memory discovered_counts accumulator to compute “kept” counts without storing full discovered/kept name lists.
  • Adds unit tests for helper behavior and report formatting.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ingestion/tests/unit/utils/test_filter_visibility.py New unit tests validating helper accumulation and report formatting.
ingestion/src/metadata/utils/filter_visibility.py New centralized helper module for discovery/filter logging and consolidated report emission.
ingestion/src/metadata/ingestion/source/storage/storage_service.py Uses helper for filtered container entries and emits step summary on close.
ingestion/src/metadata/ingestion/source/pipeline/pipeline_service.py Logs discovered pipelines, logs filtered pipelines with rich reasons, emits step summary on close.
ingestion/src/metadata/ingestion/source/mlmodel/sagemaker/metadata.py Logs discovered/filtered ML models using helper.
ingestion/src/metadata/ingestion/source/mlmodel/mlmodel_service.py Emits step summary on close for MLModel sources.
ingestion/src/metadata/ingestion/source/mlmodel/mlflow/metadata.py Logs discovered/filtered ML models using helper.
ingestion/src/metadata/ingestion/source/messaging/messaging_service.py Logs discovered/filtered topics and emits step summary on close.
ingestion/src/metadata/ingestion/source/database/snowflake/metadata.py Moves Snowflake database discovery/filter logging to shared helper.
ingestion/src/metadata/ingestion/source/database/redshift/metadata.py Adds discovery + rich filter logging for databases via helper.
ingestion/src/metadata/ingestion/source/database/postgres/metadata.py Adds discovery + rich filter logging for databases via helper.
ingestion/src/metadata/ingestion/source/database/mssql/metadata.py Adds discovery + rich filter logging for databases via helper.
ingestion/src/metadata/ingestion/source/database/database_service.py Instruments DB base-class filtering paths with helper-driven rich filter logging (and schema discovery logging).
ingestion/src/metadata/ingestion/source/database/common_db_source.py Logs discovered tables/views, logs filtered tables/views with rich reasons, emits step summary on close.
ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py Adds discovery + rich filter logging for databases/schemas via helper.
ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py Logs discovered dashboards, logs filtered dashboards/projects, emits step summary on close.
ingestion/src/metadata/ingestion/api/status.py Adds discovered_counts + record_discovered() to support report computations.

Comment thread ingestion/src/metadata/utils/filter_visibility.py Outdated
Comment thread ingestion/src/metadata/utils/filter_visibility.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py Outdated
…efenses

Observability code in the ingestion hot path must not fail connectors.
Adds three layers of defense:

  1. Helper-internal try/except in log_discovered, log_filtered, and
     log_step_summary. Any failure is logged once at WARN and swallowed.
  2. Call-site try/except around log_step_summary in every base class
     close() (database, dashboard, pipeline, messaging, storage, mlmodel)
     so a summary failure can't mask the real cleanup work.
  3. Per-entity-type cap on Status.filtered (MAX_FILTERED_ENTRIES_PER_TYPE
     = 50_000). Past the cap, log_filtered only bumps a new
     Status.filtered_counts dict so the true count stays accurate without
     unbounded memory growth. The report annotates the truncation.

Test coverage doubled (12 -> 23): adds bounded-growth tests for the cap,
resilience tests for broken loggers / bad inputs / corrupted Status, and
an integration-style multi-database lifecycle test that verifies counts
are correct after the full streaming-log + summary cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- log_discovered: skip list() materialization when DEBUG logging is off.
  Use len() directly for Sized inputs (zero alloc) and a streaming
  sum() for generators. Cuts the per-discovery cost from O(n) memory
  to O(1) on large catalogs when --debug isn't active.

- _entity_type_from_reason: case-insensitive marker match so legacy
  reasons like "Database Filtered out" (lowercase 'out' from older
  BigQuery code paths) get bucketed into the right report section
  instead of being silently undercounted.

- dashboard_service.get_dashboard: null-safe via `or []` (matches the
  Optional[List[Any]] contract Pipeline / Messaging sources already
  use); materialize names once into a list so log_discovered can take
  the zero-allocation Sized path instead of re-listing a generator
  over the heavy dashboard objects.

- common_db_source.get_tables_name_and_type / dashboard_service: skip
  the O(n) shallow copy when query_*_names_and_types() / get_dashboards_list()
  already returned a list. Only materialize when a subclass returned
  a generator.

- Report: rename "Will be published to OpenMetadata" → "Passed filter
  patterns" and prepend a note explaining the count is filter-decision
  only (does not subtract source-side extraction errors or secondary
  filters like projectFilterPattern). Avoids confusion when reconciling
  report counts with the final ingested set.

Tests: 23 → 28 (added DEBUG-gated materialization, Sized len() path,
case-insensitive marker, and clarifying-note assertions).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 22:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment thread ingestion/src/metadata/utils/filter_visibility.py
Comment thread ingestion/src/metadata/utils/filter_visibility.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/database_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/messaging/messaging_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/pipeline/pipeline_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/storage/storage_service.py
- Status.get_filtered_count(): true count of filter rejections that
  includes overflow past the per-type cap. Summary.from_step (step.py)
  now uses it so persisted StepSummary.filtered stays accurate even
  when names were dropped to bound memory. Falls back to len(filtered)
  for legacy callers that never populated filtered_counts.

- log_step_summary docstring: aligned with the actual "Passed filter
  patterns" label and the kept-semantic note printed inside the report.

- _get_filtered_database_names: restored streaming iteration. The
  earlier list() was needed for log_discovered, but discovery logging
  was moved out of this maintenance-pass helper, so materialization
  is now pure overhead on large catalogs during mark-deleted.

- messaging_service.get_topic / pipeline_service.get_pipeline: same
  isinstance optimization as dashboard_service / common_db_source —
  skip the O(n) shallow copy when the source already returned a list.
  Pre-materialize names once into a list so log_discovered takes the
  zero-allocation Sized path.

- storage_service _filter_entries: preserve bucket context when logging
  a containerFilterPattern rejection. Stored name is now `bucket/path`
  and the raw `path` is exposed via matched_against — operators can
  identify which bucket an entry came from (dataPath alone is not
  unique across buckets).

Tests: 28 -> 31 (added get_filtered_count cap-overflow + legacy +
empty cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread ingestion/src/metadata/ingestion/api/status.py Outdated
- dashboard_service.get_dashboard: coerce project_name to a string
  before passing to log_filtered. project_name is reassigned to a
  list (project_names) when get_project_names returns a non-empty
  list — passing the list verbatim produced a stringified list like
  "['a', 'b']" as the stored name, which made the reason unparseable
  by _entity_type_from_reason and the report ugly. Now joined with
  ", " when it's a list, kept as-is when it's already a string.

- Status.get_filtered_count: precise per-entity-type formula so a
  mixed helper + legacy + cap-overflow scenario stays correct. The
  prior max() formula undercounted when one type hit the cap (helper)
  AND another type had legacy direct status.filter() entries —
  exactly the worst case the reviewer called out. Now: helper-tracked
  true count + legacy entries whose entity-type prefix isn't in
  filtered_counts.

Tests: 31 -> 33 (added mixed helper+legacy case and the worst-case
mixed helper-cap-overflow + legacy case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 21, 2026 23:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

Comment thread ingestion/src/metadata/ingestion/source/database/database_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/pipeline/pipeline_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/messaging/messaging_service.py Outdated
Comment thread ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

🟡 Playwright Results — all passed (10 flaky)

✅ 4150 passed · ❌ 0 failed · 🟡 10 flaky · ⏭️ 90 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 781 0 1 11
🟡 Shard 3 776 0 3 8
🟡 Shard 4 787 0 1 12
🟡 Shard 5 771 0 2 47
🟡 Shard 6 736 0 3 8
🟡 10 flaky test(s) (passed on retry)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/NestedColumnsExpandCollapse.spec.ts › should not duplicate rows when expanding and collapsing nested columns with same names (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/TestSuitePipelineRedeploy.spec.ts › Re-deploy all test-suite ingestion pipelines (shard 3, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for File (shard 4, 1 retry)
  • Pages/Entity.spec.ts › User as Owner with unsorted list (shard 5, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should verify deleted user not visible in owner selection for pipeline (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Login.spec.ts › Refresh should work (shard 6, 1 retry)
  • Pages/TasksUIFlow.spec.ts › Create tag task for table column via UI (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

- mark_databases_as_deleted: pass add_to_status=False to
  _get_filtered_database_names. Without this, every filtered database
  was being recorded twice (once during main ingestion via the
  per-connector get_database_names, once during the maintenance pass),
  inflating Status.filtered + filtered_counts and skewing the report.

- _get_filtered_schema_names: only materialize the raw schema names
  when add_to_status=True (where log_discovered needs the count up
  front). The mark-deleted maintenance paths iterate once with
  add_to_status=False — streaming saves O(n) memory on large schemas.
  Same fix applied to BigQuery's override.

- dashboard/pipeline/messaging get_X(): compute names once into
  (entity, name) tuples instead of calling the connector-specific
  get_*_name() twice per entity (once to build the discovery list,
  once inside the filter loop). Real savings when name extraction is
  non-trivial (e.g., Tableau workbook lookups).

- DB connector log_filtered calls: use the raw entity name (database_name,
  schema_name, table_name, view_name, new_database, project_id) as
  `name` and the FQN as `matched_against`. The previous convention of
  `name=<fqn>` produced confusing report output when useFqnForFiltering
  was False ("filtered name = FQN, matched against = raw name") and
  was inconsistent with non-SQL connectors (Dashboard/Pipeline/Topic/
  MLModel) which already stored raw names. Report column is now
  readable across all connector families.

Tests: existing 33 still pass (no test changes needed — the storage-key
change to raw names is verified by the existing dashboard/pipeline/topic
assertions which always asserted raw names).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 22, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Generalizes filter visibility reporting across all connector families by introducing a centralized helper, improving log consistency and debugging capabilities. The implementation addresses memory concerns, count inaccuracies, and formatting issues identified during the review process.

✅ 4 resolved
Performance: Materializing full dashboard list may cause memory pressure

📄 ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py:575-581
The change at line 575 converts self.get_dashboards_list() from a lazy iterator/generator into a fully materialized list(...). Many concrete dashboard connectors (e.g., Tableau) implement this as a generator with pagination (yield from). For large environments with tens of thousands of dashboards, this unnecessarily holds the entire set of dashboard objects in memory simultaneously — both for the list itself and for the generator expression passed to log_discovered on line 580 which iterates names.

Additionally, the generator expression (self.get_dashboard_name(d) for d in dashboards) on line 580 is consumed by log_discovered which calls list(names) internally, creating a second full list of all dashboard names.

Suggested approach: compute only the count without storing names, or iterate once and collect just the names (strings are much cheaper than full dashboard objects).

Edge Case: 'kept' count may be misleading when dashboards fail detail extraction

📄 ingestion/src/metadata/utils/filter_visibility.py:147-149 📄 ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py:607-621
In _format_entity_section (filter_visibility.py:148), the 'kept' count is discovered - filtered_count. However in dashboard_service.py, dashboards that pass the filter pattern but then fail in get_dashboard_details() (line 620-623) or get rejected by projectFilterPattern (logged as entity_type 'Project', not 'Dashboard') are not counted as 'Dashboard Filtered Out'. This means the report may show 'Will be published to OpenMetadata: N' where N is higher than the actual number ingested.

This is inherent to the design (the report shows filter-pattern rejections specifically, not all skip reasons), but it could confuse users trying to reconcile counts. Consider adding a brief note in the report header or documentation clarifying that 'kept' means 'passed filter patterns' not 'successfully ingested'.

Edge Case: log_filtered called with list instead of string for project_name

📄 ingestion/src/metadata/ingestion/source/dashboard/dashboard_service.py:604-617
At dashboard_service.py line 612-617, log_filtered is called with project_name as the name argument. However, on line 604-606, project_name is reassigned to project_names (a list) when the list is non-empty. Passing a list to log_filtered where a str is expected will result in the filter reason storing a stringified list (e.g., "['proj1', 'proj2']") in status.filtered, which produces ugly output in the report and may not match expected parsing in _entity_type_from_reason.

Edge Case: get_filtered_count undercounts in mixed helper+legacy usage

📄 ingestion/src/metadata/ingestion/api/status.py:109-116
When both the log_filtered helper (which increments filtered_counts) and legacy direct status.filter() calls (which only appends to filtered) are used on the same Status instance AND the helper path hits the per-type cap, max(len(self.filtered), sum(self.filtered_counts.values())) will undercount.

Example: helper filters 60,000 tables (cap stores 50k names), legacy filters 5 schemas directly → len(filtered)=50005, sum(filtered_counts)=60000, max()=60000 but true total is 60,005.

In practice this requires >50k filtered items of a single type plus legacy callers on the same status, so impact is minimal. A precise formula would count legacy entries (those without a matching filtered_counts key) separately.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants