feat(data-quality): Add impact ranking for data quality checks #26658#27346
feat(data-quality): Add impact ranking for data quality checks #26658#27346AshutoshIIT1234 wants to merge 3 commits intoopen-metadata:mainfrom
Conversation
Implement Data Quality Checks Impact feature for issue open-metadata#26658 - Add scoring model that calculates impact based on: - Downstream usage (40% weight) - Consumer count (30% weight) - Recent incidents (30% weight) - Add REST API endpoint for impact ranking - Add frontend components to display impact scores - Add schema for dataQualityCheckImpact This helps prioritize which data quality checks are most critical to fix by ranking them based on their business impact.
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| } | ||
|
|
||
| public DataQualityReport getDataQualityReport( | ||
| public DataQualityReport getDataQualityReport( |
There was a problem hiding this comment.
💡 Quality: Indentation damage on pre-existing lines (accidental whitespace changes)
Several pre-existing lines had their leading indentation removed (e.g., getDataQualityReport at line 362, searchDataQualityLineage at line 2787, SubjectContext at line 545, getDataQualityTab at line 74). These appear to be accidental whitespace-only changes that break the consistent indentation style of the codebase. They should be reverted to avoid noisy diffs.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
There was a problem hiding this comment.
Pull request overview
Adds an “Impact Ranking” feature for Data Quality checks, spanning backend search/query logic, a new REST endpoint, and a new Data Quality UI tab to display ranked checks.
Changes:
- UI: Introduces a new Data Quality “Impact” tab and an
ImpactRankingtable/summary view. - Service: Adds a
GET /v1/dataQuality/testSuites/dataQualityCheckImpactendpoint and supporting repository/search methods. - Spec: Adds a new JSON schema for
dataQualityCheckImpact.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-ui/src/main/resources/ui/src/rest/testAPI.ts | Adds a client helper and TS types for fetching impact ranking data |
| openmetadata-ui/src/main/resources/ui/src/pages/DataQuality/DataQualityPage.interface.ts | Adds IMPACT tab enum value |
| openmetadata-ui/src/main/resources/ui/src/pages/DataQuality/DataQualityClassBase.ts | Registers the new Impact tab and component |
| openmetadata-ui/src/main/resources/ui/src/components/DataQuality/ImpactRanking/ImpactRanking.component.tsx | New UI component to render impact summaries + ranked table |
| openmetadata-spec/src/main/resources/json/schema/tests/dataQualityCheckImpact.json | New schema intended to define the impact ranking payload |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Adds a helper to search test cases for impact ranking |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/dqtests/TestSuiteResource.java | Adds the REST endpoint to fetch impact ranking |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TestSuiteRepository.java | Implements ranking/scoring logic and builds the search query |
| doc.put("downstreamUsage", 0); | ||
| doc.put("consumerCount", 0); | ||
| results.add(doc); |
There was a problem hiding this comment.
downstreamUsage and consumerCount are hardcoded to 0 (and recentIncidents isn’t populated at all), which makes the impact scoring endpoint always produce 0 scores. These metrics need to be computed/fetched before ranking, otherwise the feature is non-functional.
| for (Map<String, Object> testCase : testCases) { | ||
| int downstreamUsage = | ||
| ((Number) testCase.getOrDefault("downstreamUsage", 0)).intValue(); | ||
| int consumerCount = ((Number) testCase.getOrDefault("consumerCount", 0)).intValue(); | ||
| int recentIncidents = |
There was a problem hiding this comment.
The scoring loop assumes downstreamUsage, consumerCount, and recentIncidents are present in the input docs, but the current retrieval code defaults them to 0. That makes impactScore always 0 and defeats the ranking.
| "description": "Fully qualified name of the test suite.", | ||
| "$ref": "../type/basic.json#/definitions/fullyQualifiedEntityName" | ||
| }, | ||
| "entityFullyQualifiedName": { |
There was a problem hiding this comment.
This schema uses entityFullyQualifiedName, while the UI/API currently refer to entityFQN. Align the property names (and required fields) with the actual API payload so generated models and UI types stay consistent.
| "entityFullyQualifiedName": { | |
| "entityFQN": { |
| dataIndex: 'testCaseStatus', | ||
| key: 'testCaseStatus', | ||
| render: (status: string) => ( | ||
| <Tag color={status === 'Failed' ? 'error' : 'success'}>{status}</Tag> | ||
| ), |
There was a problem hiding this comment.
testCaseStatus can be Aborted or Queued as well as Failed/Success, but this render treats everything except Failed as success. Consider mapping each status to an appropriate tag color (e.g., warning/processing for Queued, default for Aborted) to avoid misleading UI.
| } | ||
|
|
||
| public getDataQualityTab() { | ||
| public getDataQualityTab() { |
There was a problem hiding this comment.
This method declaration lost its indentation (public getDataQualityTab() is no longer aligned with the rest of the class). This is likely to fail Prettier/lint formatting checks; please reformat to match surrounding style.
| public getDataQualityTab() { | |
| public getDataQualityTab() { |
| .withIndex(Entity.TEST_CASE) | ||
| .withQuery(query) | ||
| .withFrom(from) | ||
| .withSize(size) | ||
| .withSortFieldParam("timestamp") |
There was a problem hiding this comment.
This request sorts by timestamp on the testCase index, but the test_case mapping doesn’t have a root timestamp field (it’s under testCaseResult.timestamp). This sort (and any range filter on timestamp) won’t behave as intended unless you use the correct index/field path.
| import { getTestCaseDetailPath } from '../../utils/RouterUtils'; | ||
| import { | ||
| DataQualityCheckImpact, | ||
| getDataQualityCheckImpact, | ||
| } from '../../rest/testAPI'; |
There was a problem hiding this comment.
The RouterUtils/testAPI imports are using an incorrect relative path for this file location, and getTestCaseDetailPath isn’t exported (the codebase has getTestCaseDetailPagePath). This will fail TypeScript compilation until the import paths/symbols are corrected.
| import { getTestCaseDetailPath } from '../../utils/RouterUtils'; | |
| import { | |
| DataQualityCheckImpact, | |
| getDataQualityCheckImpact, | |
| } from '../../rest/testAPI'; | |
| import { getTestCaseDetailPagePath as getTestCaseDetailPath } from '../../../utils/RouterUtils'; | |
| import { | |
| DataQualityCheckImpact, | |
| getDataQualityCheckImpact, | |
| } from '../../../rest/testAPI'; |
| const handleTestCaseClick = (testCaseFQN: string) => { | ||
| history.push(getTestCaseDetailPagePath(testCaseFQN)); | ||
| }; |
There was a problem hiding this comment.
handleTestCaseClick calls getTestCaseDetailPagePath(...), but that symbol isn’t imported in this file, so this will be a compile-time error. Import the correct RouterUtils helper (and remove the unused one).
| StringBuilder query = new StringBuilder(); | ||
| query.append("{\"query\": {\"bool\": {\"must\": ["); | ||
|
|
||
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | ||
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | ||
| } | ||
|
|
||
| query.append( | ||
| String.format( | ||
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}}}", thirtyDaysAgo / 1000)); | ||
| return query.toString(); |
There was a problem hiding this comment.
buildTestCaseImpactQuery starts a JSON query string but never closes the bool.must array/object properly, so it will generate invalid JSON. Build this query with a JSON builder (or fix the missing closing brackets/braces).
| StringBuilder query = new StringBuilder(); | |
| query.append("{\"query\": {\"bool\": {\"must\": ["); | |
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | |
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| } | |
| query.append( | |
| String.format( | |
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}}}", thirtyDaysAgo / 1000)); | |
| return query.toString(); | |
| JsonArrayBuilder mustClauses = Json.createArrayBuilder(); | |
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | |
| mustClauses.add( | |
| Json.createObjectBuilder() | |
| .add( | |
| "term", | |
| Json.createObjectBuilder() | |
| .add("testCaseStatus", testCaseStatus.toLowerCase()))); | |
| } | |
| mustClauses.add( | |
| Json.createObjectBuilder() | |
| .add( | |
| "range", | |
| Json.createObjectBuilder() | |
| .add( | |
| "timestamp", | |
| Json.createObjectBuilder().add("gte", thirtyDaysAgo / 1000)))); | |
| JsonObjectBuilder query = | |
| Json.createObjectBuilder() | |
| .add( | |
| "query", | |
| Json.createObjectBuilder() | |
| .add( | |
| "bool", | |
| Json.createObjectBuilder().add("must", mustClauses))); | |
| return query.build().toString(); |
| "testCaseFullyQualifiedName", | ||
| JsonUtils.extractValue(sourceNode.toString(), FULLY_QUALIFIED_NAME)); | ||
| doc.put("entityFQN", JsonUtils.extractValue(sourceNode.toString(), "entityFQN")); | ||
| doc.put("testCaseStatus", JsonUtils.extractValue(sourceNode.toString(), "testCaseStatus")); | ||
| doc.put("timestamp", JsonUtils.extractValue(sourceNode.toString(), "timestamp")); |
There was a problem hiding this comment.
The fields being extracted here (entityFQN, testCaseStatus, timestamp) are not present at the root of the test_case index documents (they’re nested under testCaseResult.*). As a result, these values will be null unless the query/index is adjusted.
- Fix malformed JSON query in buildTestCaseImpactQuery (missing ] bracket) - Fix wrong function name import (getTestCaseDetailPagePath) - Fix wrong import path for RouterUtils - Compute recentIncidents from grouped test case results instead of hardcoded 0 - Fix property name mismatch (entityFQN -> entityFullyQualifiedName) Fixes open-metadata#26658
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
2741b5d to
1ddfe68
Compare
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| schema = @Schema(implementation = List.class))) | ||
| }) | ||
| public List<?> getDataQualityCheckImpact( | ||
| @Context UriInfo uriInfo, |
There was a problem hiding this comment.
The endpoint returns List<?> and documents the response as List.class, even though a dedicated DataQualityCheckImpact schema/model was added. Please return a typed List<DataQualityCheckImpact> (or DTO) and update the OpenAPI response annotation to the correct item type so clients can generate proper bindings.
| <Tooltip title={t('message.impact-score-tooltip')}> | ||
| <Tag color={getImpactColor(score)}>{getImpactLabel(score)}</Tag> | ||
| </Tooltip> |
There was a problem hiding this comment.
Several i18n keys referenced in this component (e.g. label.impact-ranking, label.impact-score, message.impact-score-tooltip, label.average-impact, label.data-quality-checks-by-impact, message.no-data-quality-checks-found) are not present in src/locale/languages/en-us.json, so the UI will render raw keys. Please add the missing keys (and run the i18n sync).
| testCaseFullyQualifiedName: string; | ||
| entityFullyQualifiedName: string; | ||
| testCaseStatus: TestCaseStatus; | ||
| timestamp: number; |
There was a problem hiding this comment.
timestamp is declared as a number, but the backend currently extracts it as a string from the search response and returns it in the response map as a JSON string. Either return a numeric timestamp from the API (preferred) or update this type to match the actual response shape.
| timestamp: number; | |
| timestamp: string; |
| query.append("{\"query\": {\"bool\": {\"must\": ["); | ||
|
|
||
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | ||
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); |
There was a problem hiding this comment.
User input (testCaseStatus) is interpolated into the JSON query without escaping. A value containing quotes will break the JSON (and can be used for query injection); escape it (e.g., escapeDoubleQuotes) or build the query via a JSON builder.
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| query.append( | |
| String.format( | |
| "{\"term\": {\"testCaseStatus\": \"%s\"}},", | |
| escapeDoubleQuotes(testCaseStatus.toLowerCase()))); |
| query.append("{\"query\": {\"bool\": {\"must\": ["); | ||
|
|
||
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | ||
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); |
There was a problem hiding this comment.
This term filter targets testCaseStatus, but in the TestCase index mapping the field is testCaseResult.testCaseStatus. Using the wrong field path means the status filter will never match any documents.
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| query.append( | |
| String.format( | |
| "{\"term\": {\"testCaseResult.testCaseStatus\": \"%s\"}},", | |
| testCaseStatus.toLowerCase())); |
|
|
||
| query.append( | ||
| String.format( | ||
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); |
There was a problem hiding this comment.
This range filter targets top-level timestamp, but in the TestCase index mapping the field is testCaseResult.timestamp. Using the wrong field path means the last-30-days filter won’t apply (and may return no results).
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); | |
| "{\"range\": {\"testCaseResult.timestamp\": {\"gte\": %d}}}]}}", | |
| thirtyDaysAgo / 1000)); |
|
|
||
| query.append( | ||
| String.format( | ||
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); |
There was a problem hiding this comment.
The JSON built here is malformed: it opens { "query": { "bool": { "must": [ but the final append only closes with }]}}, missing a final }. This will fail query_filter parsing and the endpoint will return empty results.
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); | |
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}}", thirtyDaysAgo / 1000)); |
| JsonNode sourceNode = JsonUtils.extractValue(jsonNode.toString(), SEARCH_SOURCE); | ||
| if (sourceNode != null) { | ||
| String testCaseFQN = | ||
| JsonUtils.extractValue(sourceNode.toString(), FULLY_QUALIFIED_NAME); | ||
| String testCaseId = JsonUtils.extractValue(sourceNode.toString(), ID); | ||
| String entityFQN = JsonUtils.extractValue(sourceNode.toString(), "entityFQN"); | ||
| String testCaseStatus = JsonUtils.extractValue(sourceNode.toString(), "testCaseStatus"); | ||
| String timestamp = JsonUtils.extractValue(sourceNode.toString(), "timestamp"); |
There was a problem hiding this comment.
This loop repeatedly converts JsonNode to string and re-parses it via JsonUtils.extractValue(...). There are existing overloads that operate directly on JsonNode (used elsewhere in this class), which avoids extra parsing and is less error-prone.
| JsonNode sourceNode = JsonUtils.extractValue(jsonNode.toString(), SEARCH_SOURCE); | |
| if (sourceNode != null) { | |
| String testCaseFQN = | |
| JsonUtils.extractValue(sourceNode.toString(), FULLY_QUALIFIED_NAME); | |
| String testCaseId = JsonUtils.extractValue(sourceNode.toString(), ID); | |
| String entityFQN = JsonUtils.extractValue(sourceNode.toString(), "entityFQN"); | |
| String testCaseStatus = JsonUtils.extractValue(sourceNode.toString(), "testCaseStatus"); | |
| String timestamp = JsonUtils.extractValue(sourceNode.toString(), "timestamp"); | |
| JsonNode sourceNode = JsonUtils.extractValue(jsonNode, SEARCH_SOURCE); | |
| if (sourceNode != null) { | |
| String testCaseFQN = JsonUtils.extractValue(sourceNode, FULLY_QUALIFIED_NAME); | |
| String testCaseId = JsonUtils.extractValue(sourceNode, ID); | |
| String entityFQN = JsonUtils.extractValue(sourceNode, "entityFQN"); | |
| String testCaseStatus = JsonUtils.extractValue(sourceNode, "testCaseStatus"); | |
| String timestamp = JsonUtils.extractValue(sourceNode, "timestamp"); |
| int downstreamUsage = | ||
| ((Number) testCase.getOrDefault("downstreamUsage", 0)).intValue(); | ||
| int consumerCount = ((Number) testCase.getOrDefault("consumerCount", 0)).intValue(); | ||
| int recentIncidents = | ||
| ((Number) testCase.getOrDefault("recentIncidents", 0)).intValue(); |
There was a problem hiding this comment.
downstreamUsage and consumerCount are inputs to the scoring model here, but they’re currently always 0 in the search results (set in searchTestCasesForImpact). That means the 40%/30% weights never affect the score, which is misleading given the feature description; please populate these metrics or remove them from scoring until available.
|
|
||
| query.append( | ||
| String.format( | ||
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); |
There was a problem hiding this comment.
🚨 Bug: JSON query still malformed — missing one closing brace
The fix at line 431 changed }}}} to }]}}, which correctly adds the missing ] for the must array. However, it still produces invalid JSON because there are only 2 closing braces after ] when 3 are needed.
The structure opened is: {"query": {"bool": {"must": [ — that's 3 { and 1 [. After the range clause closes its own 3 braces, the remaining closers should be ]}}} (1 ] + 3 }), but the code produces ]}} (1 ] + 2 }).
This means the Elasticsearch query is still malformed and searchTestCasesForImpact will throw a parse error, causing the entire impact ranking feature to return empty results.
Suggested fix:
Change the format string to include the missing brace:
"{"range": {"timestamp": {"gte": %d}}}]}}}")
Or better yet, use a JSON library to build queries instead of string concatenation to prevent these bracket-matching issues.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| doc.put("downstreamUsage", 0); | ||
| doc.put("consumerCount", 0); |
There was a problem hiding this comment.
⚠️ Bug: downstreamUsage and consumerCount always 0, crippling score
While recentIncidents is now correctly computed by counting failed runs (lines 2843-2846), downstreamUsage and consumerCount remain hardcoded to 0 (lines 2858-2859). Since the impact score formula is 0.4 * downstream + 0.3 * consumer + 0.3 * incidents, 70% of the score weight is permanently zeroed out. The ranking will only ever reflect recent incident counts, making the multi-factor scoring model misleading — users see weights advertised for downstream/consumer impact that have no effect.
This was flagged in the previous review and is only partially addressed by this commit.
Suggested fix:
Either:
1. Populate `downstreamUsage` and `consumerCount` with real data from lineage/usage APIs, or
2. Remove these factors from the score formula and UI until they are implemented, to avoid confusing users with always-zero metrics.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| Map<String, Object> doc; | ||
| if (groupedTestCases.containsKey(testCaseFQN)) { | ||
| doc = groupedTestCases.get(testCaseFQN); | ||
| int runCount = ((Number) doc.getOrDefault("runCount", 0)).intValue(); |
There was a problem hiding this comment.
💡 Edge Case: ClassCastException risk on getOrDefault with Integer 0
At lines 2840 and 2843, doc.getOrDefault("runCount", 0) and doc.getOrDefault("recentIncidents", 0) use an int literal as the default. Since the map stores Integer values from doc.put("runCount", 1), the cast (Number) works. However, the default 0 is auto-boxed to Integer which is fine, but mixing with the pattern is fragile — if anyone changes the stored value type (e.g., to Long), the cast chain could break. This is minor but worth noting for maintainability.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
|
As suggested, I think the JsonArrayBuilder or JsonUtils should really help to handle dynamic JSON. |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review 🚫 Blocked 4 resolved / 8 findingsImpact ranking for data quality checks is currently blocked by a malformed JSON query and hardcoded zero values for downstream usage metrics. Additionally, accidental indentation changes and a potential ClassCastException require attention. 🚨 Bug: JSON query still malformed — missing one closing brace📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/TestSuiteRepository.java:431 The fix at line 431 changed The structure opened is: This means the Elasticsearch query is still malformed and Suggested fix
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
Implement Data Quality Checks Impact feature for issue #26658
This helps prioritize which data quality checks are most critical to fix by ranking them based on their business impact.
Describe your changes:
I worked on the Data Quality Checks Impact feature because the issue requested ranking data quality checks by impact to give a clear idea of data quality failure criticality.
Type of change:
Checklist: