-
Notifications
You must be signed in to change notification settings - Fork 2.1k
feat(data-quality): Add impact ranking for data quality checks #26658 #27346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5c1e40c
1ddfe68
a01d9ca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -359,7 +359,7 @@ public DataQualityReport getDataQualityReport( | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return searchRepository.genericAggregation(q, index, searchAggregation, subjectContext); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| public DataQualityReport getDataQualityReport( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| public DataQualityReport getDataQualityReport( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| public DataQualityReport getDataQualityReport( | |
| public DataQualityReport getDataQualityReport( |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scoring loop assumes downstreamUsage, consumerCount, and recentIncidents are present in the input docs, but the current retrieval code defaults them to 0. That makes impactScore always 0 and defeats the ranking.
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downstreamUsage and consumerCount are inputs to the scoring model here, but they’re currently always 0 in the search results (set in searchTestCasesForImpact). That means the 40%/30% weights never affect the score, which is misleading given the feature description; please populate these metrics or remove them from scoring until available.
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User input (testCaseStatus) is interpolated into the JSON query without escaping. A value containing quotes will break the JSON (and can be used for query injection); escape it (e.g., escapeDoubleQuotes) or build the query via a JSON builder.
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| query.append( | |
| String.format( | |
| "{\"term\": {\"testCaseStatus\": \"%s\"}},", | |
| escapeDoubleQuotes(testCaseStatus.toLowerCase()))); |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This term filter targets testCaseStatus, but in the TestCase index mapping the field is testCaseResult.testCaseStatus. Using the wrong field path means the status filter will never match any documents.
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| query.append( | |
| String.format( | |
| "{\"term\": {\"testCaseResult.testCaseStatus\": \"%s\"}},", | |
| testCaseStatus.toLowerCase())); |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This range filter targets top-level timestamp, but in the TestCase index mapping the field is testCaseResult.timestamp. Using the wrong field path means the last-30-days filter won’t apply (and may return no results).
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); | |
| "{\"range\": {\"testCaseResult.timestamp\": {\"gte\": %d}}}]}}", | |
| thirtyDaysAgo / 1000)); |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON built here is malformed: it opens { "query": { "bool": { "must": [ but the final append only closes with }]}}, missing a final }. This will fail query_filter parsing and the endpoint will return empty results.
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}", thirtyDaysAgo / 1000)); | |
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}]}}}", thirtyDaysAgo / 1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚨 Bug: JSON query still malformed — missing one closing brace
The fix at line 431 changed }}}} to }]}}, which correctly adds the missing ] for the must array. However, it still produces invalid JSON because there are only 2 closing braces after ] when 3 are needed.
The structure opened is: {"query": {"bool": {"must": [ — that's 3 { and 1 [. After the range clause closes its own 3 braces, the remaining closers should be ]}}} (1 ] + 3 }), but the code produces ]}} (1 ] + 2 }).
This means the Elasticsearch query is still malformed and searchTestCasesForImpact will throw a parse error, causing the entire impact ranking feature to return empty results.
Suggested fix:
Change the format string to include the missing brace:
"{"range": {"timestamp": {"gte": %d}}}]}}}")
Or better yet, use a JSON library to build queries instead of string concatenation to prevent these bracket-matching issues.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buildTestCaseImpactQuery starts a JSON query string but never closes the bool.must array/object properly, so it will generate invalid JSON. Build this query with a JSON builder (or fix the missing closing brackets/braces).
| StringBuilder query = new StringBuilder(); | |
| query.append("{\"query\": {\"bool\": {\"must\": ["); | |
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | |
| query.append(String.format("{\"term\": {\"testCaseStatus\": \"%s\"}},", testCaseStatus.toLowerCase())); | |
| } | |
| query.append( | |
| String.format( | |
| "{\"range\": {\"timestamp\": {\"gte\": %d}}}}}", thirtyDaysAgo / 1000)); | |
| return query.toString(); | |
| JsonArrayBuilder mustClauses = Json.createArrayBuilder(); | |
| if (testCaseStatus != null && !testCaseStatus.isEmpty()) { | |
| mustClauses.add( | |
| Json.createObjectBuilder() | |
| .add( | |
| "term", | |
| Json.createObjectBuilder() | |
| .add("testCaseStatus", testCaseStatus.toLowerCase()))); | |
| } | |
| mustClauses.add( | |
| Json.createObjectBuilder() | |
| .add( | |
| "range", | |
| Json.createObjectBuilder() | |
| .add( | |
| "timestamp", | |
| Json.createObjectBuilder().add("gte", thirtyDaysAgo / 1000)))); | |
| JsonObjectBuilder query = | |
| Json.createObjectBuilder() | |
| .add( | |
| "query", | |
| Json.createObjectBuilder() | |
| .add( | |
| "bool", | |
| Json.createObjectBuilder().add("must", mustClauses))); | |
| return query.build().toString(); |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -542,10 +542,53 @@ public DataQualityReport getDataQualityReport( | |||||
| if (nullOrEmpty(aggregationQuery) || nullOrEmpty(index)) { | ||||||
| throw new IllegalArgumentException("aggregationQuery and index are required parameters"); | ||||||
| } | ||||||
| SubjectContext subjectContext = getSubjectContext(securityContext); | ||||||
| SubjectContext subjectContext = getSubjectContext(securityContext); | ||||||
|
||||||
| SubjectContext subjectContext = getSubjectContext(securityContext); | |
| SubjectContext subjectContext = getSubjectContext(securityContext); |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint returns List<?> and documents the response as List.class, even though a dedicated DataQualityCheckImpact schema/model was added. Please return a typed List<DataQualityCheckImpact> (or DTO) and update the OpenAPI response annotation to the correct item type so clients can generate proper bindings.
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint currently returns List<?>, which loses type safety and makes the OpenAPI schema unhelpful. Prefer returning a typed List<DataQualityCheckImpact> (schema was added) to provide a stable API contract.
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testCaseStatus is accepted as a free-form String and then interpolated into the search JSON. To avoid invalid values (and potential query injection), model it as the existing TestCaseStatus enum and reject unknown values with a 400.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -2785,11 +2785,92 @@ public Response searchEntityRelationship( | |||||||||||||||||||||||||||||||
| fqn, upstreamDepth, downstreamDepth, queryFilter, deleted); | ||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||
| public Response searchDataQualityLineage( | ||||||||||||||||||||||||||||||||
| public Response searchDataQualityLineage( | ||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||
| public Response searchDataQualityLineage( | |
| public Response searchDataQualityLineage( |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SearchRequest.withQuery(...) is treated as a query-string in the search layer; passing JSON DSL here won’t be parsed as DSL. If you want to pass ES query DSL, it needs to go in queryFilter (or the dedicated raw-DSL parameter), not in query.
| .withQuery(query) | |
| .withQuery("*") | |
| .withQueryFilter(query) |
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This request sorts by timestamp on the testCase index, but the test_case mapping doesn’t have a root timestamp field (it’s under testCaseResult.timestamp). This sort (and any range filter on timestamp) won’t behave as intended unless you use the correct index/field path.
Copilot
AI
Apr 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop repeatedly converts JsonNode to string and re-parses it via JsonUtils.extractValue(...). There are existing overloads that operate directly on JsonNode (used elsewhere in this class), which avoids extra parsing and is less error-prone.
| JsonNode sourceNode = JsonUtils.extractValue(jsonNode.toString(), SEARCH_SOURCE); | |
| if (sourceNode != null) { | |
| String testCaseFQN = | |
| JsonUtils.extractValue(sourceNode.toString(), FULLY_QUALIFIED_NAME); | |
| String testCaseId = JsonUtils.extractValue(sourceNode.toString(), ID); | |
| String entityFQN = JsonUtils.extractValue(sourceNode.toString(), "entityFQN"); | |
| String testCaseStatus = JsonUtils.extractValue(sourceNode.toString(), "testCaseStatus"); | |
| String timestamp = JsonUtils.extractValue(sourceNode.toString(), "timestamp"); | |
| JsonNode sourceNode = JsonUtils.extractValue(jsonNode, SEARCH_SOURCE); | |
| if (sourceNode != null) { | |
| String testCaseFQN = JsonUtils.extractValue(sourceNode, FULLY_QUALIFIED_NAME); | |
| String testCaseId = JsonUtils.extractValue(sourceNode, ID); | |
| String entityFQN = JsonUtils.extractValue(sourceNode, "entityFQN"); | |
| String testCaseStatus = JsonUtils.extractValue(sourceNode, "testCaseStatus"); | |
| String timestamp = JsonUtils.extractValue(sourceNode, "timestamp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Edge Case: ClassCastException risk on getOrDefault with Integer 0
At lines 2840 and 2843, doc.getOrDefault("runCount", 0) and doc.getOrDefault("recentIncidents", 0) use an int literal as the default. Since the map stores Integer values from doc.put("runCount", 1), the cast (Number) works. However, the default 0 is auto-boxed to Integer which is fine, but mixing with the pattern is fragile — if anyone changes the stored value type (e.g., to Long), the cast chain could break. This is minor but worth noting for maintainability.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Bug: downstreamUsage and consumerCount always 0, crippling score
While recentIncidents is now correctly computed by counting failed runs (lines 2843-2846), downstreamUsage and consumerCount remain hardcoded to 0 (lines 2858-2859). Since the impact score formula is 0.4 * downstream + 0.3 * consumer + 0.3 * incidents, 70% of the score weight is permanently zeroed out. The ranking will only ever reflect recent incident counts, making the multi-factor scoring model misleading — users see weights advertised for downstream/consumer impact that have no effect.
This was flagged in the previous review and is only partially addressed by this commit.
Suggested fix:
Either:
1. Populate `downstreamUsage` and `consumerCount` with real data from lineage/usage APIs, or
2. Remove these factors from the score formula and UI until they are implemented, to avoid confusing users with always-zero metrics.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,74 @@ | ||||||
| { | ||||||
| "$id": "https://open-metadata.org/schema/tests/dataQualityCheckImpact.json", | ||||||
| "$schema": "http://json-schema.org/draft-07/schema#", | ||||||
| "title": "DataQualityCheckImpact", | ||||||
| "description": "Data Quality Check Impact model for ranking checks by business criticality.", | ||||||
| "type": "object", | ||||||
| "javaType": "org.openmetadata.schema.tests.DataQualityCheckImpact", | ||||||
| "properties": { | ||||||
| "testCaseId": { | ||||||
| "description": "Unique identifier of the test case.", | ||||||
| "$ref": "../type/basic.json#/definitions/uuid" | ||||||
| }, | ||||||
| "testCaseFullyQualifiedName": { | ||||||
| "description": "Fully qualified name of the test case.", | ||||||
| "$ref": "../type/basic.json#/definitions/fullyQualifiedEntityName" | ||||||
| }, | ||||||
| "testSuiteId": { | ||||||
| "description": "Unique identifier of the test suite.", | ||||||
| "$ref": "../type/basic.json#/definitions/uuid" | ||||||
| }, | ||||||
| "testSuiteFullyQualifiedName": { | ||||||
| "description": "Fully qualified name of the test suite.", | ||||||
| "$ref": "../type/basic.json#/definitions/fullyQualifiedEntityName" | ||||||
| }, | ||||||
| "entityFullyQualifiedName": { | ||||||
|
||||||
| "entityFullyQualifiedName": { | |
| "entityFQN": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Quality: Indentation damage on pre-existing lines (accidental whitespace changes)
Several pre-existing lines had their leading indentation removed (e.g.,
getDataQualityReportat line 362,searchDataQualityLineageat line 2787,SubjectContextat line 545,getDataQualityTabat line 74). These appear to be accidental whitespace-only changes that break the consistent indentation style of the codebase. They should be reverted to avoid noisy diffs.Was this helpful? React with 👍 / 👎 | Reply
gitar fixto apply this suggestion