Skip to content

Commit 8363070

Browse files
Merge pull request #3643 from sosiouxme/20260605-retroactive-symptoms
TRT-2695: retroactively re-evaluate symptoms (API)
2 parents 24c082b + f36ecd5 commit 8363070

10 files changed

Lines changed: 1776 additions & 18 deletions

File tree

docs/features/job-analysis-symptoms.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,24 @@ Labels are relevant in several Sippy pages:
112112
They are also displayed in Spyglass (the Deck display of a Prow job) - an HTML summary added to the
113113
job's bucket entry is included by the html lens.
114114

115+
### 6. Retroactive Re-evaluation
116+
117+
The `POST /api/jobs/runs/reevaluate` endpoint re-runs symptom detection for specified job runs.
118+
Unlike the cloud function (which processes files as they arrive), the re-evaluator scans all
119+
artifacts at once for completed job runs.
120+
121+
Flow:
122+
123+
1. Load all active symptom definitions from PostgreSQL (excluding unimplemented matcher types).
124+
2. For each job run, run one `JobArtifactQuery` per symptom against GCS artifacts.
125+
3. Delete existing symptom-originated labels (BQ rows with non-empty `symptom_id`, GCS label files).
126+
4. Write new results to BQ (`job_labels` table), GCS (label JSON files + HTML summary), and
127+
PostgreSQL (`prow_job_runs.labels` and `release_job_runs.labels`).
128+
129+
The delete-then-insert strategy makes re-evaluation idempotent: if a symptom is modified, added, or
130+
removed, re-evaluating produces the correct result. Manually-applied labels (those with empty
131+
`symptom_id`) are preserved through re-evaluation.
132+
115133
## Key Code Locations
116134

117135
### Sippy (`openshift/sippy`)
@@ -122,12 +140,13 @@ job's bucket entry is included by the html lens.
122140
| `pkg/db/models/job_labels.go` | `JobRunLabel` - BigQuery row schema for the `job_labels` table. |
123141
| `pkg/db/models/prow.go` | `ProwJobRun.Labels` - the label array stored in Postgres. |
124142
| `pkg/db/models/triage.go` | `TriageSymptom` - junction table linking symptoms to triage records. |
125-
| `pkg/api/jobrunscan/` | API handlers for symptom/label CRUD, with validation logic. |
143+
| `pkg/api/jobrunscan/` | API handlers for symptom/label CRUD and re-evaluation, with validation logic. |
144+
| `pkg/api/jobrunscan/reevaluate.go` | Re-evaluation service: symptom scanning, BQ/GCS/PostgreSQL write logic. |
126145
| `pkg/sippyserver/job_run_scan.go` | HTTP route handlers delegating to the jobrunscan API package. |
127146
| `pkg/sippyclient/jobrunscan/` | Go client library for symptom/label APIs (used by cloud function). |
128147
| `pkg/componentreadiness/jobrunannotator/jobrunannotator.go` | `JobRunAnnotator` - the `annotate-job-runs` tool which can add labels but doesn't (yet) know about symptoms. |
129148
| `pkg/componentreadiness/jobrunannotator/prow_bucket.go` | `JobRunBucketLabel`, `WriteHTMLSummaryToBucket` - writes label files and HTML summaries to GCS. Shared with cloud function. |
130-
| `pkg/api/jobartifacts/` | `JobArtifactQuery`, `ContentMatcher` - the artifact querying and matching engine used by JAQ and symptom evaluation. |
149+
| `pkg/api/jobartifacts/` | `JobArtifactQuery`, `ContentMatcher` - the artifact querying and matching engine used by JAQ and symptom evaluation. Results (matched lines per file) are cached by `(jobRunID, pathGlob, matcherKey)` to avoid re-scanning the same job run for the same query; this does **not** cache raw GCS file contents. |
131150
| `pkg/dataloader/prowloader/prow.go` | `GatherLabelsFromBQ` - reads labels from BQ during fetchdata. |
132151
| `pkg/api/componentreadiness/regressiontracker.go` | `SyncTriageSymptoms` - links symptoms to triage records. |
133152
| `cmd/sippy/seed_data.go` | Bootstrap definitions of symptoms and labels for use in manual testing. |
@@ -148,6 +167,7 @@ All endpoints are under `/api/jobs/` and support standard CRUD:
148167
- `GET/PUT/DELETE /api/jobs/labels/{id}` - read / update / delete
149168
- `GET/POST /api/jobs/symptoms` - list / create symptoms
150169
- `GET/PUT/DELETE /api/jobs/symptoms/{id}` - read / update / delete
170+
- `POST /api/jobs/runs/reevaluate` - re-evaluate symptoms for specified job runs
151171

152172
See `pkg/api/jobrunscan/` for validation rules and `pkg/api/README.md` for broader API
153173
documentation.
@@ -171,7 +191,7 @@ The symptoms pipeline (definition → detection → labeling → display) is ful
171191
Active/planned work includes:
172192

173193
- **Retroactive re-evaluation** ([TRT-2695](https://redhat.atlassian.net/browse/TRT-2695))
174-
- API and UI to re-run symptom matching for past job runs.
194+
- Backend API complete (`POST /api/jobs/runs/reevaluate`). Frontend UI is pending.
175195
- **Compound symptoms** ([TRT-2466](https://redhat.atlassian.net/browse/TRT-2466))
176196
- richer CEL-based label composition.
177197
- **Full management UI** ([TRT-2479](https://redhat.atlassian.net/browse/TRT-2479))

docs/plans/trt-2695-retroactive-symptoms-backend-plan.md

Lines changed: 475 additions & 0 deletions
Large diffs are not rendered by default.

pkg/api/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,62 @@ A summary of runs for job(s). Results contains of the following values for each
396396
| job | String | Return only jobs containing only containing this value in their name | N/A |
397397
| limit | Integer | The maximum amount of results to return | N/A |
398398

399+
## Re-evaluate Job Run Symptoms
400+
401+
Endpoint: `POST /api/jobs/runs/reevaluate`
402+
403+
Re-runs all symptom definitions against the artifacts for specified job runs and updates
404+
BigQuery, GCS, and PostgreSQL with the results. Requires `--enable-write-endpoints`.
405+
406+
### Request
407+
408+
```json
409+
{
410+
"prow_job_build_ids": ["1234567890", "0987654321"],
411+
"dry_run": false
412+
}
413+
```
414+
415+
Maximum 50 job run IDs per request. IDs must be numeric strings.
416+
417+
### Response (200 OK)
418+
419+
```json
420+
{
421+
"results": [
422+
{
423+
"prow_job_build_id": "1234567890",
424+
"status": "success",
425+
"symptoms_evaluated": 42,
426+
"symptoms_matched": ["CreatePodSandboxForPodFailedInJournal"],
427+
"labels_applied": ["InfraFailure", "NodeProblem"],
428+
"bq_entries_written": 2,
429+
"gcs_artifacts_written": 2,
430+
"postgres_updated": true,
431+
"links": {
432+
"job_run": "https://prow.ci.openshift.org/view/gs/test-platform-results/logs/.../1234567890",
433+
"symptom:CreatePodSandboxForPodFailedInJournal": "http://localhost:8080/api/jobs/symptoms/SomeSymptom"
434+
}
435+
},
436+
{
437+
"prow_job_build_id": "0987654321",
438+
"status": "missing_error",
439+
"error": "job run 0987654321 not found in database"
440+
}
441+
],
442+
"links": {
443+
"self": "http://localhost:8080/api/jobs/runs/reevaluate"
444+
}
445+
}
446+
```
447+
448+
### Status Values
449+
450+
- `success` - re-evaluation completed and all backends updated.
451+
- `missing_error` - the job run ID was not found in the database.
452+
- `eval_error` - artifact scanning failed (timeout, GCS error, database error).
453+
- `rewrite_error` - scanning succeeded but writing to BQ/GCS/PostgreSQL failed.
454+
399455
## Tests
400456

401457
Endpoint: `/api/tests`

0 commit comments

Comments
 (0)