[reliability] Daily Reliability Review - 2026-05-20

### Executive Summary

Over the last 24h Sentry recorded 5,926 spans (5,867 `gen_ai`, 59 `default`) from `github/gh-aw`. Most workflows completed: 2,901 spans carry `gh-aw.run.status:success` vs 32 spans `failure` across 6 workflows. No `errors` or `logs` events were ingested in the same window — an explicit observability gap, not a sign of health. Latency tail is heavy: 7 individual gen_ai spans exceeded 15 min (max ~32 min), all on long-running scheduled jobs and all marked `success`.

Core reliability fields are missing or null on the bulk of spans: `span.status` is `null` on 100% of spans, `gen_ai.response.finish_reasons` and `gh-aw.run.conclusion` are absent from the queryable index, and `release` / `service.version` are not populated. This means runtime outcome cannot be inferred from traces alone — most "failure" signal is currently locked behind a single attribute (`gh-aw.run.status`) emitted only on the conclusion span.

A representative cross-check also surfaced a within-run inconsistency: trace `2fa055d5fa45b83d898cc0908a369e65` (PR Sous Chef, run 26166375489) carries both `success` and `failure` values of `gh-aw.run.status` on different spans, yet `gh run view` reports the run's overall conclusion as `success`. Worth investigating before relying on that attribute as the canonical failure signal.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | All gh-aw workflows | `gen_ai.response.finish_reasons` and `gh-aw.run.conclusion` not queryable; `span.status` null on 100% of spans (5,923) | spans dataset, 24h, `has:gen_ai.response.finish_reasons` → 0 results; `has:gh-aw.run.conclusion` → 0 results; `span.status` aggregate → `null` for both ops | Verify conclusion-span emission path in `actions/setup/js/send_otlp_span.cjs` (lines 1745, 1798–1799) — confirm conclusion spans are being exported and that Sentry is indexing these attribute keys |
| P1 | Errors / Logs datasets | No events ingested in 24h window | `dataset:errors` and `dataset:logs` both return zero rows | Confirm Sentry SDK / OTLP exporter is configured to forward error + log signals (current setup appears to emit spans only) |
| P2 | Contribution Check | 8 failure spans in 24h (avg 94s, max 5.3 min); confirmed real failure on run [26179902753](https://github.com/github/gh-aw/actions/runs/26179902753) (`conclusion=failure`) | spans dataset, `gh-aw.run.status:failure` grouped by workflow | Open targeted investigation of Contribution Check job; current cadence implies repeated failure pattern (8 failure spans across at least 2 runs in 24h) |
| P2 | PR Sous Chef | 8 failure-marked spans, but cross-check of run [26166375489](https://github.com/github/gh-aw/actions/runs/26166375489) shows GH conclusion=`success`; both `success` and `failure` appear in the same trace | trace `2fa055d5fa45b83d898cc0908a369e65` shows spans with both run.status values for the same run.id | Audit `gh-aw.run.status` emission — confirm it's not being written from an intermediate step (e.g. retried agent attempt) that doesn't reflect the run's final conclusion |
| P2 | Safe Output Health Monitor | 4 failure spans, but max span duration 12.7 min — suggests slow failure path | `gh-aw.run.status:failure` grouped by workflow, max(span.duration)=766180ms | Inspect the long failure span; if this is a timeout, finish_reasons would distinguish it from a logic error — but that attribute is currently not present in the index |
| P3 | Daily Security Observability Report, Copilot Session Insights, Typist - Go Type Analysis, GitHub API Consumption Report, Copilot Agent Prompt Clustering, Copilot PR Conversation NLP Analysis, Daily AW Cross-Repo Compile Check | 7 single-span outliers > 15 min (max 32 min) | spans dataset, `span.duration:>900000`, all `gh-aw.run.status:success` | Confirm whether these are expected wall-clock budgets for these workflows. If unintended, add per-workflow latency SLO before treating as regressions |
| P3 | All gh-aw spans | `release` (Sentry) and `service.version` (OTLP resource) not present in index | `has:release` → all 5,926 rows have `release=null`; `has:service.version` → 0 rows | Map OTLP resource attribute `service.version` to Sentry `release` in the project's ingest config — required for regression-by-version analysis |
| P3 | Test Quality Sentinel, Matt Pocock Skills Reviewer, PR Code Quality Reviewer | Highest input-token consumers (135M, 108M, 86M input tokens / 24h) | spans dataset, `sum(gen_ai.usage.input_tokens)` grouped by workflow | Inconclusive whether this is truncation-driven without `gen_ai.response.finish_reasons`; once that attribute is queryable, recheck for `length` finishes |

### Representative Traces

<details>
<summary>View representative traces</summary>

- **Longest gen_ai span (32 min, success)** — Daily Security Observability Report, span `c1b79721b772812e`, trace [`00b0c204449765a74224a36a160e77c2`](https://github.sentry.io/explore/traces/trace/00b0c204449765a74224a36a160e77c2). Single very-long gen_ai span at 2026-05-20T16:48:44Z; no failure markers; example of latency outlier with no observable cause attribute.
- **Run.status divergence** — PR Sous Chef run [26166375489](https://github.com/github/gh-aw/actions/runs/26166375489), trace [`2fa055d5fa45b83d898cc0908a369e65`](https://github.sentry.io/explore/traces/trace/2fa055d5fa45b83d898cc0908a369e65). Spans `669eff6fe7906642` (6m 31s) and `2ec351c8bf044ecd` (62s) both carry `gh-aw.run.status:failure`, while interleaved spans carry `success`. GitHub Actions reports the overall run as `success`.
- **Confirmed real failure** — Contribution Check run [26179902753](https://github.com/github/gh-aw/actions/runs/26179902753), trace [`291367aad0386117e8f212775a33bf37`](https://github.sentry.io/explore/traces/trace/291367aad0386117e8f212775a33bf37). 2 failure-marked spans (5m 6s and 55s); `gh run view` confirms conclusion=`failure`.
- **Slow failure path** — Safe Output Health Monitor failure span, max duration 12.7 min. Investigate why a monitor workflow takes >12 min on the failure path.

</details>

### Recommendations

1. **Restore conclusion attributes in the span index.** `send_otlp_span.cjs:1798-1799` claims `gen_ai.response.finish_reasons` is emitted on the conclusion span, but it returns 0 results in the spans dataset over 24h. Either the conclusion span is not reaching Sentry, the attribute is being dropped/renamed during ingest, or Sentry's span index is not capturing it. Pick one: validate locally with `/tmp/gh-aw/otel.jsonl`, or open the conclusion span in Sentry UI to confirm whether the attribute is present at the event level but excluded from the queryable index.
2. **Map `service.version` to Sentry `release`.** Resource attribute is emitted at `send_otlp_span.cjs:322` but does not appear in the index — likely an ingest-side mapping. Without it, regression-by-version triage is impossible.
3. **Audit `gh-aw.run.status` semantics.** A single run with both `success` and `failure` spans is a signal that this attribute is set per-step/attempt rather than per-run. Either (a) restrict emission to the final conclusion span only, or (b) rename mid-run status to `gh-aw.step.status` and reserve `gh-aw.run.status` for the terminal value.
4. **Forward errors and logs to Sentry.** Both datasets returned zero rows in 24h, which is almost certainly under-instrumentation rather than zero errors. Confirm exporter scope before the next reliability review so the report can include error-class evidence.

### Notes

<details>
<summary>View notes</summary>

- **Inconclusive runtime outcome for the latency outliers.** All 7 spans > 15 min are marked `gh-aw.run.status:success`, but without `gen_ai.response.finish_reasons` or OTLP `status.code`, we cannot distinguish a long successful run from a runaway one. Treat the latency table as observation-only until conclusion attributes are queryable.
- **`span.status` is `null` in 100% of sampled spans.** This is the OTLP `status.code` mapping in Sentry; emit-side sets it at `send_otlp_span.cjs:295`, but it is not appearing in the index. Suggests the same ingest gap as `release` / `service.version`.
- **`gh-aw.run.conclusion` is not queryable.** Considered for cross-checking with `gh-aw.run.status`, but it returns 0 rows on `has:` queries.
- **`gen_ai.response.finish_reasons:length` returned 0 rows.** This is consistent with the attribute being absent entirely, not with a real absence of truncated responses. Token-heavy workflows (Test Quality Sentinel: 134M input tokens / 24h) cannot be evaluated for truncation until this is fixed.
- **`errors` and `logs` datasets are empty for the project in the 24h window.** This is reported as an instrumentation/forwarding gap, not as "no failures occurred."
- **The Sentry MCP build available here exposes `list_events` only; `search_events` and `get_trace_details` were not available.** Trace continuity was verified by `list_events` filtered on `trace:<id>`. All trace links above are direct UI links.

</details>

**References:**
- [§26195453006](https://github.com/github/gh-aw/actions/runs/26195453006) — this reliability review run
- [§26179902753](https://github.com/github/gh-aw/actions/runs/26179902753) — Contribution Check confirmed failure
- [§26166375489](https://github.com/github/gh-aw/actions/runs/26166375489) — PR Sous Chef run.status divergence







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/26195453006) · ● 10.4M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on May 22, 2026, 11:22 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reliability] Daily Reliability Review - 2026-05-20 #33648

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P1	All gh-aw workflows	`gen_ai.response.finish_reasons` and `gh-aw.run.conclusion` not queryable; `span.status` null on 100% of spans (5,923)	spans dataset, 24h, `has:gen_ai.response.finish_reasons` → 0 results; `has:gh-aw.run.conclusion` → 0 results; `span.status` aggregate → `null` for both ops	Verify conclusion-span emission path in `actions/setup/js/send_otlp_span.cjs` (lines 1745, 1798–1799) — confirm conclusion spans are being exported and that Sentry is indexing these attribute keys
P1	Errors / Logs datasets	No events ingested in 24h window	`dataset:errors` and `dataset:logs` both return zero rows	Confirm Sentry SDK / OTLP exporter is configured to forward error + log signals (current setup appears to emit spans only)
P2	Contribution Check	8 failure spans in 24h (avg 94s, max 5.3 min); confirmed real failure on run 26179902753 (`conclusion=failure`)	spans dataset, `gh-aw.run.status:failure` grouped by workflow	Open targeted investigation of Contribution Check job; current cadence implies repeated failure pattern (8 failure spans across at least 2 runs in 24h)
P2	PR Sous Chef	8 failure-marked spans, but cross-check of run 26166375489 shows GH conclusion=`success`; both `success` and `failure` appear in the same trace	trace `2fa055d5fa45b83d898cc0908a369e65` shows spans with both run.status values for the same run.id	Audit `gh-aw.run.status` emission — confirm it's not being written from an intermediate step (e.g. retried agent attempt) that doesn't reflect the run's final conclusion
P2	Safe Output Health Monitor	4 failure spans, but max span duration 12.7 min — suggests slow failure path	`gh-aw.run.status:failure` grouped by workflow, max(span.duration)=766180ms	Inspect the long failure span; if this is a timeout, finish_reasons would distinguish it from a logic error — but that attribute is currently not present in the index
P3	Daily Security Observability Report, Copilot Session Insights, Typist - Go Type Analysis, GitHub API Consumption Report, Copilot Agent Prompt Clustering, Copilot PR Conversation NLP Analysis, Daily AW Cross-Repo Compile Check	7 single-span outliers > 15 min (max 32 min)	spans dataset, `span.duration:>900000`, all `gh-aw.run.status:success`	Confirm whether these are expected wall-clock budgets for these workflows. If unintended, add per-workflow latency SLO before treating as regressions
P3	All gh-aw spans	`release` (Sentry) and `service.version` (OTLP resource) not present in index	`has:release` → all 5,926 rows have `release=null`; `has:service.version` → 0 rows	Map OTLP resource attribute `service.version` to Sentry `release` in the project's ingest config — required for regression-by-version analysis
P3	Test Quality Sentinel, Matt Pocock Skills Reviewer, PR Code Quality Reviewer	Highest input-token consumers (135M, 108M, 86M input tokens / 24h)	spans dataset, `sum(gen_ai.usage.input_tokens)` grouped by workflow	Inconclusive whether this is truncation-driven without `gen_ai.response.finish_reasons`; once that attribute is queryable, recheck for `length` finishes

Uh oh!

[reliability] Daily Reliability Review - 2026-05-20 #33648

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions