Executive Summary
Over the last 24h Sentry recorded 5,926 spans (5,867 gen_ai, 59 default) from github/gh-aw. Most workflows completed: 2,901 spans carry gh-aw.run.status:success vs 32 spans failure across 6 workflows. No errors or logs events were ingested in the same window — an explicit observability gap, not a sign of health. Latency tail is heavy: 7 individual gen_ai spans exceeded 15 min (max ~32 min), all on long-running scheduled jobs and all marked success.
Core reliability fields are missing or null on the bulk of spans: span.status is null on 100% of spans, gen_ai.response.finish_reasons and gh-aw.run.conclusion are absent from the queryable index, and release / service.version are not populated. This means runtime outcome cannot be inferred from traces alone — most "failure" signal is currently locked behind a single attribute (gh-aw.run.status) emitted only on the conclusion span.
A representative cross-check also surfaced a within-run inconsistency: trace 2fa055d5fa45b83d898cc0908a369e65 (PR Sous Chef, run 26166375489) carries both success and failure values of gh-aw.run.status on different spans, yet gh run view reports the run's overall conclusion as success. Worth investigating before relying on that attribute as the canonical failure signal.
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P1 |
All gh-aw workflows |
gen_ai.response.finish_reasons and gh-aw.run.conclusion not queryable; span.status null on 100% of spans (5,923) |
spans dataset, 24h, has:gen_ai.response.finish_reasons → 0 results; has:gh-aw.run.conclusion → 0 results; span.status aggregate → null for both ops |
Verify conclusion-span emission path in actions/setup/js/send_otlp_span.cjs (lines 1745, 1798–1799) — confirm conclusion spans are being exported and that Sentry is indexing these attribute keys |
| P1 |
Errors / Logs datasets |
No events ingested in 24h window |
dataset:errors and dataset:logs both return zero rows |
Confirm Sentry SDK / OTLP exporter is configured to forward error + log signals (current setup appears to emit spans only) |
| P2 |
Contribution Check |
8 failure spans in 24h (avg 94s, max 5.3 min); confirmed real failure on run 26179902753 (conclusion=failure) |
spans dataset, gh-aw.run.status:failure grouped by workflow |
Open targeted investigation of Contribution Check job; current cadence implies repeated failure pattern (8 failure spans across at least 2 runs in 24h) |
| P2 |
PR Sous Chef |
8 failure-marked spans, but cross-check of run 26166375489 shows GH conclusion=success; both success and failure appear in the same trace |
trace 2fa055d5fa45b83d898cc0908a369e65 shows spans with both run.status values for the same run.id |
Audit gh-aw.run.status emission — confirm it's not being written from an intermediate step (e.g. retried agent attempt) that doesn't reflect the run's final conclusion |
| P2 |
Safe Output Health Monitor |
4 failure spans, but max span duration 12.7 min — suggests slow failure path |
gh-aw.run.status:failure grouped by workflow, max(span.duration)=766180ms |
Inspect the long failure span; if this is a timeout, finish_reasons would distinguish it from a logic error — but that attribute is currently not present in the index |
| P3 |
Daily Security Observability Report, Copilot Session Insights, Typist - Go Type Analysis, GitHub API Consumption Report, Copilot Agent Prompt Clustering, Copilot PR Conversation NLP Analysis, Daily AW Cross-Repo Compile Check |
7 single-span outliers > 15 min (max 32 min) |
spans dataset, span.duration:>900000, all gh-aw.run.status:success |
Confirm whether these are expected wall-clock budgets for these workflows. If unintended, add per-workflow latency SLO before treating as regressions |
| P3 |
All gh-aw spans |
release (Sentry) and service.version (OTLP resource) not present in index |
has:release → all 5,926 rows have release=null; has:service.version → 0 rows |
Map OTLP resource attribute service.version to Sentry release in the project's ingest config — required for regression-by-version analysis |
| P3 |
Test Quality Sentinel, Matt Pocock Skills Reviewer, PR Code Quality Reviewer |
Highest input-token consumers (135M, 108M, 86M input tokens / 24h) |
spans dataset, sum(gen_ai.usage.input_tokens) grouped by workflow |
Inconclusive whether this is truncation-driven without gen_ai.response.finish_reasons; once that attribute is queryable, recheck for length finishes |
Representative Traces
View representative traces
- Longest gen_ai span (32 min, success) — Daily Security Observability Report, span
c1b79721b772812e, trace 00b0c204449765a74224a36a160e77c2. Single very-long gen_ai span at 2026-05-20T16:48:44Z; no failure markers; example of latency outlier with no observable cause attribute.
- Run.status divergence — PR Sous Chef run 26166375489, trace
2fa055d5fa45b83d898cc0908a369e65. Spans 669eff6fe7906642 (6m 31s) and 2ec351c8bf044ecd (62s) both carry gh-aw.run.status:failure, while interleaved spans carry success. GitHub Actions reports the overall run as success.
- Confirmed real failure — Contribution Check run 26179902753, trace
291367aad0386117e8f212775a33bf37. 2 failure-marked spans (5m 6s and 55s); gh run view confirms conclusion=failure.
- Slow failure path — Safe Output Health Monitor failure span, max duration 12.7 min. Investigate why a monitor workflow takes >12 min on the failure path.
Recommendations
- Restore conclusion attributes in the span index.
send_otlp_span.cjs:1798-1799 claims gen_ai.response.finish_reasons is emitted on the conclusion span, but it returns 0 results in the spans dataset over 24h. Either the conclusion span is not reaching Sentry, the attribute is being dropped/renamed during ingest, or Sentry's span index is not capturing it. Pick one: validate locally with /tmp/gh-aw/otel.jsonl, or open the conclusion span in Sentry UI to confirm whether the attribute is present at the event level but excluded from the queryable index.
- Map
service.version to Sentry release. Resource attribute is emitted at send_otlp_span.cjs:322 but does not appear in the index — likely an ingest-side mapping. Without it, regression-by-version triage is impossible.
- Audit
gh-aw.run.status semantics. A single run with both success and failure spans is a signal that this attribute is set per-step/attempt rather than per-run. Either (a) restrict emission to the final conclusion span only, or (b) rename mid-run status to gh-aw.step.status and reserve gh-aw.run.status for the terminal value.
- Forward errors and logs to Sentry. Both datasets returned zero rows in 24h, which is almost certainly under-instrumentation rather than zero errors. Confirm exporter scope before the next reliability review so the report can include error-class evidence.
Notes
View notes
- Inconclusive runtime outcome for the latency outliers. All 7 spans > 15 min are marked
gh-aw.run.status:success, but without gen_ai.response.finish_reasons or OTLP status.code, we cannot distinguish a long successful run from a runaway one. Treat the latency table as observation-only until conclusion attributes are queryable.
span.status is null in 100% of sampled spans. This is the OTLP status.code mapping in Sentry; emit-side sets it at send_otlp_span.cjs:295, but it is not appearing in the index. Suggests the same ingest gap as release / service.version.
gh-aw.run.conclusion is not queryable. Considered for cross-checking with gh-aw.run.status, but it returns 0 rows on has: queries.
gen_ai.response.finish_reasons:length returned 0 rows. This is consistent with the attribute being absent entirely, not with a real absence of truncated responses. Token-heavy workflows (Test Quality Sentinel: 134M input tokens / 24h) cannot be evaluated for truncation until this is fixed.
errors and logs datasets are empty for the project in the 24h window. This is reported as an instrumentation/forwarding gap, not as "no failures occurred."
- The Sentry MCP build available here exposes
list_events only; search_events and get_trace_details were not available. Trace continuity was verified by list_events filtered on trace:<id>. All trace links above are direct UI links.
References:
Generated by 🚨 Daily Reliability Review · ● 10.4M · ◷
Executive Summary
Over the last 24h Sentry recorded 5,926 spans (5,867
gen_ai, 59default) fromgithub/gh-aw. Most workflows completed: 2,901 spans carrygh-aw.run.status:successvs 32 spansfailureacross 6 workflows. Noerrorsorlogsevents were ingested in the same window — an explicit observability gap, not a sign of health. Latency tail is heavy: 7 individual gen_ai spans exceeded 15 min (max ~32 min), all on long-running scheduled jobs and all markedsuccess.Core reliability fields are missing or null on the bulk of spans:
span.statusisnullon 100% of spans,gen_ai.response.finish_reasonsandgh-aw.run.conclusionare absent from the queryable index, andrelease/service.versionare not populated. This means runtime outcome cannot be inferred from traces alone — most "failure" signal is currently locked behind a single attribute (gh-aw.run.status) emitted only on the conclusion span.A representative cross-check also surfaced a within-run inconsistency: trace
2fa055d5fa45b83d898cc0908a369e65(PR Sous Chef, run 26166375489) carries bothsuccessandfailurevalues ofgh-aw.run.statuson different spans, yetgh run viewreports the run's overall conclusion assuccess. Worth investigating before relying on that attribute as the canonical failure signal.Top Reliability Findings
gen_ai.response.finish_reasonsandgh-aw.run.conclusionnot queryable;span.statusnull on 100% of spans (5,923)has:gen_ai.response.finish_reasons→ 0 results;has:gh-aw.run.conclusion→ 0 results;span.statusaggregate →nullfor both opsactions/setup/js/send_otlp_span.cjs(lines 1745, 1798–1799) — confirm conclusion spans are being exported and that Sentry is indexing these attribute keysdataset:errorsanddataset:logsboth return zero rowsconclusion=failure)gh-aw.run.status:failuregrouped by workflowsuccess; bothsuccessandfailureappear in the same trace2fa055d5fa45b83d898cc0908a369e65shows spans with both run.status values for the same run.idgh-aw.run.statusemission — confirm it's not being written from an intermediate step (e.g. retried agent attempt) that doesn't reflect the run's final conclusiongh-aw.run.status:failuregrouped by workflow, max(span.duration)=766180msspan.duration:>900000, allgh-aw.run.status:successrelease(Sentry) andservice.version(OTLP resource) not present in indexhas:release→ all 5,926 rows haverelease=null;has:service.version→ 0 rowsservice.versionto Sentryreleasein the project's ingest config — required for regression-by-version analysissum(gen_ai.usage.input_tokens)grouped by workflowgen_ai.response.finish_reasons; once that attribute is queryable, recheck forlengthfinishesRepresentative Traces
View representative traces
c1b79721b772812e, trace00b0c204449765a74224a36a160e77c2. Single very-long gen_ai span at 2026-05-20T16:48:44Z; no failure markers; example of latency outlier with no observable cause attribute.2fa055d5fa45b83d898cc0908a369e65. Spans669eff6fe7906642(6m 31s) and2ec351c8bf044ecd(62s) both carrygh-aw.run.status:failure, while interleaved spans carrysuccess. GitHub Actions reports the overall run assuccess.291367aad0386117e8f212775a33bf37. 2 failure-marked spans (5m 6s and 55s);gh run viewconfirms conclusion=failure.Recommendations
send_otlp_span.cjs:1798-1799claimsgen_ai.response.finish_reasonsis emitted on the conclusion span, but it returns 0 results in the spans dataset over 24h. Either the conclusion span is not reaching Sentry, the attribute is being dropped/renamed during ingest, or Sentry's span index is not capturing it. Pick one: validate locally with/tmp/gh-aw/otel.jsonl, or open the conclusion span in Sentry UI to confirm whether the attribute is present at the event level but excluded from the queryable index.service.versionto Sentryrelease. Resource attribute is emitted atsend_otlp_span.cjs:322but does not appear in the index — likely an ingest-side mapping. Without it, regression-by-version triage is impossible.gh-aw.run.statussemantics. A single run with bothsuccessandfailurespans is a signal that this attribute is set per-step/attempt rather than per-run. Either (a) restrict emission to the final conclusion span only, or (b) rename mid-run status togh-aw.step.statusand reservegh-aw.run.statusfor the terminal value.Notes
View notes
gh-aw.run.status:success, but withoutgen_ai.response.finish_reasonsor OTLPstatus.code, we cannot distinguish a long successful run from a runaway one. Treat the latency table as observation-only until conclusion attributes are queryable.span.statusisnullin 100% of sampled spans. This is the OTLPstatus.codemapping in Sentry; emit-side sets it atsend_otlp_span.cjs:295, but it is not appearing in the index. Suggests the same ingest gap asrelease/service.version.gh-aw.run.conclusionis not queryable. Considered for cross-checking withgh-aw.run.status, but it returns 0 rows onhas:queries.gen_ai.response.finish_reasons:lengthreturned 0 rows. This is consistent with the attribute being absent entirely, not with a real absence of truncated responses. Token-heavy workflows (Test Quality Sentinel: 134M input tokens / 24h) cannot be evaluated for truncation until this is fixed.errorsandlogsdatasets are empty for the project in the 24h window. This is reported as an instrumentation/forwarding gap, not as "no failures occurred."list_eventsonly;search_eventsandget_trace_detailswere not available. Trace continuity was verified bylist_eventsfiltered ontrace:<id>. All trace links above are direct UI links.References: