roachtest: increase observability into test selection

### Background

The roachtest selective-test logic in `pkg/cmd/roachtest/main.go` is currently
opaque to anyone investigating a nightly run. The only signals emitted today are
two `fmt.Printf` lines, e.g. from Azure nightly build #6871 (master,
2026-04-22):
```
[05:56:21]  112 selected out of 318 successful tests.
[05:56:21]  587 out of 777 tests selected for the run!
```

These end up in the TeamCity build log but **not** in any file artifact under
`/artifacts/_runner-logs/`, so they are not picked up by the Datadog log
uploader. Concrete consequences:

1. To explain why 587/777 (~75%) tests were selected, I had to read
   [`pkg/cmd/roachtest/main.go`](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/main.go)
   (`updateSpecForSelectiveTests`) and
   [`pkg/cmd/roachtest/testselector/snowflake_query.sql`](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/testselector/snowflake_query.sql)
   line-by-line. There is no per-criterion breakdown for the 5 OR'd
   `selected=yes` rules in the Snowflake query (`failure_count > 0`,
   `first_run > now-N`, `last_run < now-M`, last failure was preempt,
   `last_status='UNKNOWN'`) — so we can't tell which rule is dominating.
2. PR #168462 added a panic-recovery + fallback path so that a slow/unreachable
   Snowflake no longer crashes roachtest. The fallback emits one
   `error selecting tests:` line to stdout, but with no Datadog-indexed log
   we can't alert when the fallback fires. This was explicitly called out in
   [#168462 (comment)](https://github.com/cockroachdb/cockroach/pull/168462#issuecomment-4262533762)
   and deferred to a follow-up.

### Goal

1. Have `updateSpecForSelectiveTests` (and the surrounding selection code in
   `main.go`) write structured records to a file under
   `/artifacts/_runner-logs/` so that the Datadog log uploader picks them up.
   At a minimum the file should record:
   - total specs in suite, count of `successful` pool, count selected
   - per-criterion attribution counts (how many tests fired each of the 5
     Snowflake `selected=yes` rules)
   - whether the Snowflake fallback path was triggered, and the underlying
     error if so
   - the resolved `--cloud`, `--suite`, `--successful-test-select-pct`,
     `TC_BUILD_BRANCH`
2. (Nice to have) A per-test or sampled-per-test breakdown so we can answer
   "why was test X selected" without re-running.
3. Once (1) is in place, add a Datadog monitor that fires when the fallback
   path is exercised, so we know when test selection is operating without
   Snowflake input.

**Describe alternatives you've considered**

- Parsing the TeamCity build log out of band — fragile and not searchable.
- Plumbing a `*logger.Logger` from `main.go` only for the current two summary
  lines — punts on the per-criterion attribution which is the more useful
  signal.

**Additional context**

- Investigation thread: <https://cockroachlabs.slack.com/archives/C026MSSL926/p1776895389245229>
- Related PR: #168462
- Triggering build (Azure nightly #6871, 587/777 selected): TeamCity build
  21322322.

_This issue was drafted by Claude (Claude Code) during a triage session with @williamchoe3._

Epic: none

Jira issue: CRDB-63176

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: increase observability into test selection #168933

Background

Goal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

roachtest: increase observability into test selection #168933

Description

Background

Goal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions