Skip to content

roachtest: increase observability into test selection #168933

@williamchoe3

Description

@williamchoe3

Background

The roachtest selective-test logic in pkg/cmd/roachtest/main.go is currently
opaque to anyone investigating a nightly run. The only signals emitted today are
two fmt.Printf lines, e.g. from Azure nightly build #6871 (master,
2026-04-22):

[05:56:21]  112 selected out of 318 successful tests.
[05:56:21]  587 out of 777 tests selected for the run!

These end up in the TeamCity build log but not in any file artifact under
/artifacts/_runner-logs/, so they are not picked up by the Datadog log
uploader. Concrete consequences:

  1. To explain why 587/777 (~75%) tests were selected, I had to read
    pkg/cmd/roachtest/main.go
    (updateSpecForSelectiveTests) and
    pkg/cmd/roachtest/testselector/snowflake_query.sql
    line-by-line. There is no per-criterion breakdown for the 5 OR'd
    selected=yes rules in the Snowflake query (failure_count > 0,
    first_run > now-N, last_run < now-M, last failure was preempt,
    last_status='UNKNOWN') — so we can't tell which rule is dominating.
  2. PR roachtest: recover panic in test selection and fall back to executing all tests #168462 added a panic-recovery + fallback path so that a slow/unreachable
    Snowflake no longer crashes roachtest. The fallback emits one
    error selecting tests: line to stdout, but with no Datadog-indexed log
    we can't alert when the fallback fires. This was explicitly called out in
    #168462 (comment)
    and deferred to a follow-up.

Goal

  1. Have updateSpecForSelectiveTests (and the surrounding selection code in
    main.go) write structured records to a file under
    /artifacts/_runner-logs/ so that the Datadog log uploader picks them up.
    At a minimum the file should record:
    • total specs in suite, count of successful pool, count selected
    • per-criterion attribution counts (how many tests fired each of the 5
      Snowflake selected=yes rules)
    • whether the Snowflake fallback path was triggered, and the underlying
      error if so
    • the resolved --cloud, --suite, --successful-test-select-pct,
      TC_BUILD_BRANCH
  2. (Nice to have) A per-test or sampled-per-test breakdown so we can answer
    "why was test X selected" without re-running.
  3. Once (1) is in place, add a Datadog monitor that fires when the fallback
    path is exercised, so we know when test selection is operating without
    Snowflake input.

Describe alternatives you've considered

  • Parsing the TeamCity build log out of band — fragile and not searchable.
  • Plumbing a *logger.Logger from main.go only for the current two summary
    lines — punts on the per-criterion attribution which is the more useful
    signal.

Additional context

This issue was drafted by Claude (Claude Code) during a triage session with @williamchoe3.

Epic: none

Jira issue: CRDB-63176

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-testeng-infraC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-agentFiled by an AI agent; usually the result of a human/agent investigation sessionO-roachtestT-testengTestEng Team

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions