[DX-3769] CLI and AI Skill to Discover and Diagnose Flaky Tests (and easier test runs in general) by kalverra · Pull Request #22125 · smartcontractkit/chainlink

kalverra · 2026-04-22T01:41:22Z

Includes a new tool, tools/test that provides a simpler way for devs to run /chainlink unit tests, and includes features to help hunt down flakes and races.

Simpler Test Runs

Running tests in /chainlink requires chaining together 2-3 make commands, and you can miss steps. Not anymore!

# Help menu for all available commands
go tool test -h

# Spin up a Postgres container, run go tests, and tear it down all with one command!
go tool test run -v -count=1 -p 4 ./core/...
# Or use make
make new_test ARGS="-v -p 4 ./core/..."

# Use gotestsum if you prefer!
go tool test gotestsum --format=dots -- -count=1 ./core/...
# Even with make!
make new_gotestsum ARGS="--format=dots -- -count=1 ./core/..."

Why not just `go test ./...`?

I'd love to, but the problem is overhead.

go test doesn't have any sort of "before all test runs" control. So, to make go test ./... work, we'd need to launch a new PostgreSQL test container for every package that needs a real DB. This can cause serious memory, processing, and runtime overhead.

It is possible this isn't a big deal, and we could reduce/eliminate this concern with some refactoring, but that's outside the scope of this approach.

`diagnose`

You can now easily diagnose test issues with a single command, re-running tests, packages, and suites, and generating summarized results. With one command, you can exhaustively search for:

Flakes
Races
Slow tests
Deadlocks

# Run a package 25 times to look for slow tests (30s or more)
go tool test diagnose --iterations 25 --slow-threshold 30s -- ./core/services/ocr2/plugins/ocr2keeper/...
# Run all of core 10 times with race (will take a while!)
go tool test diagnose --iterations 10 -- --timeout=10m -race ./core/...

`/diagnose-tests` AI Skill

It contains steps for the agent to run various survey and test commands, review logs and code, and provide fixes for issues you encounter.

Note: Current sandbox and security restrictions mean that most agents will have to ask you to run the commands for them. Working with the security team on a workaround.

Note: Most agents will discover this skill automatically. But Claude Code won't, you'll need to point it to the file.

Future Plans/Improvements

Refine /diagnose-tests as I use it more, include more advice and restrictions
Maybe pull most of the test run logic into a separate repo and just go install it here?

github-actions · 2026-04-22T01:41:37Z

👋 kalverra, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

github-actions · 2026-04-22T01:42:35Z

✅ No conflicts with other open PRs targeting develop

Copilot

Pull request overview

Risk Rating: MEDIUM

Adds a new nested Go module (tools/test) that provides a developer-friendly harness for running Chainlink unit tests (with optional ephemeral Postgres) plus a diagnose mode for surfacing flakes/timeouts/slow tests, and ships a corresponding agent skill/playbook.

Changes:

Introduce tools/test CLI with test, gotestsum, and multi-iteration diagnose workflows (including progress UI + report/log generation).
Add Postgres lifecycle management via testcontainers-go, plus runner/analyzer logic and unit tests.
Wire repository docs + make targets + agent skill docs to make the new workflows discoverable.

Scrupulous human review recommended (high-impact areas):

tools/test/internal/db/db.go: container lifecycle, Ryuk disablement, and failure/interrupt cleanup paths.
tools/test/internal/runner/runner.go: correctness of stdout/stderr capture and JSONL integrity under concurrent writes/cancellation.
tools/test/go.mod: toolchain version selection and compatibility with the root module.

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tools/test/main.go	Entry point for the nested `tools/test` module CLI.
tools/test/internal/termstyle/termstyle.go	Shared lipgloss palette for CLI progress/summary output.
tools/test/internal/runner/runner.go	Implements `go test`, `gotestsum`, and the `diagnose` loop orchestration.
tools/test/internal/runner/runner_test.go	Tests diagnose arg building, results dir naming, canceled-ctx behavior, etc.
tools/test/internal/runner/diagnose_results_dir.go	Generates bounded-length results directory basenames.
tools/test/internal/runner/diagnose_progress.go	Parses `go test -json` to track progress and render status lines.
tools/test/internal/runner/diagnose_progress_test.go	Tests progress parsing/rendering helpers.
tools/test/internal/runner/analyze.go	Parses iteration logs into report.json/csv + per-test log files + summary rendering.
tools/test/internal/runner/analyze_test.go	Unit tests for flake/failure/timeout/slow classification and log capture.
tools/test/internal/runner/analyze_files_test.go	Unit tests for writing log files and CSV output.
tools/test/internal/repo/repo.go	Locates the chainlink repo root by walking up to the root `go.mod`.
tools/test/internal/repo/repo_test.go	Tests repo root discovery behavior.
tools/test/internal/db/db.go	Manages ephemeral Postgres via testcontainers + reset/dump/cleanup helpers.
tools/test/internal/db/db_test.go	Tests dump-state nil/no-container behavior.
tools/test/internal/config/config.go	Viper/cobra config binding and defaults for the harness.
tools/test/internal/config/config_test.go	Tests persistent + local flag binding into config.
tools/test/internal/cmd/root.go	Cobra root command, signal-aware execution, and DB setup pre-run.
tools/test/internal/cmd/root_test.go	Tests displayed command path formatting.
tools/test/internal/cmd/test.go	`test` subcommand (passthrough to `go test`).
tools/test/internal/cmd/gotestsum.go	`gotestsum` subcommand wiring.
tools/test/internal/cmd/diagnose.go	`diagnose` subcommand wiring and flags.
tools/test/go.mod	Declares the nested module and dependencies.
tools/test/go.sum	Adds dependency checksums for the nested module.
tools/test/README.md	Basic usage docs for the new harness.
tools/test/AGENTS.md	Agent-facing constraints/goals/commands for working on this tool.
tools/test/fixing-flaky-tests.md	General playbook for diagnosing/fixing flakes.
tools/README.md	Adds a pointer to the new `tools/test` harness.
GNUmakefile	Adds `new_test`, `new_gotestsum`, and `new_test_diagnose` targets.
.gitignore	Ignores `diagnose-*` output dirs while explicitly keeping the agent skill directory.
.agents/skills/diagnose-tests/SKILL.md	Adds an AI skill for running `diagnose` and using its report/logs to drive fixes.

trunk-io · 2026-04-22T01:58:20Z

Failed Test	Failure Summary	Logs
`TestNewHTTPClient_PortRanges`	The test 'TestNewHTTPClient_PortRanges' failed during execution.	Logs ↗︎
`TestConfigDocs`	The test failed because the actual documentation content did not match the expected documentation string.	Logs ↗︎
`TestConfigDocs`	The test failed because the actual documentation content did not match the expected documentation string.	Logs ↗︎
`TestIntegration_KeeperPluginLogUpkeep_ErrHandler`	The test failed without a specific error message, indicating an unspecified failure during execution.	Logs ↗︎

... and 1 more

View Full Report ↗︎ ⋅ Docs

jmank88 · 2026-04-22T02:22:31Z

+# Note: do not use "make target -p 4 ..." — -p is a make flag; use ARGS= instead.
+.PHONY: new_test
+new_test: ## tools/test: passthrough go test. Usage: make new_test ARGS="-v -p 4 ./core/..."
+	go -C ./tools/test run . test $(ARGS)


Is it necessary to use -C, rather than e.g.:

Suggested change

go -C ./tools/test run . test $(ARGS)

go run ./tools/test test $(ARGS)

The tools/test is a submodule, so trying to run it from /chainlink root that way makes go confused.

❯ go run ./tools/test test ./core/bridges/... main module (github.com/smartcontractkit/chainlink/v2) does not contain package github.com/smartcontractkit/chainlink/v2/tools/test

ah so I bet we could use go tool

This, in combo with a relative replace should work:

chainlink/go.mod

Line 440 in 1756b5e

tool github.com/smartcontractkit/chainlink-common/pkg/loop/cmd/loopinstall

Then you can do:

Suggested change

go -C ./tools/test run . test $(ARGS)

go tool test test $(ARGS)

And regardless, maybe consider differentiating the command and sub-command names?

Good idea! I've now implemented this approach, so you can call with

# Use vanilla go test commands go tool test run -v -count=1 -p 4 ./core/... # Use gotestsum as the runner go tool test gotestsum --format=dots -- -count=1 ./core/... # Run the full core test suite 10 times and collect statistics, debug logs, and more go tool test diagnose --iterations 10 -- --timeout=15m ./core/...

Pros

Smoother approach

Cons

This does mean we pull in dependencies from tools/test/go.mod whenever we run go mod tidy from the root /chainlink repo.

sebawo

Thanks for putting this together. I found two issues that look worth fixing before merge: one can leave orphaned Postgres containers when gotestsum is missing, and one can hide package-level failures from diagnose reports.

Tofel · 2026-04-27T14:00:22Z

+		}()
+
+		if conf.Iterations < 1 {
+			return errors.New("--iterations must be >= 1")


nit: or default to 1?

That's already set. Do you mean if someone passes in --iterations -2 we should just default to 1?

…testRunner

Tofel · 2026-04-27T15:34:54Z

+// Gotestsum runs `gotestsum` with the given args (repo root as working directory).
+func Gotestsum(ctx context.Context, conf *config.App, args []string) error {
+	if _, err := exec.LookPath("gotestsum"); err != nil {
+		return fmt.Errorf("gotestsum not on PATH: install with go install gotest.tools/gotestsum@latest: %w", err)


could we install it instead of failing? if so, should we support asdf and run reshim?

Tofel · 2026-04-27T15:37:09Z

+	}
+
+	interrupted := ctx.Err() != nil
+	if interrupted && !conf.AIOutput {


probably not worth to have a separate package for logging that would know how to handle AIOutput and we wouldn't have to have these ifs all over?

I was considering it.

cl-sonarqube-production · 2026-04-27T16:14:22Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

kalverra added 11 commits April 15, 2026 22:43

Absolute basics

c068b15

Fix config

ed71dc9

better survey execution

bc3f391

add signal processing

73398dd

add fail fast

1c16c2e

Make it pretty

0a8345c

More analysis

60b0378

iteration summaries

4f1065f

moar connections

e9dc18f

diagnose

b985c08

Better args input

f18157c

Copilot AI review requested due to automatic review settings April 22, 2026 01:41

kalverra requested review from a team as code owners April 22, 2026 01:41

product-security-plaid-production Bot requested a review from pavel-raykov April 22, 2026 01:41

Copilot started reviewing on behalf of kalverra April 22, 2026 01:42 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

kalverra added 2 commits April 21, 2026 21:46

lint

3efe67d

Comments

974cf66

kalverra added 2 commits April 21, 2026 22:06

-count flags

2c56ba2

errors

1d06813

jmank88 reviewed Apr 22, 2026

View reviewed changes

kalverra mentioned this pull request Apr 22, 2026

Adding fix-flaky-go-test skill #22010

Closed

Linting

a4d4644

kalverra requested review from Copilot and jmank88 April 22, 2026 14:36

Copilot started reviewing on behalf of kalverra April 22, 2026 14:38 View session

kalverra added 6 commits April 22, 2026 13:50

Generate and tidy

fc3a2a5

testcontainers-go v0.41.0

2ad6b28

testcontainers-go v0.41.0

9e36ba8

testcontainers-go v0.37.0

fd021df

testcontainers-go v0.42.0

de9e70c

Pin moby

bc8c95d

kalverra requested a review from Fletch153 April 23, 2026 14:43

kalverra added 8 commits April 24, 2026 09:56

Pin buildx version

dbd1a04

Consolidate

1bfff70

Merge

5349a98

Merge and clean

7eae135

Merge

1dc5ed9

More pins

a30de44

Revert

bfdd0e1

Remove pins?

e5222dd

sebawo requested changes Apr 27, 2026

View reviewed changes

Comment thread tools/test/internal/runner/analyze.go

Comment thread tools/test/internal/cmd/gotestsum.go Outdated