[DX-3769] CLI and AI Skill to Discover and Diagnose Flaky Tests (and easier test runs in general)#22125
[DX-3769] CLI and AI Skill to Discover and Diagnose Flaky Tests (and easier test runs in general)#22125
Conversation
|
👋 kalverra, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: MEDIUM
Adds a new nested Go module (tools/test) that provides a developer-friendly harness for running Chainlink unit tests (with optional ephemeral Postgres) plus a diagnose mode for surfacing flakes/timeouts/slow tests, and ships a corresponding agent skill/playbook.
Changes:
- Introduce
tools/testCLI withtest,gotestsum, and multi-iterationdiagnoseworkflows (including progress UI + report/log generation). - Add Postgres lifecycle management via
testcontainers-go, plus runner/analyzer logic and unit tests. - Wire repository docs + make targets + agent skill docs to make the new workflows discoverable.
Scrupulous human review recommended (high-impact areas):
tools/test/internal/db/db.go: container lifecycle, Ryuk disablement, and failure/interrupt cleanup paths.tools/test/internal/runner/runner.go: correctness of stdout/stderr capture and JSONL integrity under concurrent writes/cancellation.tools/test/go.mod: toolchain version selection and compatibility with the root module.
Reviewed changes
Copilot reviewed 28 out of 31 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/test/main.go | Entry point for the nested tools/test module CLI. |
| tools/test/internal/termstyle/termstyle.go | Shared lipgloss palette for CLI progress/summary output. |
| tools/test/internal/runner/runner.go | Implements go test, gotestsum, and the diagnose loop orchestration. |
| tools/test/internal/runner/runner_test.go | Tests diagnose arg building, results dir naming, canceled-ctx behavior, etc. |
| tools/test/internal/runner/diagnose_results_dir.go | Generates bounded-length results directory basenames. |
| tools/test/internal/runner/diagnose_progress.go | Parses go test -json to track progress and render status lines. |
| tools/test/internal/runner/diagnose_progress_test.go | Tests progress parsing/rendering helpers. |
| tools/test/internal/runner/analyze.go | Parses iteration logs into report.json/csv + per-test log files + summary rendering. |
| tools/test/internal/runner/analyze_test.go | Unit tests for flake/failure/timeout/slow classification and log capture. |
| tools/test/internal/runner/analyze_files_test.go | Unit tests for writing log files and CSV output. |
| tools/test/internal/repo/repo.go | Locates the chainlink repo root by walking up to the root go.mod. |
| tools/test/internal/repo/repo_test.go | Tests repo root discovery behavior. |
| tools/test/internal/db/db.go | Manages ephemeral Postgres via testcontainers + reset/dump/cleanup helpers. |
| tools/test/internal/db/db_test.go | Tests dump-state nil/no-container behavior. |
| tools/test/internal/config/config.go | Viper/cobra config binding and defaults for the harness. |
| tools/test/internal/config/config_test.go | Tests persistent + local flag binding into config. |
| tools/test/internal/cmd/root.go | Cobra root command, signal-aware execution, and DB setup pre-run. |
| tools/test/internal/cmd/root_test.go | Tests displayed command path formatting. |
| tools/test/internal/cmd/test.go | test subcommand (passthrough to go test). |
| tools/test/internal/cmd/gotestsum.go | gotestsum subcommand wiring. |
| tools/test/internal/cmd/diagnose.go | diagnose subcommand wiring and flags. |
| tools/test/go.mod | Declares the nested module and dependencies. |
| tools/test/go.sum | Adds dependency checksums for the nested module. |
| tools/test/README.md | Basic usage docs for the new harness. |
| tools/test/AGENTS.md | Agent-facing constraints/goals/commands for working on this tool. |
| tools/test/fixing-flaky-tests.md | General playbook for diagnosing/fixing flakes. |
| tools/README.md | Adds a pointer to the new tools/test harness. |
| GNUmakefile | Adds new_test, new_gotestsum, and new_test_diagnose targets. |
| .gitignore | Ignores diagnose-* output dirs while explicitly keeping the agent skill directory. |
| .agents/skills/diagnose-tests/SKILL.md | Adds an AI skill for running diagnose and using its report/logs to drive fixes. |
|
| # Note: do not use "make target -p 4 ..." — -p is a make flag; use ARGS= instead. | ||
| .PHONY: new_test | ||
| new_test: ## tools/test: passthrough go test. Usage: make new_test ARGS="-v -p 4 ./core/..." | ||
| go -C ./tools/test run . test $(ARGS) |
There was a problem hiding this comment.
Is it necessary to use -C, rather than e.g.:
| go -C ./tools/test run . test $(ARGS) | |
| go run ./tools/test test $(ARGS) |
There was a problem hiding this comment.
The tools/test is a submodule, so trying to run it from /chainlink root that way makes go confused.
❯ go run ./tools/test test ./core/bridges/...
main module (github.com/smartcontractkit/chainlink/v2) does not contain package github.com/smartcontractkit/chainlink/v2/tools/testThere was a problem hiding this comment.
ah so I bet we could use go tool
There was a problem hiding this comment.
This, in combo with a relative replace should work:
Line 440 in 1756b5e
Then you can do:
| go -C ./tools/test run . test $(ARGS) | |
| go tool test test $(ARGS) |
And regardless, maybe consider differentiating the command and sub-command names?
There was a problem hiding this comment.
Good idea! I've now implemented this approach, so you can call with
# Use vanilla go test commands
go tool test run -v -count=1 -p 4 ./core/...
# Use gotestsum as the runner
go tool test gotestsum --format=dots -- -count=1 ./core/...
# Run the full core test suite 10 times and collect statistics, debug logs, and more
go tool test diagnose --iterations 10 -- --timeout=15m ./core/... Pros
- Smoother approach
Cons
- This does mean we pull in dependencies from
tools/test/go.modwhenever we rungo mod tidyfrom the root/chainlinkrepo.
sebawo
left a comment
There was a problem hiding this comment.
Thanks for putting this together. I found two issues that look worth fixing before merge: one can leave orphaned Postgres containers when gotestsum is missing, and one can hide package-level failures from diagnose reports.
| }() | ||
|
|
||
| if conf.Iterations < 1 { | ||
| return errors.New("--iterations must be >= 1") |
There was a problem hiding this comment.
That's already set. Do you mean if someone passes in --iterations -2 we should just default to 1?
| // Gotestsum runs `gotestsum` with the given args (repo root as working directory). | ||
| func Gotestsum(ctx context.Context, conf *config.App, args []string) error { | ||
| if _, err := exec.LookPath("gotestsum"); err != nil { | ||
| return fmt.Errorf("gotestsum not on PATH: install with go install gotest.tools/gotestsum@latest: %w", err) |
There was a problem hiding this comment.
could we install it instead of failing? if so, should we support asdf and run reshim?
| } | ||
|
|
||
| interrupted := ctx.Err() != nil | ||
| if interrupted && !conf.AIOutput { |
There was a problem hiding this comment.
probably not worth to have a separate package for logging that would know how to handle AIOutput and we wouldn't have to have these ifs all over?
There was a problem hiding this comment.
I was considering it.
|





Includes a new tool,
tools/testthat provides a simpler way for devs to run /chainlink unit tests, and includes features to help hunt down flakes and races.Simpler Test Runs
Running tests in /chainlink requires chaining together 2-3 make commands, and you can miss steps. Not anymore!
Why not just
go test ./...?I'd love to, but the problem is overhead.
go testdoesn't have any sort of "before all test runs" control. So, to makego test ./...work, we'd need to launch a new PostgreSQL test container for every package that needs a real DB. This can cause serious memory, processing, and runtime overhead.It is possible this isn't a big deal, and we could reduce/eliminate this concern with some refactoring, but that's outside the scope of this approach.
diagnoseYou can now easily diagnose test issues with a single command, re-running tests, packages, and suites, and generating summarized results. With one command, you can exhaustively search for:
/diagnose-testsAI SkillIt contains steps for the agent to run various survey and test commands, review logs and code, and provide fixes for issues you encounter.
Note: Current sandbox and security restrictions mean that most agents will have to ask you to run the commands for them. Working with the security team on a workaround.
Note: Most agents will discover this skill automatically. But Claude Code won't, you'll need to point it to the file.
Future Plans/Improvements
/diagnose-testsas I use it more, include more advice and restrictionsgo installit here?