perf(snippets): batch slow validators to cut snippet-validation CI wall time by kinyoklion · Pull Request #526 · launchdarkly/sdk-meta

kinyoklion · 2026-06-29T23:16:02Z

Problem

Snippet-validation CI wall time was dominated by jobs validating ~100 syntax-only sdk-docs fragments serially, one harness invocation per snippet. The per-snippet work is tiny, but each invocation re-paid full environment setup. The whole run's wall time is the slowest job:

job	was
ios-client-sdk (sdk-docs)	~2h2m
cpp-server-sdk	~1h29m
cpp-client-sdk	~1h23m
android-client-sdk (sdk-docs)	~1h6m
flutter-client-sdk	~44m
rust-server-sdk	~42m
java-server-sdk	~22m
dotnet-server-sdk	~21m
haskell-server-sdk	~18m

Approach

Add a batch mode (batch: true in runner.yaml). The Go runner resolves every matching snippet up front, groups them by (runtime, build-affecting env), builds each image once, and partitions across up to --jobs concurrent harness invocations (default NumCPU). Each invocation loops a manifest of staged snippets in one warm workspace. Native (macOS) groups run single-shard (shared host — brew/simulator/SPM caches can't be driven concurrently). Non-batch validators are untouched. A shared run_batch helper drives the manifest loop.

Per-validator warm-workspace work:

rust — pre-bake the compiled dependency tree; recompile only the binary crate.
cpp (server/client + v2 variants) — pre-bake a configured CMake project so per-snippet skips the configure over the whole cpp-sdks tree.
flutter (current + v2/v3) — flutter build linux --debug (real front-end compile, no dart2js/browser) instead of flutter build web --release + headless Chromium. Snippets import only cross-platform APIs, so the linux target compiles identical code.
android — reset to the baseline scaffold between snippets and keep the gradle daemon warm.
ios — resolve the Swift Package once into shared DerivedData; xcodebuild build (compile, no simulator) for syntax-only, xcodebuild test for init.
java — pre-bake a warm maven project (deps + plugins in ~/.m2); run snippets via offline mvn -o compile + exec:java instead of mvn clean compile assembly:single from scratch.
dotnet — pre-bake a project with the package superset restored; per-snippet swaps Program.cs and rebuilds incrementally.
haskell — loop the existing warm cabal project in the batch harness.

Results — verified green on CI (run 28459913855: all 35 jobs success)

Full-run wall time: ~2h+ → ~18 min, all green.

validator	before	after (CI job time)
ios-client-sdk (sdk-docs)	~2h	~7m
cpp-server / client	~1.5h	~5m
android sdk-docs	~1h6m	~6m
flutter (x3)	~44m	~7m
rust-server	~42m	<5m
java-server	~22m	~6m
dotnet-server	~21m	~6m
haskell-server	~18m	~17m *

Full snippet coverage preserved (unit-selection logic unchanged — only dispatch differs). Also fixes a latent race in await_success_line (a process that printed the success line and exited before the next poll was read as a failure).

* haskell is now the pole, and it's image-build-bound, not per-snippet-bound — the cabal image compiles the Haskell SDK from source (now twice, for the v3 dep tree), which batching can't shorten. The remaining 5–8 min jobs are likewise dominated by cold image builds with no registry cache.

Follow-up

Cross-run Docker-layer caching (GHA cache / GHCR) is the next lever — it would cut the haskell pole and the cold-build component of every docker job, taking the wall well under 10 min. Left as a separate PR.

The validator ran one harness invocation per snippet, serially. For the syntax-only sdk-docs groups that dominate CI wall time (cpp, flutter, android, rust), the per-snippet work is tiny but each invocation paid the full environment setup again — re-resolving SDK packages, re-configuring CMake over the whole cpp-sdks tree, cold-starting a JVM, recompiling the dependency graph from scratch. Batch mode opts a validator in via `batch: true` in runner.yaml. The Go runner then resolves every matching (snippet, check) unit up front, groups them by (runtime, build-affecting env) so version-pinned / redis variants don't share a workspace, builds each image once, and partitions the group across up to --jobs concurrent harness invocations (default NumCPU). Each invocation gets a manifest of staged snippets and loops over them inside a single warm workspace. Non-batch validators keep the exact one-invocation-per-snippet path. The shared `run_batch` helper in lib.sh drives the manifest loop, tallies pass/fail, and continues past failures so one bad fragment doesn't hide the rest. Also fixes a latent race in await_success_line: a process that printed the success line and exited before the next poll was read as a failure because the loop broke on the dead pid without a final grep; syntax-only hellos that print-and-exit immediately hit this often under batch mode.

Opt the heaviest validators into batch mode and rework each harness to do its expensive setup once per job, then loop the staged snippets in a warm workspace. Measured locally (full per-SDK run): rust-server ~42m -> ~31s cpp-server ~1h29m -> ~20s cpp-client ~1h23m -> ~26s flutter (x3) ~44m -> ~2m android sdk-docs ~1h6m -> ~6m - rust: Dockerfile pre-bakes the SDK + tokio + transport dependency tree compiled once; per-snippet only recompiles the binary crate. - cpp (server/client + v2-c/v2-cpp variants): pre-bake a CONFIGURED CMake project (default + redis) so per-snippet skips the configure over the whole cpp-sdks tree and only runs an incremental `cmake --build`. The parse-only v2 stub validators just loop gcc/g++ in one container. - flutter (current + v2/v3): validate with `flutter build linux --debug` instead of `flutter build web --release` + headless Chromium. Both run the same Dart front-end (catching every syntax/type error), but the linux debug build stops before dart2js/AOT and needs no browser, so it finishes in ~5-7s warm vs ~27s. The snippets import only flutter/material and the LD SDK, which is cross-platform, so the linux target compiles the identical code with no divergence. - android: reset the package dir to the baseline scaffold between snippets and keep the gradle daemon warm across the loop, so only the first snippet pays JVM + gradle startup.

iOS was the largest pole (~2h): the native harness ran, per snippet, xcodegen + `-resolvePackageDependencies` (which builds the LD SDK Swift Package) + `xcodebuild test` (which boots a simulator) — even for the syntax-only sdk-docs fragments whose body never runs. Batch the ios-client harness: set up the project and resolve the Swift Package ONCE into a shared DerivedData, then loop the staged snippets. Dispatch on SNIPPET_CHECK: - parse (sdk-docs / experimentation): `xcodebuild build` against the iphonesimulator SDK — a compile/type-check with no simulator boot. The wrappee body lives in a never-instantiated function, so a clean compile is the signal; emit the canonical line. The swift-syntax-only scaffold now carries `env: SNIPPET_CHECK: parse` (mirroring the android syntax-only scaffolds) to select this path. - runtime (init): `xcodebuild test` as before, booting the simulator and grepping the captured log. Cannot be exercised locally (needs macOS); verifying on CI.

…s line First CI run surfaced two iOS-only failures (the docker validators and android all passed): - Concurrent shards each ran `brew install xcodegen`, colliding on Homebrew's download lock. Native validators run directly on the macOS host with no container isolation, so concurrent shards contend on shared state (the brew lock, the Simulator runtime, the SwiftPM/DerivedData caches). Run native groups single-shard; one warm workspace that resolves the Swift Package once is also the optimal shape there. Docker groups keep the worker pool, since each shard is an isolated container. - The iOS runtime (init) path grepped the xcodebuild log for the success line but never re-emitted it to stdout, so the verify-hello-app wrapper's grep of the command output found nothing and failed the cell even though the snippet passed (`batch: 1/1 passed`). Re-emit the matched line, as the parse path already does.

These became the wall-time poles once the bigger offenders were batched (~18-22m each). Same docker + syntax-only shape, so the same treatment, and the batch worker pool gives them ~NumCPU-way concurrency on top: - java: the old harness ran `mvn clean compile assembly:single` from scratch per snippet (re-resolving plugins/deps, building a fat jar). Pre-bake a warm maven project (deps + plugins in ~/.m2, one compile done) and run each snippet via offline `mvn -o compile` + `exec:java`. Copy the whole staged source tree so multi-file snippets (the init runner + its Main companion) compile. - dotnet: the old harness synthesized a csproj and ran `dotnet add package` + `dotnet restore` + `dotnet run` from scratch per snippet. Pre-bake a warm project with the package superset every snippet's requirements ask for (ServerSdk + Observability + Ai + Redis + Telemetry + Consul + DynamoDB) restored and built; per-snippet just swaps Program.cs and rebuilds incrementally. The ASP.NET Core init stages its own package-less Web .csproj, so for that one the harness adds its requirements + restores. - haskell: the warm cabal project already existed; wrap the per-snippet build/run in the batch loop. haskell-server-v3's Dockerfile COPYs the shared haskell-server harness (now batch-aware), so it gets `batch: true` too — otherwise its snippets hit the batch harness in non-batch mode and fail on the missing SNIPPET_BATCH. Local full-SDK runs: java ~22m->~2m, dotnet ~21m->~90s, haskell ~18m->fast.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 524b2cb. Configure here.}

- iOS: reset Sources to the scaffold baseline before staging each snippet. The project compiles every file under Sources/, so a differently-named .swift file from an earlier fragment could otherwise linger; mirrors the baseline reset the android and dotnet harnesses already do. - rust: surface a failed warm build on the version-pinned path instead of swallowing it with `|| true`, so a broken re-pin is reported up front rather than as a confusing per-snippet `cargo run` error. - haskell: delete the dead languages/haskell-server-v3/harness/run.sh. The v3 Dockerfile COPYs the shared (batch-aware) haskell-server harness, so the v3-local file was never used — it only invited the misreading that `batch: true` on v3 would hit a SNIPPET_ENTRYPOINT-only harness. Add a Dockerfile comment making the shared-harness intent explicit.

kinyoklion added 2 commits June 29, 2026 16:15

kinyoklion requested a review from a team as a code owner June 29, 2026 23:16

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread snippets/validators/languages/ios-client/harness/run.sh

Comment thread snippets/validators/languages/rust/harness/run.sh Outdated

joker23 approved these changes Jun 30, 2026

View reviewed changes

cursor Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread snippets/validators/languages/haskell-server-v3/runner.yaml

kinyoklion merged commit 1e0cafd into main Jun 30, 2026
42 checks passed

kinyoklion deleted the rlamb/snippets-validate-batch branch June 30, 2026 16:49

github-actions Bot mentioned this pull request Jun 30, 2026

chore: release main #529

Open

kinyoklion mentioned this pull request Jun 30, 2026

perf(snippets): cross-run Docker layer caching for validators #530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(snippets): batch slow validators to cut snippet-validation CI wall time#526

perf(snippets): batch slow validators to cut snippet-validation CI wall time#526
kinyoklion merged 6 commits into
mainfrom
rlamb/snippets-validate-batch

kinyoklion commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kinyoklion commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Results — verified green on CI (run 28459913855: all 35 jobs success)

Follow-up

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kinyoklion commented Jun 29, 2026 •

edited

Loading