loader: memoize includes to avoid re-expanding diamond include graphs#886
loader: memoize includes to avoid re-expanding diamond include graphs#886davireis wants to merge 1 commit into
Conversation
ApplyInclude re-parses and recursively re-expands an included file once per include path that reaches it. When the same file is reached through more than one path (a "diamond" in the include graph) this is quadratic-to-exponential: a 24-level doubling graph loads the leaf 2^24 times. Monorepos that aggregate per-target / per-project compose fragments hit this in practice (an ~80-service federation took ~55s in `docker compose config`). Memoize each loaded include model for the duration of a single load, keyed on every input that determines it — resolved paths, working dir, project dir, and effective environment — and hand out a deep copy on each hit. The merge into the parent (importResources) still runs for every occurrence, so a same-file `extends` in the including file still resolves and the result is identical to loading each time; only the parse + recursive expansion is shared. Keying on the working dir matters: the same file reached through two parents can have a different relative base, yielding models with different relative paths; reusing across bases would let the caller rebase an already-resolved path. Cycle-safe: an include cycle is intrinsic to a node's subtree, so it is detected on the node's first load, before it can be cached. Adds a deep-diamond regression test (times out without the cache) and a benchmark. Signed-off-by: Davi de Castro Reis <davi@davi.eng.br>
08775c8 to
e8f62dd
Compare
glours
left a comment
There was a problem hiding this comment.
Thanks for the contribution. Three inline points to address.
- A cache-key separator collision that's fixable with length-prefixed hashing.
- More importantly, the
extendslistener events emitted insideloadYamlModelare silently dropped on cache hits, which changes the publicListenercontract (cf.TestLoadExtendsListenerasserting an exact count). Worth treating as a blocker - A time-bound guard on the diamond test, so a future cache regression fails fast with a clear message instead of hanging CI for the default 10-minute timeout.
The overall approach (per-load cache, env+paths+workingDir+projectDir as key, deep-copy on get) is the right shape — cycle safety, deep-copy completeness, and the unexported context key all check out.
| for _, p := range paths { | ||
| _, _ = h.Write([]byte(p)) | ||
| _, _ = h.Write([]byte{0}) | ||
| } | ||
| _, _ = h.Write([]byte{1}) | ||
| _, _ = h.Write([]byte(workingDir)) | ||
| _, _ = h.Write([]byte{1}) | ||
| _, _ = h.Write([]byte(projectDir)) | ||
| _, _ = h.Write([]byte{1}) | ||
| keys := make([]string, 0, len(env)) | ||
| for k := range env { | ||
| keys = append(keys, k) | ||
| } | ||
| sort.Strings(keys) | ||
| for _, k := range keys { | ||
| _, _ = h.Write([]byte(k)) | ||
| _, _ = h.Write([]byte{0}) | ||
| _, _ = h.Write([]byte(env[k])) | ||
| _, _ = h.Write([]byte{0}) | ||
| } |
There was a problem hiding this comment.
The single-byte separators written here ([]byte{0} on lines 99, 113, 115 and []byte{1} on lines 101, 103, 105) aren't unambiguous against env values that contain those bytes. types.Mapping is map[string]string populated from .env files /
process env, where embedded NULs aren't forbidden. Two distinct (paths, env) tuples can then produce the same byte stream and hash to the same key, wrong cached model served, no error surfaced.
Toy example: env = {"A\x00B": "X"} and env = {"A": "B\x00X"} both serialize to …A\x00B\x00X\x00 through the loop on lines 111-116.
Length-prefixing each field removes the ambiguity:
write := func(s string) {
fmt.Fprintf(h, "%d:", len(s))
_, _ = h.Write([]byte(s))
}
for _, p := range paths { write(p) }
write(workingDir)
write(projectDir)
for _, k := range keys { write(k); write(env[k]) }While you're here, a short comment explaining why Substitute / TypeCastMapping are intentionally excluded from the key (invariant across includes within a single load) would help a future contributor who adds a per-subtree option not silently introduce collisions, like
// Note: Substitute and TypeCastMapping are intentionally excluded from the key.
// They are invariant across includes within a single Load (cloned unchanged from
// the top-level options at the call site above). If a future option allows them
// to vary per include, they must be folded into the key.| imported, ok := cache.get(key) | ||
| if !ok { | ||
| imported, err = loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| cache.put(key, imported) |
There was a problem hiding this comment.
There is a silent change to the public Listener contract here.
When cache.get(key) returns ok at line 274, the if !ok block (lines 275-281) is skipped, so loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included) on line 276 does not run on cache hits. The extends events emitted from inside
that call path, see opts.ProcessEvent("extends", v) at loader/extends.go:76 and opts.ProcessEvent("extends", map[string]any{"service": ref}) at extends.go:79, are therefore not re-emitted for any subsequent diamond-traversal of the same included file. Compare with the include listener at include.go:175, which fires per occurrence (it's emitted in the outer loop, before the cache lookup): two events documented in the same pipeline now behave asymmetrically.
That isn't a cosmetic observability change:
Listeneris part of the public API (loader/loader.go:109), and downstream tools wire into it viacli/options.go:83.- The existing test
TestLoadExtendsListeneratloader/extends_test.go:440assertsassert.Equal(t, extendsCount, 3), which establishes the contract: one event per resolvedextends. Diamond include topologies break that invariant after this PR,
and silently, because no test in this repo combines diamond includes with extends listeners. - Any consumer counting extends events for telemetry, dependency tracking, audit, or progress reporting will see different counts depending on the include topology , exactly the kind of regression that surfaces months later as a confused user report.
I'd suggest preserving the contract by recording events on first load and replaying them on cache hit.
| func TestIncludeDiamondDedup(t *testing.T) { | ||
| dir := t.TempDir() | ||
| const depth = 24 // 2^24 ~= 16.7M leaf loads without dedup | ||
| for i := 0; i < depth; i++ { | ||
| content := fmt.Sprintf("include:\n - path: ./level%d.yaml\n - path: ./level%d.yaml\n", i+1, i+1) | ||
| assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", i)), []byte(content), 0o600)) | ||
| } | ||
| leaf := "services:\n leaf:\n image: busybox\n" | ||
| assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", depth)), []byte(leaf), 0o600)) | ||
|
|
||
| p, err := LoadWithContext(context.TODO(), types.ConfigDetails{ | ||
| WorkingDir: dir, | ||
| ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}}, | ||
| }, withProjectName("diamond", true)) | ||
| assert.NilError(t, err) | ||
| _, err = p.GetService("leaf") | ||
| assert.NilError(t, err) | ||
| } |
There was a problem hiding this comment.
Nice regression test, but the only assertions are assert.NilError(t, err) on line 267 and the GetService("leaf") check on lines 268-269. If the cache regresses, the LoadWithContext call on line 263 will spin on 2^24 leaf loads and the test goroutine stays parked on that line — loader/ doesn't check ctx.Done() anywhere, so a context.WithTimeout wouldn't help. The test then hangs until the default go test timeout (10 minutes), and fails with a generic timeout message, slow CI feedback, unclear failure mode.
A fast, descriptive fail requires running the load in a goroutine and selecting on a timeout. The leaked goroutine on timeout is fine, t.Fatal ends the test and the process exits soon after.
type result struct {
p *types.Project
err error
}
// LoaderWithContext don't properly support context today, so the timeout below is consumed by the
// select, not by LoadWithContext. Two separate contexts keep that intent clear.
timeout, cancel := context.WithTimeout(t.Context(), 5*time.Second)
defer cancel()
done := make(chan result, 1)
go func() {
p, err := LoadWithContext(t.Context(), types.ConfigDetails{
WorkingDir: dir,
ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}},
}, withProjectName("diamond", true))
done <- result{p, err}
}()
select {
case r := <-done:
assert.NilError(t, r.err)
_, err := r.p.GetService("leaf")
assert.NilError(t, err)
case <-timeout.Done():
t.Fatal("diamond include did not complete within 5s — cache likely not working")
}5s is generous; the current run completes in ~0.18s on my machine.
Problem
ApplyIncludere-parses and recursively re-expands an included file once per include path that reaches it. When the same file is reachable through more than one path (a "diamond" in the include graph), this is quadratic-to-exponential: a 24-level doubling graph loads the leaf 2²⁴ ≈ 16.7M times.This shows up in real monorepos that aggregate per-target / per-project compose fragments via
include:. A federation of ~80 services took ~55s indocker compose config; the cost is re-expansion, not the graph size.Fix
Memoize each loaded include model for the duration of a single load (carried in
ctx, so it never leaks acrossLoadcalls). The cache key is every input that determines the model — resolved paths, working dir, project dir, and effective environment — and a deep copy is handed out on each hit.The merge into the parent (
importResources) still runs for every occurrence, so:extendsin the including file still resolves (the included content is present in each including scope), andCorrectness details
env_file/project_directory: folded into the key, so the same path included with a different environment or project dir does not share a cache entry.a/bvsb), which yields models with different relative paths. Keying on the working dir prevents reusing a model whose paths the caller would then rebase incorrectly.Tests
TestIncludeDiamondDedup: a depth-24 diamond that times out without the cache and completes in ~0.15s with it (both a correctness and a non-flaky perf-regression test).BenchmarkIncludeDiamond.go test ./...passes;gofmt/go vetclean.