loader: memoize includes to avoid re-expanding diamond include graphs by davireis · Pull Request #886 · compose-spec/compose-go

davireis · 2026-06-19T14:01:15Z

Problem

ApplyInclude re-parses and recursively re-expands an included file once per include path that reaches it. When the same file is reachable through more than one path (a "diamond" in the include graph), this is quadratic-to-exponential: a 24-level doubling graph loads the leaf 2²⁴ ≈ 16.7M times.

This shows up in real monorepos that aggregate per-target / per-project compose fragments via include:. A federation of ~80 services took ~55s in docker compose config; the cost is re-expansion, not the graph size.

Fix

Memoize each loaded include model for the duration of a single load (carried in ctx, so it never leaks across Load calls). The cache key is every input that determines the model — resolved paths, working dir, project dir, and effective environment — and a deep copy is handed out on each hit.

The merge into the parent (importResources) still runs for every occurrence, so:

a same-file extends in the including file still resolves (the included content is present in each including scope), and
the result is byte-identical to loading the file each time; only the parse + recursive expansion is shared.

Correctness details

Per-include env_file / project_directory: folded into the key, so the same path included with a different environment or project dir does not share a cache entry.
Relative-path rebasing: the same file reached through two parents can have a different relative base (e.g. a/b vs b), which yields models with different relative paths. Keying on the working dir prevents reusing a model whose paths the caller would then rebase incorrectly.
Cycle-safe: an include cycle is intrinsic to a node's subtree (the back-edge is in the fixed set of files the node includes), so it is detected on the node's first load — before it can be cached. A cyclic node is therefore never served from cache.
Listeners still fire per include entry; only the load is memoized.

Tests

TestIncludeDiamondDedup: a depth-24 diamond that times out without the cache and completes in ~0.15s with it (both a correctness and a non-flaky perf-regression test).
BenchmarkIncludeDiamond.
Full go test ./... passes; gofmt/go vet clean.

ApplyInclude re-parses and recursively re-expands an included file once per include path that reaches it. When the same file is reached through more than one path (a "diamond" in the include graph) this is quadratic-to-exponential: a 24-level doubling graph loads the leaf 2^24 times. Monorepos that aggregate per-target / per-project compose fragments hit this in practice (an ~80-service federation took ~55s in `docker compose config`). Memoize each loaded include model for the duration of a single load, keyed on every input that determines it — resolved paths, working dir, project dir, and effective environment — and hand out a deep copy on each hit. The merge into the parent (importResources) still runs for every occurrence, so a same-file `extends` in the including file still resolves and the result is identical to loading each time; only the parse + recursive expansion is shared. Keying on the working dir matters: the same file reached through two parents can have a different relative base, yielding models with different relative paths; reusing across bases would let the caller rebase an already-resolved path. Cycle-safe: an include cycle is intrinsic to a node's subtree, so it is detected on the node's first load, before it can be cached. Adds a deep-diamond regression test (times out without the cache) and a benchmark. Signed-off-by: Davi de Castro Reis <davi@davi.eng.br>

glours

Thanks for the contribution. Three inline points to address.

A cache-key separator collision that's fixable with length-prefixed hashing.
More importantly, the extends listener events emitted inside loadYamlModel are silently dropped on cache hits, which changes the public Listener contract (cf. TestLoadExtendsListener asserting an exact count). Worth treating as a blocker
A time-bound guard on the diamond test, so a future cache regression fails fast with a clear message instead of hanging CI for the default 10-minute timeout.

The overall approach (per-load cache, env+paths+workingDir+projectDir as key, deep-copy on get) is the right shape — cycle safety, deep-copy completeness, and the unexported context key all check out.

glours · 2026-06-29T13:34:09Z

+	for _, p := range paths {
+		_, _ = h.Write([]byte(p))
+		_, _ = h.Write([]byte{0})
+	}
+	_, _ = h.Write([]byte{1})
+	_, _ = h.Write([]byte(workingDir))
+	_, _ = h.Write([]byte{1})
+	_, _ = h.Write([]byte(projectDir))
+	_, _ = h.Write([]byte{1})
+	keys := make([]string, 0, len(env))
+	for k := range env {
+		keys = append(keys, k)
+	}
+	sort.Strings(keys)
+	for _, k := range keys {
+		_, _ = h.Write([]byte(k))
+		_, _ = h.Write([]byte{0})
+		_, _ = h.Write([]byte(env[k]))
+		_, _ = h.Write([]byte{0})
+	}


The single-byte separators written here ([]byte{0} on lines 99, 113, 115 and []byte{1} on lines 101, 103, 105) aren't unambiguous against env values that contain those bytes. types.Mapping is map[string]string populated from .env files /
process env, where embedded NULs aren't forbidden. Two distinct (paths, env) tuples can then produce the same byte stream and hash to the same key, wrong cached model served, no error surfaced.

Toy example: env = {"A\x00B": "X"} and env = {"A": "B\x00X"} both serialize to …A\x00B\x00X\x00 through the loop on lines 111-116.

Length-prefixing each field removes the ambiguity:

write := func(s string) { fmt.Fprintf(h, "%d:", len(s)) _, _ = h.Write([]byte(s)) } for _, p := range paths { write(p) } write(workingDir) write(projectDir) for _, k := range keys { write(k); write(env[k]) }

While you're here, a short comment explaining why Substitute / TypeCastMapping are intentionally excluded from the key (invariant across includes within a single load) would help a future contributor who adds a per-subtree option not silently introduce collisions, like

// Note: Substitute and TypeCastMapping are intentionally excluded from the key. // They are invariant across includes within a single Load (cloned unchanged from // the top-level options at the call site above). If a future option allows them // to vary per include, they must be folded into the key.

glours · 2026-06-29T13:38:40Z

+		imported, ok := cache.get(key)
+		if !ok {
+			imported, err = loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included)
+			if err != nil {
+				return err
+			}
+			cache.put(key, imported)


There is a silent change to the public Listener contract here.

When cache.get(key) returns ok at line 274, the if !ok block (lines 275-281) is skipped, so loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included) on line 276 does not run on cache hits. The extends events emitted from inside
that call path, see opts.ProcessEvent("extends", v) at loader/extends.go:76 and opts.ProcessEvent("extends", map[string]any{"service": ref}) at extends.go:79, are therefore not re-emitted for any subsequent diamond-traversal of the same included file. Compare with the include listener at include.go:175, which fires per occurrence (it's emitted in the outer loop, before the cache lookup): two events documented in the same pipeline now behave asymmetrically.

That isn't a cosmetic observability change:

Listener is part of the public API (loader/loader.go:109), and downstream tools wire into it via cli/options.go:83.

The existing test TestLoadExtendsListener at loader/extends_test.go:440 asserts assert.Equal(t, extendsCount, 3), which establishes the contract: one event per resolved extends. Diamond include topologies break that invariant after this PR,
and silently, because no test in this repo combines diamond includes with extends listeners.

Any consumer counting extends events for telemetry, dependency tracking, audit, or progress reporting will see different counts depending on the include topology , exactly the kind of regression that surfaces months later as a confused user report.

I'd suggest preserving the contract by recording events on first load and replaying them on cache hit.

glours · 2026-06-29T13:57:44Z

+func TestIncludeDiamondDedup(t *testing.T) {
+	dir := t.TempDir()
+	const depth = 24 // 2^24 ~= 16.7M leaf loads without dedup
+	for i := 0; i < depth; i++ {
+		content := fmt.Sprintf("include:\n  - path: ./level%d.yaml\n  - path: ./level%d.yaml\n", i+1, i+1)
+		assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", i)), []byte(content), 0o600))
+	}
+	leaf := "services:\n  leaf:\n    image: busybox\n"
+	assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", depth)), []byte(leaf), 0o600))
+
+	p, err := LoadWithContext(context.TODO(), types.ConfigDetails{
+		WorkingDir:  dir,
+		ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}},
+	}, withProjectName("diamond", true))
+	assert.NilError(t, err)
+	_, err = p.GetService("leaf")
+	assert.NilError(t, err)
+}


Nice regression test, but the only assertions are assert.NilError(t, err) on line 267 and the GetService("leaf") check on lines 268-269. If the cache regresses, the LoadWithContext call on line 263 will spin on 2^24 leaf loads and the test goroutine stays parked on that line — loader/ doesn't check ctx.Done() anywhere, so a context.WithTimeout wouldn't help. The test then hangs until the default go test timeout (10 minutes), and fails with a generic timeout message, slow CI feedback, unclear failure mode.

A fast, descriptive fail requires running the load in a goroutine and selecting on a timeout. The leaked goroutine on timeout is fine, t.Fatal ends the test and the process exits soon after.

type result struct { p *types.Project err error } // LoaderWithContext don't properly support context today, so the timeout below is consumed by the // select, not by LoadWithContext. Two separate contexts keep that intent clear. timeout, cancel := context.WithTimeout(t.Context(), 5*time.Second) defer cancel() done := make(chan result, 1) go func() { p, err := LoadWithContext(t.Context(), types.ConfigDetails{ WorkingDir: dir, ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}}, }, withProjectName("diamond", true)) done <- result{p, err} }() select { case r := <-done: assert.NilError(t, r.err) _, err := r.p.GetService("leaf") assert.NilError(t, err) case <-timeout.Done(): t.Fatal("diamond include did not complete within 5s — cache likely not working") }

5s is generous; the current run completes in ~0.18s on my machine.

davireis requested a review from ndeloof as a code owner June 19, 2026 14:01

davireis force-pushed the loader-memoize-includes branch from 08775c8 to e8f62dd Compare June 19, 2026 16:49

glours requested changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loader: memoize includes to avoid re-expanding diamond include graphs#886

loader: memoize includes to avoid re-expanding diamond include graphs#886
davireis wants to merge 1 commit into
compose-spec:mainfrom
davireis:loader-memoize-includes

davireis commented Jun 19, 2026

Uh oh!

glours left a comment

Uh oh!

glours Jun 29, 2026

Uh oh!

glours Jun 29, 2026

Uh oh!

glours Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

davireis commented Jun 19, 2026

Problem

Fix

Correctness details

Tests

Uh oh!

glours left a comment

Choose a reason for hiding this comment

Uh oh!

glours Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

glours Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

glours Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants