Skip to content

loader: memoize includes to avoid re-expanding diamond include graphs#886

Open
davireis wants to merge 1 commit into
compose-spec:mainfrom
davireis:loader-memoize-includes
Open

loader: memoize includes to avoid re-expanding diamond include graphs#886
davireis wants to merge 1 commit into
compose-spec:mainfrom
davireis:loader-memoize-includes

Conversation

@davireis

Copy link
Copy Markdown

Problem

ApplyInclude re-parses and recursively re-expands an included file once per include path that reaches it. When the same file is reachable through more than one path (a "diamond" in the include graph), this is quadratic-to-exponential: a 24-level doubling graph loads the leaf 2²⁴ ≈ 16.7M times.

This shows up in real monorepos that aggregate per-target / per-project compose fragments via include:. A federation of ~80 services took ~55s in docker compose config; the cost is re-expansion, not the graph size.

Fix

Memoize each loaded include model for the duration of a single load (carried in ctx, so it never leaks across Load calls). The cache key is every input that determines the model — resolved paths, working dir, project dir, and effective environment — and a deep copy is handed out on each hit.

The merge into the parent (importResources) still runs for every occurrence, so:

  • a same-file extends in the including file still resolves (the included content is present in each including scope), and
  • the result is byte-identical to loading the file each time; only the parse + recursive expansion is shared.

Correctness details

  • Per-include env_file / project_directory: folded into the key, so the same path included with a different environment or project dir does not share a cache entry.
  • Relative-path rebasing: the same file reached through two parents can have a different relative base (e.g. a/b vs b), which yields models with different relative paths. Keying on the working dir prevents reusing a model whose paths the caller would then rebase incorrectly.
  • Cycle-safe: an include cycle is intrinsic to a node's subtree (the back-edge is in the fixed set of files the node includes), so it is detected on the node's first load — before it can be cached. A cyclic node is therefore never served from cache.
  • Listeners still fire per include entry; only the load is memoized.

Tests

  • TestIncludeDiamondDedup: a depth-24 diamond that times out without the cache and completes in ~0.15s with it (both a correctness and a non-flaky perf-regression test).
  • BenchmarkIncludeDiamond.
  • Full go test ./... passes; gofmt/go vet clean.

@davireis davireis requested a review from ndeloof as a code owner June 19, 2026 14:01
ApplyInclude re-parses and recursively re-expands an included file once per
include path that reaches it. When the same file is reached through more than
one path (a "diamond" in the include graph) this is quadratic-to-exponential:
a 24-level doubling graph loads the leaf 2^24 times. Monorepos that aggregate
per-target / per-project compose fragments hit this in practice (an ~80-service
federation took ~55s in `docker compose config`).

Memoize each loaded include model for the duration of a single load, keyed on
every input that determines it — resolved paths, working dir, project dir, and
effective environment — and hand out a deep copy on each hit. The merge into
the parent (importResources) still runs for every occurrence, so a same-file
`extends` in the including file still resolves and the result is identical to
loading each time; only the parse + recursive expansion is shared.

Keying on the working dir matters: the same file reached through two parents
can have a different relative base, yielding models with different relative
paths; reusing across bases would let the caller rebase an already-resolved
path. Cycle-safe: an include cycle is intrinsic to a node's subtree, so it is
detected on the node's first load, before it can be cached.

Adds a deep-diamond regression test (times out without the cache) and a
benchmark.

Signed-off-by: Davi de Castro Reis <davi@davi.eng.br>
@davireis davireis force-pushed the loader-memoize-includes branch from 08775c8 to e8f62dd Compare June 19, 2026 16:49

@glours glours left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Three inline points to address.

  • A cache-key separator collision that's fixable with length-prefixed hashing.
  • More importantly, the extends listener events emitted inside loadYamlModel are silently dropped on cache hits, which changes the public Listener contract (cf. TestLoadExtendsListener asserting an exact count). Worth treating as a blocker
  • A time-bound guard on the diamond test, so a future cache regression fails fast with a clear message instead of hanging CI for the default 10-minute timeout.

The overall approach (per-load cache, env+paths+workingDir+projectDir as key, deep-copy on get) is the right shape — cycle safety, deep-copy completeness, and the unexported context key all check out.

Comment thread loader/include.go
Comment on lines +97 to +116
for _, p := range paths {
_, _ = h.Write([]byte(p))
_, _ = h.Write([]byte{0})
}
_, _ = h.Write([]byte{1})
_, _ = h.Write([]byte(workingDir))
_, _ = h.Write([]byte{1})
_, _ = h.Write([]byte(projectDir))
_, _ = h.Write([]byte{1})
keys := make([]string, 0, len(env))
for k := range env {
keys = append(keys, k)
}
sort.Strings(keys)
for _, k := range keys {
_, _ = h.Write([]byte(k))
_, _ = h.Write([]byte{0})
_, _ = h.Write([]byte(env[k]))
_, _ = h.Write([]byte{0})
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-byte separators written here ([]byte{0} on lines 99, 113, 115 and []byte{1} on lines 101, 103, 105) aren't unambiguous against env values that contain those bytes. types.Mapping is map[string]string populated from .env files /
process env, where embedded NULs aren't forbidden. Two distinct (paths, env) tuples can then produce the same byte stream and hash to the same key, wrong cached model served, no error surfaced.

Toy example: env = {"A\x00B": "X"} and env = {"A": "B\x00X"} both serialize to …A\x00B\x00X\x00 through the loop on lines 111-116.

Length-prefixing each field removes the ambiguity:

write := func(s string) {
    fmt.Fprintf(h, "%d:", len(s))
    _, _ = h.Write([]byte(s))
}
for _, p := range paths { write(p) }
write(workingDir)
write(projectDir)
for _, k := range keys { write(k); write(env[k]) }

While you're here, a short comment explaining why Substitute / TypeCastMapping are intentionally excluded from the key (invariant across includes within a single load) would help a future contributor who adds a per-subtree option not silently introduce collisions, like

// Note: Substitute and TypeCastMapping are intentionally excluded from the key.
// They are invariant across includes within a single Load (cloned unchanged from
// the top-level options at the call site above). If a future option allows them
// to vary per include, they must be folded into the key.

Comment thread loader/include.go
Comment on lines +274 to +280
imported, ok := cache.get(key)
if !ok {
imported, err = loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included)
if err != nil {
return err
}
cache.put(key, imported)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a silent change to the public Listener contract here.

When cache.get(key) returns ok at line 274, the if !ok block (lines 275-281) is skipped, so loadYamlModel(ctx, config, loadOptions, &cycleTracker{}, included) on line 276 does not run on cache hits. The extends events emitted from inside
that call path, see opts.ProcessEvent("extends", v) at loader/extends.go:76 and opts.ProcessEvent("extends", map[string]any{"service": ref}) at extends.go:79, are therefore not re-emitted for any subsequent diamond-traversal of the same included file. Compare with the include listener at include.go:175, which fires per occurrence (it's emitted in the outer loop, before the cache lookup): two events documented in the same pipeline now behave asymmetrically.

That isn't a cosmetic observability change:

  • Listener is part of the public API (loader/loader.go:109), and downstream tools wire into it via cli/options.go:83.
  • The existing test TestLoadExtendsListener at loader/extends_test.go:440 asserts assert.Equal(t, extendsCount, 3), which establishes the contract: one event per resolved extends. Diamond include topologies break that invariant after this PR,
    and silently, because no test in this repo combines diamond includes with extends listeners.
  • Any consumer counting extends events for telemetry, dependency tracking, audit, or progress reporting will see different counts depending on the include topology , exactly the kind of regression that surfaces months later as a confused user report.

I'd suggest preserving the contract by recording events on first load and replaying them on cache hit.

Comment thread loader/include_test.go
Comment on lines +253 to +270
func TestIncludeDiamondDedup(t *testing.T) {
dir := t.TempDir()
const depth = 24 // 2^24 ~= 16.7M leaf loads without dedup
for i := 0; i < depth; i++ {
content := fmt.Sprintf("include:\n - path: ./level%d.yaml\n - path: ./level%d.yaml\n", i+1, i+1)
assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", i)), []byte(content), 0o600))
}
leaf := "services:\n leaf:\n image: busybox\n"
assert.NilError(t, os.WriteFile(filepath.Join(dir, fmt.Sprintf("level%d.yaml", depth)), []byte(leaf), 0o600))

p, err := LoadWithContext(context.TODO(), types.ConfigDetails{
WorkingDir: dir,
ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}},
}, withProjectName("diamond", true))
assert.NilError(t, err)
_, err = p.GetService("leaf")
assert.NilError(t, err)
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice regression test, but the only assertions are assert.NilError(t, err) on line 267 and the GetService("leaf") check on lines 268-269. If the cache regresses, the LoadWithContext call on line 263 will spin on 2^24 leaf loads and the test goroutine stays parked on that line — loader/ doesn't check ctx.Done() anywhere, so a context.WithTimeout wouldn't help. The test then hangs until the default go test timeout (10 minutes), and fails with a generic timeout message, slow CI feedback, unclear failure mode.

A fast, descriptive fail requires running the load in a goroutine and selecting on a timeout. The leaked goroutine on timeout is fine, t.Fatal ends the test and the process exits soon after.

type result struct {
    p   *types.Project
    err error
}

// LoaderWithContext don't properly support context today, so the timeout below is consumed by the
// select, not by LoadWithContext. Two separate contexts keep that intent clear.                                                                                                                                                                           
timeout, cancel := context.WithTimeout(t.Context(), 5*time.Second)                                                                                                                                                                                         
defer cancel()                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                           
done := make(chan result, 1)
go func() {                                                                                                                                                                                                                                                
    p, err := LoadWithContext(t.Context(), types.ConfigDetails{
        WorkingDir:  dir,                                                                                                                                                                                                                                  
        ConfigFiles: []types.ConfigFile{{Filename: filepath.Join(dir, "level0.yaml")}},
    }, withProjectName("diamond", true))                                                                                                                                                                                                                   
    done <- result{p, err}
}()                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                           
select {
case r := <-done:                                                                                                                                                                                                                                          
    assert.NilError(t, r.err)
    _, err := r.p.GetService("leaf")
    assert.NilError(t, err)
case <-timeout.Done():
    t.Fatal("diamond include did not complete within 5s — cache likely not working")                                                                                                                                                                       
}

5s is generous; the current run completes in ~0.18s on my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants