fix(clone): stop COW layer-walk recursing through pointer fields (#1669) by ooples · Pull Request #1676 · ooples/AiDotNet

ooples · 2026-06-23T16:50:37Z

Summary

Fixes #1669 — the StackOverflow host crash that aborted the NN/Generated model-family CI shards (ModelFamily - NeuralNetworks A-L / O-R / S / T-Z and Generated Layers A-M). Stack overflow is uncatchable, so it killed the whole test host with no [FAIL] recorded.

Root cause

The reflective trainable-layer collector in CopyOnWriteCloneHelper.CollectInto (the COW Clone() lever, #1624/#1633) walks every instance field of every reachable object. When it reaches an unmanaged pointer field (void*, byte*, T*), FieldInfo.GetValue boxes it into a fresh System.Reflection.Pointer object on every call, and that wrapper's own _ptr field is itself a pointer — so following it spirals into unbounded recursion (a brand-new Pointer per level, never deduped by the visited set):

CopyOnWriteCloneHelper.CollectInto → CollectInto → CollectInto → …   (Pointer._ptr : Pointer, forever)

This was the single shared recursion behind every crashing shard: any model whose object graph transitively reaches a native pointer (the foundation-scale weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, …) hit it; name-ranges with no such model (M-N) were spared — exactly the observed crash pattern.

Fix (root cause, minimal — `src/Helpers/CopyOnWriteCloneHelper.cs`)

Skip unmanaged pointer fields (field.FieldType.IsPointer) before GetValue, so the System.Reflection.Pointer wrapper is never created. A pointer can never reference an ITrainableLayer, so it is always a leaf.
Defensively treat a runtime System.Reflection.Pointer value as a leaf too (covers a pointer surfaced through an object/interface-typed field).
Fail-safe depth backstop (1024, far above any real model graph): on the now-unreachable event of another self-regenerating field type, bail to an empty result so TryShareTrainableParameters falls back to the eager flat copy (always correct) instead of ever stack-overflowing the host again.

The COW share path is unchanged for every real model (no model holds a layer behind a pointer), so collection results — and clone fidelity — are identical.

Validation

NeuralNetworks S shard (definitively crashed the host before): now runs to completion (22m, 0 crash / abort / stack-overflow markers). 202 passed, 12 failures — all pure test timeouts (9×120s, 3×180s) on foundation-scale models (SmolVLM/SPLADE/SimCSE/SGPT). Zero assertion/clone-drift/NaN failures. These pre-existing foundation-scale-model perf timeouts are the "remaining genuine per-model failures tracked separately" the issue scopes out.
COW clone correctness: dedicated CopyOnWriteCloneTests green; Clone_ShouldProduceIdenticalOutput passed for SGPT / SpikingNeuralNetwork / etc. in the shard.
Builds net10.0 + net471.

Acceptance

✅ The crashing shards no longer stack-overflow; they run to completion. Remaining per-model timeouts are tracked separately (foundation-scale model speed).

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

Prevented stack overflows/infinite recursion during cloning by adding a maximum traversal depth; when exceeded, traversal safely aborts and falls back to the eager copy approach.
Hardened reflection-based traversal to better handle pointer-like values by skipping pointer-typed fields and treating pointer wrapper values as leaf nodes.
Added diagnostic logging when traversal depth limits are hit.

The reflective trainable-layer collector in copyonwriteclonehelper walked every instance field of every reachable object. When it reached an unmanaged pointer field (void*, byte*, T*), FieldInfo.GetValue boxes it into a FRESH System.Reflection.Pointer on every call, and that wrapper's own _ptr field is itself a pointer — so following it spiraled into unbounded recursion (a new Pointer object per level, never deduped by the visited set), stack-overflowing the test host. Stack overflow is uncatchable, so it aborted whole CI shards with no [FAIL] recorded. This was the single shared recursion behind the NN/Generated model-family host crashes (NeuralNetworks A-L/O-R/S/T-Z + Generated Layers A-M): every model whose object graph transitively reached a native pointer (the foundation-scale weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, ...) hit it, while ranges with no such model (M-N) were spared. Fix (root cause, leaves the share path unchanged for every real model): - Skip unmanaged pointer fields (FieldType.IsPointer) before GetValue, so the System.Reflection.Pointer wrapper is never created. A pointer can never reference an ITrainableLayer, so it is always a leaf. - Defensively treat a runtime System.Reflection.Pointer value as a leaf too, for the case a pointer surfaces through an object/interface-typed field. - Add a fail-safe recursion-depth backstop (1024, far above any real model): on the unreachable event of another self-regenerating field type, bail to an empty result so TryShareTrainableParameters falls back to the eager flat copy (always correct) instead of ever stack-overflowing the host again. Builds net10.0 + net471. Existing CopyOnWriteCloneTests contract unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel · 2026-06-23T16:50:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
aidotnet_website	Ignored	Preview	Jun 23, 2026 5:30pm
aidotnet-playground-api	Ignored	Preview	Jun 23, 2026 5:30pm

coderabbitai · 2026-06-23T16:54:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7bb29855-40b9-4f6f-a229-4e2eb45d38c3

📥 Commits

Reviewing files that changed from the base of the PR and between 42ced7b and ae6ad65.

📒 Files selected for processing (1)

src/Helpers/CopyOnWriteCloneHelper.cs

Walkthrough

CopyOnWriteCloneHelper.CollectTrainableLayers and CollectInto are hardened against stack overflows: a MaxWalkDepth constant and a private CollectDepthExceededException impose a recursion ceiling, pointer-typed fields are skipped entirely, System.Reflection.Pointer values are treated as leaf nodes, and a try/catch fallback clears the layer list to force the eager flat-copy path.

Changes

Stack Overflow Hardening in CopyOnWriteCloneHelper

Layer / File(s)	Summary
Depth limit, fallback, and pointer-field skipping `src/Helpers/CopyOnWriteCloneHelper.cs`	`CollectInto` gains a `depth` parameter and throws `CollectDepthExceededException` when `MaxWalkDepth` is exceeded. `CollectTrainableLayers` catches that exception, clears the layer list, and returns empty to trigger the flat-copy fallback. Pointer-typed fields (`field.FieldType.IsPointer`) are skipped, `System.Reflection.Pointer` runtime values are treated as leaf nodes, and all recursive call sites pass `depth + 1`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A walker once wandered too deep in the graph,
And crashed the whole host — not a reason to laugh.
Now depth has a ceiling, and pointers stand still,
Exceptions catch overflow before it can kill.
The stack breathes again, calm and flat as can be 🛡️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(clone): stop COW layer-walk recursing through pointer fields (`#1669`)' directly and specifically describes the main change—preventing recursive traversal through pointer fields in the COW Clone mechanism—and references the root-cause issue being fixed.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1669`: it eliminates the StackOverflow crash by skipping pointer fields before GetValue, treating System.Reflection.Pointer as leaves, and adding a recursion-depth backstop with diagnostic logging as specified.
Out of Scope Changes check	✅ Passed	All changes are confined to src/Helpers/CopyOnWriteCloneHelper.cs and directly target the pointer-recursion bug; no unrelated modifications or scope creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/1669-nn-stackoverflow-recursion

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Helpers/CopyOnWriteCloneHelper.cs`:
- Around line 100-107: The CollectDepthExceededException catch block in the
CopyOnWriteCloneHelper.cs file silently clears the layers list without any
diagnostic signal, which means if the supposedly unreachable depth limit is ever
exceeded, the fallback to eager copy happens invisibly without observability.
Add a diagnostic output (such as System.Diagnostics.Debug.WriteLine or your
project's standard logging mechanism) inside the catch block to surface when
this exception is triggered, ensuring regressions in object graph depth are
detectable rather than silent and invisible to monitoring.
- Around line 117-124: The MaxWalkDepth constant is currently set to 1024, which
is a very high ceiling that could allow the recursive CollectInto method to
approach dangerous stack depths before throwing CollectDepthExceededException.
Reduce the MaxWalkDepth value from 1024 to a more conservative ceiling such as
256 or 128 to provide additional stack safety margin and reduce the risk of
stack overflow, while still maintaining a safe threshold that is well above the
expected nesting depth of legitimate model object graphs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 284bf534-2930-443b-a117-4e75adf9d971

📥 Commits

Reviewing files that changed from the base of the PR and between 01a4ddb and 42ced7b.

📒 Files selected for processing (1)

src/Helpers/CopyOnWriteCloneHelper.cs

…#1669) Address CodeRabbit review on PR #1676: - The recursion-depth backstop previously cleared the layer list and fell back to the eager copy SILENTLY. That fallback is correct, but it is the exact path COW exists to avoid on foundation-scale models — a silent trip would quietly reintroduce large-Clone() OOM pressure (#1624) with zero observability, and a swallow-without-log violates the project's exception-logging guidance. Emit a System.Diagnostics.Trace.TraceWarning (the codebase's existing diagnostic idiom, observable in Release on both TFMs) naming the offending model type so a future regression is detectable instead of invisible. - Lower MaxWalkDepth 1024 -> 256: still a conservative ~5-10x margin over the few- dozen levels a real model graph nests, but caps frames well short of a host stack overflow if the pointer/leaf skips ever miss a case. Builds net10.0 + net471. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vercel Bot deployed to Preview – aidotnet_website June 23, 2026 16:50 View deployment

vercel Bot deployed to Preview – aidotnet-playground-api June 23, 2026 16:51 View deployment

coderabbitai Bot requested changes Jun 23, 2026

View reviewed changes

Comment thread src/Helpers/CopyOnWriteCloneHelper.cs

Comment thread src/Helpers/CopyOnWriteCloneHelper.cs Outdated

coderabbitai Bot approved these changes Jun 23, 2026

View reviewed changes

ooples merged commit 264e75c into master Jun 23, 2026
49 of 79 checks passed

ooples deleted the fix/1669-nn-stackoverflow-recursion branch June 23, 2026 18:58

ooples mentioned this pull request Jun 24, 2026

Playbook: clear remaining #1677 OOM/timeout CI shards — memory-architecture fixes (caching arenas, fused optimizer-in-backward, COW clone, weight streaming) + float as the LAST lever #1684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(clone): stop COW layer-walk recursing through pointer fields (#1669)#1676

fix(clone): stop COW layer-walk recursing through pointer fields (#1669)#1676
ooples merged 2 commits into
masterfrom
fix/1669-nn-stackoverflow-recursion

ooples commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ooples commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix (root cause, minimal — src/Helpers/CopyOnWriteCloneHelper.cs)

Validation

Acceptance

Summary by CodeRabbit

Bug Fixes

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ooples commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

Fix (root cause, minimal — `src/Helpers/CopyOnWriteCloneHelper.cs`)

vercel Bot commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading