Skip to content

fix(clone): stop COW layer-walk recursing through pointer fields (#1669)#1676

Merged
ooples merged 2 commits into
masterfrom
fix/1669-nn-stackoverflow-recursion
Jun 23, 2026
Merged

fix(clone): stop COW layer-walk recursing through pointer fields (#1669)#1676
ooples merged 2 commits into
masterfrom
fix/1669-nn-stackoverflow-recursion

Conversation

@ooples

@ooples ooples commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #1669 — the StackOverflow host crash that aborted the NN/Generated model-family CI shards (ModelFamily - NeuralNetworks A-L / O-R / S / T-Z and Generated Layers A-M). Stack overflow is uncatchable, so it killed the whole test host with no [FAIL] recorded.

Root cause

The reflective trainable-layer collector in CopyOnWriteCloneHelper.CollectInto (the COW Clone() lever, #1624/#1633) walks every instance field of every reachable object. When it reaches an unmanaged pointer field (void*, byte*, T*), FieldInfo.GetValue boxes it into a fresh System.Reflection.Pointer object on every call, and that wrapper's own _ptr field is itself a pointer — so following it spirals into unbounded recursion (a brand-new Pointer per level, never deduped by the visited set):

CopyOnWriteCloneHelper.CollectInto → CollectInto → CollectInto → …   (Pointer._ptr : Pointer, forever)

This was the single shared recursion behind every crashing shard: any model whose object graph transitively reaches a native pointer (the foundation-scale weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, …) hit it; name-ranges with no such model (M-N) were spared — exactly the observed crash pattern.

Fix (root cause, minimal — src/Helpers/CopyOnWriteCloneHelper.cs)

  • Skip unmanaged pointer fields (field.FieldType.IsPointer) before GetValue, so the System.Reflection.Pointer wrapper is never created. A pointer can never reference an ITrainableLayer, so it is always a leaf.
  • Defensively treat a runtime System.Reflection.Pointer value as a leaf too (covers a pointer surfaced through an object/interface-typed field).
  • Fail-safe depth backstop (1024, far above any real model graph): on the now-unreachable event of another self-regenerating field type, bail to an empty result so TryShareTrainableParameters falls back to the eager flat copy (always correct) instead of ever stack-overflowing the host again.

The COW share path is unchanged for every real model (no model holds a layer behind a pointer), so collection results — and clone fidelity — are identical.

Validation

  • NeuralNetworks S shard (definitively crashed the host before): now runs to completion (22m, 0 crash / abort / stack-overflow markers). 202 passed, 12 failures — all pure test timeouts (9×120s, 3×180s) on foundation-scale models (SmolVLM/SPLADE/SimCSE/SGPT). Zero assertion/clone-drift/NaN failures. These pre-existing foundation-scale-model perf timeouts are the "remaining genuine per-model failures tracked separately" the issue scopes out.
  • COW clone correctness: dedicated CopyOnWriteCloneTests green; Clone_ShouldProduceIdenticalOutput passed for SGPT / SpikingNeuralNetwork / etc. in the shard.
  • Builds net10.0 + net471.

Acceptance

✅ The crashing shards no longer stack-overflow; they run to completion. Remaining per-model timeouts are tracked separately (foundation-scale model speed).

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes

  • Prevented stack overflows/infinite recursion during cloning by adding a maximum traversal depth; when exceeded, traversal safely aborts and falls back to the eager copy approach.
  • Hardened reflection-based traversal to better handle pointer-like values by skipping pointer-typed fields and treating pointer wrapper values as leaf nodes.
  • Added diagnostic logging when traversal depth limits are hit.

The reflective trainable-layer collector in copyonwriteclonehelper walked every
instance field of every reachable object. When it reached an unmanaged pointer
field (void*, byte*, T*), FieldInfo.GetValue boxes it into a FRESH
System.Reflection.Pointer on every call, and that wrapper's own _ptr field is
itself a pointer — so following it spiraled into unbounded recursion (a new
Pointer object per level, never deduped by the visited set), stack-overflowing
the test host. Stack overflow is uncatchable, so it aborted whole CI shards with
no [FAIL] recorded.

This was the single shared recursion behind the NN/Generated model-family host
crashes (NeuralNetworks A-L/O-R/S/T-Z + Generated Layers A-M): every model whose
object graph transitively reached a native pointer (the foundation-scale
weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, ...) hit it, while
ranges with no such model (M-N) were spared.

Fix (root cause, leaves the share path unchanged for every real model):
- Skip unmanaged pointer fields (FieldType.IsPointer) before GetValue, so the
  System.Reflection.Pointer wrapper is never created. A pointer can never
  reference an ITrainableLayer, so it is always a leaf.
- Defensively treat a runtime System.Reflection.Pointer value as a leaf too, for
  the case a pointer surfaces through an object/interface-typed field.
- Add a fail-safe recursion-depth backstop (1024, far above any real model): on
  the unreachable event of another self-regenerating field type, bail to an empty
  result so TryShareTrainableParameters falls back to the eager flat copy (always
  correct) instead of ever stack-overflowing the host again.

Builds net10.0 + net471. Existing CopyOnWriteCloneTests contract unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
aidotnet_website Ignored Ignored Preview Jun 23, 2026 5:30pm
aidotnet-playground-api Ignored Ignored Preview Jun 23, 2026 5:30pm

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7bb29855-40b9-4f6f-a229-4e2eb45d38c3

📥 Commits

Reviewing files that changed from the base of the PR and between 42ced7b and ae6ad65.

📒 Files selected for processing (1)
  • src/Helpers/CopyOnWriteCloneHelper.cs

Walkthrough

CopyOnWriteCloneHelper.CollectTrainableLayers and CollectInto are hardened against stack overflows: a MaxWalkDepth constant and a private CollectDepthExceededException impose a recursion ceiling, pointer-typed fields are skipped entirely, System.Reflection.Pointer values are treated as leaf nodes, and a try/catch fallback clears the layer list to force the eager flat-copy path.

Changes

Stack Overflow Hardening in CopyOnWriteCloneHelper

Layer / File(s) Summary
Depth limit, fallback, and pointer-field skipping
src/Helpers/CopyOnWriteCloneHelper.cs
CollectInto gains a depth parameter and throws CollectDepthExceededException when MaxWalkDepth is exceeded. CollectTrainableLayers catches that exception, clears the layer list, and returns empty to trigger the flat-copy fallback. Pointer-typed fields (field.FieldType.IsPointer) are skipped, System.Reflection.Pointer runtime values are treated as leaf nodes, and all recursive call sites pass depth + 1.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A walker once wandered too deep in the graph,
And crashed the whole host — not a reason to laugh.
Now depth has a ceiling, and pointers stand still,
Exceptions catch overflow before it can kill.
The stack breathes again, calm and flat as can be 🛡️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(clone): stop COW layer-walk recursing through pointer fields (#1669)' directly and specifically describes the main change—preventing recursive traversal through pointer fields in the COW Clone mechanism—and references the root-cause issue being fixed.
Linked Issues check ✅ Passed The PR fully addresses issue #1669: it eliminates the StackOverflow crash by skipping pointer fields before GetValue, treating System.Reflection.Pointer as leaves, and adding a recursion-depth backstop with diagnostic logging as specified.
Out of Scope Changes check ✅ Passed All changes are confined to src/Helpers/CopyOnWriteCloneHelper.cs and directly target the pointer-recursion bug; no unrelated modifications or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1669-nn-stackoverflow-recursion

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Helpers/CopyOnWriteCloneHelper.cs`:
- Around line 100-107: The CollectDepthExceededException catch block in the
CopyOnWriteCloneHelper.cs file silently clears the layers list without any
diagnostic signal, which means if the supposedly unreachable depth limit is ever
exceeded, the fallback to eager copy happens invisibly without observability.
Add a diagnostic output (such as System.Diagnostics.Debug.WriteLine or your
project's standard logging mechanism) inside the catch block to surface when
this exception is triggered, ensuring regressions in object graph depth are
detectable rather than silent and invisible to monitoring.
- Around line 117-124: The MaxWalkDepth constant is currently set to 1024, which
is a very high ceiling that could allow the recursive CollectInto method to
approach dangerous stack depths before throwing CollectDepthExceededException.
Reduce the MaxWalkDepth value from 1024 to a more conservative ceiling such as
256 or 128 to provide additional stack safety margin and reduce the risk of
stack overflow, while still maintaining a safe threshold that is well above the
expected nesting depth of legitimate model object graphs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 284bf534-2930-443b-a117-4e75adf9d971

📥 Commits

Reviewing files that changed from the base of the PR and between 01a4ddb and 42ced7b.

📒 Files selected for processing (1)
  • src/Helpers/CopyOnWriteCloneHelper.cs

Comment thread src/Helpers/CopyOnWriteCloneHelper.cs
Comment thread src/Helpers/CopyOnWriteCloneHelper.cs Outdated
…#1669)

Address CodeRabbit review on PR #1676:

- The recursion-depth backstop previously cleared the layer list and fell back to
  the eager copy SILENTLY. That fallback is correct, but it is the exact path COW
  exists to avoid on foundation-scale models — a silent trip would quietly
  reintroduce large-Clone() OOM pressure (#1624) with zero observability, and a
  swallow-without-log violates the project's exception-logging guidance. Emit a
  System.Diagnostics.Trace.TraceWarning (the codebase's existing diagnostic idiom,
  observable in Release on both TFMs) naming the offending model type so a future
  regression is detectable instead of invisible.
- Lower MaxWalkDepth 1024 -> 256: still a conservative ~5-10x margin over the few-
  dozen levels a real model graph nests, but caps frames well short of a host
  stack overflow if the pointer/leaf skips ever miss a case.

Builds net10.0 + net471.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ooples ooples merged commit 264e75c into master Jun 23, 2026
49 of 79 checks passed
@ooples ooples deleted the fix/1669-nn-stackoverflow-recursion branch June 23, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: StackOverflow host crash in NN/Generated model-family shards (likely one shared recursion)

1 participant