fix(clone): stop COW layer-walk recursing through pointer fields (#1669)#1676
Conversation
The reflective trainable-layer collector in copyonwriteclonehelper walked every instance field of every reachable object. When it reached an unmanaged pointer field (void*, byte*, T*), FieldInfo.GetValue boxes it into a FRESH System.Reflection.Pointer on every call, and that wrapper's own _ptr field is itself a pointer — so following it spiraled into unbounded recursion (a new Pointer object per level, never deduped by the visited set), stack-overflowing the test host. Stack overflow is uncatchable, so it aborted whole CI shards with no [FAIL] recorded. This was the single shared recursion behind the NN/Generated model-family host crashes (NeuralNetworks A-L/O-R/S/T-Z + Generated Layers A-M): every model whose object graph transitively reached a native pointer (the foundation-scale weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, ...) hit it, while ranges with no such model (M-N) were spared. Fix (root cause, leaves the share path unchanged for every real model): - Skip unmanaged pointer fields (FieldType.IsPointer) before GetValue, so the System.Reflection.Pointer wrapper is never created. A pointer can never reference an ITrainableLayer, so it is always a leaf. - Defensively treat a runtime System.Reflection.Pointer value as a leaf too, for the case a pointer surfaces through an object/interface-typed field. - Add a fail-safe recursion-depth backstop (1024, far above any real model): on the unreachable event of another self-regenerating field type, bail to an empty result so TryShareTrainableParameters falls back to the eager flat copy (always correct) instead of ever stack-overflowing the host again. Builds net10.0 + net471. Existing CopyOnWriteCloneTests contract unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
Walkthrough
ChangesStack Overflow Hardening in CopyOnWriteCloneHelper
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/Helpers/CopyOnWriteCloneHelper.cs`:
- Around line 100-107: The CollectDepthExceededException catch block in the
CopyOnWriteCloneHelper.cs file silently clears the layers list without any
diagnostic signal, which means if the supposedly unreachable depth limit is ever
exceeded, the fallback to eager copy happens invisibly without observability.
Add a diagnostic output (such as System.Diagnostics.Debug.WriteLine or your
project's standard logging mechanism) inside the catch block to surface when
this exception is triggered, ensuring regressions in object graph depth are
detectable rather than silent and invisible to monitoring.
- Around line 117-124: The MaxWalkDepth constant is currently set to 1024, which
is a very high ceiling that could allow the recursive CollectInto method to
approach dangerous stack depths before throwing CollectDepthExceededException.
Reduce the MaxWalkDepth value from 1024 to a more conservative ceiling such as
256 or 128 to provide additional stack safety margin and reduce the risk of
stack overflow, while still maintaining a safe threshold that is well above the
expected nesting depth of legitimate model object graphs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 284bf534-2930-443b-a117-4e75adf9d971
📒 Files selected for processing (1)
src/Helpers/CopyOnWriteCloneHelper.cs
…#1669) Address CodeRabbit review on PR #1676: - The recursion-depth backstop previously cleared the layer list and fell back to the eager copy SILENTLY. That fallback is correct, but it is the exact path COW exists to avoid on foundation-scale models — a silent trip would quietly reintroduce large-Clone() OOM pressure (#1624) with zero observability, and a swallow-without-log violates the project's exception-logging guidance. Emit a System.Diagnostics.Trace.TraceWarning (the codebase's existing diagnostic idiom, observable in Release on both TFMs) naming the offending model type so a future regression is detectable instead of invisible. - Lower MaxWalkDepth 1024 -> 256: still a conservative ~5-10x margin over the few- dozen levels a real model graph nests, but caps frames well short of a host stack overflow if the pointer/leaf skips ever miss a case. Builds net10.0 + net471. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Fixes #1669 — the StackOverflow host crash that aborted the NN/Generated model-family CI shards (
ModelFamily - NeuralNetworks A-L / O-R / S / T-ZandGenerated Layers A-M). Stack overflow is uncatchable, so it killed the whole test host with no[FAIL]recorded.Root cause
The reflective trainable-layer collector in
CopyOnWriteCloneHelper.CollectInto(the COWClone()lever, #1624/#1633) walks every instance field of every reachable object. When it reaches an unmanaged pointer field (void*,byte*,T*),FieldInfo.GetValueboxes it into a freshSystem.Reflection.Pointerobject on every call, and that wrapper's own_ptrfield is itself a pointer — so following it spirals into unbounded recursion (a brand-newPointerper level, never deduped by the visited set):This was the single shared recursion behind every crashing shard: any model whose object graph transitively reaches a native pointer (the foundation-scale weight-streaming / mmap path on SmolVLM, SGPT, SPLADE, SimCSE, …) hit it; name-ranges with no such model (M-N) were spared — exactly the observed crash pattern.
Fix (root cause, minimal —
src/Helpers/CopyOnWriteCloneHelper.cs)field.FieldType.IsPointer) beforeGetValue, so theSystem.Reflection.Pointerwrapper is never created. A pointer can never reference anITrainableLayer, so it is always a leaf.System.Reflection.Pointervalue as a leaf too (covers a pointer surfaced through anobject/interface-typed field).TryShareTrainableParametersfalls back to the eager flat copy (always correct) instead of ever stack-overflowing the host again.The COW share path is unchanged for every real model (no model holds a layer behind a pointer), so collection results — and clone fidelity — are identical.
Validation
NeuralNetworks Sshard (definitively crashed the host before): now runs to completion (22m, 0 crash / abort / stack-overflow markers). 202 passed, 12 failures — all pure test timeouts (9×120s, 3×180s) on foundation-scale models (SmolVLM/SPLADE/SimCSE/SGPT). Zero assertion/clone-drift/NaN failures. These pre-existing foundation-scale-model perf timeouts are the "remaining genuine per-model failures tracked separately" the issue scopes out.CopyOnWriteCloneTestsgreen;Clone_ShouldProduceIdenticalOutputpassed for SGPT / SpikingNeuralNetwork / etc. in the shard.Acceptance
✅ The crashing shards no longer stack-overflow; they run to completion. Remaining per-model timeouts are tracked separately (foundation-scale model speed).
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes