EmitClassTableJob: surface silently dropped class buckets#646
Open
sj-i wants to merge 6 commits into
Open
Conversation
The class_table walk wraps every bucket in a per-bucket
`try { … } catch (\Throwable) {}` so a single unreadable class
doesn't kill the whole walk. That's the right design, but it
also means a class that fails to emit silently disappears from
the class_table tree — `RmemModel::findClassDef()` then returns
null for that class, the `⇒ class` pseudo-edge that wires
ObjectContext rows to their ClassDefinitionContext on the
explore TUI doesn't fire, and the user sees an apparently
arbitrary asymmetry: two ObjectContext children of the same
parent, one resolves its class def via the pseudo-edge and the
other doesn't.
Make the failure visible without changing the resilience:
- Reserve the class_table root id up front via `assignNodeId()`,
defer the actual emit to the end of the walk so attributes
pick up counters populated during the loop.
- DefinedClassesContext gains `recordDroppedException()` /
`recordSkippedDedup()` setters, and exposes the totals as
`#dropped_exceptions` / `#skipped_dedup` attributes (alongside
the existing `#count`). Both are visible from rmem:explore's
detail pane on the class_table root, so users can tell at a
glance whether their snapshot is missing classes.
- The catch arm also samples the first 8 dropped buckets (class
name when readable + exception class + message) and prints a
STDERR warning at the end of analyze. That's enough to point
at which class is failing and what's going wrong without
threading a logger through the collector.
This is a diagnostic patch, not a fix — the `try/catch`
swallowing is still resilient as before. The intent is to
narrow down whether the asymmetry is genuinely "bucket throws
on read" versus "bucket isn't where we think it is" before
proposing a real fix.
https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
Followup to the previous diagnostic: a user reported that the class_table root in their freshly-analyzed rmem shows up as type=ClassDefinitionContext (with #is_internal=1, a class-level attribute) instead of DefinedClassesContext. Isolated repros of the assignNodeId-then-late-emit pattern produce the right type; something in the real path is over-writing node #N's type after late-emit. Add a STDERR line that logs both the id reserved up front via `assignNodeId()` and the id memoised on `$defined_classes_context` after the late `emitNode()` call. If the two match, the late-emit reached the analyzer + sink — the bug is then upstream of us (another emit on the same id). If they don't match, the late emit short-circuited via the "fully emitted, just emit a reference edge" branch and the root genuinely never landed. https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
… late-emit
The previous diagnostic confirmed that the reserved id and the
final memoised id match, but the user's rmem still has node
#(reserved+1) showing as `class_table → ClassDefinitionContext`
in the explore Roots view, which is impossible if late-emit
actually wrote the tree edge from the sentinel to #reserved.
The remaining failure modes both flow through the same branch in
analyzeContext:
if ($existing_node_id !== null && $existing_node_id >= 0) {
$sink->emitReference(...); // reference edge, NOT tree edge
return;
}
i.e. if `defined_classes_context`'s memo gets flipped from
`-(reserved+1)` to a positive number sometime between the
`assignNodeId()` reservation and the late `emitNode()` call,
late-emit short-circuits to a reference edge and the tree edge
from `-1 → reserved` is never written. The substrate then
doesn't see #reserved as a root, which would explain why the
user's first class_def (at #reserved+1) appears with the
`class_table` link instead.
Snapshot the memo right before the late `emitNode()` so we can
tell which case we're in:
- memo == `-(reserved+1)` (negative sentinel): late-emit took
the normal path; the bug is elsewhere.
- memo == positive: someone flipped it during the loop, late-
emit took the `emitReference` branch, and we need to find
who's mutating the context's memo.
https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
BinaryMemoryOutput.buildCsrSections writes per-node arrays (treeParents, tree CSR rowptr/colidx/linknames, etc.) keyed by node_id directly — relying on the implicit assumption that csr_idx == node_id at load time, with the synthetic -1 sentinel at csr_idx == nodeCount. FfiCsrGraphSubstrate.loadFromBinary built indexToNodeFfi by iterating $all_node_ids in INSERTION order, which only happened to satisfy that contract when nodes were emitted strictly in id order. EmitClassTableJob's diagnostic patch (147b451) reserves a node_id for the class_table root via assignNodeId and defers the actual emit until after the bucket walk, so dropped/skipped counters land in the root's attributes. That writes rows out of order: row N+1 (first ClassDefinitionContext) lands before row N (the late-emit DefinedClassesContext). all_node_ids consequently mapped node_id N to the LAST csr slot, shifting every per-node array index by one from row N onward — manifesting as an explore TUI where the class_table root appeared as a leaf parented under some ClassEntryContext.constants while a downstream ClassDefinitionContext mis-claimed the class_table root slot. Sort all_node_ids before assigning csr indices and re-append -1 afterwards so the sentinel keeps the last slot. Also drop the late-emit STDERR trace from EmitClassTableJob now that the csr-vs-node-id mismatch is understood and fixed; keep the silently-dropped-bucket warning, which is the user-actionable part.
The previous commit (d0f1077) fixes csr_idx assignment in FfiCsrGraphSubstrate.loadFromBinary. Caches written before that fix contain srcloc_refs computed against the broken tree topology — held_by node_ids point at whatever the shifted parent walk landed on (e.g. an unrelated ClassEntryContext for the class_table root), so reli:explore keeps surfacing nonsense source locations even after the substrate is fixed because the sidecar still validates by mtime/size/fingerprint (the rmem itself is unchanged). Bumping VERSION 2 → 3 forces recomputation on first open after upgrade.
When a class bucket walk throws (typically MemoryAddressNotInDumpException on opcache-cached vendor classes), the existing diagnostic prints the class name + exception kind but not the address that failed. Without the hex address it is hard to tell whether the missing memory is in opcache SHM (0x55…/0x56… /dev/zero VMA), in a library .data segment (0x7f… file-backed mmap), or somewhere else entirely — which decides whether --include-binary, the pagemap residency filter, or a different dump strategy is the lever to pull. Capture ce->address right after deref'ing the bucket so the catch can print it as `<class> @ 0x… : <ExceptionClass>: <message>`. Raise the sample cap from 8 → 32 so a real-world target with dozens of drops gets a useful spread instead of just the first eight.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Diagnostic-only PR to chase down a reported asymmetry: two
ObjectContext rows hanging off the same parent in
rmem:explore,one shows the
⇒ class → ClassDefinitionContextpseudo-edge thatlets users jump to the class definition, the other doesn't, even
though both clearly have a class.
RmemModel::findClassDef()is fed byclassDefIndex, which isbuilt from the children of the
class_tableroot in thesubstrate. If a particular class doesn't end up under
class_tablein the dump, the index misses, the pseudo-edgeisn't added, and the user sees the asymmetry.
EmitClassTableJobwraps every bucket in a per-buckettry { … } catch (\Throwable) {}for resilience — a singleunreadable class shouldn't abort the whole walk. The unfortunate
side-effect is that whatever class fails to emit disappears
silently. Make the failure visible without changing the
resilience:
class_tableroot id up front viaassignNodeId()and defer the actual emit to the end of thewalk, so the root's attributes pick up counters populated
during bucket iteration.
DefinedClassesContextgainsrecordDroppedException()andrecordSkippedDedup()and exposes the totals as#dropped_exceptions/#skipped_dedupattributes alongsidethe existing
#count. Both surface in thermem:exploredetailpane — users can tell at a glance whether their snapshot is
missing classes.
(class name when readable + exception class + message) and
prints a STDERR warning at the end of analyze. Enough to
point at which class is failing and what is throwing,
without threading a logger through the collector.
How to use
After re-running
inspector:memory:analyzeagainst the sametarget with this branch:
WARNING: EmitClassTableJob silently dropped class bucketsline shows up, the listed classes are the ones that won't appear underclass_tablein the explore TUI..rmeminrmem:explore, navigate to theclass_tableroot, and check the detail pane:#dropped_exceptions/#skipped_dedupattribute values give a count even when stderr was missed.⇒ classpseudo-edge — if they line up, this PR explains the bug; if they don't, the asymmetry has a different cause and we have a narrower hypothesis to chase.Not a fix
The catch is still silent at the resilience layer (a single
unreadable class still doesn't crash analyze) — this PR only
makes the failure observable. Once the user reports back what
the warnings say, a follow-up PR can target the actual root
cause (e.g. fixing the FFI read that's throwing, or making
findClassDefresilient to gaps).Test plan
phpcs(CI-equivalent:--standard=./phpcs.xml -n -q ./src ./tests) — exit 0psalm.phar --no-progress --show-info=false— no errorsphpunit tests/Rmem tests/Lib/PhpProcessReader --exclude-group target-version— 192 tests, 847 assertions, only the pre-existing 2 env-dependent failures (mod_php, FrankenPHP)https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
Generated by Claude Code