Skip to content

EmitClassTableJob: surface silently dropped class buckets#646

Open
sj-i wants to merge 6 commits into
0.12.xfrom
claude/class-table-drop-diag-pT8jW
Open

EmitClassTableJob: surface silently dropped class buckets#646
sj-i wants to merge 6 commits into
0.12.xfrom
claude/class-table-drop-diag-pT8jW

Conversation

@sj-i
Copy link
Copy Markdown
Member

@sj-i sj-i commented Apr 25, 2026

Summary

Diagnostic-only PR to chase down a reported asymmetry: two
ObjectContext rows hanging off the same parent in rmem:explore,
one shows the ⇒ class → ClassDefinitionContext pseudo-edge that
lets users jump to the class definition, the other doesn't, even
though both clearly have a class.

RmemModel::findClassDef() is fed by classDefIndex, which is
built from the children of the class_table root in the
substrate. If a particular class doesn't end up under
class_table in the dump, the index misses, the pseudo-edge
isn't added, and the user sees the asymmetry.

EmitClassTableJob wraps every bucket in a per-bucket
try { … } catch (\Throwable) {} for resilience — a single
unreadable class shouldn't abort the whole walk. The unfortunate
side-effect is that whatever class fails to emit disappears
silently. Make the failure visible without changing the
resilience:

  • Reserve the class_table root id up front via
    assignNodeId() and defer the actual emit to the end of the
    walk, so the root's attributes pick up counters populated
    during bucket iteration.
  • DefinedClassesContext gains recordDroppedException() and
    recordSkippedDedup() and exposes the totals as
    #dropped_exceptions / #skipped_dedup attributes alongside
    the existing #count. Both surface in the rmem:explore detail
    pane — users can tell at a glance whether their snapshot is
    missing classes.
  • The catch arm also samples the first 8 dropped buckets
    (class name when readable + exception class + message) and
    prints a STDERR warning at the end of analyze. Enough to
    point at which class is failing and what is throwing,
    without threading a logger through the collector.

How to use

After re-running inspector:memory:analyze against the same
target with this branch:

  1. Watch stderr — if the WARNING: EmitClassTableJob silently dropped class buckets line shows up, the listed classes are the ones that won't appear under class_table in the explore TUI.
  2. Open the resulting .rmem in rmem:explore, navigate to the class_table root, and check the detail pane: #dropped_exceptions / #skipped_dedup attribute values give a count even when stderr was missed.
  3. Compare the dropped class names with the ObjectContext rows that are missing the ⇒ class pseudo-edge — if they line up, this PR explains the bug; if they don't, the asymmetry has a different cause and we have a narrower hypothesis to chase.

Not a fix

The catch is still silent at the resilience layer (a single
unreadable class still doesn't crash analyze) — this PR only
makes the failure observable. Once the user reports back what
the warnings say, a follow-up PR can target the actual root
cause (e.g. fixing the FFI read that's throwing, or making
findClassDef resilient to gaps).

Test plan

  • phpcs (CI-equivalent: --standard=./phpcs.xml -n -q ./src ./tests) — exit 0
  • psalm.phar --no-progress --show-info=false — no errors
  • phpunit tests/Rmem tests/Lib/PhpProcessReader --exclude-group target-version — 192 tests, 847 assertions, only the pre-existing 2 env-dependent failures (mod_php, FrankenPHP)
  • Manual: run analyze against a real target with classes likely to fail (e.g. heavy classes / closures / classes with unusual ZE state) and confirm stderr surfaces them.

https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw


Generated by Claude Code

claude added 3 commits April 25, 2026 13:36
The class_table walk wraps every bucket in a per-bucket
`try { … } catch (\Throwable) {}` so a single unreadable class
doesn't kill the whole walk. That's the right design, but it
also means a class that fails to emit silently disappears from
the class_table tree — `RmemModel::findClassDef()` then returns
null for that class, the `⇒ class` pseudo-edge that wires
ObjectContext rows to their ClassDefinitionContext on the
explore TUI doesn't fire, and the user sees an apparently
arbitrary asymmetry: two ObjectContext children of the same
parent, one resolves its class def via the pseudo-edge and the
other doesn't.

Make the failure visible without changing the resilience:

- Reserve the class_table root id up front via `assignNodeId()`,
  defer the actual emit to the end of the walk so attributes
  pick up counters populated during the loop.
- DefinedClassesContext gains `recordDroppedException()` /
  `recordSkippedDedup()` setters, and exposes the totals as
  `#dropped_exceptions` / `#skipped_dedup` attributes (alongside
  the existing `#count`). Both are visible from rmem:explore's
  detail pane on the class_table root, so users can tell at a
  glance whether their snapshot is missing classes.
- The catch arm also samples the first 8 dropped buckets (class
  name when readable + exception class + message) and prints a
  STDERR warning at the end of analyze. That's enough to point
  at which class is failing and what's going wrong without
  threading a logger through the collector.

This is a diagnostic patch, not a fix — the `try/catch`
swallowing is still resilient as before. The intent is to
narrow down whether the asymmetry is genuinely "bucket throws
on read" versus "bucket isn't where we think it is" before
proposing a real fix.

https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
Followup to the previous diagnostic: a user reported that the
class_table root in their freshly-analyzed rmem shows up as
type=ClassDefinitionContext (with #is_internal=1, a class-level
attribute) instead of DefinedClassesContext. Isolated repros of
the assignNodeId-then-late-emit pattern produce the right type;
something in the real path is over-writing node #N's type after
late-emit.

Add a STDERR line that logs both the id reserved up front via
`assignNodeId()` and the id memoised on `$defined_classes_context`
after the late `emitNode()` call. If the two match, the late-emit
reached the analyzer + sink — the bug is then upstream of us
(another emit on the same id). If they don't match, the late
emit short-circuited via the "fully emitted, just emit a
reference edge" branch and the root genuinely never landed.

https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
… late-emit

The previous diagnostic confirmed that the reserved id and the
final memoised id match, but the user's rmem still has node
#(reserved+1) showing as `class_table → ClassDefinitionContext`
in the explore Roots view, which is impossible if late-emit
actually wrote the tree edge from the sentinel to #reserved.

The remaining failure modes both flow through the same branch in
analyzeContext:

  if ($existing_node_id !== null && $existing_node_id >= 0) {
      $sink->emitReference(...);  // reference edge, NOT tree edge
      return;
  }

i.e. if `defined_classes_context`'s memo gets flipped from
`-(reserved+1)` to a positive number sometime between the
`assignNodeId()` reservation and the late `emitNode()` call,
late-emit short-circuits to a reference edge and the tree edge
from `-1 → reserved` is never written. The substrate then
doesn't see #reserved as a root, which would explain why the
user's first class_def (at #reserved+1) appears with the
`class_table` link instead.

Snapshot the memo right before the late `emitNode()` so we can
tell which case we're in:

  - memo == `-(reserved+1)` (negative sentinel): late-emit took
    the normal path; the bug is elsewhere.
  - memo == positive: someone flipped it during the loop, late-
    emit took the `emitReference` branch, and we need to find
    who's mutating the context's memo.

https://claude.ai/code/session_01XXo4QACibyKMdC8CvbBijw
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24940141295

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-23.2%) to 59.071%

Details

  • Coverage decreased (-23.2%) from the base build.
  • Patch coverage: 40 uncovered changes across 2 files (0 of 40 lines covered, 0.0%).
  • 9158 coverage regressions across 282 files.

Uncovered Changes

File Changed Covered %
src/Lib/PhpProcessReader/PhpMemoryReader/Collector/Job/EmitClassTableJob.php 30 0 0.0%
src/Lib/PhpProcessReader/PhpMemoryReader/ReferenceContext/DefinedClassesContext.php 10 0 0.0%

Coverage Regressions

9158 previously-covered lines in 282 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
src/Inspector/MemoryDump/MemoryDumper.php 393 0.44%
src/Inspector/Output/MemoryOutput/Report/Substrate/FfiCsrGraphSubstrate.php 393 49.41%
src/Inspector/Watch/VariableReader.php 389 9.41%
src/Lib/PhpProcessReader/PhpMemoryReader/MetadataPeekWalker.php 322 0.0%
src/Lib/PhpProcessReader/PhpMemoryReader/Collector/Job/PdoHelper.php 245 0.0%
src/Lib/PhpInternals/Types/Zend/ZendExecuteData.php 236 0.0%
src/Lib/PhpProcessReader/PhpMemoryReader/MemoryLocationsCollector.php 221 0.77%
src/Lib/PhpProcessReader/PhpMemoryReader/Collector/Job/EmitObjectJob.php 218 0.0%
src/Lib/PhpProcessReader/PhpMemoryReader/Collector/Job/EmitClassTableJob.php 213 0.0%
src/Inspector/Output/MemoryOutput/Report/BinaryReportDataProvider.php 192 45.29%

Coverage Stats

Coverage Status
Relevant Lines: 37787
Covered Lines: 22321
Line Coverage: 59.07%
Coverage Strength: 18.36 hits per line

💛 - Coveralls

claude added 3 commits April 25, 2026 21:12
BinaryMemoryOutput.buildCsrSections writes per-node arrays
(treeParents, tree CSR rowptr/colidx/linknames, etc.) keyed by
node_id directly — relying on the implicit assumption that
csr_idx == node_id at load time, with the synthetic -1 sentinel at
csr_idx == nodeCount. FfiCsrGraphSubstrate.loadFromBinary built
indexToNodeFfi by iterating $all_node_ids in INSERTION order, which
only happened to satisfy that contract when nodes were emitted
strictly in id order.

EmitClassTableJob's diagnostic patch (147b451) reserves a node_id
for the class_table root via assignNodeId and defers the actual
emit until after the bucket walk, so dropped/skipped counters land
in the root's attributes. That writes rows out of order: row N+1
(first ClassDefinitionContext) lands before row N (the late-emit
DefinedClassesContext). all_node_ids consequently mapped node_id N
to the LAST csr slot, shifting every per-node array index by one
from row N onward — manifesting as an explore TUI where the
class_table root appeared as a leaf parented under some
ClassEntryContext.constants while a downstream ClassDefinitionContext
mis-claimed the class_table root slot.

Sort all_node_ids before assigning csr indices and re-append -1
afterwards so the sentinel keeps the last slot. Also drop the
late-emit STDERR trace from EmitClassTableJob now that the
csr-vs-node-id mismatch is understood and fixed; keep the
silently-dropped-bucket warning, which is the user-actionable part.
The previous commit (d0f1077) fixes csr_idx assignment in
FfiCsrGraphSubstrate.loadFromBinary. Caches written before that fix
contain srcloc_refs computed against the broken tree topology — held_by
node_ids point at whatever the shifted parent walk landed on (e.g. an
unrelated ClassEntryContext for the class_table root), so reli:explore
keeps surfacing nonsense source locations even after the substrate is
fixed because the sidecar still validates by mtime/size/fingerprint
(the rmem itself is unchanged).

Bumping VERSION 2 → 3 forces recomputation on first open after upgrade.
When a class bucket walk throws (typically MemoryAddressNotInDumpException
on opcache-cached vendor classes), the existing diagnostic prints the
class name + exception kind but not the address that failed. Without
the hex address it is hard to tell whether the missing memory is in
opcache SHM (0x55…/0x56… /dev/zero VMA), in a library .data segment
(0x7f… file-backed mmap), or somewhere else entirely — which decides
whether --include-binary, the pagemap residency filter, or a different
dump strategy is the lever to pull.

Capture ce->address right after deref'ing the bucket so the catch can
print it as `<class> @ 0x… :  <ExceptionClass>: <message>`. Raise the
sample cap from 8 → 32 so a real-world target with dozens of drops
gets a useful spread instead of just the first eight.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants