Problem Statement
main currently has three separate DFHeapSize implementations for Arc-backed values:
impl<T: DFHeapSize> DFHeapSize for Arc<T>
impl DFHeapSize for Arc<str>
impl DFHeapSize for Arc<dyn DFHeapSize>
Each implementation repeats the same shared-allocation deduplication control flow:
- Extract an allocation identity pointer.
- Insert pointer into
DFHeapSizeCtx.seen.
- Return
0 if already seen.
- Compute type-specific payload size.
This repetition duplicates the same core invariant in multiple places and increases risk that one implementation diverges from the others over time.
Why This Matters
DFHeapSize depends on a critical invariant: shared heap allocations must be counted once per traversal context. If Arc implementations drift, memory accounting becomes inconsistent and hard to reason about.
Consolidating the one-time-accounting logic improves:
- Correctness durability (single source of truth for Arc dedup behavior)
- Maintainability (less repeated logic)
- Reviewability (future changes touch one helper)
Current Code Locations
- datafusion/common/src/heap_size.rs
- Arc impl around line 284
- Arc impl around line 297
- Arc impl around line 310
Proposed Change
Introduce small internal helpers for Arc allocation identity and deduplication.
Deduplication helper:
- Input: allocation identity (
usize pointer), mutable DFHeapSizeCtx
- Output: whether this allocation should be counted now
Pseudo-shape:
fn count_allocation_once(ptr: usize, ctx: &mut DFHeapSizeCtx) -> bool
- returns
true on first sighting
- returns
false if already accounted
Pointer identity helpers:
- Keep pointer extraction explicit for sized, unsized, and trait-object Arc values.
- Reduce repetitive pointer-casting style with narrowly scoped helpers, for example:
fn arc_ptr<T>(arc: &Arc<T>) -> usize
fn arc_unsized_ptr<T: ?Sized>(arc: &Arc<T>) -> usize
Then update all Arc implementations to:
- Compute allocation identity via the appropriate helper.
- Call
count_allocation_once and early-return 0 when already seen.
- Keep each impl's payload-size math local and explicit.
Important Constraints
- Do not force one shared payload formula across Arc variants.
- Keep unsized and trait-object cases explicit (
Arc<str>, Arc<dyn DFHeapSize>).
- Preserve behavior exactly; this is a refactor, not a semantic change.
Non-Goals
- Changing ownership semantics or documented
DFHeapSize contract.
- Reworking non-Arc implementations.
- Broad API redesign.
Acceptance Criteria
- Arc implementations no longer duplicate
ctx.seen.insert control flow.
- Repetitive Arc pointer-casting style is reduced while preserving explicit handling for sized, unsized, and trait-object cases.
- Existing Arc-related tests continue to pass without behavior changes.
- Readability is improved: dedup invariant and allocation identity extraction are obvious and centralized.
- No regression in docs or test expectations for heap-size semantics.
Validation Plan
Run targeted tests:
cargo test -p datafusion-common heap_size --lib
cargo test -p datafusion-common --doc heap_size
If practical, ensure clone/dedup Arc tests still cover:
Related PR
#22358
Problem Statement
main currently has three separate
DFHeapSizeimplementations for Arc-backed values:impl<T: DFHeapSize> DFHeapSize for Arc<T>impl DFHeapSize for Arc<str>impl DFHeapSize for Arc<dyn DFHeapSize>Each implementation repeats the same shared-allocation deduplication control flow:
DFHeapSizeCtx.seen.0if already seen.This repetition duplicates the same core invariant in multiple places and increases risk that one implementation diverges from the others over time.
Why This Matters
DFHeapSizedepends on a critical invariant: shared heap allocations must be counted once per traversal context. If Arc implementations drift, memory accounting becomes inconsistent and hard to reason about.Consolidating the one-time-accounting logic improves:
Current Code Locations
Proposed Change
Introduce small internal helpers for Arc allocation identity and deduplication.
Deduplication helper:
usizepointer), mutableDFHeapSizeCtxPseudo-shape:
fn count_allocation_once(ptr: usize, ctx: &mut DFHeapSizeCtx) -> booltrueon first sightingfalseif already accountedPointer identity helpers:
fn arc_ptr<T>(arc: &Arc<T>) -> usizefn arc_unsized_ptr<T: ?Sized>(arc: &Arc<T>) -> usizeThen update all Arc implementations to:
count_allocation_onceand early-return0when already seen.Important Constraints
Arc<str>,Arc<dyn DFHeapSize>).Non-Goals
DFHeapSizecontract.Acceptance Criteria
ctx.seen.insertcontrol flow.Validation Plan
Run targeted tests:
cargo test -p datafusion-common heap_size --libcargo test -p datafusion-common --doc heap_sizeIf practical, ensure clone/dedup Arc tests still cover:
Related PR
#22358