Summary
When tp_spill_finalize publishes a freshly-built L0 segment from the
in-flight memtable chain, the old chain pages are not WAL-touched
(they are simply unlinked from the metapage's memtable_head_blkno).
On disk they remain as zero-cost dead pages until something reclaims
them. There is currently no such reclaim path.
This is a storage leak, not a correctness issue: queries don't see
the orphans (the chain head moves), and stock PG WAL replay reaches
the same end-state on standbys. But under workloads with many
spills the relation file grows monotonically.
The path to reclamation is the same as for btree's RECENTLY_DEAD
pages: stamp each unlinked chain page with a merged_at_xid horizon
on spill, then have a future amvacuumcleanup pass FSM-recycle pages
whose horizon is older than RecentGlobalXmin. The horizon needs to
be standby-safe, so the same mechanism is required by #(see standby
horizon issue) for displaced segment pages.
Acceptance criteria
- After N successful spills with K chain pages each, on-disk relation
size grows by O(active chain pages), not O(N×K).
amvacuumcleanup reclaims orphaned chain pages once their horizon
is past every active snapshot on primary and replicas.
- Reclaimed pages are returned to the FSM so subsequent
ExtendBufferedRel calls reuse them instead of growing the relation.
Notes
- Documented in
tp_spill_finalize's docstring.
- See also the matching issue for displaced segment pages during
merge — same horizon mechanism.
Summary
When
tp_spill_finalizepublishes a freshly-built L0 segment from thein-flight memtable chain, the old chain pages are not WAL-touched
(they are simply unlinked from the metapage's
memtable_head_blkno).On disk they remain as zero-cost dead pages until something reclaims
them. There is currently no such reclaim path.
This is a storage leak, not a correctness issue: queries don't see
the orphans (the chain head moves), and stock PG WAL replay reaches
the same end-state on standbys. But under workloads with many
spills the relation file grows monotonically.
The path to reclamation is the same as for btree's
RECENTLY_DEADpages: stamp each unlinked chain page with a
merged_at_xidhorizonon spill, then have a future
amvacuumcleanuppass FSM-recycle pageswhose horizon is older than
RecentGlobalXmin. The horizon needs tobe standby-safe, so the same mechanism is required by #(see standby
horizon issue) for displaced segment pages.
Acceptance criteria
size grows by O(active chain pages), not O(N×K).
amvacuumcleanupreclaims orphaned chain pages once their horizonis past every active snapshot on primary and replicas.
ExtendBufferedRel calls reuse them instead of growing the relation.
Notes
tp_spill_finalize's docstring.merge — same horizon mechanism.