Skip to content

fix(mental-models): cap history array length to prevent jsonb overflow#1593

Open
cdbartholomew wants to merge 2 commits into
mainfrom
fix/mental-model-history-cap
Open

fix(mental-models): cap history array length to prevent jsonb overflow#1593
cdbartholomew wants to merge 2 commits into
mainfrom
fix/mental-model-history-cap

Conversation

@cdbartholomew
Copy link
Copy Markdown
Contributor

@cdbartholomew cdbartholomew commented May 12, 2026

Problem

Each content-changing update to a mental model appends a full snapshot (previous_content + previous_reflect_response + changed_at) to the mental_models.history jsonb array. Two compounding issues:

1. Unbounded growth → jsonb 256 MB overflow. Without a cap, the array grows forever. PostgreSQL has a hard 256 MB limit on total jsonb array element size; once a row crosses it, every subsequent UPDATE fails:

ERROR: total size of jsonb array elements exceeds the maximum of 268435455 bytes
SQLSTATE: 54000

The mental model is permanently un-writable until the history column is manually trimmed at the DB level. Reachable in normal use: with reflect responses on the order of hundreds of KB and a workload that refreshes a small set of mental models repeatedly, the limit is hit in a few hundred refreshes. Once one tenant's row lands in this state, repeated UPDATE attempts materialize the 256 MB+ array on every retry — significant memory pressure on the primary, degraded availability for unrelated tenants on the same instance.

2. Per-update bloat → no HOT updates, persistent dead-tuple churn. Even with a cap in place, storing the full reflect_response payload in each snapshot pushes per-row size to ~22 MB at cap=50. That exceeds heap-page fit, so every UPDATE writes a full TOAST row and skips HOT. Every refresh leaves a ~22 MB dead tuple that has to be vacuumed; under sustained refresh load the dead-tuple backlog outruns autovacuum and the table balloons (observed: 570 MB / 43% dead tuples on a single tenant's mental_models table).

Fix

(a) Cap history length at write time. Trim to the most recent N entries via a single subquery on COALESCE(history, '[]'::jsonb) || $new::jsonb:

history = (
  SELECT COALESCE(jsonb_agg(elem ORDER BY idx), '[]'::jsonb)
  FROM jsonb_array_elements(
    COALESCE(history, '[]'::jsonb) || $new::jsonb
  ) WITH ORDINALITY a(elem, idx)
  WHERE idx > GREATEST(
    jsonb_array_length(COALESCE(history, '[]'::jsonb)) + 1 - N, 0
  )
)

New env var HINDSIGHT_API_MENTAL_MODEL_HISTORY_MAX_ENTRIES controls N. Default 50 — well under the 256 MB ceiling even with hundreds-of-KB reflect responses, while preserving enough recent history for audit / rollback.

(b) Slim each history entry to {based_on: ...} only. The control-plane history view (mental-model-detail-modal.tsx) only reads previous_reflect_response.based_on; every other field in the reflect payload is unused. Store just that slice — per-entry size drops ~100x, rows fit on a heap page, HOT updates re-enable, dead tuples self-clean.

Existing bulky rows rotate out naturally via the cap=50 ring buffer; no migration needed.

What this PR does NOT do

Rows already over the 256 MB ceiling pre-fix need a one-shot manual trim of their history column at the DB level. The SQL-side append in this PR cannot heal a row whose existing history is already too large to materialize in the jsonb engine — evaluating history || $new itself raises 54000. After the manual trim, this fix prevents recurrence.

A standalone migration that trims existing rows is intentionally NOT included here: the right cap depends on per-deployment tolerance and the migration would block on ACCESS EXCLUSIVE while trimming potentially-huge rows. Operators should do this as a targeted one-time DML if they've observed the issue.

Test plan

  • test_history_capped_to_max_entries: with max_entries=3, six content updates produce a 3-element history (most recent first: v5, v4, v3v1 and v2 dropped).
  • test_history_snapshots_previous_reflect_response: assertions updated to expect the slim {based_on: ...} shape.
  • test_history_snapshots_omit_reflect_response_when_based_on_missing: covers reflect payloads with no based_on field (stored as None) and with based_on: {} (stored as {based_on: {}}).
  • Existing history tests cover unchanged ordering, snapshot, and gating behaviors — semantics untouched for short histories.
  • Lint clean on modified files.
  • Env-var override verified: HINDSIGHT_API_MENTAL_MODEL_HISTORY_MAX_ENTRIES=7 reads as 7.
  • CI green

Each content-changing update to a mental model appends a full snapshot
(previous_content + previous_reflect_response + changed_at) to the
`mental_models.history` jsonb array. Without a cap the array grows
unboundedly. Postgres has a hard 256MB limit on the total size of jsonb
array elements; once a row crosses it, every subsequent UPDATE to that
row fails with SQLSTATE 54000 ("total size of jsonb array elements
exceeds the maximum of 268435455 bytes") — the mental model becomes
permanently un-writable until the history is manually trimmed at the DB
level.

This is reachable in normal use: with reflect responses on the order of
hundreds of KB (common when the bank has many memories) and a workload
that refreshes a small set of mental models repeatedly, the limit is
hit in a few hundred refreshes.

Fix
---
Trim history to the most recent N entries at write time. The append
becomes a single subquery that takes the last N elements of
`COALESCE(history, '[]'::jsonb) || $new::jsonb` ordered by their array
index. New env var `HINDSIGHT_API_MENTAL_MODEL_HISTORY_MAX_ENTRIES`
controls N; default 50 (well under the 256MB ceiling even with large
reflect responses, while preserving enough recent history for audit /
rollback).

Rows already over the limit pre-fix need a one-shot manual trim of
their `history` column — the SQL-side append in this PR cannot heal a
row whose existing `history` is already too large to materialize in
the jsonb engine, because evaluating `history || $new` itself raises
54000. After the manual trim, this fix prevents recurrence.

Tests
-----
New `test_history_capped_to_max_entries`: with max_entries=3, six
content updates produce a 3-element history (most recent first: v5,
v4, v3 — v1 and v2 dropped). Existing history tests cover the unchanged
ordering, snapshot, and gating behaviors.

Docs
----
New row in `configuration.md`.
Each history entry previously stored the full reflect_response payload
(~400-500 KB), pushing per-row size to ~22 MB at the cap. That exceeds
heap-page fit, so every UPDATE writes a full TOAST row and skips HOT,
leaving a dead tuple that must be vacuumed.

The control-plane history view only reads previous_reflect_response.based_on;
everything else in the payload is unused. Store just that slice — per-entry
size drops ~100x, rows fit on a heap page, HOT updates re-enable, dead
tuples self-clean.

Existing bulky rows rotate out naturally via the cap=50 ring buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants