Skip to content

fix(migrations): repair mental_models.subtype at current head (#1553)#1627

Merged
benfrank241 merged 1 commit into
mainfrom
fix/mental-models-subtype-backfill-1553
May 14, 2026
Merged

fix(migrations): repair mental_models.subtype at current head (#1553)#1627
benfrank241 merged 1 commit into
mainfrom
fix/mental-models-subtype-backfill-1553

Conversation

@benfrank241
Copy link
Copy Markdown
Contributor

@benfrank241 benfrank241 commented May 14, 2026

Summary

Closes #1553.

Three production deployments — @mvessair-hive (OP), @4Lienau, @khanhduyvt0101 — report column "subtype" of relation "mental_models" does not exist on create_mental_model even though alembic_version is at m3rg3h3ad5f6 and the container logs Database migrations completed successfully. mental_models is stuck in a v3-shaped schema missing six v4 columns: subtype, description, entity_id, observations, links, last_updated.

Root cause (confirmed from git history)

The chain was retroactively edited to insert a new migration "behind" the v0.5.6 head. Databases that completed migrations on v0.5.6 advanced to a head that, at that time, did not have the column-add as an ancestor. When they later upgraded to v0.6.0/v0.6.1, the new chain claimed that migration was already applied — but it never actually ran.

The mechanism, step by step:

  1. v0.5.6 (2026-04-28) — head of the migration chain was i4j5k6l7m8n9, with down_revision = "8c6fa6f7230b". d5y6z7a8b9c0_backfill_mental_models_subtype did not exist in this release (it was on a feature branch, not yet merged into a tagged release).

  2. Any user who deployed v0.5.6 and let migrations complete ended up with alembic_version = i4j5k6l7m8n9, and mental_models in v3 shape — without subtype, because:

    • t5o6p7q8r9s0_rename_mental_models_to_observations (a prior ancestor) renamed reflections → mental_models, leaving the table without subtype
    • h3c4d5e6f7g8_mental_models_v4 used CREATE TABLE IF NOT EXISTS — a no-op on the already-existing renamed table — so the subtype column was never added
    • The intended remediation (d5y6z7a8b9c0) didn't exist yet
  3. On 2026-04-29, PR fix(oracle): restore PG query semantics and clean up migration chain #1312 (76bcd931 fix(oracle): restore PG query semantics and clean up migration chain) rewrote i4j5k6l7m8n9's parent:

    - down_revision: str | Sequence[str] | None = "8c6fa6f7230b"
    + down_revision: str | Sequence[str] | None = "d5y6z7a8b9c0"

    The intent was to insert d5y6z7a8b9c0 into the chain so any future migration runs would apply it. But for databases that had already advanced past i4j5k6l7m8n9, this insertion is invisible — alembic treats all ancestors of the current state as "applied," and per the rewritten graph, d5y6z7a8b9c0 is now an ancestor of i4j5k6l7m8n9.

  4. v0.6.0 (2026-05-05) and v0.6.1 (2026-05-08) shipped with the rewritten chain. When a stuck v0.5.6 user upgrades, alembic walks forward from i4j5k6l7m8n9 and applies migrations downstream — never going back to apply d5y6z7a8b9c0. The DB advances to m3rg3h3ad5f6 (the new head added in v0.6.0) but subtype is never added.

  5. End state: alembic_version = m3rg3h3ad5f6, mental_models missing six v4 columns, create_mental_model 500s.

Three confirmed reports is consistent with this — v0.5.6 was the current release for one week (2026-04-28 to 2026-05-05). Any deployment that ran migrations during that window now hits the trap on upgrade to 0.6.x.

Lesson: rewriting a migration's down_revision to insert a new migration retroactively does not retroactively apply it to databases past that point. Net-new migrations should be added as a new head, not spliced into the middle of an existing chain.

Fix

Add a new migration at the current head — 86f7a033d372, down_revision = "m3rg3h3ad5f6" — so every affected deployment picks it up on next container start. It mirrors the column-add block from d5y6z7a8b9c0 and is fully idempotent:

  • Each ALTER TABLE uses ADD COLUMN IF NOT EXISTS
  • The whole block is wrapped in DO $$ ... IF EXISTS (information_schema.tables ...) ... END $$; so it skips databases that predate the mental_models table
  • UPDATE ... SET subtype = 'structural' WHERE subtype = 'directive' normalizes any leftover 'directive' rows from the o0j1k2l3m4n5 directive-only branch so the CHECK constraint add succeeds
  • The CHECK constraint is recreated with the canonical v4 allowlist ('structural', 'emergent', 'pinned', 'learned') — same value set as d5y6z7a8b9c0 and h3c4d5e6f7g8

Databases that already received the columns from d5y6z7a8b9c0 see this as a no-op. Databases stuck at m3rg3h3ad5f6 without the columns get them on next migration run.

Scope

PG-only. Oracle's baseline (o1a2b3c4d5e6_oracle_baseline.py) creates mental_models with its own topology and chk_mm_subtype CHECK (subtype IN ('directive', 'pinned')), so this PG-shaped repair does not apply there. The run_for_dialect(pg=_pg_upgrade) dispatcher omits the Oracle slot intentionally per the Dialect-only migrations guidance in CLAUDE.md.

Test plan

  • uv run pytest tests/test_migration_shape.py — 64/64 passed including the new file's parametrized case
  • uv run ruff check on the new migration — clean
  • Empirical reproduction of the stuck state and verification of the fix: spun up a fresh pgvector container, applied the full canonical chain, dropped the six v4 columns, dropped the CHECK constraint, rewound alembic_version to m3rg3h3ad5f6, confirmed INSERT INTO mental_models (..., subtype, ...) reproduces the exact 500 error from Migration chain stuck on m3rg3h3ad5f6 — alembic reports success but doesn't move head; create_mental_model 500s on missing subtype column #1553, then ran alembic upgrade heads. Result: DB advances to 86f7a033d372, all six columns present, CHECK constraint recreated, index recreated, INSERT succeeds, subtype='structural' accepted, subtype='INVALID' rejected.
  • Idempotency: re-running alembic upgrade heads on the already-fixed DB is a no-op and doesn't disturb existing rows.
  • Directive-row rewrite: pre-seeded mental_models with rows where subtype='directive' (simulating databases that came through the o0j1k2l3m4n5 branch). Migration normalizes them to 'structural' so the CHECK constraint applies cleanly. No data loss.

CI

Two unrelated failures on the run:

  • test-python-client-oracleORA-01659: unable to allocate MINEXTENTS beyond 1 in tablespace HINDSIGHT_TS. Infrastructure issue in the Oracle CI container's tablespace, unrelated to this PR. The migration is run_for_dialect(pg=_pg_upgrade) so it's a no-op on Oracle.
  • LLM acceptance (gemini/gemini-2.5-flash-lite) — transient Gemini API flake (1m58s runtime suggests early API failure, not a real test failure). Similar flake seen on PR feat(claude-agent-sdk): add Claude Agent SDK integration #1582.

test-api (the Postgres pytest job that actually runs the migration) passed in 17m40s. All Docker builds, all four Python version matrix builds, all client SDK builds, and all doc-example test matrices passed.

Note for affected users

After the next release containing this PR, the migration runs automatically on container start and adds the missing columns + CHECK constraint + index. No manual SQL needed. Existing user-curated mental models are preserved (each row gets subtype = 'structural' by default).

Three production deployments (issue #1553, plus confirmations from
@4Lienau and @khanhduyvt0101) report `column "subtype" of relation
"mental_models" does not exist` on `create_mental_model`, despite their
alembic_version showing the current head `m3rg3h3ad5f6`.

Both h3c4d5e6f7g8_mental_models_v4 (which uses `CREATE TABLE IF NOT EXISTS`
and is a no-op on databases that came through the reflections rename) and
d5y6z7a8b9c0_backfill_mental_models_subtype were meant to ensure the
column exists, but on these specific deployments neither fired
successfully — likely a casualty of the divergent-heads reorganization
that put d5y6z7a8b9c0 on a branch the affected DBs bypassed.

Add a new migration at the current head so every stuck deployment picks
it up on next container start. Idempotent (`ADD COLUMN IF NOT EXISTS`),
guarded by an existence check on the table, and matches the canonical v4
column set and CHECK allowlist from d5y6z7a8b9c0.

PG-only: Oracle's baseline creates mental_models with a different
topology and constraint shape, so this repair does not apply there.
@benfrank241 benfrank241 merged commit debbd91 into main May 14, 2026
137 of 140 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migration chain stuck on m3rg3h3ad5f6 — alembic reports success but doesn't move head; create_mental_model 500s on missing subtype column

1 participant