Skip to content

fix(opensearch): repair bare orphan index on bootstrap without discarding populated indices (#36237)#36353

Merged
fabrizzio-dotCMS merged 4 commits into
mainfrom
issue-36237-os-orphan-bootstrap-mapping-repair
Jun 30, 2026
Merged

fix(opensearch): repair bare orphan index on bootstrap without discarding populated indices (#36237)#36353
fabrizzio-dotCMS merged 4 commits into
mainfrom
issue-36237-os-orphan-bootstrap-mapping-repair

Conversation

@fabrizzio-dotCMS

@fabrizzio-dotCMS fabrizzio-dotCMS commented Jun 29, 2026

Copy link
Copy Markdown
Member

Problem

Re-fix for #36237 (QA failed the prior PRs #36238/#36240 on TC-003).

The idempotent-bootstrap reuse path introduced in #36238 re-asserted the custom mapping via putMapping against an orphaned cluster index (one that exists in the cluster but is missing from the dotCMS index store). On a bare orphan this failed:

INFO  Bootstrap: OS index already exists, reusing and re-asserting mapping: ...working_….os
ERROR MappingOperationsOS - putMapping failed for index ...working_….os — HTTP 400   (×8)

Root cause: the content mapping references the custom analyzer my_analyzer, defined in os-content-settings.json. Analyzers are static index settings that can only be applied at index creation — so a putMapping-only re-assert against a bare index is rejected (analyzer [my_analyzer] not found) and the index is left half-mapped, with the dotCMS dynamic templates missing.

How the orphan arises

In the migration catch-up path the OS index name is derived deterministically by mirroring the ES name (working_T0cluster_X.working_T0.os), not generated with a fresh timestamp. If a prior bootstrap created that physical index but crashed before committing its VersionedIndices store pointer, the next restart re-derives the same name, finds it already in the cluster, and the create fails with resource_already_exists.

Fix

Decide the orphan's fate by document count, so a populated index is never discarded:

  • Empty orphan (0 docs) → delete and recreate from scratch, restoring full settings + base mapping + custom mapping. An empty index has no data and no reindex progress, so recreating it costs nothing operationally — and it's the only case putMapping-reuse could not repair.
  • Populated orphan (>0 docs), or unknown count → reuse in place, untouched (not deleted, not recreated, not remapped). A dotCMS-created index already carries the full mapping; deleting it would force a full reindex (hours, degraded/inconsistent search) — not justified to clean up an orphan. On any uncertainty (the count probe fails) we err toward reuse.

The delete only ever fires against a demonstrably empty orphan, and only in the bootstrap path for a store slot that is not registered — never against the active production index.

Tests

  • Unit (ContentletIndexAPIImplBootstrapTest, 8/8): empty→recreate, populated→reuse-untouched, doc-count-probe-fails→reuse, empty-orphan-delete-fails→still-creates, missing→create, create-fails→no-mapping, existence-probe-fails→create, OS-tag propagation.
  • Integration (ContentletIndexAPIImplMigrationIntegrationTest): new regression IT creates a bare orphan against a real OpenSearch cluster, runs the bootstrap seam, and asserts the recreated index carries the dotCMS dynamic templates (template_1) and the my_analyzer setting. Verified green against OpenSearch:
    Tests run: 23, Failures: 0, Errors: 0, Skipped: 6  → BUILD SUCCESS
    ✅ bootstrap bare-orphan Phase 1 — recreated with full settings + mapping
    

Closes #36237

🤖 Generated with Claude Code

This PR fixes: #36237

…ding populated indices (#36237)

The idempotent-bootstrap reuse path from #36238 re-asserted the custom
mapping via putMapping against an orphaned cluster index. On a bare orphan
(no settings) this failed with HTTP 400 because the content mapping
references the custom analyzer `my_analyzer`, which is a static index
setting that can only be applied at creation time — leaving the index
half-mapped (QA TC-003).

Decide the orphan's fate by document count so a populated index is never
discarded:
- empty orphan (0 docs): delete and recreate from scratch, restoring full
  settings + base mapping + custom mapping. An empty index has no data and
  no reindex progress, so this costs nothing operationally.
- populated orphan (>0 docs) or unknown count: reuse in place, untouched.
  A dotCMS-created index already carries the full mapping; deleting it would
  force a full reindex (hours, degraded search). On any uncertainty (the
  count probe fails) we err toward reuse.

Covered by unit tests (empty->recreate, populated->reuse-untouched,
probe-fails->reuse, delete-fails->still-creates, OS tag propagation) and a
regression IT that recreates a bare orphan against a real OpenSearch cluster
and asserts the recreated index carries the dynamic templates and
my_analyzer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Claude finished @fabrizzio-dotCMS's task in 2m 22s —— View job


Rollback Safety Analysis

  • Read rollback-unsafe categories reference
  • Get full PR diff
  • Analyze diff against all unsafe categories
  • Apply appropriate label

Result: ✅ Safe To Rollback

This PR changes 3 files, all scoped to OpenSearch bootstrap logic and tests:

File Type
ContentletIndexAPIImpl.java Bootstrap orphan-handling logic
ContentletIndexAPIImplBootstrapTest.java Unit tests
ContentletIndexAPIImplMigrationIntegrationTest.java Integration tests

Analysis Against Every Unsafe Category

Category Verdict
C-1 Structural data model change ✅ None — no DB schema changes, no runonce tasks
C-2 Elasticsearch/OpenSearch mapping change ✅ None — the fix is about not calling putMapping on a bare orphan during bootstrap. No ES/OS index mapping fields, types, or analyzers are modified in the live index; the code changes only affect the bootstrap path for an un-registered (orphaned) cluster index.
C-3 Content JSON model version bump ✅ None — no CURRENT_MODEL_VERSION or contentlet_as_json changes
C-4 DROP TABLE / DROP COLUMN ✅ None
H-1 One-way data migration ✅ None — the deletion path only targets a demonstrably empty orphan not registered in the store; no business data is transformed or lost
H-2 RENAME TABLE / COLUMN ✅ None
H-3 PK restructuring ✅ None
H-4 New content type field type ✅ None
H-5 Binary storage provider change ✅ None
H-6 DROP PROCEDURE / FUNCTION ✅ None
H-7 NOT NULL column without default ✅ None
H-8 VTL viewtool contract change ✅ None
M-1 Non-broadening column type change ✅ None
M-2 Push publishing bundle format ✅ None
M-3 REST/GraphQL API contract change ✅ None
M-4 OSGi public interface change ✅ None

Reasoning

The entire diff is contained within the bootstrap seam of ContentletIndexAPIImpl.createContentIndex() — a path that only runs during dotCMS startup when an OS index name collision is detected. No database schema is touched, no index mappings are altered on any existing live index, no data is transformed, and no API contracts are changed. Rolling back to N-1 simply reverts to the prior bootstrap behavior (reuse-and-putMapping), which is the very state that existed before these commits. No manual cluster intervention would be needed after rollback.

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

🤖 dotBot Review (Bedrock)

Reviewed 3 file(s); 2 candidate(s) → 0 confirmed, 0 uncertain (unverified, kept for review).

✅ No issues found after verification.


us.deepseek.r1-v1:0 · Run: #28390053907 · tokens: in: 28972 · out: 11674 · total: 40646 · calls: 7 · est. ~$0.102

…cknowledged (#36237)

Addresses PR review: if the empty-orphan delete is not acknowledged the
index may still exist, so the follow-up create would throw
resource_already_exists — which bootstrapAndPointOS does not catch (it only
handles IOException), aborting bootstrap (the original #36237 failure mode).

Now, when the delete is not acknowledged, the empty orphan is reused in
place instead of attempting a doomed create. The index is empty so nothing
is lost, and a later clean restart recreates it properly once the cluster is
healthy. Unit test updated accordingly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabrizzio-dotCMS

Copy link
Copy Markdown
Member Author

Addressed the 🟡 Medium finding (failed orphan deletion → create failure) in 47b6899.

The empty-orphan branch no longer falls through to a create after a non-acknowledged delete. If the delete isn't confirmed (so the index may still exist), the orphan is now reused in place instead of attempting a create that would throw resource_already_exists — which bootstrapAndPointOS doesn't catch (it only handles IOException) and would abort bootstrap, i.e. the exact original #36237 failure. The orphan is empty so nothing is lost, and a later clean restart recreates it properly once the cluster is healthy.

Unit test test_emptyOrphanDeleteFails_reusedInPlace_notRecreated updated to assert no create / no remap on a non-acknowledged delete. All 8 unit tests green.

…n when delete is stuck (#36237)

Addresses the 🔴 Critical review finding: the previous revision returned
true (reused the bare orphan) when its delete was not acknowledged, which
would register an un-repairable half-mapped index in the store — bare orphans
have no custom analyzer and their mapping cannot be repaired in place.

Now, when the empty-orphan delete is not acknowledged, re-probe existence:
- index gone (delete took effect without an ack) -> recreate cleanly;
- index still present -> throw, failing bootstrap loudly with a clear message
  rather than recreating (would throw resource_already_exists) or registering
  a half-mapped index. This is an abnormal cluster state, not the orphan-name
  collision the method otherwise resolves.

Unit tests: split the unacked-delete case into index-gone->recreate and
index-still-present->fail-loud. 9/9 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fabrizzio-dotCMS

Copy link
Copy Markdown
Member Author

Addressed the 🔴 Critical finding (failed orphan deletion leaves broken index state) in 8b8da22.

You're right — the previous revision's return true on an unacknowledged delete would register a bare, un-repairable orphan (no custom analyzer → mapping cannot be fixed in place), leaving search permanently broken.

New behavior when the empty-orphan delete is not acknowledged — re-probe existence:

  • index is gone (delete took effect without an ack) → recreate cleanly;
  • index still presentthrow and fail bootstrap loudly with a clear message, rather than recreating (would throw resource_already_exists) or registering a half-mapped index. This is an abnormal cluster state (not the orphan-name collision the method resolves), so failing loud — consistent with fix(opensearch): fail loudly on migration init failure + surface ES shard failures (#36237) #36240 — is the correct outcome.

Unit tests split into test_emptyOrphanDeleteUnacked_butIndexGone_recreates and test_emptyOrphanDeleteFails_indexStillExists_failsLoud. All 9 unit tests green.

@fabrizzio-dotCMS fabrizzio-dotCMS added this pull request to the merge queue Jun 30, 2026
@mergify

mergify Bot commented Jun 30, 2026

Copy link
Copy Markdown

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

Merged via the queue into main with commit 9b62a88 Jun 30, 2026
101 of 132 checks passed
@fabrizzio-dotCMS fabrizzio-dotCMS deleted the issue-36237-os-orphan-bootstrap-mapping-repair branch June 30, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

OpenSearch: index bootstrap not idempotent — orphaned cluster index aborts startup

2 participants