Skip to content

Commit 9b62a88

Browse files
fix(opensearch): repair bare orphan index on bootstrap without discarding populated indices (#36237) (#36353)
## Problem Re-fix for **#36237** (QA failed the prior PRs #36238/#36240 on **TC-003**). The idempotent-bootstrap reuse path introduced in #36238 re-asserted the custom mapping via `putMapping` against an orphaned cluster index (one that exists in the cluster but is missing from the dotCMS index store). On a **bare** orphan this failed: ``` INFO Bootstrap: OS index already exists, reusing and re-asserting mapping: ...working_….os ERROR MappingOperationsOS - putMapping failed for index ...working_….os — HTTP 400 (×8) ``` Root cause: the content mapping references the custom analyzer `my_analyzer`, defined in `os-content-settings.json`. Analyzers are **static** index settings that can only be applied at index creation — so a `putMapping`-only re-assert against a bare index is rejected (`analyzer [my_analyzer] not found`) and the index is left half-mapped, with the dotCMS dynamic templates missing. ## How the orphan arises In the migration catch-up path the OS index name is **derived deterministically** by mirroring the ES name (`working_T0` → `cluster_X.working_T0.os`), not generated with a fresh timestamp. If a prior bootstrap created that physical index but crashed before committing its `VersionedIndices` store pointer, the next restart re-derives the **same** name, finds it already in the cluster, and the create fails with `resource_already_exists`. ## Fix Decide the orphan's fate by **document count**, so a populated index is never discarded: - **Empty orphan (0 docs)** → delete and recreate from scratch, restoring full **settings + base mapping + custom mapping**. An empty index has no data and no reindex progress, so recreating it costs nothing operationally — and it's the only case `putMapping`-reuse could not repair. - **Populated orphan (>0 docs), or unknown count** → reuse in place, **untouched** (not deleted, not recreated, not remapped). A dotCMS-created index already carries the full mapping; deleting it would force a full reindex (hours, degraded/inconsistent search) — not justified to clean up an orphan. On any uncertainty (the count probe fails) we err toward reuse. The delete only ever fires against a **demonstrably empty** orphan, and only in the bootstrap path for a store slot that is not registered — never against the active production index. ## Tests - **Unit** (`ContentletIndexAPIImplBootstrapTest`, 8/8): empty→recreate, populated→reuse-untouched, doc-count-probe-fails→reuse, empty-orphan-delete-fails→still-creates, missing→create, create-fails→no-mapping, existence-probe-fails→create, OS-tag propagation. - **Integration** (`ContentletIndexAPIImplMigrationIntegrationTest`): new regression IT creates a **bare** orphan against a real OpenSearch cluster, runs the bootstrap seam, and asserts the recreated index carries the dotCMS dynamic templates (`template_1`) **and** the `my_analyzer` setting. Verified green against OpenSearch: ``` Tests run: 23, Failures: 0, Errors: 0, Skipped: 6 → BUILD SUCCESS ✅ bootstrap bare-orphan Phase 1 — recreated with full settings + mapping ``` Closes #36237 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent e3aa226 commit 9b62a88

3 files changed

Lines changed: 294 additions & 26 deletions

File tree

dotCMS/src/main/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImpl.java

Lines changed: 81 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -655,15 +655,35 @@ public synchronized boolean createContentIndex(final String indexName, final int
655655
* Create an index exclusively in one of the SE Providers.
656656
*
657657
* <p><b>Idempotent bootstrap.</b> If the physical index already exists in the target
658-
* cluster it is reused instead of issuing a create. This guards against an orphaned
659-
* cluster index — present in the cluster but missing from the index store — left behind
660-
* when a previous bootstrap created the index but never committed its store pointer
661-
* (e.g. the OS {@code VersionedIndices} row, or after a partial/interrupted startup).
662-
* Without this guard the restart re-derives the same logical name, the create fails with
663-
* {@code resource_already_exists}, and {@code checkAndInitializeIndex()} aborts — leaving
664-
* the instance half-initialised. The custom mapping is (re)applied either way
665-
* ({@code putMapping} is additive/idempotent), so a previously unmapped orphan is repaired,
666-
* and the caller's {@code point()} re-registers the index in the store.</p>
658+
* cluster it is an <em>orphan</em> — present in the cluster but missing from the index
659+
* store — left behind when a previous bootstrap created the index but never committed its
660+
* store pointer (e.g. the OS {@code VersionedIndices} row, or after a partial/interrupted
661+
* startup). Without handling this, the restart re-derives the same logical name, the create
662+
* fails with {@code resource_already_exists}, and {@code checkAndInitializeIndex()} aborts —
663+
* leaving the instance half-initialised.</p>
664+
*
665+
* <p>The orphan is handled by document count, so a populated index is never discarded:</p>
666+
* <ul>
667+
* <li><b>Empty orphan (0 docs)</b> — deleted and recreated from scratch. In-place reuse
668+
* cannot fully repair a bare orphan: the content mapping references a custom analyzer
669+
* ({@code my_analyzer}) defined in the provider settings file, and analyzers are
670+
* <em>static</em> index settings that can only be applied at creation time — so a
671+
* {@code putMapping}-only re-assert against a bare orphan fails with {@code HTTP 400}
672+
* (analyzer not found) and leaves the index half-mapped (issue #36237, QA TC-003). An
673+
* empty index has no data and no reindex progress, so recreating it costs nothing
674+
* operationally and yields a clean index with full settings + base mapping. If the
675+
* delete cannot be confirmed and the index is still present, bootstrap fails loudly
676+
* rather than register a half-mapped index.</li>
677+
* <li><b>Populated orphan (&gt; 0 docs), or count unknown</b> — reused in place, untouched.
678+
* A populated orphan was created by dotCMS itself, so it already carries the full
679+
* settings + base mapping + custom mapping; nothing needs to be (re)applied. The index is
680+
* never deleted here: discarding it would throw away its contents (including partial
681+
* reindex progress) and force a full reindex, which can run for hours and degrade search
682+
* consistency — not justified to clean up an orphan. On any uncertainty (the count probe
683+
* fails) we err toward reuse for the same reason.</li>
684+
* </ul>
685+
*
686+
* <p>The caller's {@code point()} then registers the index in the store.</p>
667687
*
668688
* @param indexName logical index name (no cluster prefix, no vendor tag)
669689
* @param shards number of shards to create with (ignored when the index already exists)
@@ -714,11 +734,60 @@ boolean createContentIndex(final String indexName, final int shards, final Index
714734
+ e.getMessage(), e))
715735
.getOrElse(false);
716736
if (alreadyExists) {
737+
// Orphan: exists in cluster, missing from store (see method javadoc). Decide by doc
738+
// count so a populated index — including partial reindex progress — is never discarded.
739+
// The count probe is best-effort: any failure is treated as "has data" (-1) so we err
740+
// toward reuse and never delete on uncertainty.
741+
final long docCount = Try.of(() -> ops.getIndexDocumentCount(physicalName))
742+
.onFailure(e -> Logger.warn(this,
743+
"Orphan doc-count probe failed for " + physicalName
744+
+ " — treating as populated and reusing in place: "
745+
+ e.getMessage(), e))
746+
.getOrElse(-1L);
747+
748+
if (docCount != 0L) {
749+
// Populated (or unknown): reuse in place, untouched. A dotCMS-created index already
750+
// carries the full settings + base mapping + custom mapping, so nothing needs to be
751+
// (re)applied. Deleting it would force a full reindex (hours, degraded search) —
752+
// not justified to clean up an orphan.
753+
Logger.info(this, String.format(
754+
"Bootstrap: orphaned %s index found with %s document(s); reusing in place"
755+
+ " (not deleting, not remapping): %s",
756+
tag, docCount < 0 ? "an unknown number of" : docCount, physicalName));
757+
return true;
758+
}
759+
760+
// Empty orphan: delete so the create below rebuilds a clean index with full settings +
761+
// base mapping. An empty index has no data and no reindex progress, so this is safe and
762+
// costs nothing operationally (issue #36237 — repairs a bare orphan that reuse cannot).
717763
Logger.info(this, String.format(
718-
"Bootstrap: %s index already exists, reusing and re-asserting mapping: %s",
764+
"Bootstrap: empty orphaned %s index found (in cluster, missing from store);"
765+
+ " deleting and recreating with full settings + mapping: %s",
719766
tag, physicalName));
720-
helper.addCustomMapping(List.of(indexName), tag);
721-
return true;
767+
final boolean deleted = Try.of(() -> providerApi.delete(physicalName))
768+
.onFailure(e -> Logger.warn(this,
769+
"Failed to delete empty orphaned index " + physicalName
770+
+ ": " + e.getMessage(), e))
771+
.getOrElse(false);
772+
if (!deleted) {
773+
// Delete not acknowledged. Re-probe: it may have taken effect without an ack, in
774+
// which case we can still recreate cleanly. If the index is genuinely still there
775+
// we must NOT proceed — recreating would throw resource_already_exists, and reusing
776+
// it would register a bare orphan whose mapping cannot be repaired (the custom
777+
// analyzer is a create-time-only setting). Fail loud instead of leaving a
778+
// half-mapped index in the store. This is an abnormal cluster state, not the
779+
// orphan-name collision this method otherwise resolves.
780+
final boolean stillExists = Try.of(() -> providerApi.indexExists(physicalName))
781+
.getOrElse(true);
782+
if (stillExists) {
783+
throw new IOException("Empty orphaned " + tag + " index " + physicalName
784+
+ " could not be deleted and still exists; aborting bootstrap to avoid"
785+
+ " registering a half-mapped index. Check the search cluster health"
786+
+ " and restart.");
787+
}
788+
Logger.warn(this, "Empty orphaned index " + physicalName + " delete was not"
789+
+ " acknowledged, but the index is gone; proceeding to recreate.");
790+
}
722791
}
723792

724793
final boolean contentIndex = ops.createContentIndex(physicalName, shards);

dotCMS/src/test/java/com/dotcms/content/elasticsearch/business/ContentletIndexAPIImplBootstrapTest.java

Lines changed: 141 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
import static org.junit.Assert.assertFalse;
44
import static org.junit.Assert.assertTrue;
5+
import static org.junit.Assert.fail;
56
import static org.mockito.ArgumentMatchers.anyInt;
67
import static org.mockito.ArgumentMatchers.anyString;
78
import static org.mockito.Mockito.mock;
@@ -30,12 +31,20 @@
3031
*
3132
* <p>The behaviour under test:</p>
3233
* <ul>
33-
* <li>an index already present in the cluster is <b>reused</b> (no create) and its mapping is
34-
* re-asserted — the orphaned-index repair path;</li>
34+
* <li>an <b>empty</b> orphan (0 docs) is <b>deleted and recreated</b> from scratch (full settings
35+
* + base mapping + custom mapping) — reusing in place cannot repair a bare orphan whose
36+
* static custom analyzer can only be set at creation time (#36237);</li>
37+
* <li>a <b>populated</b> orphan (&gt; 0 docs) is <b>reused untouched</b> — never deleted,
38+
* recreated, or remapped, so its data (and any reindex progress) is preserved;</li>
39+
* <li>a failing doc-count probe is treated as "populated" — the orphan is reused, never deleted;</li>
3540
* <li>a missing index is created and, on success, mapped;</li>
3641
* <li>a failed create does not apply a mapping;</li>
3742
* <li>a failing existence probe is treated as "does not exist", so bootstrap falls through to
38-
* the create path instead of aborting.</li>
43+
* the create path instead of aborting;</li>
44+
* <li>an empty orphan whose delete is unacknowledged but is actually gone recreates cleanly;</li>
45+
* <li>an empty orphan whose delete fails while the index is still present fails loudly, rather
46+
* than recreating (would throw {@code resource_already_exists}) or registering a bare,
47+
* un-repairable index.</li>
3948
* </ul>
4049
*/
4150
public class ContentletIndexAPIImplBootstrapTest {
@@ -60,29 +69,142 @@ private static ContentletIndexAPIImpl newApi() {
6069
}
6170

6271
/**
63-
* Given : the physical index already exists in the target cluster (an orphaned cluster index
64-
* left behind by a previous bootstrap that never committed its store pointer).
72+
* Given : an EMPTY orphaned index (0 docs) exists in the cluster but is missing from the store
73+
* (left by a previous bootstrap that never committed its store pointer).
6574
* When : createContentIndex() runs during bootstrap.
66-
* Then : the index is reused (no create is issued), the custom mapping is re-asserted to
67-
* repair a possibly-unmapped orphan, and the method returns true.
75+
* Then : the empty orphan is deleted and recreated from scratch (full settings + base mapping),
76+
* the custom mapping is applied to the clean index, and the method returns true.
6877
*/
6978
@Test
70-
public void test_orphanIndexExists_reusesAndReassertsMapping_skipsCreate() throws IOException {
79+
public void test_emptyOrphan_deletedAndRecreated_withFullMapping() throws IOException {
7180
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
7281
final IndexAPI providerApi = mock(IndexAPI.class);
7382
final MappingHelper helper = mock(MappingHelper.class);
7483

7584
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(PHYSICAL_NAME);
7685
when(providerApi.indexExists(PHYSICAL_NAME)).thenReturn(true);
86+
when(ops.getIndexDocumentCount(PHYSICAL_NAME)).thenReturn(0L);
87+
when(providerApi.delete(PHYSICAL_NAME)).thenReturn(true);
88+
when(ops.createContentIndex(PHYSICAL_NAME, SHARDS)).thenReturn(true);
7789

7890
final boolean result = newApi()
7991
.createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.ES, ops, providerApi, helper);
8092

81-
assertTrue("Existing (orphaned) index must be reused and reported as available", result);
93+
assertTrue("Empty orphan must be recreated and reported as available", result);
94+
verify(providerApi).delete(PHYSICAL_NAME);
95+
verify(ops).createContentIndex(PHYSICAL_NAME, SHARDS);
96+
verify(helper).addCustomMapping(List.of(LOGICAL_NAME), IndexTag.ES);
97+
}
98+
99+
/**
100+
* Given : a POPULATED orphaned index (&gt; 0 docs) exists in the cluster but is missing from
101+
* the store.
102+
* When : createContentIndex() runs during bootstrap.
103+
* Then : the orphan is reused in place, untouched — it is NOT deleted, NOT recreated, and its
104+
* mapping is NOT re-applied (a dotCMS-created index already carries the full mapping).
105+
* Discarding it would force a costly full reindex. The method returns true.
106+
*/
107+
@Test
108+
public void test_populatedOrphan_reusedInPlace_notDeletedNotRemapped() throws IOException {
109+
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
110+
final IndexAPI providerApi = mock(IndexAPI.class);
111+
final MappingHelper helper = mock(MappingHelper.class);
112+
113+
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(PHYSICAL_NAME);
114+
when(providerApi.indexExists(PHYSICAL_NAME)).thenReturn(true);
115+
when(ops.getIndexDocumentCount(PHYSICAL_NAME)).thenReturn(42L);
116+
117+
final boolean result = newApi()
118+
.createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.ES, ops, providerApi, helper);
119+
120+
assertTrue("Populated orphan must be reused and reported as available", result);
121+
verify(providerApi, never()).delete(PHYSICAL_NAME);
82122
verify(ops, never()).createContentIndex(anyString(), anyInt());
123+
verify(helper, never()).addCustomMapping(List.of(LOGICAL_NAME), IndexTag.ES);
124+
}
125+
126+
/**
127+
* Given : an orphan exists but the document-count probe fails (e.g. transient cluster error).
128+
* When : createContentIndex() runs during bootstrap.
129+
* Then : the uncertainty is treated as "has data" — the orphan is reused in place and never
130+
* deleted, so a possibly-populated index is never discarded on a flaky probe.
131+
*/
132+
@Test
133+
public void test_orphanDocCountProbeFails_treatedAsPopulated_reused() throws IOException {
134+
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
135+
final IndexAPI providerApi = mock(IndexAPI.class);
136+
final MappingHelper helper = mock(MappingHelper.class);
137+
138+
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(PHYSICAL_NAME);
139+
when(providerApi.indexExists(PHYSICAL_NAME)).thenReturn(true);
140+
when(ops.getIndexDocumentCount(PHYSICAL_NAME))
141+
.thenThrow(new RuntimeException("count unavailable"));
142+
143+
final boolean result = newApi()
144+
.createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.ES, ops, providerApi, helper);
145+
146+
assertTrue("Unknown doc count must be reused (never deleted)", result);
147+
verify(providerApi, never()).delete(PHYSICAL_NAME);
148+
verify(ops, never()).createContentIndex(anyString(), anyInt());
149+
}
150+
151+
/**
152+
* Given : an EMPTY orphan whose delete is not acknowledged, but a re-probe shows the index is
153+
* actually gone (the delete took effect without an ack).
154+
* When : createContentIndex() runs during bootstrap.
155+
* Then : it recreates cleanly — the create is issued and the mapping applied.
156+
*/
157+
@Test
158+
public void test_emptyOrphanDeleteUnacked_butIndexGone_recreates() throws IOException {
159+
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
160+
final IndexAPI providerApi = mock(IndexAPI.class);
161+
final MappingHelper helper = mock(MappingHelper.class);
162+
163+
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(PHYSICAL_NAME);
164+
// exists at first (orphan probe) → gone after the unacked delete (re-probe)
165+
when(providerApi.indexExists(PHYSICAL_NAME)).thenReturn(true, false);
166+
when(ops.getIndexDocumentCount(PHYSICAL_NAME)).thenReturn(0L);
167+
when(providerApi.delete(PHYSICAL_NAME)).thenReturn(false);
168+
when(ops.createContentIndex(PHYSICAL_NAME, SHARDS)).thenReturn(true);
169+
170+
final boolean result = newApi()
171+
.createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.ES, ops, providerApi, helper);
172+
173+
assertTrue("Unacked delete with the index gone must recreate cleanly", result);
174+
verify(ops).createContentIndex(PHYSICAL_NAME, SHARDS);
83175
verify(helper).addCustomMapping(List.of(LOGICAL_NAME), IndexTag.ES);
84176
}
85177

178+
/**
179+
* Given : an EMPTY orphan whose delete fails AND a re-probe shows the index is still present.
180+
* When : createContentIndex() runs during bootstrap.
181+
* Then : it fails loudly (throws) rather than recreating (which would throw
182+
* {@code resource_already_exists}) or reusing a bare, un-repairable index. No create is
183+
* issued and no mapping is applied.
184+
*/
185+
@Test
186+
public void test_emptyOrphanDeleteFails_indexStillExists_failsLoud() throws IOException {
187+
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
188+
final IndexAPI providerApi = mock(IndexAPI.class);
189+
final MappingHelper helper = mock(MappingHelper.class);
190+
191+
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(PHYSICAL_NAME);
192+
when(providerApi.indexExists(PHYSICAL_NAME)).thenReturn(true); // still there on re-probe
193+
when(ops.getIndexDocumentCount(PHYSICAL_NAME)).thenReturn(0L);
194+
when(providerApi.delete(PHYSICAL_NAME))
195+
.thenThrow(new RuntimeException("delete not acknowledged"));
196+
197+
try {
198+
newApi().createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.ES, ops, providerApi, helper);
199+
fail("A stuck empty orphan (delete fails, index remains) must fail loudly");
200+
} catch (final IOException expected) {
201+
// expected — bootstrap must not silently register a half-mapped index
202+
}
203+
204+
verify(ops, never()).createContentIndex(anyString(), anyInt());
205+
verify(helper, never()).addCustomMapping(List.of(LOGICAL_NAME), IndexTag.ES);
206+
}
207+
86208
/**
87209
* Given : the physical index does not exist in the target cluster.
88210
* When : createContentIndex() runs and the create succeeds.
@@ -156,25 +278,31 @@ public void test_existenceProbeThrows_treatedAsMissing_proceedsToCreate() throws
156278
}
157279

158280
/**
159-
* Given : an OS-tagged bootstrap of an already-existing index.
281+
* Given : an OS-tagged bootstrap of an already-existing EMPTY (orphaned) index.
160282
* When : createContentIndex() runs with {@link IndexTag#OS}.
161-
* Then : the mapping is re-asserted against the OS provider — the tag is propagated unchanged
162-
* to the mapping helper so the correct vendor is targeted.
283+
* Then : the empty orphan is deleted and recreated against the OS provider — the fully-tagged
284+
* physical name is used for the delete, create and doc-count probe, and the OS tag is
285+
* propagated unchanged to the mapping helper so the correct vendor is targeted.
163286
*/
164287
@Test
165-
public void test_osTag_isPropagatedToMappingHelper() throws IOException {
288+
public void test_osTag_emptyOrphanDeletedRecreated_andTagPropagated() throws IOException {
166289
final ContentletIndexOperations ops = mock(ContentletIndexOperations.class);
167290
final IndexAPI providerApi = mock(IndexAPI.class);
168291
final MappingHelper helper = mock(MappingHelper.class);
169292

170293
final String osPhysical = PHYSICAL_NAME + ".os";
171294
when(ops.toPhysicalName(LOGICAL_NAME)).thenReturn(osPhysical);
172295
when(providerApi.indexExists(osPhysical)).thenReturn(true);
296+
when(ops.getIndexDocumentCount(osPhysical)).thenReturn(0L);
297+
when(providerApi.delete(osPhysical)).thenReturn(true);
298+
when(ops.createContentIndex(osPhysical, SHARDS)).thenReturn(true);
173299

174300
final boolean result = newApi()
175301
.createContentIndex(LOGICAL_NAME, SHARDS, IndexTag.OS, ops, providerApi, helper);
176302

177303
assertTrue(result);
304+
verify(providerApi).delete(osPhysical);
305+
verify(ops).createContentIndex(osPhysical, SHARDS);
178306
verify(helper).addCustomMapping(List.of(LOGICAL_NAME), IndexTag.OS);
179307
}
180308
}

0 commit comments

Comments
 (0)