Skip to content

Fix DFA recovery failures: file-handle leak and reset-path crash#21759

Open
ask-kamal-nayan wants to merge 4 commits into
opensearch-project:mainfrom
ask-kamal-nayan:dataFusionbugfix
Open

Fix DFA recovery failures: file-handle leak and reset-path crash#21759
ask-kamal-nayan wants to merge 4 commits into
opensearch-project:mainfrom
ask-kamal-nayan:dataFusionbugfix

Conversation

@ask-kamal-nayan
Copy link
Copy Markdown
Contributor

Description

Fixes two recovery-path bugs in the DataFormat-aware engine, both reproducible with seed 4DB4295327DC8DC0.

1. File-handle leak in LuceneReaderManager (CompositeLocalRecoveryIT)

The initial DirectoryReader opened via DirectoryReader.open(writer) was overwritten by afterRefresh() without decRef, leaking its FileChannel on _0.cfs. LeakFS flagged it at suite teardown. Hold a separate initialReader reference and release it
idempotently from close() and onDeleted(...) to balance the original open().

2. Reset-path crash on DFA shards (DataFormatAwareReplicationPromotionWithLuceneIT.testUncleanFailoverWithContinuousIndexing)

During resetEngineToGlobalCheckpoint, the wrapper's getSegmentInfosSnapshot() override called applyOnEngine(newEngineReference, ...) — a BWC shim that throws for DataFormatAwareEngine. Hit by ReplicationCheckpointUpdater during translog-recovery flush;
failed the shard, RED cluster. Add Engine.acquireSnapshot() returning the format-agnostic CatalogSnapshot (mirrors existing acquireSafeCatalogSnapshot pattern, default impl bridges via getSegmentInfosSnapshot for legacy engines). Route
EngineBackedIndexer.acquireSnapshot() through it. Replace the broken override on the reset wrapper with one that delegates via the Indexer interface — which both legacy and DFA implement.

Files

  • sandbox/plugins/analytics-backend-lucene/.../LuceneReaderManager.java
  • server/src/main/java/org/opensearch/index/engine/Engine.java
  • server/src/main/java/org/opensearch/index/engine/EngineBackedIndexer.java
  • server/src/main/java/org/opensearch/index/shard/IndexShard.java

Verification

Both failing tests now pass deterministically (3/3) with seed 4DB4295327DC8DC0. Existing UTs unaffected: ReadOnlyEngineTests (7), IndexShardTests reset suite (5) — all pass. No public API signatures changed; Engine.acquireSnapshot default impl preserves legacy
behavior.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

PR Reviewer Guide 🔍

(Review updated until commit 84b9458)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Thread-safety issue

The releaseInitialReader() method checks initialReader != null and then performs decRef() and sets it to null without synchronization. If two threads call onDeleted() or close() concurrently, both could pass the null-check before either sets initialReader = null, causing a double-decRef on the same reader. This can corrupt the reference count and lead to premature closure or use-after-free of the DirectoryReader.

private void releaseInitialReader() throws IOException {
    if (initialReader != null) {
        initialReader.decRef();
        initialReader = null;
    }
}
Missing synchronization

The new acquireSnapshot() override reads newEngineReference.get() outside the engineMutex lock, unlike acquireLastCommittedSnapshot() which synchronizes on engineMutex before accessing newEngineReference. If the engine is closed or replaced concurrently, the unsynchronized read can return a stale or null reference, leading to AlreadyClosedException being thrown after the null-check or operating on a closed engine.

public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
    final Indexer ref = newEngineReference.get();
    if (ref == null) {
        throw new AlreadyClosedException("engine was closed");
    }
    return ref.acquireSnapshot();
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Latest suggestions up to 84b9458

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Synchronize to prevent race condition

The releaseInitialReader() method has a race condition. Since initialReader is
volatile, multiple threads could pass the null-check simultaneously before the field
is nulled out, leading to double-decRef. Use synchronization or atomic operations to
ensure thread-safe idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 8

__

Why: Valid concern about a race condition in releaseInitialReader(). Multiple threads could pass the null-check simultaneously before initialReader is nulled out, potentially causing double-decRef. Synchronization ensures thread-safe idempotency, which is critical for resource management.

Medium
Add synchronization for thread safety

The method accesses newEngineReference without holding engineMutex, unlike
acquireLastCommittedSnapshot. This creates a race where the engine could be closed
between the null-check and the acquireSnapshot() call. Wrap the logic in
synchronized (engineMutex) for consistency.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 @Override
 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies an inconsistency with acquireLastCommittedSnapshot, which uses synchronized (engineMutex). Without synchronization, there's a race where the engine could be closed between the null-check and acquireSnapshot() call. Adding synchronization improves thread safety and consistency.

Medium

Previous suggestions

Suggestions up to commit 8fab2b3
CategorySuggestion                                                                                                                                    Impact
Possible issue
Synchronize access to engine reference

The method lacks synchronization on engineMutex unlike acquireLastCommittedSnapshot.
This creates a race condition where newEngineReference could be cleared between the
null-check and the acquireSnapshot() call, potentially causing NPE or inconsistent
state.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 @Override
 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 8

__

Why: The method lacks synchronization on engineMutex unlike acquireLastCommittedSnapshot (lines 5572-5581), creating a race condition. The newEngineReference could be cleared between the null-check and acquireSnapshot() call, potentially causing NPE or inconsistent state during engine transitions.

Medium
Add synchronization for thread-safe release

The releaseInitialReader() method is not thread-safe despite initialReader being
volatile. Multiple threads could pass the null-check simultaneously, leading to
double-decRef. Use synchronization or atomic operations to ensure idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 7

__

Why: While initialReader is volatile, the check-then-act pattern is not atomic. Multiple threads could pass the null-check simultaneously, leading to double-decRef which could cause resource management issues. Synchronization ensures idempotency as documented in the method's javadoc.

Medium
Suggestions up to commit 5053382
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add synchronization for thread safety

The method modifies the volatile field initialReader without synchronization,
creating a race condition. Multiple threads could pass the null-check
simultaneously, leading to double-decRef and potential crashes. Synchronize access
or use atomic operations to ensure thread-safe idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical race condition where multiple threads could simultaneously pass the null-check and call decRef() on the same initialReader, leading to double-free bugs. The volatile keyword alone doesn't prevent this race. Adding synchronized ensures thread-safe idempotency as documented in the method's javadoc.

High
Synchronize access to prevent race condition

The method accesses newEngineReference outside the engineMutex lock, creating a race
condition where the engine could be closed between the null-check and the
acquireSnapshot() call. Wrap the entire operation in synchronized (engineMutex) to
match the pattern used in acquireLastCommittedSnapshot.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 @Override
 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 8

__

Why: Valid race condition identified where newEngineReference could become null between the check and the acquireSnapshot() call. The suggestion correctly points out the inconsistency with acquireLastCommittedSnapshot which uses synchronized (engineMutex). Adding synchronization would prevent potential NullPointerException or accessing a closed engine.

Medium
Suggestions up to commit 10e3972
CategorySuggestion                                                                                                                                    Impact
Possible issue
Synchronize access to engine reference

The method lacks synchronization on engineMutex, unlike
acquireLastCommittedSnapshot. This creates a race condition where newEngineReference
could be cleared between the null-check and the acquireSnapshot() call, potentially
causing unexpected behavior.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 @Override
 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 8

__

Why: Important consistency issue. The method acquireLastCommittedSnapshot uses synchronized (engineMutex) while acquireSnapshot does not, creating an inconsistency. This could lead to race conditions where newEngineReference is cleared between the null-check and method call, potentially causing issues during engine transitions.

Medium
Add synchronization for thread-safe release

The releaseInitialReader() method is not thread-safe despite initialReader being
volatile. Multiple threads could pass the null-check simultaneously, leading to
double-decRef. Use a synchronized block or atomic reference to ensure idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 7

__

Why: Valid concern about thread-safety. While volatile ensures visibility, it doesn't prevent race conditions in the check-then-act pattern. However, the impact depends on actual concurrent usage patterns in the codebase, and the method is called from close() and onDeleted() which may already have external synchronization.

Medium
Suggestions up to commit be5c46b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add synchronization for thread-safe release

The releaseInitialReader() method is not thread-safe despite initialReader being
volatile. Multiple threads could pass the null check simultaneously, leading to
double-decRef. Use synchronization or atomic operations to ensure idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential race condition where multiple threads could pass the null check simultaneously and call decRef() multiple times on initialReader. Adding synchronized ensures thread-safe idempotent behavior, which is critical for proper resource management.

Medium
Synchronize access to engine reference

The method accesses newEngineReference without synchronization, which could lead to
race conditions during engine transitions. Wrap the entire operation in synchronized
(engineMutex) to match the pattern used in acquireLastCommittedSnapshot().

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion identifies an inconsistency in synchronization patterns. While acquireLastCommittedSnapshot() uses synchronized (engineMutex), the new acquireSnapshot() method does not. Adding synchronization would prevent potential race conditions during engine transitions and maintain consistency with the existing pattern.

Medium
Suggestions up to commit be5c46b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add synchronization for thread-safe release

The releaseInitialReader() method is not thread-safe despite initialReader being
volatile. Multiple threads could pass the null check simultaneously, leading to
double-decRef. Use synchronization or atomic operations to ensure idempotency.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneReaderManager.java [233-238]

-private void releaseInitialReader() throws IOException {
+private synchronized void releaseInitialReader() throws IOException {
     if (initialReader != null) {
         initialReader.decRef();
         initialReader = null;
     }
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a potential thread-safety issue. While initialReader is volatile, the check-then-act pattern (null check followed by decRef and assignment) is not atomic, which could lead to race conditions and double-decRef calls. Adding synchronization ensures idempotency and thread safety.

Medium
Synchronize to prevent race condition

The method accesses newEngineReference without synchronization, creating a race
condition where the engine could be closed between the null check and the
acquireSnapshot() call. Wrap the entire operation in synchronized (engineMutex) for
consistency with other methods.

server/src/main/java/org/opensearch/index/shard/IndexShard.java [5584-5590]

 @Override
 public GatedCloseable<CatalogSnapshot> acquireSnapshot() {
-    final Indexer ref = newEngineReference.get();
-    if (ref == null) {
-        throw new AlreadyClosedException("engine was closed");
+    synchronized (engineMutex) {
+        final Indexer ref = newEngineReference.get();
+        if (ref == null) {
+            throw new AlreadyClosedException("engine was closed");
+        }
+        return ref.acquireSnapshot();
     }
-    return ref.acquireSnapshot();
 }
Suggestion importance[1-10]: 8

__

Why: Valid concern about thread safety. The method accesses newEngineReference without synchronization, while acquireLastCommittedSnapshot() above uses synchronized (engineMutex). This inconsistency creates a race condition where the engine could be closed between the null check and the method call. Synchronization ensures consistency with other methods.

Medium

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c85674d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 60ef45c

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit be5c46b

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for be5c46b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit be5c46b

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for be5c46b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Kamal Nayan <askkamal@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 10e3972

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 10e3972: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.41%. Comparing base (ca22376) to head (84b9458).

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21759      +/-   ##
============================================
- Coverage     73.44%   73.41%   -0.04%     
+ Complexity    75084    75072      -12     
============================================
  Files          6016     6016              
  Lines        341072   341074       +2     
  Branches      49091    49091              
============================================
- Hits         250513   250409     -104     
- Misses        70641    70701      +60     
- Partials      19918    19964      +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5053382

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 5053382: SUCCESS

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8fab2b3

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 8fab2b3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 8fab2b3: SUCCESS

@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 84b9458

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 84b9458: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants