Skip to content

[RW Separation] Provide clear information when a query exception occurs due to the number of search only shards being 0#21100

Open
guojialiang92 wants to merge 2 commits into
opensearch-project:mainfrom
guojialiang92:dev/provide_clear_exception_msg
Open

[RW Separation] Provide clear information when a query exception occurs due to the number of search only shards being 0#21100
guojialiang92 wants to merge 2 commits into
opensearch-project:mainfrom
guojialiang92:dev/provide_clear_exception_msg

Conversation

@guojialiang92
Copy link
Copy Markdown
Contributor

Description

I have provided clear error messages for the following two scenarios.

  • The query used preference "_search_replica" && the number of search only shards is 0
  • The query did not use preference "_search_replica" && the number of search only shards is 0 && INDEX_BLOCKS_SEARCH_ONLY_SETTING is true

Related Issues

Resolves #[21099]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…rovide clear exception information.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@guojialiang92 guojialiang92 requested a review from a team as a code owner April 3, 2026 03:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

PR Reviewer Guide 🔍

(Review updated until commit 798e29e)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add clear exception messages in OperationRouting for zero search-only shards

Relevant files:

  • server/src/main/java/org/opensearch/cluster/routing/OperationRouting.java
  • server/src/test/java/org/opensearch/cluster/routing/OperationRoutingTests.java

Sub-PR theme: Update ClusterStateCreationUtils and integration tests for search-only shard scenarios

Relevant files:

  • test/framework/src/main/java/org/opensearch/action/support/replication/ClusterStateCreationUtils.java
  • server/src/internalClusterTest/java/org/opensearch/action/admin/indices/scale/searchonly/ScaleIndexIT.java

⚡ Recommended focus areas for review

Loop Exception

The two new checks for numberOfSearchOnlyReplicas == 0 are placed inside a loop over shards. If there are multiple shards or indices, the exception will be thrown on the first shard encountered, potentially before all shards are evaluated. This may cause misleading error messages when only some shards/indices have 0 search replicas.

if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0 && (Preference.SEARCH_REPLICA.type().equals(preference))) {
    throw new NoShardAvailableActionException(
        null,
        String.format(
            Locale.ROOT,
            "Strictly require querying search only shards, but the number of search only replicas for index %s is 0",
            indexMetadataForShard.getIndex().getName()
        )
    );
}

if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0
    && indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false)) {
    throw new NoShardAvailableActionException(
        null,
        String.format(
            Locale.ROOT,
            "The index %s is in the scale down state, and the number of search only shards is 0",
            indexMetadataForShard.getIndex().getName()
        )
    );
}
Missing catch scope

In the assertBusy block, the try block catches NoShardAvailableActionException and asserts the message, but if the search succeeds (no exception thrown), the fail("search replica should have failed") is never reached because it's placed before the catch. Actually looking more carefully, fail is inside the try block before the catch — this is correct. However, if a different exception type is thrown (not NoShardAvailableActionException), it will propagate uncaught and may cause confusing test failures rather than a clear assertion failure.

try {
    client().prepareSearch(TEST_INDEX).setPreference(Preference.SEARCH_REPLICA.type()).setSize(0).get();
    fail("search replica should have failed");
} catch (NoShardAvailableActionException e) {
    assertEquals(
        String.format(
            Locale.ROOT,
            "Strictly require querying search only shards, but the number of search only replicas for index %s is 0",
            TEST_INDEX
        ),
        e.getMessage()
    );
}
Redundant Setting Lookup

The check indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false) reads the setting from raw settings string key. It would be safer and more consistent to use IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.get(indexMetadataForShard.getSettings()) to leverage the typed setting accessor and avoid potential key mismatch issues.

if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0
    && indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false)) {
    throw new NoShardAvailableActionException(
        null,
        String.format(
            Locale.ROOT,
            "The index %s is in the scale down state, and the number of search only shards is 0",
            indexMetadataForShard.getIndex().getName()
        )
    );
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

PR Code Suggestions ✨

Latest suggestions up to 798e29e
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure test exercises correct code path

The test passes null as the preference for the scale-down state scenario, but the
production code in OperationRouting.searchShards only throws the scale-down
exception when preference is null (or empty) AND isStrictSearchOnlyShardRouting is
true. If isStrictSearchOnlyShardRouting defaults to false, the preference won't be
set to SEARCH_REPLICA and the scale-down check may not be reached. Verify that the
test correctly exercises the code path, or explicitly set the strict routing flag to
ensure the condition is triggered.

server/src/test/java/org/opensearch/cluster/routing/OperationRoutingTests.java [707-717]

+opRouting.setStrictSearchOnlyShardRouting(true);
+
 ClusterState newState = ClusterStateCreationUtils.stateWithAssignedPrimariesAndReplicas(
     indexNames,
     1,
     1,
     0,
     Settings.builder().put(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), true).build()
 );
 
 throwable = assertThrows(
     NoShardAvailableActionException.class,
     () -> opRouting.searchShards(newState, indexNames, null, null, collector, outstandingRequests, null)
 );
Suggestion importance[1-10]: 6

__

Why: This is a valid concern - looking at the production code, the second NoShardAvailableActionException (scale-down state) is thrown regardless of isStrictSearchOnlyShardRouting since it's a separate condition block. The check at line 297-307 doesn't depend on the preference being set to SEARCH_REPLICA. However, the suggestion correctly identifies that the test should verify the code path is properly exercised, making it a useful improvement for test reliability.

Low
General
Reorder exception checks for correct priority

The second condition is a subset of the first: if SEARCH_REPLICA preference is set
and search replicas are 0, the first exception is thrown. However, when the index is
in scale-down state with 0 search replicas and no explicit preference, only the
second check fires. But if both conditions are true (scale-down state AND
SEARCH_REPLICA preference), only the first exception is thrown, potentially hiding
the scale-down context. Consider combining or reordering the checks so the
scale-down state message takes priority, or merging both conditions into a single,
more informative message.

server/src/main/java/org/opensearch/cluster/routing/OperationRouting.java [286-307]

-if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0 && (Preference.SEARCH_REPLICA.type().equals(preference))) {
-    throw new NoShardAvailableActionException(
-        null,
-        String.format(
-            Locale.ROOT,
-            "Strictly require querying search only shards, but the number of search only replicas for index %s is 0",
-            indexMetadataForShard.getIndex().getName()
-        )
-    );
+if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0) {
+    boolean isScaleDown = indexMetadataForShard.getSettings()
+        .getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false);
+    if (isScaleDown) {
+        throw new NoShardAvailableActionException(
+            null,
+            String.format(
+                Locale.ROOT,
+                "The index %s is in the scale down state, and the number of search only shards is 0",
+                indexMetadataForShard.getIndex().getName()
+            )
+        );
+    }
+    if (Preference.SEARCH_REPLICA.type().equals(preference)) {
+        throw new NoShardAvailableActionException(
+            null,
+            String.format(
+                Locale.ROOT,
+                "Strictly require querying search only shards, but the number of search only replicas for index %s is 0",
+                indexMetadataForShard.getIndex().getName()
+            )
+        );
+    }
 }
 
-if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0
-    && indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false)) {
-    throw new NoShardAvailableActionException(
-        null,
-        String.format(
-            Locale.ROOT,
-            "The index %s is in the scale down state, and the number of search only shards is 0",
-            indexMetadataForShard.getIndex().getName()
-        )
-    );
-}
-
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid point about condition ordering when both scale-down state and SEARCH_REPLICA preference are true simultaneously. However, the current behavior (first check wins) is arguably acceptable since the user explicitly requested search replicas, and the improved code changes observable behavior in edge cases. The impact is minor as these conditions are unlikely to overlap in practice.

Low

Previous suggestions

Suggestions up to commit 9cf6116
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard scale-down exception to strict routing context only

The second check for the scale-down state will never be reached when preference is
SEARCH_REPLICA, because the first check already throws an exception in that case.
However, when preference is null and the index is in scale-down state with 0 search
replicas, the second check should trigger. The logic appears correct for that path,
but the condition should also verify that isStrictSearchOnlyShardRouting is true or
that the preference is not already set to something else, to avoid throwing for
non-search-replica queries on a scaled-down index that still has regular shards
available.

server/src/main/java/org/opensearch/cluster/routing/OperationRouting.java [297-307]

 if (indexMetadataForShard.getNumberOfSearchOnlyReplicas() == 0
-    && indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false)) {
+    && indexMetadataForShard.getSettings().getAsBoolean(IndexMetadata.INDEX_BLOCKS_SEARCH_ONLY_SETTING.getKey(), false)
+    && (isStrictSearchOnlyShardRouting || Preference.SEARCH_REPLICA.type().equals(preference))) {
     throw new NoShardAvailableActionException(
         null,
         String.format(
             Locale.ROOT,
             "The index %s is in the scale down state, and the number of search only shards is 0",
             indexMetadataForShard.getIndex().getName()
         )
     );
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion raises a valid concern: the second check throws NoShardAvailableActionException for any query on a scaled-down index with 0 search replicas, even when preference is null and regular shards are available. Adding the isStrictSearchOnlyShardRouting guard would prevent false failures for non-search-replica queries. However, the scale-down state (INDEX_BLOCKS_SEARCH_ONLY_SETTING) may intentionally block all queries, so the behavior depends on the intended semantics.

Low
Ensure test reliably triggers scale-down exception

The second assertThrows call passes null as the preference, but the second check in
OperationRouting only throws when preference is null AND
isStrictSearchOnlyShardRouting is true (or the index is in scale-down state). If
isStrictSearchOnlyShardRouting defaults to false, this test may not trigger the
expected exception, making it an unreliable test. The test should either set strict
routing to true or pass Preference.SEARCH_REPLICA.type() as the preference to ensure
the exception is thrown.

server/src/test/java/org/opensearch/cluster/routing/OperationRoutingTests.java [715-718]

+opRouting.setStrictSearchOnlyShardRouting(true);
 throwable = assertThrows(
     NoShardAvailableActionException.class,
     () -> opRouting.searchShards(newState, indexNames, null, null, collector, outstandingRequests, null)
 );
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the test passes null preference without enabling strict routing, which may not reliably trigger the scale-down exception if isStrictSearchOnlyShardRouting defaults to false. Setting opRouting.setStrictSearchOnlyShardRouting(true) would make the test more reliable, though the behavior depends on the actual default value of the flag.

Low
General
Use explicit preference for scale-down state test

This block tests that a plain search (without SEARCH_REPLICA preference) throws
NoShardAvailableActionException when the index is in scale-down state with 0 search
replicas. However, a plain search without preference should still route to
regular/primary shards if available. If the intent is to test the scale-down block
behavior, the test should explicitly set the preference or ensure strict routing is
enabled, otherwise the search may succeed unexpectedly.

server/src/internalClusterTest/java/org/opensearch/action/admin/indices/scale/searchonly/ScaleIndexIT.java [138-152]

 if (searchOnlyReplica == 0) {
     try {
-        client().prepareSearch(TEST_INDEX).setSize(0).get();
+        client().prepareSearch(TEST_INDEX).setPreference(Preference.SEARCH_REPLICA.type()).setSize(0).get();
         fail("search replica should have failed");
     } catch (NoShardAvailableActionException e) {
         assertEquals(
             String.format(
                 Locale.ROOT,
                 "The index %s is in the scale down state, and the number of search only shards is 0",
                 TEST_INDEX
             ),
             e.getMessage()
         );
     }
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion points out that a plain search without preference might route to regular shards and not throw the expected exception. However, the improved_code changes the error message expectation to the scale-down message while using SEARCH_REPLICA preference, which would actually trigger the first check (not the scale-down check), making the expected message incorrect. The suggestion's improved_code is inconsistent with the intended test behavior.

Low

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

✅ Gradle check result for 9cf6116: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.27%. Comparing base (0f8cc13) to head (798e29e).
⚠️ Report is 189 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21100      +/-   ##
============================================
+ Coverage     73.20%   73.27%   +0.07%     
- Complexity    72751    73106     +355     
============================================
  Files          5871     5921      +50     
  Lines        332688   333653     +965     
  Branches      48017    48111      +94     
============================================
+ Hits         243543   244489     +946     
+ Misses        69625    69614      -11     
- Partials      19520    19550      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

Persistent review updated to latest commit 798e29e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

✅ Gradle check result for 798e29e: SUCCESS

@guojialiang92
Copy link
Copy Markdown
Contributor Author

guojialiang92 commented Apr 7, 2026

Hi, @prudhvigodithi
Could you please help review the code? :)
issue: #21099.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants