Skip to content

fix: [2.6] guard facade APIs from exceptions#1685

Open
foxspy wants to merge 1 commit into
zilliztech:2.6from
foxspy:backport-pr-1675-to-2.6
Open

fix: [2.6] guard facade APIs from exceptions#1685
foxspy wants to merge 1 commit into
zilliztech:2.6from
foxspy:backport-pr-1675-to-2.6

Conversation

@foxspy

@foxspy foxspy commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

issue: #1674

Summary

  • Add guarded facade calls so C++ exceptions are converted to error-code based results.
  • Mark public facade APIs as noexcept and preserve existing returned status values.
  • Add status-category helpers and unit coverage for error-code behavior.

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
(cherry picked from commit 5d2a2fb)
@foxspy foxspy added the kind/bug This PR is a bug fix label Jun 18, 2026
@sre-ci-robot

Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: foxspy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

return std::invoke(std::forward<Func>(func), std::forward<Args>(args)...);
}
} catch (const std::bad_alloc& e) {
return detail::GuardedCallFailure<Result>(Status::malloc_error,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure-construction path here can itself allocate, so under memory pressure that allocation throws and the exception escapes the guard, reaching terminate — a hole in the exact no-throw guarantee this PR exists to provide. On OOM, a guarded call that should have returned a clean error status instead crashes the whole process. Make the failure path allocation-free / noexcept so it cannot throw.

Comment thread src/index/index.cc
// when index is mutable, it could happen that data count larger than bitset size, see
// https://github.com/zilliztech/knowhere/issues/70
// so something must be wrong at caller side when passed bitset size larger than data count
if (bitset_.size() > (size_t)this->Count()) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping Count() in the guard collapses an internal failure to 0, which is then surfaced as invalid_args/input_error (same pattern in AnnIterator and RangeSearch). This inverts the error category — an inner/server-side failure is misreported as a user-input error, so downstream Milvus makes the wrong retry-vs-reject decision instead of seeing the real error. Propagate the underlying error and category rather than degrading the result to 0.

Comment thread src/index/index.cc
auto cfg = this->node->CreateConfig();
RETURN_IF_ERROR(LoadConfig(cfg.get(), json, knowhere::TRAIN, "Build"));
Index<T>::BuildAsync(const DataSetPtr dataset, const Json& json, const std::chrono::seconds timeout) noexcept {
return GuardedCall([&]() -> std::shared_ptr<Interrupt> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildAsync now returns GuardedCall([&]() -> std::shared_ptr<Interrupt> {...}) (Cardinal at index.cc:43, non-Cardinal at index.cc:70). Because the return type is a bare shared_ptr with no error channel, a throw from make_shared<Interrupt> (OOM) or pool->push (e.g. a shut-down pool) takes the default-constructible branch of GuardedFailure and yields a null pointer; pre-PR these propagated as exceptions. The usual caller pattern BuildAsync(...)->Get() then dereferences null and crashes, converting a catchable exception into a segfault. Return an error-carrying type, or confirm the Milvus caller null-checks the returned Interrupt and document that it can be null.

case knowhere::Status::type_conflict_in_json:
case knowhere::Status::invalid_metric_type:
case knowhere::Status::empty_index:
case knowhere::Status::not_implemented:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_not_trained/index_already_trained map to input_error (:87-89) and timeout maps to inner_error (:106), and that category directly drives the retry-vs-reject decision. These are state/deadline conditions rather than obvious user-input, so confirm the mapping matches intended Milvus semantics — otherwise a not-trained or timeout outcome could be retried or rejected incorrectly. This is a verify-before-merge sign-off, not a confirmed bug.


template <typename T>
class expected {
class KNOWHERE_NODISCARD expected {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking expected<T> [[nodiscard]] (via KNOWHERE_NODISCARD) is a real compile-surface change, and the in-repo test_diskann.cc fix shows discard sites exist. But knowhere's own build will not break: host C++ -Werror is commented out (compile_flags.cmake:19) so a discard is only a warning, and the GPU build suppresses host warnings via -Xcompiler=-w / --diag-suppress (compile_flags.cmake:31-33), so the cuVS -Werror=all-warnings path is unaffected. The only real risk is downstream Milvus if it compiles with -Werror — sweep it for discarded expected<T> returns before merge.

Comment thread src/index/index_static.cc
LOG_KNOWHERE_WARNING_ << "unhandled create config for indexType: " << indexType;
return std::make_unique<BaseConfig>();
IndexStaticFaced<DataType>::CreateConfig(const IndexType& indexType, const IndexVersion& version) noexcept {
return GuardedCall([&]() -> std::unique_ptr<BaseConfig> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexStaticFaced::CreateConfig is now noexcept and wraps its body in GuardedCall, so on any exception it returns a null unique_ptr<BaseConfig> (the default-constructible branch). In-repo callers are safe — they route the result through LoadStaticConfig, which this PR teaches to null-check — but the public static facade now hands nullptr to external (Milvus) callers that previously received an exception; any CreateConfig(...)->... dereference becomes a null-deref crash. This is the same null-sentinel hazard flagged for BuildAsync; return an error-carrying type or guarantee/document a non-null default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants