fix: [2.6] guard facade APIs from exceptions#1685
Conversation
Signed-off-by: xianliang.li <xianliang.li@zilliz.com> (cherry picked from commit 5d2a2fb)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: foxspy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| return std::invoke(std::forward<Func>(func), std::forward<Args>(args)...); | ||
| } | ||
| } catch (const std::bad_alloc& e) { | ||
| return detail::GuardedCallFailure<Result>(Status::malloc_error, |
There was a problem hiding this comment.
The failure-construction path here can itself allocate, so under memory pressure that allocation throws and the exception escapes the guard, reaching terminate — a hole in the exact no-throw guarantee this PR exists to provide. On OOM, a guarded call that should have returned a clean error status instead crashes the whole process. Make the failure path allocation-free / noexcept so it cannot throw.
| // when index is mutable, it could happen that data count larger than bitset size, see | ||
| // https://github.com/zilliztech/knowhere/issues/70 | ||
| // so something must be wrong at caller side when passed bitset size larger than data count | ||
| if (bitset_.size() > (size_t)this->Count()) { |
There was a problem hiding this comment.
Wrapping Count() in the guard collapses an internal failure to 0, which is then surfaced as invalid_args/input_error (same pattern in AnnIterator and RangeSearch). This inverts the error category — an inner/server-side failure is misreported as a user-input error, so downstream Milvus makes the wrong retry-vs-reject decision instead of seeing the real error. Propagate the underlying error and category rather than degrading the result to 0.
| auto cfg = this->node->CreateConfig(); | ||
| RETURN_IF_ERROR(LoadConfig(cfg.get(), json, knowhere::TRAIN, "Build")); | ||
| Index<T>::BuildAsync(const DataSetPtr dataset, const Json& json, const std::chrono::seconds timeout) noexcept { | ||
| return GuardedCall([&]() -> std::shared_ptr<Interrupt> { |
There was a problem hiding this comment.
BuildAsync now returns GuardedCall([&]() -> std::shared_ptr<Interrupt> {...}) (Cardinal at index.cc:43, non-Cardinal at index.cc:70). Because the return type is a bare shared_ptr with no error channel, a throw from make_shared<Interrupt> (OOM) or pool->push (e.g. a shut-down pool) takes the default-constructible branch of GuardedFailure and yields a null pointer; pre-PR these propagated as exceptions. The usual caller pattern BuildAsync(...)->Get() then dereferences null and crashes, converting a catchable exception into a segfault. Return an error-carrying type, or confirm the Milvus caller null-checks the returned Interrupt and document that it can be null.
| case knowhere::Status::type_conflict_in_json: | ||
| case knowhere::Status::invalid_metric_type: | ||
| case knowhere::Status::empty_index: | ||
| case knowhere::Status::not_implemented: |
There was a problem hiding this comment.
index_not_trained/index_already_trained map to input_error (:87-89) and timeout maps to inner_error (:106), and that category directly drives the retry-vs-reject decision. These are state/deadline conditions rather than obvious user-input, so confirm the mapping matches intended Milvus semantics — otherwise a not-trained or timeout outcome could be retried or rejected incorrectly. This is a verify-before-merge sign-off, not a confirmed bug.
|
|
||
| template <typename T> | ||
| class expected { | ||
| class KNOWHERE_NODISCARD expected { |
There was a problem hiding this comment.
Marking expected<T> [[nodiscard]] (via KNOWHERE_NODISCARD) is a real compile-surface change, and the in-repo test_diskann.cc fix shows discard sites exist. But knowhere's own build will not break: host C++ -Werror is commented out (compile_flags.cmake:19) so a discard is only a warning, and the GPU build suppresses host warnings via -Xcompiler=-w / --diag-suppress (compile_flags.cmake:31-33), so the cuVS -Werror=all-warnings path is unaffected. The only real risk is downstream Milvus if it compiles with -Werror — sweep it for discarded expected<T> returns before merge.
| LOG_KNOWHERE_WARNING_ << "unhandled create config for indexType: " << indexType; | ||
| return std::make_unique<BaseConfig>(); | ||
| IndexStaticFaced<DataType>::CreateConfig(const IndexType& indexType, const IndexVersion& version) noexcept { | ||
| return GuardedCall([&]() -> std::unique_ptr<BaseConfig> { |
There was a problem hiding this comment.
IndexStaticFaced::CreateConfig is now noexcept and wraps its body in GuardedCall, so on any exception it returns a null unique_ptr<BaseConfig> (the default-constructible branch). In-repo callers are safe — they route the result through LoadStaticConfig, which this PR teaches to null-check — but the public static facade now hands nullptr to external (Milvus) callers that previously received an exception; any CreateConfig(...)->... dereference becomes a null-deref crash. This is the same null-sentinel hazard flagged for BuildAsync; return an error-carrying type or guarantee/document a non-null default.
issue: #1674
Summary
noexceptand preserve existing returned status values.