Skip to content

enhance: knowhere support metric mhjaccard and index minhash lsh for minhash vector#1203

Merged
foxspy merged 1 commit into
zilliztech:mainfrom
cqy123456:minhash-final
May 28, 2025
Merged

enhance: knowhere support metric mhjaccard and index minhash lsh for minhash vector#1203
foxspy merged 1 commit into
zilliztech:mainfrom
cqy123456:minhash-final

Conversation

@cqy123456
Copy link
Copy Markdown
Collaborator

No description provided.

@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2025

@cqy123456 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Comment thread include/knowhere/comp/bloomfilter.h Outdated
private:
static constexpr size_t multiplier = 31;
std::vector<bool> bits;
size_t m;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please initialize these m, k, p, n values with zero here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread include/knowhere/utils.h Outdated
uint64_t total_size = dim * npts / 8;
data = std::make_unique<char[]>(total_size);
file.read(reinterpret_cast<char*>(data.get()), total_size);
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is not needed

Comment thread src/index/minhash/minhash_util.cc Outdated
for (size_t i = 0; i < element_length; i++) {
const char* x_b = x + element_size * i;
const char* y_b = y + element_size * i;
res += (std::memcmp(x_b, y_b, element_size) == 0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will trigger compiler warnings about implicit bool-to-int conversions. I'd use

    int res = 0;
    for (size_t i = 0; i < element_length; i++) {
        const char* x_b = x + element_size * i;
        const char* y_b = y + element_size * i;
        res += (std::memcmp(x_b, y_b, element_size) == 0) ? 1 : 0;
    }
    return (float)res / float(element_length);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/index/minhash/minhash_util.cc Outdated
const char* y_b = y + element_size * i;
res += (std::memcmp(x_b, y_b, element_size) == 0);
}
return res / float(element_length);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if element_length==0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_length == 0 will check in minhashconfigcheck

Comment thread src/index/minhash/minhash_util.cc Outdated
const char* x_b = x + r * i;
const char* y_b = y + r * i;
if (std::memcmp(x_b, y_b, r) == 0) {
return 1.0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0f


using idx_t = faiss::idx_t;
struct MinHashLSHResultHandler {
idx_t* ids_list_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please initialize these variables with default values

Comment thread src/index/minhash/minhash_util.h Outdated
counter_ = 0;
for (size_t i = 0; i < topk_; i++) {
ids_list_[i] = -1;
dis_list_[i] = 0.0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0f :)

mask >>= 1;
offset++;
}
result = mid + offset;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that something like __builtin_ctz() is also applicable here

Comment thread src/simd/distances_avx512.cc Outdated
}
uint32_t sum = _mm512_reduce_add_epi32(equal_sum);
while (count > 0) {
sum += (*u32_x) == (*u32_y);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum += ((*u32_x) == (*u32_y)) ? 1 : 0;

Comment thread src/simd/distances_avx512.cc Outdated
sum += (*u32_x) == (*u32_y);
count--;
u32_x++;
u32_y++;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, you could use masked operation, such as _mm512_maskz_loadu_si512 and _mm512_mask_add_epi32

@alexanderguzhva
Copy link
Copy Markdown
Collaborator

@cqy123456 I'll re-read the code again and write more comments a bit later

Comment thread conanfile.py
self.requires("folly/2023.10.30.09@milvus/dev")
self.requires("libcurl/8.2.1")
self.requires("simde/0.8.2")
self.requires("xxhash/0.8.2")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most recent version is 0.8.3

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update later after pushing xxhash/0.8.3 to milvus JFrog cloud

Comment thread include/knowhere/comp/bloomfilter.h Outdated

for (bool bit : bits) {
char byte = bit ? 1 : 0;
writeBinaryPOD(writer, byte);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I get it correct that this code writes 1 byte per 1 bit?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread include/knowhere/comp/index_param.h Outdated
constexpr const char* INDEX_HNSW_PRQ = "HNSW_PRQ";

constexpr const char* INDEX_DISKANN = "DISKANN";
constexpr const char* INDEX_MINHASH_INDEX = "MinHash_LSH";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understand, this name should be made of uppercase letters, i.e. MINHASH_LSH

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread include/knowhere/config.h
KNOWHERE_CONFIG_DECLARE_FIELD(band).description("param of MinHashLSH").set_default(1).for_train().for_search();
KNOWHERE_CONFIG_DECLARE_FIELD(mh_element_bit_width)
.description("sizeof(hash code), the hash element should be aligned on 8 bits")
.set_default(8)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please set the acceptable range as well

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread include/knowhere/utils.h
file.read(reinterpret_cast<char*>(&d), sizeof(uint32_t));
npts = n;
dim = d;
uint64_t total_size = dim * npts / 8;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(dim + 7) / 8 * npts, maybe?
Please confirm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimension of binary vector should be divisible by 8, I have added the relevant check.

Comment thread src/index/minhash/minhash_lsh.h Outdated
private:
std::unique_ptr<MinHashBandIndex[]> band_index_;
bool is_loaded_ = false;
size_t block_size_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

futures.emplace_back(build_pool->push([&, idx = i]() {
KeyType* blk_i = reinterpret_cast<KeyType*>(data_ + block_size_ * idx);
for (auto j = 0; j < num_in_a_blk_[idx]; j++) {
bloom_filter.add(blk_i[j]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is bloom_filter thread safe?

Comment thread src/index/minhash/minhash_lsh.h Outdated
// is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
// or implied. See the License for the specific language governing permissions and limitations under the License.

#ifndef MINHASH_TREE_H
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really needed to have a header-only code here? Many of the classes from this files (both functions and methods) may be safely moved to minhash_lsh.cpp file

Comment thread src/index/minhash/minhash_lsh.h Outdated
Load(FileReader& reader, size_t rows, char* mmap_data, BloomFilter<KeyType>& bloom_filter);

void
Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically, this method should be a const one, so

    void
    Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector) const;

Please consider making it const

Comment thread src/index/minhash/minhash_lsh.h Outdated
Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector);

Status
WarmUp() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and other methods as well :)

…minhash vector

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
constexpr const char* SEARCH_WITH_JACCARD = "search_with_jaccard";
constexpr const char* MH_ELEMENT_BIT_WIDTH = "mh_element_bit_width";
constexpr const char* MH_LSH_REFINE_K = "refine_k";
constexpr const char* BATCH_SEARCH = "batch_search";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the mh_ prefix to the minhash-specific parameters to distinguish them.


static expected<Resource>
StaticEstimateLoadResource(const float file_size, const knowhere::BaseConfig& config, const IndexVersion& version) {
return Resource{file_size * 0.25f, file_size};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please confirm and add comments for the load resource estimata

Comment thread src/common/utils.cc
return !index_type.compare(IndexEnum::INDEX_DISKANN);
} else {
return !index_type.compare(IndexEnum::INDEX_DISKANN) || !index_type.compare(IndexEnum::INDEX_HNSW);
return !index_type.compare(IndexEnum::INDEX_DISKANN) || !index_type.compare(IndexEnum::INDEX_HNSW) ||
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's clearer.


bool
HasRawData(const std::string& metric_type) const override {
return false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be dynamic, if with_raw_data

@foxspy
Copy link
Copy Markdown
Collaborator

foxspy commented May 28, 2025

/kind improvement

Copy link
Copy Markdown
Collaborator

@foxspy foxspy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cqy123456, foxspy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@foxspy foxspy merged commit fd0871f into zilliztech:main May 28, 2025
9 of 11 checks passed
@mergify mergify Bot removed the ci-passed label May 28, 2025
foxspy added a commit to foxspy/knowhere that referenced this pull request May 28, 2025
foxspy added a commit to foxspy/knowhere that referenced this pull request May 28, 2025
…lsh for minhash vector (zilliztech#1203)"

This reverts commit fd0871f.

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
sre-ci-robot pushed a commit that referenced this pull request May 28, 2025
* update cardinal version

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>

* Revert "enhance: knowhere support metric mhjaccard and index minhash lsh for minhash vector (#1203)"

This reverts commit fd0871f.

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>

---------

Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
cqy123456 added a commit to cqy123456/zilliztech-knowhere that referenced this pull request May 29, 2025
…minhash vector (zilliztech#1203)

Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants