enhance: knowhere support metric mhjaccard and index minhash lsh for minhash vector#1203
Conversation
|
@cqy123456 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
db09946 to
f10de66
Compare
| private: | ||
| static constexpr size_t multiplier = 31; | ||
| std::vector<bool> bits; | ||
| size_t m; |
There was a problem hiding this comment.
please initialize these m, k, p, n values with zero here
| uint64_t total_size = dim * npts / 8; | ||
| data = std::make_unique<char[]>(total_size); | ||
| file.read(reinterpret_cast<char*>(data.get()), total_size); | ||
| return; |
There was a problem hiding this comment.
return is not needed
| for (size_t i = 0; i < element_length; i++) { | ||
| const char* x_b = x + element_size * i; | ||
| const char* y_b = y + element_size * i; | ||
| res += (std::memcmp(x_b, y_b, element_size) == 0); |
There was a problem hiding this comment.
this will trigger compiler warnings about implicit bool-to-int conversions. I'd use
int res = 0;
for (size_t i = 0; i < element_length; i++) {
const char* x_b = x + element_size * i;
const char* y_b = y + element_size * i;
res += (std::memcmp(x_b, y_b, element_size) == 0) ? 1 : 0;
}
return (float)res / float(element_length);| const char* y_b = y + element_size * i; | ||
| res += (std::memcmp(x_b, y_b, element_size) == 0); | ||
| } | ||
| return res / float(element_length); |
There was a problem hiding this comment.
what if element_length==0?
There was a problem hiding this comment.
element_length == 0 will check in minhashconfigcheck
| const char* x_b = x + r * i; | ||
| const char* y_b = y + r * i; | ||
| if (std::memcmp(x_b, y_b, r) == 0) { | ||
| return 1.0; |
|
|
||
| using idx_t = faiss::idx_t; | ||
| struct MinHashLSHResultHandler { | ||
| idx_t* ids_list_; |
There was a problem hiding this comment.
please initialize these variables with default values
| counter_ = 0; | ||
| for (size_t i = 0; i < topk_; i++) { | ||
| ids_list_[i] = -1; | ||
| dis_list_[i] = 0.0; |
| mask >>= 1; | ||
| offset++; | ||
| } | ||
| result = mid + offset; |
There was a problem hiding this comment.
it seems that something like __builtin_ctz() is also applicable here
| } | ||
| uint32_t sum = _mm512_reduce_add_epi32(equal_sum); | ||
| while (count > 0) { | ||
| sum += (*u32_x) == (*u32_y); |
There was a problem hiding this comment.
sum += ((*u32_x) == (*u32_y)) ? 1 : 0;| sum += (*u32_x) == (*u32_y); | ||
| count--; | ||
| u32_x++; | ||
| u32_y++; |
There was a problem hiding this comment.
alternatively, you could use masked operation, such as _mm512_maskz_loadu_si512 and _mm512_mask_add_epi32
|
@cqy123456 I'll re-read the code again and write more comments a bit later |
| self.requires("folly/2023.10.30.09@milvus/dev") | ||
| self.requires("libcurl/8.2.1") | ||
| self.requires("simde/0.8.2") | ||
| self.requires("xxhash/0.8.2") |
There was a problem hiding this comment.
the most recent version is 0.8.3
There was a problem hiding this comment.
update later after pushing xxhash/0.8.3 to milvus JFrog cloud
|
|
||
| for (bool bit : bits) { | ||
| char byte = bit ? 1 : 0; | ||
| writeBinaryPOD(writer, byte); |
There was a problem hiding this comment.
Do I get it correct that this code writes 1 byte per 1 bit?
| constexpr const char* INDEX_HNSW_PRQ = "HNSW_PRQ"; | ||
|
|
||
| constexpr const char* INDEX_DISKANN = "DISKANN"; | ||
| constexpr const char* INDEX_MINHASH_INDEX = "MinHash_LSH"; |
There was a problem hiding this comment.
as far as I understand, this name should be made of uppercase letters, i.e. MINHASH_LSH
| KNOWHERE_CONFIG_DECLARE_FIELD(band).description("param of MinHashLSH").set_default(1).for_train().for_search(); | ||
| KNOWHERE_CONFIG_DECLARE_FIELD(mh_element_bit_width) | ||
| .description("sizeof(hash code), the hash element should be aligned on 8 bits") | ||
| .set_default(8) |
There was a problem hiding this comment.
please set the acceptable range as well
| file.read(reinterpret_cast<char*>(&d), sizeof(uint32_t)); | ||
| npts = n; | ||
| dim = d; | ||
| uint64_t total_size = dim * npts / 8; |
There was a problem hiding this comment.
(dim + 7) / 8 * npts, maybe?
Please confirm
There was a problem hiding this comment.
dimension of binary vector should be divisible by 8, I have added the relevant check.
| private: | ||
| std::unique_ptr<MinHashBandIndex[]> band_index_; | ||
| bool is_loaded_ = false; | ||
| size_t block_size_; |
| futures.emplace_back(build_pool->push([&, idx = i]() { | ||
| KeyType* blk_i = reinterpret_cast<KeyType*>(data_ + block_size_ * idx); | ||
| for (auto j = 0; j < num_in_a_blk_[idx]; j++) { | ||
| bloom_filter.add(blk_i[j]); |
There was a problem hiding this comment.
is bloom_filter thread safe?
| // is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express | ||
| // or implied. See the License for the specific language governing permissions and limitations under the License. | ||
|
|
||
| #ifndef MINHASH_TREE_H |
There was a problem hiding this comment.
is it really needed to have a header-only code here? Many of the classes from this files (both functions and methods) may be safely moved to minhash_lsh.cpp file
| Load(FileReader& reader, size_t rows, char* mmap_data, BloomFilter<KeyType>& bloom_filter); | ||
|
|
||
| void | ||
| Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector); |
There was a problem hiding this comment.
typically, this method should be a const one, so
void
Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector) const;Please consider making it const
| Search(KeyType key, MinHashLSHResultHandler* res, faiss::IDSelector* id_selector); | ||
|
|
||
| Status | ||
| WarmUp() { |
There was a problem hiding this comment.
and other methods as well :)
…minhash vector Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
| constexpr const char* SEARCH_WITH_JACCARD = "search_with_jaccard"; | ||
| constexpr const char* MH_ELEMENT_BIT_WIDTH = "mh_element_bit_width"; | ||
| constexpr const char* MH_LSH_REFINE_K = "refine_k"; | ||
| constexpr const char* BATCH_SEARCH = "batch_search"; |
There was a problem hiding this comment.
Please add the mh_ prefix to the minhash-specific parameters to distinguish them.
|
|
||
| static expected<Resource> | ||
| StaticEstimateLoadResource(const float file_size, const knowhere::BaseConfig& config, const IndexVersion& version) { | ||
| return Resource{file_size * 0.25f, file_size}; |
There was a problem hiding this comment.
please confirm and add comments for the load resource estimata
| return !index_type.compare(IndexEnum::INDEX_DISKANN); | ||
| } else { | ||
| return !index_type.compare(IndexEnum::INDEX_DISKANN) || !index_type.compare(IndexEnum::INDEX_HNSW); | ||
| return !index_type.compare(IndexEnum::INDEX_DISKANN) || !index_type.compare(IndexEnum::INDEX_HNSW) || |
|
|
||
| bool | ||
| HasRawData(const std::string& metric_type) const override { | ||
| return false; |
There was a problem hiding this comment.
This should be dynamic, if with_raw_data
|
/kind improvement |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cqy123456, foxspy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…lsh for minhash vector (zilliztech#1203)" This reverts commit fd0871f.
…lsh for minhash vector (zilliztech#1203)" This reverts commit fd0871f. Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
* update cardinal version Signed-off-by: xianliang.li <xianliang.li@zilliz.com> * Revert "enhance: knowhere support metric mhjaccard and index minhash lsh for minhash vector (#1203)" This reverts commit fd0871f. Signed-off-by: xianliang.li <xianliang.li@zilliz.com> --------- Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
…minhash vector (zilliztech#1203) Signed-off-by: cqy123456 <qianya.cheng@zilliz.com>
No description provided.