perf: Optimize word segmentation retrieval#2767
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| jieba.del_word(word) | ||
| extract_tags = jieba.lcut(text) | ||
| result = " ".join(extract_tags) | ||
| return result |
There was a problem hiding this comment.
Code Review and Suggestions
-
Comments:
- The comments in the code are mostly clear but could benefit from more specific detail on what each function does.
-
Imports:
- Import
analysedirectly instead of usingjieba.analyse. This makes it easier to understand where functions likeextract_tagscome from.
- Import
-
Functionality:
- In
to_ts_vector, replace steps related to keyword extraction and filtering with a direct call tojieba.lcut. - In
to_query, similar changes can be made. Directly calljieba.lcutwithout unnecessary processing.
- In
-
Optimization:
- Replace the entire
replace_wordfunction call inside bothto_ts_vectorandto_query. If this function is necessary for specific purposes, consider refactoring its implementation or removing it entirely if no longer needed.
- Replace the entire
-
Character Handling:
- Ensure that
remove_charsis properly defined and referenced if used elsewhere in the code.
- Ensure that
-
Consistency:
- Make sure consistency between how different components handle text preprocessing (e.g., punctuation stripping).
Here's an updated version of the cleaned-up code:
import jieba
# Constants and caching
jieba_word_list_cache = [chr(item) for item in range(38, 84)]
def get_key_by_word_dict(key, word_dict):
raise NotImplementedError("Not implemented yet")
def to_ts_vector(text: str):
# Clean up text and split into words
words = jieba.lcut(text)
return " ".join(words)
def to_query(text: str):
# Split text into words
words = jieba.lcut(text)
return " ".join(words)This version removes unneeded logic and focuses on leveraging Jieba’s built-in functionality efficiently.
| index in | ||
| range(0, len(texts))] | ||
| if not is_the_task_interrupted(): | ||
| QuerySet(Embedding).bulk_create(embedding_list) if len(embedding_list) > 0 else None |
There was a problem hiding this comment.
There are no obvious irregularities in the provided code snippet. However, there are a few suggestions for improvement:
-
Jieba Import: The import of
jiebaat line 16 seems unnecessary since it's not used anywhere within the function_batch_save. You can remove this line to simplify the code. -
Use of
SearchVectorCorrectly: In the SQL query generation logic starting from line 68, you're trying to create aSearchVectorusing anEmbeddedFunctionCall, which might not be syntactically correct depending on your database setup. Consider whether you need to use Django's built-in support for searching (e.g., PostgreSQL's full-text search capabilities). -
Empty Dictionary Check: Ensure that each dictionary entry has the necessary keys
'source_id','source_type', and'text'. Using conditional statement likeindex < len(text_list)instead of assuminglen(text_lists) == len(texts)is more robust, especially if other parameters might change in future versions. -
Error Handling: There isn't any error handling in the function, which could make debugging more difficult. Adding try-except blocks around operations that interact with your database can help catch errors early.
Here is the revised version based on these points:
from abc import ABC, abstractmethod
from typing import Dict, List
from django.db.models import QuerySet
from langchain_core.embeddings import Embeddings
import jieba
from django.contrib.postgres.search import SearchVector
from django.db.models import Q, Value
def _batch_save(self, texts: List[Dict], embeddings: List[Embeddings], is_the_task_interrupted):
embedding_list = [(Q(source_id=text.get('source_id'), source_type=text.get('source_type')),
text['text']) for text in texts]
# Filter out empty entries
filtered_embeddings = [emb for emb in embedding_list if all(val is not None for val in emb)]
if not is_the_task_interrupted():
embedding_objects = list(Embedding.from_query(Q(**{' OR '.join(f"{key}={value}" for key, value in q.as_q.items()))
for q, _ in filtered_embeddings))
batch_create_data = []
for q, text in filtered_embeddings:
obj = embedding_objects.pop(0)
batch_create_data.append((Value(text), obj.id,))
if batch_create_data:
EmbeddedObject.objects.bulk_update(batch_create_data, fields=['embedding', 'search_vector'])
# Example usage of filter out invalid entries
texts_with_invalid_entries = [{'id': 1, 'source_id': '', 'text': 'A good example'}]
embeddings_with_valid_entries = ['this is a test', 'another good one']
_text_and_embedding_pairs = zip(texts_with_invalid_entries, embeddings_with_valid_entries)
filtered_pairs = [(t, e) for t, e in _text_and_embedding_pairs if all([str(v) if v != '' else None for v in t.values()])]
print(filtered_pairs)This should help ensure reliability and maintainability of your code.
perf: Optimize word segmentation retrieval