Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions apps/common/util/ts_vecto_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,11 +78,11 @@ def get_key_by_word_dict(key, word_dict):

def to_ts_vector(text: str):
# 分词
result = jieba.lcut(text)
result = jieba.lcut(text, cut_all=True)
return " ".join(result)


def to_query(text: str):
extract_tags = jieba.lcut(text)
extract_tags = jieba.lcut(text, cut_all=True)
result = " ".join(extract_tags)
return result
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code looks generally correct, but it can be optimized and clarified slightly:

  1. Function Naming:

    • The function to_ts_vector is not necessary if you are using the same logic in to_query.
    • You might want to rename one of these functions to avoid redundancy.
  2. Regular Expressions:

    • In both to_ts_vector and to_query, the default behavior of jieba.lcut (or Chinese word segmentation) involves splitting text into words without considering punctuation or other delimiters. This means that numbers, symbols, etc., will also be included in the tokenization unless they are part of a Chinese character sequence.

    If you specifically need to keep digits out during segmenting, consider using jieba.lcut_for_search() which splits on spaces and considers punctuation as separate tokens:

    from jieba import lcut_for_search
    
    def to_query(text: str):
        extract_tags = lcut_for_search(text)
        result = " ".join(extract_tags)
        return result
  3. Code Readability:

    • It's generally better practice to use descriptive variable names instead of single-letter variables like result.

Here’s an updated version with some of these suggestions applied:

def get_key_by_word_dict(key, word_dict):
    # Your existing implementation here

def process_text_for_segmentation(text: str):
    return " ".join(jieba.lcut_for_search(text))

def to_query(text: str):
    """Convert text into query format."""
    processed_text = process_text_for_segmentation(text)
    return processed_text

These changes make the code more readable and maintainable while improving its functionality according to specific requirements.