Skip to content

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted#2721

Merged
shaohuzhang1 merged 1 commit intomainfrom
pr@main@fix_document_embedding
Mar 28, 2025
Merged

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted#2721
shaohuzhang1 merged 1 commit intomainfrom
pr@main@fix_document_embedding

Conversation

@shaohuzhang1
Copy link
Copy Markdown
Contributor

fix: Every time a vectorized document is generated, the entire vectorized data of the document is deleted

@f2c-ci-robot
Copy link
Copy Markdown

f2c-ci-robot bot commented Mar 28, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@f2c-ci-robot
Copy link
Copy Markdown

f2c-ci-robot bot commented Mar 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

VectorStore.get_embedding_vector().delete_by_document_id(document_id)

# 根据段落进行向量化处理
page_desc(QuerySet(Paragraph)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet does not contain any obvious irregularities or potential issues. However, there are some general optimizations that can be made:

Potential Improvements

  1. Use of Generators for Data Handling: If QuerySet supports generator operations, you might use generators to handle data more efficiently, especially if dealing with large datasets.

  2. Exception Handling: Consider adding try-except blocks around database operations and document management tasks to ensure robustness against exceptions during execution.

  3. Code Clarity: Ensure that variable names and function structures are clear and concise. This makes the code easier to understand and maintain.

  4. Performance Optimization: If VectorStore's delete_by_document_id method is slow due to indexing or other optimizations, consider caching results or using batch deletion when applicable.

Here's a slightly optimized version of the relevant part of the code (assuming document_id is valid):

def is_the_task_interrupted():
    try:
        # Update listener status
        ListenerManagement.update_status(
            QuerySet(Document).filter(id=document_id),
            TaskType.EMBEDDING,
            State.STARTED
        )
        
        # Delete document vector data
        VectorStore.get_embedding_vector().delete_by_document_id(document_id)
        
        # According to paragraphs for vectorization processing
        page_desc(QuerySet(Paragraph))
    
    except Exception as e:
        print(f"An error occurred: {e}")

By implementing these suggestions, the code will be more robust, efficient, and readable. Adjustments may vary based on specific requirements and constraints within your project.

@shaohuzhang1 shaohuzhang1 merged commit dcc80a4 into main Mar 28, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_document_embedding branch March 28, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant