Skip to content

Library.run_ocr_on_images(add_to_library=True) populates 'text_search' only, leaving 'text_block' empty. whereas Query.query (semantic type) retrieves data from 'text_block' #1123

@wissamharoun

Description

@wissamharoun

environment
llmware v0.3.8
macos 15
active db: sqlite
vector db: chromadb
for illustration of issue using example file: slicing_and_dicing_office_docs.py and the Microsoft Investor Relations data - However, issue was discovered initially on our private data - which is very OCR heavy.

issue:
run lib.add_files()

and ingest documents that the C parser will extract images pending downstream OCR with lib.run_ocr_on_images(add_to_library=True)
next, perform the ocr with llmware's "convenience" method on the images extracted to the image directory,
lib.run_ocr_on_images(add_to_library=True, other_params)
The result will be a new collection written to the db each entry per image referencing originating doc by 'doc_ID' (and so forth), with block_IDs starting at 100,000 and incrementing, and where the text chunks extracted by tesseract OCR populate only 'text_search'
perform a new embedding with llmware's
lib.install_new_embedding(params)
chunks/sentences for embedding are retrieved and collated into batches from 'text_search'
so far so good

at Query time -
Query.query(query="a query highly pertaining to the corpus", query_type="semantic", other_params)

would return results where 'text' is empty! - a little digging reveals that while the query text is indeed being compared to embedded chunks that are bonafide -- returned results for 'text' are retrieved from 'text_block' which remain empty after OCR.

the following images show this clearly...

Screenshot 2024-12-01 at 19 02 43

Screenshot 2024-12-02 at 17 25 10

Screenshot 2024-12-01 at 19 01 45

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions