Skip to content

Issue concerning formula detection/processing. #297

@mkd-enrique

Description

@mkd-enrique

I apologise in advance if this is uploaded in the wrong repository since I was utilising both langchain-opendataloader-pdf and opendataloader-pdf[hybrid]. If it is issued in the wrong repository, kindly inform to me and I will re-issue in the right repository instead!

Bug

When using hybrid_mode="full" on the client side as recommended when the backend is started with --enrich-formula, all pages are expected to be routed to the hybrid backend. However, the extracted output is identical to what is produced without hybrid_mode="full", indicating that pages are still being processed locally and the --enrich-formula enrichment has no effect.

This is especially visible on pages containing mathematical formulas, where the output still contains fragmented SymbolMT-encoded characters instead of properly reconstructed formula content with the appropriate "formula" label. At the end of this issue is a screenshot of the formula for further context.

The following line of command was used to create the hybrid instance:
opendataloader-pdf-hybrid --enrich-formula --port 5002

And the loader had the following parameters:

loader = OpenDataLoaderPDFLoader(
        file_path=[str(sample_pdf)],
        format="json",
        quiet=True,
        split_pages=True,
        use_struct_tree=True, 
        table_method="cluster", 
        include_header_footer=False,
        sanitize=False, 

        # Image Handling
        image_output="external", 
        image_dir="/home/<redacted>/opendata_test/imagestore",
        image_format="png",

        # Hybrid Extractions
        hybrid="docling-fast",
        hybrid_mode="full",
        hybrid_url="http://localhost:5002",
        hybrid_timeout="10000",
        hybrid_fallback=True,
    )

...

Version

Python 3.13.9
langchain-opendataloader-pdf 2.0.0
opendataloader-pdf 2.0.1
langchain-text-splitters 1.1.1

Only the following commands were used concerning package installation:

uv pip install "opendataloader-pdf[hybrid]"
uv pip install -U langchain-opendataloader-pdf
uv pip install -U langchain-text-splitters

...

Java version

openjdk 17.0.16 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)

...

Image

Image

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions