I apologise in advance if this is uploaded in the wrong repository since I was utilising both langchain-opendataloader-pdf and opendataloader-pdf[hybrid]. If it is issued in the wrong repository, kindly inform to me and I will re-issue in the right repository instead!
Bug
When using hybrid_mode="full" on the client side as recommended when the backend is started with --enrich-formula, all pages are expected to be routed to the hybrid backend. However, the extracted output is identical to what is produced without hybrid_mode="full", indicating that pages are still being processed locally and the --enrich-formula enrichment has no effect.
This is especially visible on pages containing mathematical formulas, where the output still contains fragmented SymbolMT-encoded characters instead of properly reconstructed formula content with the appropriate "formula" label. At the end of this issue is a screenshot of the formula for further context.
The following line of command was used to create the hybrid instance:
opendataloader-pdf-hybrid --enrich-formula --port 5002
And the loader had the following parameters:
loader = OpenDataLoaderPDFLoader(
file_path=[str(sample_pdf)],
format="json",
quiet=True,
split_pages=True,
use_struct_tree=True,
table_method="cluster",
include_header_footer=False,
sanitize=False,
# Image Handling
image_output="external",
image_dir="/home/<redacted>/opendata_test/imagestore",
image_format="png",
# Hybrid Extractions
hybrid="docling-fast",
hybrid_mode="full",
hybrid_url="http://localhost:5002",
hybrid_timeout="10000",
hybrid_fallback=True,
)
...
Version
Python 3.13.9
langchain-opendataloader-pdf 2.0.0
opendataloader-pdf 2.0.1
langchain-text-splitters 1.1.1
Only the following commands were used concerning package installation:
uv pip install "opendataloader-pdf[hybrid]"
uv pip install -U langchain-opendataloader-pdf
uv pip install -U langchain-text-splitters
...
Java version
openjdk 17.0.16 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
...
Image

I apologise in advance if this is uploaded in the wrong repository since I was utilising both langchain-opendataloader-pdf and opendataloader-pdf[hybrid]. If it is issued in the wrong repository, kindly inform to me and I will re-issue in the right repository instead!
Bug
When using
hybrid_mode="full"on the client side as recommended when the backend is started with--enrich-formula, all pages are expected to be routed to the hybrid backend. However, the extracted output is identical to what is produced withouthybrid_mode="full", indicating that pages are still being processed locally and the--enrich-formulaenrichment has no effect.This is especially visible on pages containing mathematical formulas, where the output still contains fragmented
SymbolMT-encoded characters instead of properly reconstructed formula content with the appropriate "formula" label. At the end of this issue is a screenshot of the formula for further context.The following line of command was used to create the hybrid instance:
opendataloader-pdf-hybrid --enrich-formula --port 5002And the loader had the following parameters:
...
Version
Python 3.13.9
langchain-opendataloader-pdf 2.0.0
opendataloader-pdf 2.0.1
langchain-text-splitters 1.1.1
Only the following commands were used concerning package installation:
...
Java version
openjdk 17.0.16 2025-07-15
OpenJDK Runtime Environment (build 17.0.16+8-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.16+8-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
...
Image