Skip to content

feat: integrate Docling for high-fidelity PDF ingestion (#80)#146

Open
DhruvGarg111 wants to merge 3 commits into
Bessouat40:mainfrom
DhruvGarg111:feature/docling-pdf-ingestion
Open

feat: integrate Docling for high-fidelity PDF ingestion (#80)#146
DhruvGarg111 wants to merge 3 commits into
Bessouat40:mainfrom
DhruvGarg111:feature/docling-pdf-ingestion

Conversation

@DhruvGarg111
Copy link
Copy Markdown

This PR implements high-fidelity PDF ingestion using IBM's Docling, as discussed in issue #80.

Key Changes:

  • DoclingPDFProcessor: New processor supporting table structure, LaTeX formulas, code snippets, and charts.
  • DocumentProcessorFactory: Updated to use Docling as the default PDF processor with a fallback to PDFProcessor if docling is not installed.
  • Dependencies: Added docling>=2.15.1 to pyproject.toml and updated uv.lock.
  • Documentation: Updated README to reflect the new capabilities.
  • Formatting: Applied black formatting to pass CI checks.

Implements DoclingPDFProcessor with support for tables, LaTeX formulas, code snippets, and charts. Updates DocumentProcessorFactory with fallback support.
Adds Docling features to README and applies black formatting to maintain CI compliance.
@Bessouat40
Copy link
Copy Markdown
Owner

Hello, thanks for the contribution ! It seems great, I'll test it soon and integrate it if it works fine.

@DhruvGarg111
Copy link
Copy Markdown
Author

sure, let me know if any change is needed.

@Bessouat40
Copy link
Copy Markdown
Owner

Maybe you can fix a version of docling, I had some errors using minimal version : 2.15.1

File "/Users/labess40/dev/RAGLight/src/raglight/document_processing/document_processor_factory.py", line 29, in __init__
    "pdf": DoclingPDFProcessor() if HAS_DOCLING else PDFProcessor(),
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/labess40/dev/RAGLight/src/raglight/document_processing/docling_pdf_processor.py", line 22, in __init__
    pipeline_options.do_formula_enrichment = True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.12/site-packages/pydantic/main.py", line 997, in __setattr__
    elif (setattr_handler := self._setattr_handler(name, value)) is not None:
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/lib/python3.12/site-packages/pydantic/main.py", line 1044, in _setattr_handler
    raise ValueError(f'"{cls.__name__}" object has no field "{name}"')
ValueError: "PdfPipelineOptions" object has no field "do_formula_enrichment"

But works for latest pypi version.

Same for huggingface version, it works only if I get a newer version. But there are some conflicts with other libraries.

@Bessouat40
Copy link
Copy Markdown
Owner

For the code, it's good for me 👍

@DhruvGarg111
Copy link
Copy Markdown
Author

Regarding the error -

Since I was on latest version, i did not realize it. Now I can think of 2 solutions.

  • Bump the minimum docling version: We can increase the minimum version of docling in pyproject.toml to a version that supports those attributes. However, bumping it might cause the conflicts with other libraries.

  • Or, we can use a try-except block -

def __init__(self):
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_table_structure = True
    
    try:
        pipeline_options.do_formula_enrichment = True
        pipeline_options.do_code_enrichment = True
        pipeline_options.do_chart_extraction = True
    except ValueError:
        pass # Older version detected, safely skip these options

Let me know if you want me to make any of this change. If you have a better solution I welcome it.

@Bessouat40
Copy link
Copy Markdown
Owner

I think it's better to fix and not only bump the docling version to avoid errors.
If we just bump the version and they introduce a breaking change, users can encounter errors.
Fixing the version ensure our code works

@DhruvGarg111
Copy link
Copy Markdown
Author

Implemented the compatibility fix requested in review.

This update makes Docling PDF processing robust across Docling versions by safely enabling advanced pipeline options
only when they exist, and falling back to PDFProcessor if Docling initialization fails. I also added targeted tests
for factory behavior to verify:

  • Docling is used when available and healthy
  • fallback to PDFProcessor when Docling init raises
  • fallback when Docling is unavailable

I have tested it in venv and it works. Let me know if any change is needed.

@DhruvGarg111
Copy link
Copy Markdown
Author

Hey, @Bessouat40 . Is there anything else required in this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants