Skip to content

Commit c7a809f

Browse files
mrowdyclaude
andcommitted
chore: merge upstream/main and resolve docstring conflict
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2 parents 8d61008 + fef01f8 commit c7a809f

204 files changed

Lines changed: 27779 additions & 6418 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ env:
2020
tests/test_asr_pipeline.py
2121
tests/test_threaded_pipeline.py
2222
PYTEST_TO_SKIP: |-
23-
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm|post_process_ocr_with_vlm)\.py$'
23+
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm|post_process_ocr_with_vlm)\.py$|xbrl_conversion\.ipynb$'
2424

2525
jobs:
2626
lint:

CHANGELOG.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,64 @@
1+
## [v2.78.0](https://github.com/docling-project/docling/releases/tag/v2.78.0) - 2026-03-10
2+
3+
### Feature
4+
5+
* Add support for TableFormer v2 ([#3013](https://github.com/docling-project/docling/issues/3013)) ([`4ccd1d4`](https://github.com/docling-project/docling/commit/4ccd1d465deb8d521c09e2da61b537a9236d6560))
6+
* Add gRPC transport for KServe v2 API engine ([#3074](https://github.com/docling-project/docling/issues/3074)) ([`3d90778`](https://github.com/docling-project/docling/commit/3d90778e3e5762b16758e1c121f42890e32f0560))
7+
8+
### Fix
9+
10+
* **html:** Fix broken document tree and quadratic complexity in rich table cells ([#3025](https://github.com/docling-project/docling/issues/3025)) ([`80f75b8`](https://github.com/docling-project/docling/commit/80f75b8896a6b15c5422c56e9a423e4d2e6673cd))
11+
* Loosen dependency for pandas3 ([#3095](https://github.com/docling-project/docling/issues/3095)) ([`5188180`](https://github.com/docling-project/docling/commit/5188180ea31dd90567140affc564ce2729b6e4a1))
12+
* Add parse timeout to legacy LaTeX documents ([#3019](https://github.com/docling-project/docling/issues/3019)) ([`1192714`](https://github.com/docling-project/docling/commit/1192714b536ebb8117785b06ed85e7d203e0996d))
13+
* **msword:** Skip GroupItem targets without comments attribute ([#3080](https://github.com/docling-project/docling/issues/3080)) ([`ee16285`](https://github.com/docling-project/docling/commit/ee16285651e5c2f963e051b1ee32b50a043191e2))
14+
15+
### Documentation
16+
17+
* Fix code in rag langchain chunker tokenizer ([#2993](https://github.com/docling-project/docling/issues/2993)) ([`d113e61`](https://github.com/docling-project/docling/commit/d113e611c445db6793fd94b3fee9c4109513d04a))
18+
* Update code snippet to use modern pipeline options syntax ([#3087](https://github.com/docling-project/docling/issues/3087)) ([`95b759e`](https://github.com/docling-project/docling/commit/95b759e5199f1142fb66dc2088c0c36177c5c284))
19+
* Set HuggingFaceEndpoint task for Mixtral examples ([#2945](https://github.com/docling-project/docling/issues/2945)) ([`5d3ac38`](https://github.com/docling-project/docling/commit/5d3ac38a65000cd39766f87557c685668224ad7f))
20+
21+
## [v2.77.0](https://github.com/docling-project/docling/releases/tag/v2.77.0) - 2026-03-06
22+
23+
### Feature
24+
25+
* Track vlm_inference time for mlx_model pipeline ([#3060](https://github.com/docling-project/docling/issues/3060)) ([`38c4bb2`](https://github.com/docling-project/docling/commit/38c4bb26e8e3a7797d1caec3f690a7c8d5d9a735))
26+
* Add configurable graph_optimization_level for ONNX Runtime engines ([#3071](https://github.com/docling-project/docling/issues/3071)) ([`cfc6636`](https://github.com/docling-project/docling/commit/cfc6636a2a0e6b149dd51714d20e9b93f3f6463b))
27+
28+
### Fix
29+
30+
* **docx:** Preserve URL fragments and query params in hyperlinks ([#3050](https://github.com/docling-project/docling/issues/3050)) ([`cd9dd10`](https://github.com/docling-project/docling/commit/cd9dd10ccfe2a112af10ad135f8293d3bf845e1a))
31+
* Detect Office Open XML formats from ZIP contents when filename has no extension ([#3073](https://github.com/docling-project/docling/issues/3073)) ([`56f06fe`](https://github.com/docling-project/docling/commit/56f06fe372e3bfda29c14d66de0a066afb4c79c0))
32+
* **readingorder:** Assign FURNITURE content_layer to footer/header in container groups ([#3044](https://github.com/docling-project/docling/issues/3044)) ([`f7cb304`](https://github.com/docling-project/docling/commit/f7cb304daa7b7bfe49ba23b81d53fb16da4024af))
33+
* **docx:** Handle list items immediately after numbered headings ([#3070](https://github.com/docling-project/docling/issues/3070)) ([`56eb127`](https://github.com/docling-project/docling/commit/56eb12782c804b7ec36145bf52c1e005839c816b))
34+
* **rapidocr:** ORT thread configuration for RapidOCR backend ([#3062](https://github.com/docling-project/docling/issues/3062)) ([`68336c2`](https://github.com/docling-project/docling/commit/68336c2bda2b79f10759ad1587626c47500f4fb4))
35+
36+
### Documentation
37+
38+
* Add examples and fix docstring bug in DocumentConverter ([#3064](https://github.com/docling-project/docling/issues/3064)) ([`653940e`](https://github.com/docling-project/docling/commit/653940e0251e1bc5f311aded31690c64f42d9819))
39+
* Add docstrings to PipelineOptions classes ([#3065](https://github.com/docling-project/docling/issues/3065)) ([`8b99085`](https://github.com/docling-project/docling/commit/8b990856cd48fec12c68d940e665d8187d349753))
40+
41+
## [v2.76.0](https://github.com/docling-project/docling/releases/tag/v2.76.0) - 2026-03-02
42+
43+
### Feature
44+
45+
* Export to WebVTT format ([#3036](https://github.com/docling-project/docling/issues/3036)) ([`d276e60`](https://github.com/docling-project/docling/commit/d276e6056106b6aa04fee65def96d3e10557d632))
46+
47+
### Fix
48+
49+
* **xlsx:** Handle OneCellAnchor images in Excel backend ([#3045](https://github.com/docling-project/docling/issues/3045)) ([`859c302`](https://github.com/docling-project/docling/commit/859c302310289c5bab45a6e160e7cc3b9c538343))
50+
* Normalize Unicode ligatures in PDF text extraction ([#3057](https://github.com/docling-project/docling/issues/3057)) ([`6198e69`](https://github.com/docling-project/docling/commit/6198e69dec33d9c14b3be279b19924d73e5eb3fb))
51+
* **ocr:** Update RapidOCR torch GPU config key ([#3049](https://github.com/docling-project/docling/issues/3049)) ([`477359b`](https://github.com/docling-project/docling/commit/477359b772039b9c9c0d31c9dabcd755abdeb560))
52+
* Convert PIL images to RGB before picture description ([#3014](https://github.com/docling-project/docling/issues/3014)) ([`90ce93d`](https://github.com/docling-project/docling/commit/90ce93d8a095ea17040bd6a91ded0b463998bea9))
53+
* **msword:** Use outlineLvl for heading levels and clamp to minimum 1 ([#2916](https://github.com/docling-project/docling/issues/2916)) ([`a3d2b4b`](https://github.com/docling-project/docling/commit/a3d2b4bcc07fc00fff3039ae2046ee69b7587ab2))
54+
55+
### Documentation
56+
57+
* Add metaxy integration ([#3058](https://github.com/docling-project/docling/issues/3058)) ([`7aacc6c`](https://github.com/docling-project/docling/commit/7aacc6c18da3e856babb0f06afd7c985774f118e))
58+
* Removes merge conflict artifacts ([#3055](https://github.com/docling-project/docling/issues/3055)) ([`672125c`](https://github.com/docling-project/docling/commit/672125cd1bb5e22bb7a677f48157a55ca93f9ff6))
59+
* Add audio & video processing guide ([#3038](https://github.com/docling-project/docling/issues/3038)) ([`1321b39`](https://github.com/docling-project/docling/commit/1321b39cd8203d5e1cd60191cc9e979c5b939f98))
60+
* Add XBRL conversion example notebook and update feature listings ([#3039](https://github.com/docling-project/docling/issues/3039)) ([`1eb5c21`](https://github.com/docling-project/docling/commit/1eb5c21dabfed02bfe71cb7fc502d124562f1ba8))
61+
162
## [v2.75.0](https://github.com/docling-project/docling/releases/tag/v2.75.0) - 2026-02-24
263

364
### Feature

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ Docling simplifies document processing, parsing diverse formats — including ad
3333
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, WebVTT, images (PNG, TIFF, JPEG, ...), LaTeX, and more
3434
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
3535
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
36-
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
36+
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, WebVTT, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
37+
* 📜 Support of several application-specifc XML schemas incl. [USPTO](https://www.uspto.gov/patents) patents, [JATS](https://jats.nlm.nih.gov/) articles, and [XBRL](https://www.xbrl.org/) financial reports.
3738
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
3839
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
3940
* 🔍 Extensive OCR support for scanned PDFs and images
@@ -46,7 +47,8 @@ Docling simplifies document processing, parsing diverse formats — including ad
4647
* 📤 Structured [information extraction][extraction] \[🧪 beta\]
4748
* 📑 New layout model (**Heron**) by default, for faster PDF parsing
4849
* 🔌 [MCP server](https://docling-project.github.io/docling/usage/mcp/) for agentic applications
49-
* 💬 Parsing of Web Video Text Tracks (WebVTT) files
50+
* 💼 Parsing of XBRL (eXtensible Business Reporting Language) documents for financial reports
51+
* 💬 Parsing of WebVTT (Web Video Text Tracks) files and export to WebVTT format
5052
* 💬 Parsing of LaTeX files
5153

5254
### Coming soon

docling/backend/html_backend.py

Lines changed: 29 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
from copy import deepcopy
88
from io import BytesIO
99
from pathlib import Path
10-
from typing import Final, Optional, Union, cast
10+
from typing import Final, Iterator, Optional, Union, cast
1111
from urllib.parse import urljoin, urlparse
1212

1313
import requests
@@ -656,7 +656,7 @@ def _flush_buffer() -> None:
656656
return
657657

658658
for annotated_text_list in parts:
659-
with self._use_inline_group(annotated_text_list, doc):
659+
with self._use_inline_group(annotated_text_list, doc) as inline_ref:
660660
for annotated_text in annotated_text_list:
661661
if annotated_text.text.strip():
662662
seg_clean = HTMLDocumentBackend._clean_unicode(
@@ -670,7 +670,8 @@ def _flush_buffer() -> None:
670670
formatting=annotated_text.formatting,
671671
hyperlink=annotated_text.hyperlink,
672672
)
673-
added_refs.append(docling_code2.get_ref())
673+
if inline_ref is None:
674+
added_refs.append(docling_code2.get_ref())
674675
else:
675676
docling_text2 = doc.add_text(
676677
parent=self.parents[self.level],
@@ -680,7 +681,10 @@ def _flush_buffer() -> None:
680681
formatting=annotated_text.formatting,
681682
hyperlink=annotated_text.hyperlink,
682683
)
683-
added_refs.append(docling_text2.get_ref())
684+
if inline_ref is None:
685+
added_refs.append(docling_text2.get_ref())
686+
if inline_ref is not None:
687+
added_refs.append(inline_ref)
684688

685689
for node in element.contents:
686690
if isinstance(node, Tag):
@@ -866,7 +870,7 @@ def _use_format(self, tags: list[str]):
866870
@contextmanager
867871
def _use_inline_group(
868872
self, annotated_text_list: AnnotatedTextList, doc: DoclingDocument
869-
):
873+
) -> Iterator[RefItem | None]:
870874
"""Create an inline group for annotated texts.
871875
872876
Checks if annotated_text_list has more than one item and if so creates an inline
@@ -876,6 +880,10 @@ def _use_inline_group(
876880
Args:
877881
annotated_text_list (AnnotatedTextList): Annotated text
878882
doc (DoclingDocument): Currently used document
883+
884+
Yields:
885+
The RefItem of the created InlineGroup, or None when the list has only one
886+
element and no group is created.
879887
"""
880888
if len(annotated_text_list) > 1:
881889
inline_fmt = doc.add_group(
@@ -886,7 +894,7 @@ def _use_inline_group(
886894
self.parents[self.level + 1] = inline_fmt
887895
self.level += 1
888896
try:
889-
yield None
897+
yield inline_fmt.get_ref()
890898
finally:
891899
self.parents[self.level] = None
892900
self.level -= 1
@@ -1205,7 +1213,7 @@ def _handle_block(self, tag: Tag, doc: DoclingDocument) -> list[RefItem]:
12051213
)
12061214
annotated_texts: AnnotatedTextList = text_list.simplify_text_elements()
12071215
for part in annotated_texts.split_by_newline():
1208-
with self._use_inline_group(part, doc):
1216+
with self._use_inline_group(part, doc) as inline_ref:
12091217
for annotated_text in part:
12101218
if seg := annotated_text.text.strip():
12111219
seg_clean = HTMLDocumentBackend._clean_unicode(seg)
@@ -1217,7 +1225,8 @@ def _handle_block(self, tag: Tag, doc: DoclingDocument) -> list[RefItem]:
12171225
formatting=annotated_text.formatting,
12181226
hyperlink=annotated_text.hyperlink,
12191227
)
1220-
added_refs.append(docling_code.get_ref())
1228+
if inline_ref is None:
1229+
added_refs.append(docling_code.get_ref())
12211230
else:
12221231
docling_text = doc.add_text(
12231232
parent=self.parents[self.level],
@@ -1227,7 +1236,10 @@ def _handle_block(self, tag: Tag, doc: DoclingDocument) -> list[RefItem]:
12271236
formatting=annotated_text.formatting,
12281237
hyperlink=annotated_text.hyperlink,
12291238
)
1230-
added_refs.append(docling_text.get_ref())
1239+
if inline_ref is None:
1240+
added_refs.append(docling_text.get_ref())
1241+
if inline_ref is not None:
1242+
added_refs.append(inline_ref)
12311243

12321244
for img_tag in tag("img"):
12331245
if isinstance(img_tag, Tag):
@@ -1244,19 +1256,13 @@ def _handle_block(self, tag: Tag, doc: DoclingDocument) -> list[RefItem]:
12441256
added_refs.append(docling_table.get_ref())
12451257
self.parse_table_data(tag, doc, docling_table, num_rows, num_cols)
12461258

1247-
for img_tag in tag("img"):
1248-
if isinstance(img_tag, Tag):
1249-
im_ref2 = self._emit_image(tag, doc)
1250-
if im_ref2 is not None:
1251-
added_refs.append(im_ref2)
1252-
12531259
elif tag_name in {"pre"}:
12541260
# handle monospace code snippets (pre).
12551261
text_list = self._extract_text_and_hyperlink_recursively(
12561262
tag, find_parent_annotation=True, keep_newlines=True
12571263
)
12581264
annotated_texts = text_list.simplify_text_elements()
1259-
with self._use_inline_group(annotated_texts, doc):
1265+
with self._use_inline_group(annotated_texts, doc) as inline_ref:
12601266
for annotated_text in annotated_texts:
12611267
text_clean = HTMLDocumentBackend._clean_unicode(
12621268
annotated_text.text.strip()
@@ -1268,7 +1274,10 @@ def _handle_block(self, tag: Tag, doc: DoclingDocument) -> list[RefItem]:
12681274
formatting=annotated_text.formatting,
12691275
hyperlink=annotated_text.hyperlink,
12701276
)
1271-
added_refs.append(docling_code2.get_ref())
1277+
if inline_ref is None:
1278+
added_refs.append(docling_code2.get_ref())
1279+
if inline_ref is not None:
1280+
added_refs.append(inline_ref)
12721281

12731282
elif tag_name == "footer":
12741283
with self._use_footer(tag, doc):
@@ -1416,7 +1425,9 @@ def _extract_text_recursively(item: PageElement) -> list[str]:
14161425
for child in tag:
14171426
parts.extend(_extract_text_recursively(child))
14181427
result.append(
1419-
"".join(parts) + " " if tag.name in {"p", "li"} else "".join(parts)
1428+
"".join(parts) + " "
1429+
if tag.name in {"p", "li", "th", "td"}
1430+
else "".join(parts)
14201431
)
14211432

14221433
return result

0 commit comments

Comments
 (0)