Skip to content

MsWordDocumentBackend does not correctly extract tables when bullet lists in different cells use the same numId #3289

@olivierantonelli

Description

@olivierantonelli

Bug

When a DOCX table contains multiple bullet lists with the same numId, MsWordDocumentBackend does not extract the table correctly.

Steps to reproduce

Using python-docx, create a DOCX file with a 2-cells table containing bullet lists that share the same numId

from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn


def bullets(cell, items, numId="1"):
    """Add bullet paragraphs with numPr injected directly into each paragraph."""
    for i, text in enumerate(items):
        p = cell.paragraphs[0] if i == 0 else cell.add_paragraph()
        p.text = text
        pPr = p._p.get_or_add_pPr()
        numPr = OxmlElement("w:numPr")
        for tag, val in [("w:ilvl", "0"), ("w:numId", numId)]:
            el = OxmlElement(tag)
            el.set(qn("w:val"), val)
            numPr.append(el)
        pPr.insert(0, numPr)


doc = Document()
table = doc.add_table(rows=2, cols=1)
table.style = "Table Grid"

bullets(table.cell(0, 0), ["First row first bulletpoint", "First row second bulletpoint"])
bullets(table.cell(1, 0), ["Second row first bulletpoint", "Second row second bulletpoint"])

doc.save("issue-table.docx")
print("Saved issue-table.docx")

Extract the table

from docling.document_converter import DocumentConverter
import re

docling_doc = DocumentConverter().convert("issue-table.docx").document
print(re.search(r"<table>.*?</table>", docling_doc.export_to_html(), re.DOTALL).group(0))

Actual result

<table><tbody><tr><th><ul>
<li>First row first bulletpoint</li>
<li>First row second bulletpoint</li>
<li>Second row first bulletpoint</li>
<li>Second row second bulletpoint</li>
</ul></th></tr><tr><td></td></tr></tbody></table>

The two bullet lists are merged into a single list in the first table cell, and the second cell is left empty.

Docling item extraction

[i for i in docling_doc.iterate_items(with_groups=True)]

The extracted items show that all list items were attached to a single ListGroup:

[(GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/tables/0')], content_layer=<ContentLayer.BODY: 'body'>, meta=None, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>),
  0),
 (TableItem(self_ref='#/tables/0', parent=RefItem(cref='#/body'), children=[RefItem(cref='#/groups/1')], content_layer=<ContentLayer.BODY: 'body'>, meta=None, label=<DocItemLabel.TABLE: 'table'>, prov=[], source=[], comments=[], captions=[], references=[], footnotes=[], image=None, data=TableData(table_cells=[RichTableCell(bbox=None, row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=0, end_col_offset_idx=1, text='First row first bulletpoint\nFirst row second bulletpoint', column_header=True, row_header=False, row_section=False, fillable=False, ref=RefItem(cref='#/groups/1')), RichTableCell(bbox=None, row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=0, end_col_offset_idx=1, text='Second row first bulletpoint\nSecond row second bulletpoint', column_header=False, row_header=False, row_section=False, fillable=False, ref=RefItem(cref='#/groups/1'))], num_rows=2, num_cols=1, grid=[[RichTableCell(bbox=None, row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=0, end_col_offset_idx=1, text='First row first bulletpoint\nFirst row second bulletpoint', column_header=True, row_header=False, row_section=False, fillable=False, ref=RefItem(cref='#/groups/1'))], [RichTableCell(bbox=None, row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=0, end_col_offset_idx=1, text='Second row first bulletpoint\nSecond row second bulletpoint', column_header=False, row_header=False, row_section=False, fillable=False, ref=RefItem(cref='#/groups/1'))]]), annotations=[]),
  1),
 (GroupItem(self_ref='#/groups/1', parent=RefItem(cref='#/tables/0'), children=[RefItem(cref='#/groups/0')], content_layer=<ContentLayer.BODY: 'body'>, meta=None, name='rich_cell_group_1_0_0', label=<GroupLabel.UNSPECIFIED: 'unspecified'>),
  2),
 (ListGroup(self_ref='#/groups/0', parent=RefItem(cref='#/groups/1'), children=[RefItem(cref='#/texts/0'), RefItem(cref='#/texts/1'), RefItem(cref='#/texts/2'), RefItem(cref='#/texts/3')], content_layer=<ContentLayer.BODY: 'body'>, meta=None, name='list', label=<GroupLabel.LIST: 'list'>),
  3),
 (ListItem(self_ref='#/texts/0', parent=RefItem(cref='#/groups/0'), children=[], content_layer=<ContentLayer.BODY: 'body'>, meta=None, label=<DocItemLabel.LIST_ITEM: 'list_item'>, prov=[], source=[], comments=[], orig='First row first bulletpoint', text='First row first bulletpoint', formatting=Formatting(bold=False, italic=False, underline=False, strikethrough=False, script=<Script.BASELINE: 'baseline'>), hyperlink=None, enumerated=False, marker=''),
  4),
 (ListItem(self_ref='#/texts/1', parent=RefItem(cref='#/groups/0'), children=[], content_layer=<ContentLayer.BODY: 'body'>, meta=None, label=<DocItemLabel.LIST_ITEM: 'list_item'>, prov=[], source=[], comments=[], orig='First row second bulletpoint', text='First row second bulletpoint', formatting=Formatting(bold=False, italic=False, underline=False, strikethrough=False, script=<Script.BASELINE: 'baseline'>), hyperlink=None, enumerated=False, marker=''),
  4),
 (ListItem(self_ref='#/texts/2', parent=RefItem(cref='#/groups/0'), children=[], content_layer=<ContentLayer.BODY: 'body'>, meta=None, label=<DocItemLabel.LIST_ITEM: 'list_item'>, prov=[], source=[], comments=[], orig='Second row first bulletpoint', text='Second row first bulletpoint', formatting=Formatting(bold=False, italic=False, underline=False, strikethrough=False, script=<Script.BASELINE: 'baseline'>), hyperlink=None, enumerated=False, marker=''),
  4),
 (ListItem(self_ref='#/texts/3', parent=RefItem(cref='#/groups/0'), children=[], content_layer=<ContentLayer.BODY: 'body'>, meta=None, label=<DocItemLabel.LIST_ITEM: 'list_item'>, prov=[], source=[], comments=[], orig='Second row second bulletpoint', text='Second row second bulletpoint', formatting=Formatting(bold=False, italic=False, underline=False, strikethrough=False, script=<Script.BASELINE: 'baseline'>), hyperlink=None, enumerated=False, marker=''),
  4)]

Expected behavior

Each table cell should preserve its own bullet list.

For example:

<table><tbody><tr><th><ul>
<li>First row first bulletpoint</li>
<li>First row second bulletpoint</li>
</ul></th></tr><tr><td><ul>
<li>Second row first bulletpoint</li>
<li>Second row second bulletpoint</li>
</ul></td></tr></tbody></table>

Workaround

This issue can be avoided by using different numId values for the two lists.

Adding a carriage return after the bullet list in the first row also resolves the issue.

Docling version

Docling version: 2.87.0
Docling Core version: 2.73.0
Docling IBM Models version: 3.13.0
Docling Parse version: 5.8.0
Python: cpython-313 (3.13.5)
Platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39

Python version

Python 3.13.5

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocxissue related to docx backend

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions