Skip to content

Inconsistent structural parsing (list vs paragraph) for similar financial tables #463

@Bhargav129

Description

@Bhargav129

Title

Bug: Inconsistent parsing of financial statements – list items vs paragraphs for similar structures

Description

I am observing inconsistent parsing behavior for financial statement tables across different PDFs.

Example

For FY 2022–23:

  • Data is parsed as structured list and list item elements
  • Hierarchy (I, II, III, a, b, c) is preserved

For FY 2024–25:

  • Similar data is parsed as plain paragraph
  • No structure or hierarchy detected

Expected Behavior

Similar structured financial statements should produce consistent structured output (list/table format).

Actual Behavior

  • One file → structured list output
  • Another similar file → flat paragraph output

Sample Output

2022-2023 Profit loss

{'type': 'paragraph', 'id': 8175, 'page number': 134, 'bounding box': [56.693, 719.761, 210.733, 730.165], 'font': 'HelveticaLTStd-Roman', 'font size': 9.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'for the year ended on 31st March, 2023'}
{'type': 'paragraph', 'id': 8177, 'page number': 134, 'bounding box': [59.525, 684.966, 545.669, 713.31], 'font': 'HelveticaLTStd-Roman', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '(` In Lakhs) Particulars Notes'}
{'type': 'heading', 'id': 8178, 'level': 'Subtitle', 'page number': 134, 'bounding box': [395.037, 679.941, 466.299, 699.462], 'heading level': 35, 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[1.0, 1.0, 1.0]', 'content': 'For the year ended 31st March, 2023'}
{'type': 'paragraph', 'id': 8179, 'page number': 134, 'bounding box': [478.475, 689.965, 545.667, 699.213], 'font': 'HelveticaLTStd-Roman', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'For the year ended'}
{'type': 'paragraph', 'id': 8180, 'page number': 134, 'bounding box': [59.525, 666.261, 545.669, 689.213], 'font': 'HelveticaLTStd-Roman', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '31st March, 2022 Revenue'}
{'type': 'list', 'id': 8181, 'level': '9', 'page number': 134, 'bounding box': [59.469, 258.365, 545.669, 662.205], 'numbering style': 'roman numbers', 'number of list items': 11,
'list items': [{'type': 'list item', 'page number': 134, 'bounding box': [59.525, 652.373, 545.669, 662.205], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'I. Revenue from Operations 27 59,780.35 51,712.50', 'kids': []},
{'type': 'list item', 'page number': 134, 'bounding box': [59.525, 638.677, 545.669, 648.509], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'II. Other income 28 661.40 582.53', 'kids': []},
{'type': 'list item', 'page number': 134, 'bounding box': [59.525, 625.173, 545.669, 634.693], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'III. Total Income (I+II) 60,441.75 52,295.03', 'kids': []},
{'type': 'list item', 'page number': 134, 'bounding box': [59.525, 611.493, 111.053, 621.013], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'IV. Expenses',
'kids': [{'type': 'list', 'id': 2037, 'level': '10', 'page number': 134, 'bounding box': [73.699, 505.621, 545.669, 607.437], 'numbering style': 'english letters', 'number of list items': 6, 'list items': [{'type': 'list item', 'page number': 134, 'bounding box': [73.701, 597.605, 545.669, 607.437], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'a) Cost of materials consumed 29 31,058.53 26,617.63', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [73.701, 573.909, 302.629, 593.741], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'b) Changes in inventories of finished goods, stock in trade and work-in-progress', 'kids': [{'type': 'paragraph', 'id': 2036, 'page number': 134, 'bounding box': [368.309, 583.909, 545.669, 593.741], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '30 17.03 92.73'}]}, {'type': 'list item', 'page number': 134, 'bounding box': [73.701, 560.213, 545.669, 570.045], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'c) Employee Benefits Expenses 31 4,774.76 3,944.08', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [73.701, 546.517, 545.669, 556.349], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'd) Finance Costs 32 1,499.73 1,800.13', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [73.699, 532.821, 545.669, 542.653], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'e) Depreciation,\tAmortisation\tand\tImpairment\texpense 33 1,163.19 1,180.92', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [73.701, 505.621, 545.669, 528.957], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'f) Other Expenses 34 17,064.66 14,452.45 Total Expenses (IV) 55,577.90 48,087.94', 'kids': []}]}]}, {'type': 'list item', 'page number': 134, 'bounding box': [59.525, 491.749, 545.661, 501.581], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'V. Profit\tBefore\tExceptional\tItems\tand\tTax\t(III-IV) 4,863.85 4,207.09', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [59.517, 478.053, 545.653, 487.885], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'VI. Exceptional\tItems - -', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [59.509, 464.549, 545.653, 474.069], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'VII. Profit Before Tax (V-VI) 4,863.85 4,207.09', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [59.509, 409.573, 545.653, 460.493], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'VIII. Tax expense: 36 Current Tax 1,285.13 1,041.65 Deferred\tTax (36.51) 78.49 Total Tax Expenses 1248.62 -', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [59.501, 396.069, 545.645, 405.589], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'IX. Profit for the period (VII-VIII) 3,615.23 3,086.95', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [59.501, 382.181, 377.181, 392.013], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'X. Other Comprehensive Income 37', 'kids': [{'type': 'list', 'id': 2038, 'level': '10', 'page number': 134, 'bounding box': [73.669, 327.397, 548.301, 378.317], 'numbering style': 'english letters', 'number of list items': 2, 'list items': [{'type': 'list item', 'page number': 134, 'bounding box': [73.677, 354.789, 548.301, 378.317], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'A. (i) Items that will not be reclassified to profit or loss (11.19) 463.57 (ii) Income tax related to items that will not be reclassified to profit or loss 2.95 (108.22)', 'kids': []}, {'type': 'list item', 'page number': 134, 'bounding box': [73.669, 327.397, 545.637, 350.925], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'B. (i) Items that will be reclassified to profit or loss - (ii) Income tax related to items that will be reclassified to profit or loss - -', 'kids': []}]}, {'type': 'paragraph', 'id': 2039, 'page number': 134, 'bounding box': [59.485, 313.893, 545.621, 323.413], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'Total Other Comprehensive Income (X) (8.24) 355.35'}]}, {'type': 'list item', 'page number': 134, 'bounding box': [59.469, 258.365, 545.621, 309.733], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'XI. Total Comprehensive Income for the period (IX+X) 3,606.99 3,442.30 Earnings\tper\tequity\tshare\tof\tFace\tValue\tof C 5 each 38 Basic 10.25 8.75 Diluted 10.25 8.75', 'kids': []}]}

2024 - 2025 Profit and loss

{'type': 'paragraph', 'page number': 170, 'bounding box': [56.693, 719.761, 196.333, 730.165], 'font': 'HelveticaLTStd-Roman', 'font size': 9.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'for the Year Ended 31st March 2025'}
{'type': 'paragraph', 'page number': 170, 'bounding box': [506.114, 700.309, 545.294, 709.527], 'font': 'HelveticaLTStd-Light', 'font size': 7.5, 'text color': '[0.0, 0.0, 0.0]', 'content': '(In Lakhs)'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.528, 686.648, 468.152, 696.168], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[1.0, 1.0, 1.0]', 'content': 'Particulars Note No. For the Year ended '} {'type': 'paragraph', 'page number': 170, 'bounding box': [478.444, 686.672, 547.516, 695.92], 'font': 'HelveticaLTStd-Roman', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'For the Year ended '} {'type': 'paragraph', 'page number': 170, 'bounding box': [407.328, 676.648, 465.924, 686.168], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[1.0, 1.0, 1.0]', 'content': '31st March 2025'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.526, 662.968, 545.294, 685.92], 'font': 'HelveticaLTStd-Roman', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '31st March 2024 Revenue'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.526, 608.184, 545.294, 658.912], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'I. Revenue from Operations 29 79,491.98 67,245.00 II. Other Income 30 917.07 853.71 III. Total Income (I+II) 80,409.05 68,098.71 IV. Expenses'} {'type': 'paragraph', 'page number': 170, 'bounding box': [73.702, 580.6, 545.294, 604.128], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'a) Cost of Materials Consumed 31 42,410.77 35,684.48 b) Changes in Inventories of Finished Goods, Stock in Trade and Work In '} {'type': 'paragraph', 'page number': 170, 'bounding box': [367.934, 580.6, 545.294, 590.432], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '32 (443.49) (72.35)'} {'type': 'paragraph', 'page number': 170, 'bounding box': [87.878, 570.6, 119.59, 580.432], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'Progress'} {'type': 'paragraph', 'page number': 170, 'bounding box': [73.702, 515.816, 545.294, 566.736], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'c) Employee Benefits Expenses 33 6,382.87 5,410.07 d) Finance Costs 34 1,572.66 1,292.05 e) Depreciation, Amortisation and Impairment Expense 35 1,506.76 1,158.88 f) Other Expenses 36 21,406.31 17,651.18'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.526, 447.336, 545.294, 511.832], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'Total Expenses (IV) 72,835.88 61,124.32 V. Profit Before Exceptional Items and Tax(III-IV) 7,573.17 6,974.41 VI. Exceptional Items 52 203.50 155.56 VII. Profit Before Tax (V-VI) 7,369.67 6,818.84 VIII. Tax Expense: 38'} {'type': 'paragraph', 'page number': 170, 'bounding box': [73.702, 406.44, 545.294, 443.472], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'Current Tax 1,828.09 1,750.26 Deferred Tax (94.41) 53.30 Total Tax Expense 1,733.68 1,803.56'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.526, 378.856, 545.294, 402.264], 'font': 'HelveticaLTStd-Bold', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'IX. Profit for the year(VII-VIII) 5,635.98 5,015.28 X. Other Comprehensive Income 39'} {'type': 'paragraph', 'page number': 170, 'bounding box': [73.702, 365.16, 545.294, 374.992], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': 'A. (i) Items that will not be reclassified to profit or loss 532.36 804.63'} {'type': 'paragraph', 'page number': 170, 'bounding box': [73.702, 337.768, 545.294, 361.296], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '(ii) Income tax related to items that will not be reclassified to profit or loss 58.12 (187.11) B. (i) Items that will be reclassified to profit or loss - -'} {'type': 'paragraph', 'page number': 170, 'bounding box': [59.526, 240.952, 545.294, 333.904], 'font': 'HelveticaLTStd-Light', 'font size': 8.0, 'text color': '[0.0, 0.0, 0.0]', 'content': '(ii) Income tax related to items that will be reclassified to profit or loss - Total Other Comprehensive Income (X) 590.48 617.52 XI. Total Comprehensive Income for the year(IX+X) 6,226.46 5,632.81 Earnings per equity share of Face Value of5 each 40 Basic 15.97 14.21 Diluted 15.97 14.21 See accompanying notes to the financial statements 1 to 53'}

Observations

  • Borderless tables and alignment-based layouts are not consistently detected
  • Financial statements are especially affected

Impact

This inconsistency makes it difficult to build reliable downstream pipelines (e.g., financial data extraction, RAG systems).

Additional Context

Both PDFs are visually similar but produce very different structured outputs.

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions