Skip to content

Improve full markdown output around complex table regions #318

@StevenVincentOne

Description

@StevenVincentOne

Problem

On some table-heavy PDFs, the default markdown output in full mode can degrade nearby reading order around complex tables.

Using DeepSeek_R1.pdf as a repro, the markdown around Table 4 contains a mix of:

  • fake footnote bullets like - 1https://example.com
  • flattened benchmark header/value runs
  • a dangling narrative fragment before the next real subsection heading

This means the issue is not only table fidelity. The default markdown output can also disrupt surrounding prose/section structure in the table region.

Reproduction

Source PDF:

  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • DeepSeek-AI et al.

Observed full-mode markdown shape around Table 4:

##### 3.1. DeepSeek-R1 Evaluation

- 1https://example.com
- 2https://example.com
- 3https://example.com

Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
...
Table 4 | Comparison between DeepSeek-R1 and other representative models.
...
performance of DeepSeek-R1 will improve in the next version...

##### 3.2. Distilled Model Evaluation

Expected behavior

In full mode, ODL should preserve surrounding reading order while still attempting to emit rich table markdown.

That means table-adjacent artifacts like the fake footnote bullets and dangling prose fragment should not leak between valid subsection boundaries.

Scope

This is intentionally separate from the new markdown-table-output PR that adds caption_only / off modes.

That PR provides a safe alternative for consumers who want to suppress tables.
This issue is about improving the default full markdown behavior itself.

Related

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions