Problem
On some table-heavy PDFs, the default markdown output in full mode can degrade nearby reading order around complex tables.
Using DeepSeek_R1.pdf as a repro, the markdown around Table 4 contains a mix of:
- fake footnote bullets like
- 1https://example.com
- flattened benchmark header/value runs
- a dangling narrative fragment before the next real subsection heading
This means the issue is not only table fidelity. The default markdown output can also disrupt surrounding prose/section structure in the table region.
Reproduction
Source PDF:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- DeepSeek-AI et al.
Observed full-mode markdown shape around Table 4:
##### 3.1. DeepSeek-R1 Evaluation
- 1https://example.com
- 2https://example.com
- 3https://example.com
Claude-3.5- GPT-4o DeepSeek OpenAI OpenAI DeepSeek
...
Table 4 | Comparison between DeepSeek-R1 and other representative models.
...
performance of DeepSeek-R1 will improve in the next version...
##### 3.2. Distilled Model Evaluation
Expected behavior
In full mode, ODL should preserve surrounding reading order while still attempting to emit rich table markdown.
That means table-adjacent artifacts like the fake footnote bullets and dangling prose fragment should not leak between valid subsection boundaries.
Scope
This is intentionally separate from the new markdown-table-output PR that adds caption_only / off modes.
That PR provides a safe alternative for consumers who want to suppress tables.
This issue is about improving the default full markdown behavior itself.
Related
Problem
On some table-heavy PDFs, the default markdown output in
fullmode can degrade nearby reading order around complex tables.Using
DeepSeek_R1.pdfas a repro, the markdown aroundTable 4contains a mix of:- 1https://example.comThis means the issue is not only table fidelity. The default markdown output can also disrupt surrounding prose/section structure in the table region.
Reproduction
Source PDF:
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningObserved
full-mode markdown shape aroundTable 4:Expected behavior
In
fullmode, ODL should preserve surrounding reading order while still attempting to emit rich table markdown.That means table-adjacent artifacts like the fake footnote bullets and dangling prose fragment should not leak between valid subsection boundaries.
Scope
This is intentionally separate from the new
markdown-table-outputPR that addscaption_only/offmodes.That PR provides a safe alternative for consumers who want to suppress tables.
This issue is about improving the default
fullmarkdown behavior itself.Related
Add markdown table output modesRecover flattened benchmark tables emitted as paragraph runs