Problem
On some born-digital academic PDFs (for example DeepSeek_R1.pdf), benchmark tables are emitted as a run of SemanticParagraph blocks instead of TableBorder, so markdown output loses table structure.
Example symptom: section 3.2 benchmark matrix is flattened into prose-like paragraph lines and no markdown table is emitted.
Repro
- PDF: DeepSeek_R1.pdf
- Parser config: default markdown generation
- Observed: metric header + row values + caption (Table 2 | ...) appear as paragraph sequence rather than table syntax.
Proposal
Add a conservative recovery pass in markdown generation that:
- Detects a short contiguous paragraph window containing:
- a metric header pattern (e.g., pass@1, cons@64, rating)
- numeric row runs with model tokens
- a Table N | ... caption
- Reconstructs a markdown table with stable column names.
- Falls back with no output change unless high-confidence conditions are met.
I have a branch with a first implementation + tests and can open a PR linked to this issue.
Problem
On some born-digital academic PDFs (for example DeepSeek_R1.pdf), benchmark tables are emitted as a run of SemanticParagraph blocks instead of TableBorder, so markdown output loses table structure.
Example symptom: section 3.2 benchmark matrix is flattened into prose-like paragraph lines and no markdown table is emitted.
Repro
Proposal
Add a conservative recovery pass in markdown generation that:
I have a branch with a first implementation + tests and can open a PR linked to this issue.