Skip to content

Recover flattened benchmark tables emitted as paragraph runs #259

@StevenVincentOne

Description

@StevenVincentOne

Problem

On some born-digital academic PDFs (for example DeepSeek_R1.pdf), benchmark tables are emitted as a run of SemanticParagraph blocks instead of TableBorder, so markdown output loses table structure.

Example symptom: section 3.2 benchmark matrix is flattened into prose-like paragraph lines and no markdown table is emitted.

Repro

  • PDF: DeepSeek_R1.pdf
  • Parser config: default markdown generation
  • Observed: metric header + row values + caption (Table 2 | ...) appear as paragraph sequence rather than table syntax.

Proposal

Add a conservative recovery pass in markdown generation that:

  1. Detects a short contiguous paragraph window containing:
    • a metric header pattern (e.g., pass@1, cons@64, rating)
    • numeric row runs with model tokens
    • a Table N | ... caption
  2. Reconstructs a markdown table with stable column names.
  3. Falls back with no output change unless high-confidence conditions are met.

I have a branch with a first implementation + tests and can open a PR linked to this issue.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions