fix: recover flattened benchmark tables in markdown output by StevenVincentOne · Pull Request #262 · opendataloader-project/opendataloader-pdf

StevenVincentOne · 2026-03-06T03:04:01Z

Summary

recover DeepSeek-style flattened benchmark tables that are emitted as paragraph runs instead of native table structures
handle caption-first, metric-first, and benchmark pre-header variants while preventing prose bleed-through and keeping fallback behavior conservative
add focused MarkdownGeneratorTest coverage for duplicate metric headers, partial row handling, percent values, and multi-row flattened recovery cases

Closes #259.

CLAassistant · 2026-03-06T03:04:26Z

All committers have signed the CLA.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 397d908308

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T03:09:01Z

+            List<String> metrics = new ArrayList<>();
+            int j = i + 1;
+            while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) {
+                metrics.add(tokens[j]);
+                j += 1;


Parse full model labels before consuming metric tokens

recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T03:09:01Z

+                    if (consumed > 0) {
+                        i += consumed - 1;
+                        continue;


Do not skip non-table objects when advancing consumed range

After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.

Useful? React with 👍 / 👎.

bundolee

Code Review: PR #262 -- fix: recover flattened benchmark tables in markdown output

Verdict: REQUEST_CHANGES

Thanks for the contribution, @StevenVincentOne. The benchmark table recovery addresses a real need. Below is detailed feedback.

Code Quality and Best Practices

[Critical] SRP violation -- ~830 lines added to MarkdownGenerator. The recovery logic (15+ methods, 12 regex patterns, inner class) is a self-contained table parser embedded in a markdown writer. This should be extracted into a dedicated class (e.g., FlattenedTableRecovery) that returns structured results, keeping MarkdownGenerator focused on rendering.

[Important] Hardcoded domain-specific tokens. BENCHMARK_HEADER_HINT_TOKEN_RE matches aime|gpqa|livecode|codeforces|diamond etc. tryRecoverWideComparisonTable skips "English", "Code", "Math", "Chinese". These are DeepSeek-specific and won't generalize. Consider externalizing or clearly documenting the intended scope.

[Important] Duplicated data-collection loop. The same dataText.append → recoverBenchmarkRowsFromText pattern repeats 4 times across methods. Extract into a shared helper.

Potential Bugs

[Critical] Partial write risk. tryWriteRecoveredFlattenedBenchmarkTable writes to markdownWriter before returning the consumed count. If a later check fails or throws, orphaned output remains. Buffer the output and write atomically after all validation passes.

[Important] Dead code. metricPos is set to 0, then checked if (metricPos > MAX_TABLE_RECOVERY_LEAD_IN) — always false. Remove.

[Important] Overly broad isLikelyBenchmarkPreHeader. The final heuristic (tokens >= 4 && modelLikeCount >= 3) matches any line with 3+ tokens containing uppercase or digits. Ordinary prose like "The GPT-4 model achieved 95.2% on MMLU" would trigger false-positive table recovery.

Performance

[Important] O(n²) re-parsing. Each loop iteration calls recoverBenchmarkRowsFromText on the full accumulated text. Consider incremental parsing or a single call after the loop.

Security

No concerns. Regex patterns are anchored, input is extracted PDF text.

Test Coverage

[Critical] No integration tests. All 10 tests exercise isolated static helpers (buildMetricColumns, recoverBenchmarkRowsFromText). The orchestration logic — which variant to trigger, how paragraphs are consumed, what gets written — has zero coverage. Add at least one test per recovery variant using mock IObject lists.

[Important] Tests use List<?> — can only verify row count, not content. Make RecoveredTableRow package-private so tests can assert on model names and metric values.

[Important] No tests for isLikelyProseLine or isLikelyBenchmarkPreHeader. These heuristics determine false-positive behavior and need dedicated coverage.

Process

CLA: Currently pending — please sign before this PR can proceed.
DCO: Commit is missing Signed-off-by. Please amend with git commit --amend -s.

Summary

Priority	Item
Critical	Extract recovery logic into dedicated class (SRP)
Critical	Add integration tests for full recovery flow
Critical	Buffer output to prevent partial writes
Important	Remove dead code (`metricPos > MAX_TABLE_RECOVERY_LEAD_IN`)
Important	Tighten `isLikelyBenchmarkPreHeader` + add tests
Important	Assert on row content, not just count
Required	Sign CLA + add DCO sign-off

The goal is valuable and the defensive fallback design is appreciated. With class extraction and integration tests, this will be solid. Thanks!

StevenVincentOne · 2026-03-19T22:21:37Z

Thanks for the detailed review. I agree with the core concerns here.

This PR grew too large and too document-shaped for a first upstream change. In particular:

too much recovery logic ended up inside MarkdownGenerator
the heuristics are too DeepSeek/benchmark-specific
the lack of end-to-end orchestration coverage is a real gap

I’m going to stop pushing this version as-is.

I’ve rewritten the branch commits to use my GitHub-linked email (stevenvincentone@icloud.com), so the CLA check should be able to re-evaluate correctly.

I’ve opened a narrower follow-up PR instead:

Add markdown table output modes #317 adds native markdown table output modes (full, caption_only, off)

That PR keeps full unchanged and gives downstream consumers a safe caption-only mode without requiring aggressive reconstruction heuristics in the renderer.

For the remaining full-mode table quality problem, I’ve also opened:

Improve full markdown output around complex table regions #318 improve full markdown output around complex table regions

And this existing issue is still the right place for the more ambitious recovery path:

Recover flattened benchmark tables emitted as paragraph runs #259 recover flattened benchmark tables emitted as paragraph runs

Unless there is interest in reviving this branch with a much smaller scope, I plan to leave further work on this problem concentrated in #317 + #318 + #259.

StevenVincentOne · 2026-03-19T22:30:44Z

Closing this PR in favor of the narrower follow-up work.

This branch is too large and too DeepSeek-shaped to be a good upstream change as-is. I’m leaving the more appropriate follow-up paths in place:

Add markdown table output modes #317 for native markdown table output modes
Improve full markdown output around complex table regions #318 for improving default full markdown around complex table regions
Recover flattened benchmark tables emitted as paragraph runs #259 for any future work on conservative flattened-table recovery

Thanks for the review and guidance here.

StevenVincentOne requested review from LonelyMidoriya, MaximPlusov, bundolee and hhjojojo as code owners March 6, 2026 03:04

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

bundolee requested review from hyunhee-jo and removed request for hhjojojo March 11, 2026 00:40

bundolee requested changes Mar 11, 2026

View reviewed changes

MaximPlusov mentioned this pull request Mar 16, 2026

Recover flattened benchmark tables emitted as paragraph runs #259

Open

fix: recover flattened benchmark tables from DeepSeek outputs

5bb7bd7

StevenVincentOne force-pushed the fix/deepseek-table-recovery branch from 397d908 to 5bb7bd7 Compare March 19, 2026 22:20

StevenVincentOne closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover flattened benchmark tables in markdown output#262

fix: recover flattened benchmark tables in markdown output#262
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery

StevenVincentOne commented Mar 6, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Uh oh!

bundolee left a comment

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

StevenVincentOne commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

CLAassistant commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

bundolee left a comment

Choose a reason for hiding this comment

Code Review: PR #262 -- fix: recover flattened benchmark tables in markdown output

Code Quality and Best Practices

Potential Bugs

Performance

Security

Test Coverage

Process

Summary

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

StevenVincentOne commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

StevenVincentOne commented Mar 6, 2026 •

edited

Loading

CLAassistant commented Mar 6, 2026 •

edited

Loading