Skip to content

fix: recover flattened benchmark tables in markdown output#262

Closed
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery
Closed

fix: recover flattened benchmark tables in markdown output#262
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery

Conversation

@StevenVincentOne
Copy link
Copy Markdown
Contributor

@StevenVincentOne StevenVincentOne commented Mar 6, 2026

Summary

  • recover DeepSeek-style flattened benchmark tables that are emitted as paragraph runs instead of native table structures
  • handle caption-first, metric-first, and benchmark pre-header variants while preventing prose bleed-through and keeping fallback behavior conservative
  • add focused MarkdownGeneratorTest coverage for duplicate metric headers, partial row handling, percent values, and multi-row flattened recovery cases

Closes #259.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 6, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 397d908308

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +942 to +946
List<String> metrics = new ArrayList<>();
int j = i + 1;
while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) {
metrics.add(tokens[j]);
j += 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse full model labels before consuming metric tokens

recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.

Useful? React with 👍 / 👎.

Comment on lines +95 to +97
if (consumed > 0) {
i += consumed - 1;
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not skip non-table objects when advancing consumed range

After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.

Useful? React with 👍 / 👎.

@bundolee bundolee requested review from hyunhee-jo and removed request for hhjojojo March 11, 2026 00:40
Copy link
Copy Markdown
Contributor

@bundolee bundolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #262 -- fix: recover flattened benchmark tables in markdown output

Verdict: REQUEST_CHANGES

Thanks for the contribution, @StevenVincentOne. The benchmark table recovery addresses a real need. Below is detailed feedback.


Code Quality and Best Practices

[Critical] SRP violation -- ~830 lines added to MarkdownGenerator. The recovery logic (15+ methods, 12 regex patterns, inner class) is a self-contained table parser embedded in a markdown writer. This should be extracted into a dedicated class (e.g., FlattenedTableRecovery) that returns structured results, keeping MarkdownGenerator focused on rendering.

[Important] Hardcoded domain-specific tokens. BENCHMARK_HEADER_HINT_TOKEN_RE matches aime|gpqa|livecode|codeforces|diamond etc. tryRecoverWideComparisonTable skips "English", "Code", "Math", "Chinese". These are DeepSeek-specific and won't generalize. Consider externalizing or clearly documenting the intended scope.

[Important] Duplicated data-collection loop. The same dataText.append → recoverBenchmarkRowsFromText pattern repeats 4 times across methods. Extract into a shared helper.


Potential Bugs

[Critical] Partial write risk. tryWriteRecoveredFlattenedBenchmarkTable writes to markdownWriter before returning the consumed count. If a later check fails or throws, orphaned output remains. Buffer the output and write atomically after all validation passes.

[Important] Dead code. metricPos is set to 0, then checked if (metricPos > MAX_TABLE_RECOVERY_LEAD_IN) — always false. Remove.

[Important] Overly broad isLikelyBenchmarkPreHeader. The final heuristic (tokens >= 4 && modelLikeCount >= 3) matches any line with 3+ tokens containing uppercase or digits. Ordinary prose like "The GPT-4 model achieved 95.2% on MMLU" would trigger false-positive table recovery.


Performance

[Important] O(n²) re-parsing. Each loop iteration calls recoverBenchmarkRowsFromText on the full accumulated text. Consider incremental parsing or a single call after the loop.


Security

No concerns. Regex patterns are anchored, input is extracted PDF text.


Test Coverage

[Critical] No integration tests. All 10 tests exercise isolated static helpers (buildMetricColumns, recoverBenchmarkRowsFromText). The orchestration logic — which variant to trigger, how paragraphs are consumed, what gets written — has zero coverage. Add at least one test per recovery variant using mock IObject lists.

[Important] Tests use List<?> — can only verify row count, not content. Make RecoveredTableRow package-private so tests can assert on model names and metric values.

[Important] No tests for isLikelyProseLine or isLikelyBenchmarkPreHeader. These heuristics determine false-positive behavior and need dedicated coverage.


Process

  • CLA: Currently pending — please sign before this PR can proceed.
  • DCO: Commit is missing Signed-off-by. Please amend with git commit --amend -s.

Summary

Priority Item
Critical Extract recovery logic into dedicated class (SRP)
Critical Add integration tests for full recovery flow
Critical Buffer output to prevent partial writes
Important Remove dead code (metricPos > MAX_TABLE_RECOVERY_LEAD_IN)
Important Tighten isLikelyBenchmarkPreHeader + add tests
Important Assert on row content, not just count
Required Sign CLA + add DCO sign-off

The goal is valuable and the defensive fallback design is appreciated. With class extraction and integration tests, this will be solid. Thanks!

@StevenVincentOne StevenVincentOne force-pushed the fix/deepseek-table-recovery branch from 397d908 to 5bb7bd7 Compare March 19, 2026 22:20
@StevenVincentOne
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed review. I agree with the core concerns here.

This PR grew too large and too document-shaped for a first upstream change. In particular:

  • too much recovery logic ended up inside MarkdownGenerator
  • the heuristics are too DeepSeek/benchmark-specific
  • the lack of end-to-end orchestration coverage is a real gap

I’m going to stop pushing this version as-is.

I’ve rewritten the branch commits to use my GitHub-linked email (stevenvincentone@icloud.com), so the CLA check should be able to re-evaluate correctly.

I’ve opened a narrower follow-up PR instead:

That PR keeps full unchanged and gives downstream consumers a safe caption-only mode without requiring aggressive reconstruction heuristics in the renderer.

For the remaining full-mode table quality problem, I’ve also opened:

And this existing issue is still the right place for the more ambitious recovery path:

Unless there is interest in reviving this branch with a much smaller scope, I plan to leave further work on this problem concentrated in #317 + #318 + #259.

@StevenVincentOne
Copy link
Copy Markdown
Contributor Author

Closing this PR in favor of the narrower follow-up work.

This branch is too large and too DeepSeek-shaped to be a good upstream change as-is. I’m leaving the more appropriate follow-up paths in place:

Thanks for the review and guidance here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recover flattened benchmark tables emitted as paragraph runs

3 participants