fix: recover flattened benchmark tables in markdown output#262
fix: recover flattened benchmark tables in markdown output#262StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 397d908308
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| List<String> metrics = new ArrayList<>(); | ||
| int j = i + 1; | ||
| while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) { | ||
| metrics.add(tokens[j]); | ||
| j += 1; |
There was a problem hiding this comment.
Parse full model labels before consuming metric tokens
recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.
Useful? React with 👍 / 👎.
| if (consumed > 0) { | ||
| i += consumed - 1; | ||
| continue; |
There was a problem hiding this comment.
Do not skip non-table objects when advancing consumed range
After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.
Useful? React with 👍 / 👎.
bundolee
left a comment
There was a problem hiding this comment.
Code Review: PR #262 -- fix: recover flattened benchmark tables in markdown output
Verdict: REQUEST_CHANGES
Thanks for the contribution, @StevenVincentOne. The benchmark table recovery addresses a real need. Below is detailed feedback.
Code Quality and Best Practices
[Critical] SRP violation -- ~830 lines added to MarkdownGenerator. The recovery logic (15+ methods, 12 regex patterns, inner class) is a self-contained table parser embedded in a markdown writer. This should be extracted into a dedicated class (e.g., FlattenedTableRecovery) that returns structured results, keeping MarkdownGenerator focused on rendering.
[Important] Hardcoded domain-specific tokens. BENCHMARK_HEADER_HINT_TOKEN_RE matches aime|gpqa|livecode|codeforces|diamond etc. tryRecoverWideComparisonTable skips "English", "Code", "Math", "Chinese". These are DeepSeek-specific and won't generalize. Consider externalizing or clearly documenting the intended scope.
[Important] Duplicated data-collection loop. The same dataText.append → recoverBenchmarkRowsFromText pattern repeats 4 times across methods. Extract into a shared helper.
Potential Bugs
[Critical] Partial write risk. tryWriteRecoveredFlattenedBenchmarkTable writes to markdownWriter before returning the consumed count. If a later check fails or throws, orphaned output remains. Buffer the output and write atomically after all validation passes.
[Important] Dead code. metricPos is set to 0, then checked if (metricPos > MAX_TABLE_RECOVERY_LEAD_IN) — always false. Remove.
[Important] Overly broad isLikelyBenchmarkPreHeader. The final heuristic (tokens >= 4 && modelLikeCount >= 3) matches any line with 3+ tokens containing uppercase or digits. Ordinary prose like "The GPT-4 model achieved 95.2% on MMLU" would trigger false-positive table recovery.
Performance
[Important] O(n²) re-parsing. Each loop iteration calls recoverBenchmarkRowsFromText on the full accumulated text. Consider incremental parsing or a single call after the loop.
Security
No concerns. Regex patterns are anchored, input is extracted PDF text.
Test Coverage
[Critical] No integration tests. All 10 tests exercise isolated static helpers (buildMetricColumns, recoverBenchmarkRowsFromText). The orchestration logic — which variant to trigger, how paragraphs are consumed, what gets written — has zero coverage. Add at least one test per recovery variant using mock IObject lists.
[Important] Tests use List<?> — can only verify row count, not content. Make RecoveredTableRow package-private so tests can assert on model names and metric values.
[Important] No tests for isLikelyProseLine or isLikelyBenchmarkPreHeader. These heuristics determine false-positive behavior and need dedicated coverage.
Process
- CLA: Currently pending — please sign before this PR can proceed.
- DCO: Commit is missing
Signed-off-by. Please amend withgit commit --amend -s.
Summary
| Priority | Item |
|---|---|
| Critical | Extract recovery logic into dedicated class (SRP) |
| Critical | Add integration tests for full recovery flow |
| Critical | Buffer output to prevent partial writes |
| Important | Remove dead code (metricPos > MAX_TABLE_RECOVERY_LEAD_IN) |
| Important | Tighten isLikelyBenchmarkPreHeader + add tests |
| Important | Assert on row content, not just count |
| Required | Sign CLA + add DCO sign-off |
The goal is valuable and the defensive fallback design is appreciated. With class extraction and integration tests, this will be solid. Thanks!
397d908 to
5bb7bd7
Compare
|
Thanks for the detailed review. I agree with the core concerns here. This PR grew too large and too document-shaped for a first upstream change. In particular:
I’m going to stop pushing this version as-is. I’ve rewritten the branch commits to use my GitHub-linked email ( I’ve opened a narrower follow-up PR instead:
That PR keeps For the remaining
And this existing issue is still the right place for the more ambitious recovery path:
Unless there is interest in reviving this branch with a much smaller scope, I plan to leave further work on this problem concentrated in #317 + #318 + #259. |
|
Closing this PR in favor of the narrower follow-up work. This branch is too large and too DeepSeek-shaped to be a good upstream change as-is. I’m leaving the more appropriate follow-up paths in place:
Thanks for the review and guidance here. |
Summary
Closes #259.