Skip to content

docs: add Chinese translation for README#332

Closed
JasonYeYuhe wants to merge 1 commit into
opendataloader-project:mainfrom
JasonYeYuhe:i18n-zh-readme
Closed

docs: add Chinese translation for README#332
JasonYeYuhe wants to merge 1 commit into
opendataloader-project:mainfrom
JasonYeYuhe:i18n-zh-readme

Conversation

@JasonYeYuhe
Copy link
Copy Markdown

@JasonYeYuhe JasonYeYuhe commented Mar 22, 2026

docs: add Chinese translation for README

Description

This PR introduces a high-quality Chinese translation of the README (README.zh.md) and adds a language switcher to the main README.md.

Motivation

OpenDataLoader PDF is a top-tier tool for AI-ready data extraction and PDF accessibility. Given the booming interest in RAG (Retrieval-Augmented Generation) and document intelligence within the Chinese developer community, providing localized documentation will significantly lower the barrier for users to leverage this project's capabilities—especially its #1 benchmark-ranked extraction accuracy and its hybrid mode support for Chinese OCR (ch_sim, ch_tra).

This translation covers:

  • Core features (AI-ready data, accessibility automation).
  • Extraction benchmarks and capability matrix.
  • Quick-start guides for Python, Node.js, and Java.
  • Detailed Hybrid Mode explanation including OCR and formula extraction.
  • PDF accessibility and PDF/UA compliance workflow.

Changes

  • Created README.zh.md with localized technical terminology (e.g., RAG, Bounding Boxes, Auto-tagging).
  • Added English/Chinese language switcher at the top of the main README.

Translated and contributed by @JasonYeYuhe

Summary by CodeRabbit

Documentation

  • Added language navigation links between English and Chinese versions, making the project accessible to multilingual audiences
  • Introduced comprehensive Chinese translation with project overview, feature highlights, installation and quickstart guides, complete functionality matrix, performance benchmarks, hybrid mode configuration details, supported output formats, accessibility workflows, development roadmap, troubleshooting FAQ, and community contribution guidelines

Copilot AI review requested due to automatic review settings March 22, 2026 06:38
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 22, 2026

CLA assistant check
All committers have signed the CLA.

This comment was marked as off-topic.

@bundolee
Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 23, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

Documentation updates adding language localization support to the OpenDataLoader PDF project. A language navigation link is added to the English README, and a comprehensive new Chinese README is introduced with project overview, feature descriptions, quickstart instructions, usage workflows, output formats, accessibility pipeline details, and contribution guidelines.

Changes

Cohort / File(s) Summary
Documentation & Localization
README.md, README.zh.md
Added language navigation to English README linking to Chinese translation; created comprehensive Chinese README with project positioning, features, quickstart guide, functionality matrix, benchmark results, mode selection guidance, output formats, accessibility pipeline, roadmap, and FAQ.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

📚✨ A rabbit hops through pages new,
Where English meets the Chinese brew,
README blooms in two-fold tongue,
Both docs sing out, their bells are rung!
Localization makes the project shine,
For readers far across each line. 🐰

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main change: adding a Chinese translation for the README. It directly matches the primary objective of the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
README.zh.md (1)

16-19: 建议在中文 README 顶部加反向语言切换

英文 README 已有语言切换;这里也加一行 [English](README.md) | [中文版](README.zh.md) 会更一致、可发现性更好。

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.zh.md` around lines 16 - 19, Add a language-switch line at the top of
README.zh.md to match the English README; insert a single-line navigation like
"[English](README.md) | [中文版](README.zh.md)" immediately above or below the main
title/header so readers can quickly switch languages and improve
discoverability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.zh.md`:
- Line 33: The README contains broken internal anchors (`#langchain-集成` and
`#自动标记预览-2026-q2-发布`) referenced in the sentence about RAG on Line 33/38; locate
the hyperlink targets (the section headings or TOC entries) that should match
these anchors and either rename those headings to exactly "LangChain 集成" and
"自动标记预览 2026 Q2 发布" (or the exact anchor strings) or update the links in the
Line 33/38 text to match the existing heading anchors; ensure the anchor
casing/spacing/encoding matches the generated markdown anchor for the headings
referenced by the symbols '#langchain-集成' and '#自动标记预览-2026-q2-发布'.

---

Nitpick comments:
In `@README.zh.md`:
- Around line 16-19: Add a language-switch line at the top of README.zh.md to
match the English README; insert a single-line navigation like
"[English](README.md) | [中文版](README.zh.md)" immediately above or below the main
title/header so readers can quickly switch languages and improve
discoverability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: deea6042-4946-4dc8-9c9e-150c1416c42f

📥 Commits

Reviewing files that changed from the base of the PR and between 8e3f74a and d3e5bbe.

📒 Files selected for processing (2)
  • README.md
  • README.zh.md

Comment thread README.zh.md
- **准确度如何?** —— 基准测试排名第一:在 200 份包含多栏和科学论文的真实 PDF 中,综合准确度达 0.90,表格准确度达 0.93。本地确定性模式 + 复杂页面的 AI 混合模式 ([查看基准测试](#提取基准测试))。
- **支持扫描件和 OCR 吗?** —— 支持。在混合模式下内置 OCR(支持 80 多种语言)。适用于 300 DPI 以上的低质量扫描件 ([混合模式详情](#混合模式复杂-pdf-的准确度冠军))。
- **支持表格、公式、图像和图表吗?** —— 支持。通过混合模式处理复杂/无边框表格、LaTeX 公式以及 AI 生成的图片/图表描述 ([混合模式详情](#混合模式复杂-pdf-的准确度冠军))。
- **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](#30-秒快速上手) | [LangChain 集成](#langchain-集成))。
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

修复 README 内部锚点失效链接

Line 33 和 Line 38 的片段链接目标在当前文档中不存在,会导致跳转失败(#langchain-集成#自动标记预览-2026-q2-发布)。

🔧 Suggested diff
-- **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](`#30-秒快速上手`) | [LangChain 集成](`#langchain-集成`))。
+- **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](`#30-秒快速上手`) | [LangChain 集成](https://opendataloader.org/docs/langchain-opendataloader-pdf))。

-- **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([自动标记预览](`#自动标记预览-2026-q2-发布`))。
+- **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([工作流](`#无障碍流水线`))。

Also applies to: 38-38

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 33-33: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.zh.md` at line 33, The README contains broken internal anchors
(`#langchain-集成` and `#自动标记预览-2026-q2-发布`) referenced in the sentence about RAG on
Line 33/38; locate the hyperlink targets (the section headings or TOC entries)
that should match these anchors and either rename those headings to exactly
"LangChain 集成" and "自动标记预览 2026 Q2 发布" (or the exact anchor strings) or update
the links in the Line 33/38 text to match the existing heading anchors; ensure
the anchor casing/spacing/encoding matches the generated markdown anchor for the
headings referenced by the symbols '#langchain-集成' and '#自动标记预览-2026-q2-发布'.

@bundolee
Copy link
Copy Markdown
Contributor

Thanks for adding Chinese localization! Great community contribution. A few items to address before we can merge:

Must fix:

  1. AI-AGENT-SUMMARY block — This metadata block (HTML comment at the top) is machine-readable and must remain in English. Please revert it to an exact copy of the English original.
  2. Language switcher placement — Make sure [English](README.md) | [中文版](README.zh.md) is placed after the # OpenDataLoader PDF heading, not before the AI-AGENT-SUMMARY comment block.

Should fix:
3. Incomplete translation — The Chinese README is ~203 lines vs ~585 lines in English (~65% content missing). Please either complete all sections or add a note at the top: > 本文为部分翻译。完整文档请参阅 [English README](README.md)。
4. Internal anchor links — Chinese headings produce different GitHub anchors (e.g., #提取基准测试 not #extraction-benchmarks). Please verify all internal links work.
5. File naming — Consider renaming to README.zh-CN.md to distinguish Simplified Chinese from Traditional Chinese, since the project already supports both ch_sim and ch_tra OCR languages.

Minor:

  • Technical terms like "Tagged PDF", "PDF/UA", "XY-Cut++", "RAG", "LangChain" should remain in English (they are proper nouns / standard names).
  • Please add the same language switcher line in README.zh.md so users can navigate back to English.

Happy to merge once the must-fix items are resolved!

@bundolee
Copy link
Copy Markdown
Contributor

bundolee commented Mar 24, 2026

Thanks for the translation — the quality is solid.

We've decided not to maintain translated README files in the repository. The README changes frequently, and keeping translations in sync is a maintenance burden we can't commit to right now.

Appreciate the effort, and sorry we can't merge this one!

@bundolee bundolee closed this Mar 24, 2026
@JasonYeYuhe
Copy link
Copy Markdown
Author

Thank you for your reply. Now I understand the situation. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants