-
Notifications
You must be signed in to change notification settings - Fork 2k
docs: add Chinese translation for README #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,203 @@ | ||||||||||||||
| <!-- AI-AGENT-SUMMARY | ||||||||||||||
| name: opendataloader-pdf | ||||||||||||||
| category: PDF 数据提取, PDF 无障碍自动化 | ||||||||||||||
| license: Apache-2.0 | ||||||||||||||
| solves: [面向 RAG/LLM 流水的 PDF 转结构化数据, 自动化 PDF 无障碍合规性 — 布局分析 + 自动标记为 Tagged PDF (首个开源端到端工具)] | ||||||||||||||
| input: PDF 文件 (原生、扫描件、含标签) | ||||||||||||||
| output: Markdown, JSON (含边界框), HTML, Tagged PDF, PDF/UA (企业级) | ||||||||||||||
| sdk: Python, Node.js, Java | ||||||||||||||
| requirements: Java 11+ | ||||||||||||||
| pricing: 开源核心 (数据提取、布局分析、自动标记为 Tagged PDF), 企业级插件 (PDF/UA 导出、无障碍工作室) | ||||||||||||||
| extraction-benchmark: 混合模式下提取准确度排名第一 (0.90), 表格提取准确度 0.93, 本地模式每页 0.05 秒 | ||||||||||||||
| accessibility-validation: 与 PDF 协会合作, Well-Tagged PDF 规范, veraPDF 自动验证 | ||||||||||||||
| key-differentiators: [基准测试排名第一的 PDF 解析器, 确定性输出, 每个元素均含边界框, XY-Cut++ 阅读顺序, AI 安全过滤, AI 混合模式, 首个开源端到端 Tagged PDF 自动标记工具, 与 PDF 协会及 Dual Lab (veraPDF 开发商) 合作, 符合 Well-Tagged PDF 规范] | ||||||||||||||
| --> | ||||||||||||||
|
|
||||||||||||||
| # OpenDataLoader PDF | ||||||||||||||
|
|
||||||||||||||
| **为 AI 就绪数据而生的 PDF 解析器。PDF 无障碍自动化。开源。** | ||||||||||||||
|
|
||||||||||||||
| [](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE) | ||||||||||||||
| [](https://pypi.org/project/opendataloader-pdf/) | ||||||||||||||
| [](https://www.npmjs.com/package/@opendataloader/pdf) | ||||||||||||||
| [](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core) | ||||||||||||||
| [](https://github.com/opendataloader-project/opendataloader-pdf#java) | ||||||||||||||
|
|
||||||||||||||
| <a href="https://trendshift.io/repositories/21917" target="_blank"><img src="https://trendshift.io/api/badge/repositories/21917" alt="opendataloader-project%2Fopendataloader-pdf | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> | ||||||||||||||
|
|
||||||||||||||
| 🔍 **面向 AI 数据提取的 PDF 解析器** —— 从任何 PDF 中提取 Markdown、JSON(含边界框)和 HTML。基准测试排名第一(综合 0.90)。提供确定性的本地模式 + 针对复杂页面的 AI 混合模式。 | ||||||||||||||
|
|
||||||||||||||
| - **准确度如何?** —— 基准测试排名第一:在 200 份包含多栏和科学论文的真实 PDF 中,综合准确度达 0.90,表格准确度达 0.93。本地确定性模式 + 复杂页面的 AI 混合模式 ([查看基准测试](#提取基准测试))。 | ||||||||||||||
| - **支持扫描件和 OCR 吗?** —— 支持。在混合模式下内置 OCR(支持 80 多种语言)。适用于 300 DPI 以上的低质量扫描件 ([混合模式详情](#混合模式复杂-pdf-的准确度冠军))。 | ||||||||||||||
| - **支持表格、公式、图像和图表吗?** —— 支持。通过混合模式处理复杂/无边框表格、LaTeX 公式以及 AI 生成的图片/图表描述 ([混合模式详情](#混合模式复杂-pdf-的准确度冠军))。 | ||||||||||||||
| - **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](#30-秒快速上手) | [LangChain 集成](#langchain-集成))。 | ||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 修复 README 内部锚点失效链接 Line 33 和 Line 38 的片段链接目标在当前文档中不存在,会导致跳转失败( 🔧 Suggested diff-- **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](`#30-秒快速上手`) | [LangChain 集成](`#langchain-集成`))。
+- **如何用于 RAG?** —— `pip install opendataloader-pdf`,3 行代码即可转换。输出用于分块(chunking)的结构化 Markdown,用于源码引用的含边界框 JSON,以及 HTML。支持 LangChain 集成。提供 Python、Node.js、Java SDK ([快速开始](`#30-秒快速上手`) | [LangChain 集成](https://opendataloader.org/docs/langchain-opendataloader-pdf))。
-- **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([自动标记预览](`#自动标记预览-2026-q2-发布`))。
+- **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([工作流](`#无障碍流水线`))。Also applies to: 38-38 🧰 Tools🪛 markdownlint-cli2 (0.21.0)[warning] 33-33: Link fragments should be valid (MD051, link-fragments) 🤖 Prompt for AI Agents |
||||||||||||||
|
|
||||||||||||||
|
||||||||||||||
| ## LangChain 集成 | |
| 使用本项目生成的结构化 Markdown 或 JSON,可以在 LangChain 中作为文档源进行分块、索引和检索。你可以在 LangChain 的 `DocumentLoader` 或自定义链中直接读取这些输出,将其接入你的 RAG / LLM 流水线。 |
Copilot
AI
Mar 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This references an in-page anchor #自动标记预览-2026-q2-发布, but there is no matching section heading later in the document, so the link is broken. Add the missing "自动标记预览" section or update the link to an existing heading.
| - **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([自动标记预览](#自动标记预览-2026-q2-发布))。 | |
| - **哪些功能免费?** —— 布局分析 + 自动标记(2026 Q2,Apache 2.0 协议)。将未标记的 PDF 转为 Tagged PDF,无需依赖专有 SDK ([自动标记预览](#无障碍流水线))。 |
Copilot
AI
Mar 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description says the Chinese translation includes quick-start guides for Python, Node.js, and Java, but this README only includes a Python example and links out for Node.js/Java. If the intent is parity with README.md, add the Node.js and Java quick-start snippets (or update the PR description to match what’s actually included).
Copilot
AI
Mar 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commands in the "我该使用哪种模式?" table look inconsistent with actual CLI usage: opendataloader-pdf --hybrid requires a backend value (e.g., docling-fast), and some rows list only the server command (opendataloader-pdf-hybrid ...) without the client command that actually processes a PDF. Consider aligning this table with the English README by separating server/client commands (or otherwise showing both steps) so the copy/paste commands work.
| | 复杂或嵌套表格 | **混合 (Hybrid)** | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf --hybrid doc.pdf` | | |
| | 扫描件 / 图像 PDF | **混合 + OCR** | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --force-ocr` | | |
| | 包含数学公式 | **混合 + 公式** | `pip install "opendataloader-pdf[hybrid]"` | `opendataloader-pdf-hybrid --enrich-formula` | | |
| | 复杂或嵌套表格 | **混合 (Hybrid)** | `pip install "opendataloader-pdf[hybrid]"` | `终端 1: opendataloader-pdf-hybrid --backend docling-fast<br>终端 2: opendataloader-pdf --hybrid docling-fast doc.pdf` | | |
| | 扫描件 / 图像 PDF | **混合 + OCR** | `pip install "opendataloader-pdf[hybrid]"` | `终端 1: opendataloader-pdf-hybrid --backend docling-fast --force-ocr<br>终端 2: opendataloader-pdf --hybrid docling-fast doc.pdf` | | |
| | 包含数学公式 | **混合 + 公式** | `pip install "opendataloader-pdf[hybrid]"` | `终端 1: opendataloader-pdf-hybrid --backend docling-fast --enrich-formula<br>终端 2: opendataloader-pdf --hybrid docling-fast doc.pdf` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the AI-AGENT-SUMMARY, "RAG/LLM 流水" is inconsistent with later wording (e.g., "RAG 流水线") and reads like a missing character. Consider changing this to "RAG/LLM 流水线" for consistency and correctness.