feat(web):Add web page to Markdown conversion tool by wondywang · Pull Request #1048 · jackwener/OpenCLI

wondywang · 2026-04-15T11:51:09Z

This file implements a CLI tool to convert web pages to Markdown format, utilizing the Readability library for content extraction and Turndown for HTML-to-Markdown conversion. It includes options for outputting to a file or stdout and handles various media URLs.

Description

This PR introduces a new, dedicated CLI plugin for converting web pages into high-fidelity Markdown. It is designed to address the limitations of the existing web read command, which currently lacks robust support for rich content like embedded media and complex tables.

This plugin ensures a more complete conversion by strictly preserving the original URLs of images and videos and accurately rendering HTML tables into Markdown format, making it suitable for archiving or processing content-heavy web pages.

Motivation

The primary motivation for this feature is to overcome the shortcomings of the current web read functionality. While web read is useful for extracting plain text, it often fails to capture the full context of a webpage, specifically:

Multimedia Content: Images and videos are frequently omitted or their links are lost.
Tabular Data: HTML tables are not correctly converted, leading to a loss of structured information.

This plugin fills that gap by providing a specialized tool for users who need to preserve the integrity of rich media and structured data when converting HTML to Markdown.

Web MD — Convert any web page to Markdown with enhanced quality.

Uses @mozilla/readability for content extraction and Turndown + GFM
or HTML-to-Markdown conversion. Preserves image/video URLs.

Usage:

# 1. Convert a webpage and save to a specific Markdown file
opencli web md --url "https://example.com/article" --output ./docs/article.md

# 2. Convert a webpage and print the result directly to the terminal (stdout)
opencli web md --url "https://example.com/article" --stdout

# 3. Basic conversion (default behavior may depend on core config)
opencli web md --url "https://example.com/article"

Related issue: None

Type of Change

This file implements a CLI tool to convert web pages to Markdown format, utilizing the Readability library for content extraction and Turndown for HTML-to-Markdown conversion. It includes options for outputting to a file or stdout and handles various media URLs.

jackwener · 2026-04-22T10:16:18Z

感谢你提交这个 PR！@wondywang 🙏

经过 review，我们决定不引入独立的 web md 命令，而是把这个 PR 里真正有增量价值的部分直接合进已有的 web read / 共享 Markdown pipeline — 在 #1146 里落地：

<video> / <audio> / <iframe> 的 Turndown 规则（保留媒体 URL，iframe 降级为 markdown 链接）
--stdout 选项（现在是 web read --stdout）
扩展 lazy-load 属性 (data-src / data-original / data-lazy-src / data-srcset) 并顺手修了一个 placeholder.gif 的潜在 bug

不保留的部分（已在 main 里或属于冗余）：

Readability 抽取 — feat(download): harden HTML→Markdown pipeline #1143 已经加了 src/browser/article-extract.ts
GFM / canonical strikethrough / base64 image drop / 页面 chrome 剥离 — feat(download): harden HTML→Markdown pipeline #1143 已经做掉
自建的 mediaUrls 收集 — 这个 PR 里实际收集后并没有被最终 markdown 使用（dead code）
新命令 web md 本身 — 和 web read 功能重叠

所以这个 PR 按当前形态是多余的，但你的思路完全对——功能已通过 #1146 合入。Close by superseded-by。

* feat(web,download): absorb #1048 media + --stdout into web read Distill the useful pieces of the abandoned PR #1048 (`web md`) into the existing shared pipeline instead of introducing a parallel command: - Turndown rules for <video> / <audio> / <iframe>. Video and audio are emitted as inline HTML so renderers that support it keep playback, and iframes degrade to markdown links (title + src) so embedded content (YouTube, CodePen, …) stays reachable. `iframe` moves out of STRIPPED_TAGS since it's now handled explicitly. - `stdout` option on ArticleDownloadOptions: writes the full markdown to process.stdout, skips image download + mkdir + file write, and reports saved='-'. Remote image URLs stay intact so piped output is self-contained. - `web read --stdout` wires the above through. - Lazy-load src rewrite: the extractor now promotes data-src / data-original / data-lazy-src / data-srcset onto `src` before the HTML is frozen, so the markdown body and the image-download list reference the same URL (previously a page with placeholder.gif + data-src produced broken image links in the output). Nothing in #1048 that overlapped with the already-merged #1143 hardening was kept — no new Readability wiring, no duplicate Turndown config, no new command. * fix(web): keep stdout streaming output clean * fix(tests): update iframe e2e assertion and drop relative src import - article-extract e2e fixture test: iframe now converts to a markdown link instead of being stripped, so assert the YouTube embed link survives rather than asserting its absence. - clis/web/read.test.js: replace vi.importActual('../../src/registry.js') with a direct __test__.command export from read.js; the relative import into src/ tripped the package-exports adapter guardrail.

wondywang changed the title ~~Add web page to Markdown conversion tool~~ feat(web):Add web page to Markdown conversion tool Apr 15, 2026

jackwener mentioned this pull request Apr 22, 2026

feat(web,download): absorb #1048 — video/audio/iframe + --stdout #1146

Merged

5 tasks

jackwener closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web):Add web page to Markdown conversion tool#1048

feat(web):Add web page to Markdown conversion tool#1048
wondywang wants to merge 1 commit intojackwener:mainfrom
wondywang:html-to-markdown

wondywang commented Apr 15, 2026

Uh oh!

jackwener commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wondywang commented Apr 15, 2026

Description

Motivation

Usage:

Type of Change

Uh oh!

jackwener commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants