Problem
When processing CSDN (blog.csdn.net) pages, the Dripper VLM model occasionally assigns _item_id annotations to noise DOM nodes (sidebar recommendations, spam links), classifying them as "main" content. This causes main_html to include large amounts of non-article content.
We tested on a sample of 240 CSDN pages from production. 3 pages (1.2%) have real content defects where noise leaks into the extracted main_html.
Reproduction Cases
Case 1: Sidebar recommendation leak (2/240 pages)
URL: https://blog.csdn.net/goldfishsky/article/details/149716564/
Steps:
- Run
MinerUHTML.process(html) on this page's HTML
- Inspect
_item_id annotations in the output main_html
- Observe:
_item_id 3, 12–37 correctly annotate the article body (<div class="content_views">)
- Bug:
_item_id 38–81 annotate the CSDN recommendation sidebar (<div id="recommend">)
Impact: The article body is 2,869 chars (4.7% of main_html). The sidebar is 58,106 chars (95.3%). The extracted content is 95% noise.
HTML at the boundary:
<!-- Last article node (correct) -->
<p _item_id="37">如果输出 <code>2147483647</code>,则表示无限强度策略已生效。</p>
</div></div></div>
<!-- Sidebar starts here (should NOT have _item_id) -->
<div id="recommend">
<div class="recommend_list">
<dl class="container-fluid" data-url="https://download.csdn.net/...">
<!-- _item_id="38" assigned to first recommend item -->
A second page exhibits the same pattern: https://blog.csdn.net/Angellyouran/article/details/147597019/
Case 2: GitCode spam link leak (1/240 pages)
URL: https://blog.csdn.net/gitblog_00820/article/details/152394486
Steps:
- Run
MinerUHTML.process(html) on this page's HTML
- Inspect the spam
<a> tag with _item_id=13
Observation: The model assigned _item_id=13 to a CSDN-injected GitCode spam link:
<a _item_id="13" class="has-card" href="https://gitcode.com/GitHub_Trending/ra/raylib/...">
<span class="link-title">【免费下载链接】raylib</span>
</a>
This spam text (【免费下载链接】) appears in the final extracted content.
Context: 123/240 CSDN pages in our sample contain GitCode spam links. The model correctly excludes them in 120/123 cases (97.6% accuracy). This is one of 3 leak cases.
Case 3: Cross-domain sidebar noise
The sidebar issue is not CSDN-specific. In our 2,000-page sample across multiple domains:
| Domain |
Pages with <aside> / sidebar noise in main_html |
Total pages |
| gitmemories.com |
5 |
67 |
| go.dev |
5 |
6 |
| csdn.net |
2 |
240 |
| arcgis.com |
1 |
2 |
| Several others |
1 each |
— |
Root Cause Analysis
simplify_html.py strips <nav>, <script>, <style>, etc. before LLM inference, but does not strip:
<aside> (semantic HTML5 sidebar tag)
- Elements with sidebar-related IDs (
id="recommend", id="sidebar")
- Elements with ARIA roles (
role="complementary", role="navigation")
These survive preprocessing and get _item_id annotations. When the LLM misclassifies them as "main", the noise enters main_html.
The current ATTR_PATTERNS_TO_REMOVE in simplify_html.py only contains 'nav':
ATTR_PATTERNS_TO_REMOVE = {
'nav', # 'footer', 'header', # standalone words
}
Suggestion
Would you consider extending simplify_html.py to also strip <aside> elements (similar to the existing <nav> handling in tags_to_remove)? The <aside> tag has clear semantic meaning in HTML5 — it represents content "tangentially related" to the main content, which aligns with what should be classified as "other".
This would be a conservative extension that follows the same pattern as the existing <nav> removal and reduces the chance of LLM misclassification on sidebar content.
Environment
- Model:
opendatalab/MinerU-HTML (v1.0, Qwen3-based)
- Code: commit
73cf2666
- We are planning to test v1.1 (
MinerU-HTML-v1.1-hunyuan0.5B-compact) on these cases to see if the improved model accuracy resolves the issue.
Sample Data
We can share the full 240-page CSDN sample (raw HTML + Dripper output + extracted markdown) if helpful for reproducing and testing fixes.
Problem
When processing CSDN (
blog.csdn.net) pages, the Dripper VLM model occasionally assigns_item_idannotations to noise DOM nodes (sidebar recommendations, spam links), classifying them as"main"content. This causesmain_htmlto include large amounts of non-article content.We tested on a sample of 240 CSDN pages from production. 3 pages (1.2%) have real content defects where noise leaks into the extracted
main_html.Reproduction Cases
Case 1: Sidebar recommendation leak (2/240 pages)
URL: https://blog.csdn.net/goldfishsky/article/details/149716564/
Steps:
MinerUHTML.process(html)on this page's HTML_item_idannotations in the outputmain_html_item_id3, 12–37 correctly annotate the article body (<div class="content_views">)_item_id38–81 annotate the CSDN recommendation sidebar (<div id="recommend">)Impact: The article body is 2,869 chars (4.7% of
main_html). The sidebar is 58,106 chars (95.3%). The extracted content is 95% noise.HTML at the boundary:
A second page exhibits the same pattern: https://blog.csdn.net/Angellyouran/article/details/147597019/
Case 2: GitCode spam link leak (1/240 pages)
URL: https://blog.csdn.net/gitblog_00820/article/details/152394486
Steps:
MinerUHTML.process(html)on this page's HTML<a>tag with_item_id=13Observation: The model assigned
_item_id=13to a CSDN-injected GitCode spam link:This spam text (
【免费下载链接】) appears in the final extracted content.Context: 123/240 CSDN pages in our sample contain GitCode spam links. The model correctly excludes them in 120/123 cases (97.6% accuracy). This is one of 3 leak cases.
Case 3: Cross-domain sidebar noise
The sidebar issue is not CSDN-specific. In our 2,000-page sample across multiple domains:
<aside>/ sidebar noise inmain_htmlRoot Cause Analysis
simplify_html.pystrips<nav>,<script>,<style>, etc. before LLM inference, but does not strip:<aside>(semantic HTML5 sidebar tag)id="recommend",id="sidebar")role="complementary",role="navigation")These survive preprocessing and get
_item_idannotations. When the LLM misclassifies them as"main", the noise entersmain_html.The current
ATTR_PATTERNS_TO_REMOVEinsimplify_html.pyonly contains'nav':Suggestion
Would you consider extending
simplify_html.pyto also strip<aside>elements (similar to the existing<nav>handling intags_to_remove)? The<aside>tag has clear semantic meaning in HTML5 — it represents content "tangentially related" to the main content, which aligns with what should be classified as"other".This would be a conservative extension that follows the same pattern as the existing
<nav>removal and reduces the chance of LLM misclassification on sidebar content.Environment
opendatalab/MinerU-HTML(v1.0, Qwen3-based)73cf2666MinerU-HTML-v1.1-hunyuan0.5B-compact) on these cases to see if the improved model accuracy resolves the issue.Sample Data
We can share the full 240-page CSDN sample (raw HTML + Dripper output + extracted markdown) if helpful for reproducing and testing fixes.