Skip to content

Dripper VLM mislabels sidebar and spam nodes as main content on CSDN pages #37

@Jack47

Description

@Jack47

Problem

When processing CSDN (blog.csdn.net) pages, the Dripper VLM model occasionally assigns _item_id annotations to noise DOM nodes (sidebar recommendations, spam links), classifying them as "main" content. This causes main_html to include large amounts of non-article content.

We tested on a sample of 240 CSDN pages from production. 3 pages (1.2%) have real content defects where noise leaks into the extracted main_html.

Reproduction Cases

Case 1: Sidebar recommendation leak (2/240 pages)

URL: https://blog.csdn.net/goldfishsky/article/details/149716564/

Steps:

  1. Run MinerUHTML.process(html) on this page's HTML
  2. Inspect _item_id annotations in the output main_html
  3. Observe: _item_id 3, 12–37 correctly annotate the article body (<div class="content_views">)
  4. Bug: _item_id 38–81 annotate the CSDN recommendation sidebar (<div id="recommend">)

Impact: The article body is 2,869 chars (4.7% of main_html). The sidebar is 58,106 chars (95.3%). The extracted content is 95% noise.

HTML at the boundary:

<!-- Last article node (correct) -->
<p _item_id="37">如果输出 <code>2147483647</code>,则表示无限强度策略已生效。</p>
</div></div></div>

<!-- Sidebar starts here (should NOT have _item_id) -->
<div id="recommend">
    <div class="recommend_list">
        <dl class="container-fluid" data-url="https://download.csdn.net/...">
            <!-- _item_id="38" assigned to first recommend item -->

A second page exhibits the same pattern: https://blog.csdn.net/Angellyouran/article/details/147597019/

Case 2: GitCode spam link leak (1/240 pages)

URL: https://blog.csdn.net/gitblog_00820/article/details/152394486

Steps:

  1. Run MinerUHTML.process(html) on this page's HTML
  2. Inspect the spam <a> tag with _item_id=13

Observation: The model assigned _item_id=13 to a CSDN-injected GitCode spam link:

<a _item_id="13" class="has-card" href="https://gitcode.com/GitHub_Trending/ra/raylib/...">
  <span class="link-title">【免费下载链接】raylib</span>
</a>

This spam text (【免费下载链接】) appears in the final extracted content.

Context: 123/240 CSDN pages in our sample contain GitCode spam links. The model correctly excludes them in 120/123 cases (97.6% accuracy). This is one of 3 leak cases.

Case 3: Cross-domain sidebar noise

The sidebar issue is not CSDN-specific. In our 2,000-page sample across multiple domains:

Domain Pages with <aside> / sidebar noise in main_html Total pages
gitmemories.com 5 67
go.dev 5 6
csdn.net 2 240
arcgis.com 1 2
Several others 1 each

Root Cause Analysis

simplify_html.py strips <nav>, <script>, <style>, etc. before LLM inference, but does not strip:

  • <aside> (semantic HTML5 sidebar tag)
  • Elements with sidebar-related IDs (id="recommend", id="sidebar")
  • Elements with ARIA roles (role="complementary", role="navigation")

These survive preprocessing and get _item_id annotations. When the LLM misclassifies them as "main", the noise enters main_html.

The current ATTR_PATTERNS_TO_REMOVE in simplify_html.py only contains 'nav':

ATTR_PATTERNS_TO_REMOVE = {
    'nav',  # 'footer', 'header',  # standalone words
}

Suggestion

Would you consider extending simplify_html.py to also strip <aside> elements (similar to the existing <nav> handling in tags_to_remove)? The <aside> tag has clear semantic meaning in HTML5 — it represents content "tangentially related" to the main content, which aligns with what should be classified as "other".

This would be a conservative extension that follows the same pattern as the existing <nav> removal and reduces the chance of LLM misclassification on sidebar content.

Environment

  • Model: opendatalab/MinerU-HTML (v1.0, Qwen3-based)
  • Code: commit 73cf2666
  • We are planning to test v1.1 (MinerU-HTML-v1.1-hunyuan0.5B-compact) on these cases to see if the improved model accuracy resolves the issue.

Sample Data

We can share the full 240-page CSDN sample (raw HTML + Dripper output + extracted markdown) if helpful for reproducing and testing fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions