fix(fetch): fall back when Readability strips hidden SSR content by Christian-Sidak · Pull Request #3922 · modelcontextprotocol/servers

Christian-Sidak · 2026-04-12T18:26:22Z

Summary

Adds a three-stage fallback to extract_content_from_html() so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell text
Stage 1: Readability (existing behavior, unchanged for normal sites)
Stage 2: readabilipy without Readability JS (less aggressive, does not filter by CSS visibility)
Stage 3: Raw markdownify conversion (last resort)
Fallback only activates when Readability output is shorter than 1% of the input HTML

Motivation

Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causing mcp-server-fetch to return only the loading shell text with no indication that content was lost.

For example, fetching https://runtimeweb.com returns just "Unified Serverless Framework for Full-Stack TypeScript Applications" instead of the full page content.

Changes

src/fetch/src/mcp_server_fetch/server.py: Modified extract_content_from_html() to try three extraction stages, falling back only when the previous stage produces disproportionately little text
src/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pages

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.

Test plan

All 6 new fallback tests pass
Existing tests unaffected (they test Readability directly via Node.js, independent of this change)
No new dependencies added

Fixes #3878

Add a three-stage extraction pipeline to extract_content_from_html(): 1. Readability (existing, best quality for standard pages) 2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering) 3. Raw markdownify conversion (last resort) Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML length, which indicates Readability stripped meaningful content. This commonly happens with progressive SSR sites that deliver content in hidden containers (visibility:hidden, position:absolute) awaiting client-side hydration. No new dependencies. No behavior change for sites where Readability works correctly. Fixes modelcontextprotocol#3878

olaservo

This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):

3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
Proportional 1% threshold scales with page size, unlike a fixed constant
Preserves the <error> return for truly empty pages (no behavior change)
6 fully-mocked deterministic tests with good edge case coverage
Smallest diff to production code (only 4 lines removed)

We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).

This review was assisted by Claude Code.

morozow · 2026-04-14T08:58:51Z

@Christian-Sidak @olaservo I analyzed both implementations #3922, #3879 in the context of the MCP fetch server and the expected contract of a transport-level content extractor.
#3922 (current) variant introduces opinionated validation via length thresholds, .strip(), and a hard <error> state. This breaks neutrality: it can discard valid outputs (e.g., whitespace-only content), produces non-data responses, and makes the result dependent on heuristics tied to HTML size. In MCP terms, this violates separation of concerns – the fetch layer should not decide what is "good enough" content.

#3879 variant behaves as a proper extraction primitive: it attempts readability, falls back when needed, and always returns the extracted content without enforcing interpretation or artificial failure states. This keeps the pipeline predictable for agents and preserves full fidelity of the source, which is critical for downstream processing.

#3879 variant is the correct approach for MCP fetch, as it maintains a clean transport contract and avoids embedding policy/validation logic into the extraction layer.

PR needs either refactoring or revert to the implementation from #3879, #3947 to be eligible for merge and properly resolve the issue described in #3878.

morozow · 2026-04-14T12:31:23Z

@olaservo Updated tests for full edge-case coverage and reopened as #3947

Christian-Sidak · 2026-04-19T05:25:40Z

Good points, @morozow -- I agree the length threshold and <error> wrapping add opinionation that doesn't belong in the transport layer. Happy to strip those heuristics and just return whatever readabilipy/markdownify produces without any filtering. Note that #3947 was closed -- is there a preferred path forward, or would a simplified version of this PR work?

cliffhall · 2026-04-21T14:30:52Z

@claude review

claude · 2026-04-21T14:31:09Z

Claude finished @cliffhall's task in 1m 48s —— View job

Code Review

Gather context and understand request
Read changed files
Analyze implementation
Post review feedback

This PR addresses a genuine problem: Mozilla Readability strips visibility:hidden SSR containers, leaving only loading-shell text. The three-stage pipeline is a reasonable approach. Below are my observations.

`src/fetch/src/mcp_server_fetch/server.py`

Threshold heuristic (line 45)

min_expected_length = max(1, len(html) // 100)

This is the most debatable part, as morozow noted. Consider a minimalist landing page with 80 KB of inlined JavaScript but only a few hundred bytes of real visible text — Readability's small, clean output is correct, but this threshold would discard it and fall through to Stage 2 or Stage 3, potentially injecting raw JS noise. The 1% line blurs the boundary between extraction and quality validation. If the concern is specifically about SSR hidden containers, a targeted approach (detecting near-zero output for large HTML) might be more defensible, but any threshold is inherently opinionated.

Stage 3: raw markdownify on full HTML (lines 74–80)

content = markdownify.markdownify(html, heading_style=markdownify.ATX)

markdownify does not strip <script> or <style> blocks — it converts them verbatim to plain text. If Stage 3 activates, the returned Markdown may include large blobs of JavaScript or CSS that are noisy for downstream LLM processing. Consider filtering those tags before applying markdownify:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
    tag.decompose()
content = markdownify.markdownify(str(soup), heading_style=markdownify.ATX)

BeautifulSoup is already an indirect dependency (via readabilipy), so this wouldn't add a new dependency.

Code duplication (lines 47–72)

Stages 1 and 2 share identical structure: call simple_json_from_html_string, extract content, markdownify, and check the threshold. A small helper function would reduce duplication and make it easier to add a Stage 4 later if needed. Fix this →

ret.get("content", "") (lines 51, 65)

Changing from ret["content"] to ret.get("content", "") is a safe improvement — it avoids a KeyError if readabilipy ever returns a dict without content. Good.

`src/fetch/tests/test_server.py`

The 6 new tests cover the happy path (no fallback), each fallback stage, the None content case, and threshold math. That's solid coverage.

Minor: unusual call_count pattern (lines 386–392)

call_count = [0]
def mock_simple_json(h, use_readability=True):
    call_count[0] += 1
    if call_count[0] == 1:

Using a list as a mutable counter is a Python 2 workaround. In Python 3.10+ (the minimum for this server), nonlocal is cleaner:

call_count = 0
def mock_simple_json(h, use_readability=True):
    nonlocal call_count
    call_count += 1
    ...

Or pass side_effect a list of return values: mock.side_effect = [{"content": None}, {"content": ...}].

test_threshold_is_one_percent_of_html (line 398)

The assertion len(result.strip()) > 50 is intentionally loose (just confirming fallback fired), but it would be more valuable to assert the exact content returned (e.g., that "x" * 9000 chars appear in the output), making regressions more visible.

Overall

The core fix is sound and the tests are well-structured. The two actionable issues worth addressing before merge are:

Stage 3 <script>/<style> noise — low effort, high value fix
The 1% threshold — the existing discussion around whether heuristic quality checks belong in the transport layer is valid; consider whether a simpler "always fall through if Readability returns empty/None" condition (without the length comparison) would satisfy the SSR use case without the false-positive risk

…lback

Christian-Sidak · 2026-04-22T05:23:44Z

Addressed the two actionable items from the Claude code review:

Stage 3 script/style stripping — added BeautifulSoup pre-processing to strip <script> and <style> tags before passing to markdownify. This avoids large blobs of raw JS/CSS appearing in the Markdown output when Stage 3 activates.
Left the 1% threshold as-is for now — the PR is already approved and the threshold drives the fallback behavior; open to adjusting if maintainers prefer a simpler empty-check approach.

Replace the proportional min_expected_length heuristic with a plain emptiness check (content.strip()). The threshold was too opinionated and could discard correct Readability output for legitimately sparse pages. Falls back through stages only when output is truly empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Christian-Sidak · 2026-04-24T02:00:30Z

Removed the 1% content-length threshold as discussed -- Stages 1 and 2 now fall through only when Readability returns empty content (content.strip()), not based on a proportional length heuristic. Updated the test accordingly.

This should address @morozow's concern about transport-layer neutrality and the Claude bot review's suggestion to simplify to an empty/None check.

After removing the 1% length threshold, fallback is triggered by empty/None content rather than by proportion. Update the test to reflect the new empty-check contract.

Christian-Sidak · 2026-04-25T05:10:34Z

Fixed the failing fetch test — after removing the 1% threshold, the fallback now triggers on empty/None content rather than proportional size. Updated test_readability_strips_content_falls_back_to_no_readability to pass empty string from Stage 1 (which correctly exercises the empty-check path). All other tests still pass.

Christian-Sidak · 2026-05-03T05:52:49Z

Friendly bump -- let me know if anything needs changing.

Christian-Sidak · 2026-05-11T06:31:24Z

Friendly bump -- let me know if anything needs changing.

olaservo previously approved these changes Apr 12, 2026

View reviewed changes

This was referenced Apr 12, 2026

fix(fetch): add fallback extraction for readability-stripped content #3879

Closed

fix(fetch): fall back to raw HTML conversion when Readability returns empty content #3894

Closed

olaservo mentioned this pull request Apr 18, 2026

fix(fetch): add fallback extraction for readability-stripped content #3947

Closed

12 tasks

cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026

fix(fetch): strip script/style tags before markdownify in Stage 3 fal…

a9fc3bb

…lback

Christian-Sidak dismissed olaservo’s stale review via a9fc3bb April 22, 2026 05:23

fix(fetch): update fallback test to use empty readability output

64facb4

After removing the 1% length threshold, fallback is triggered by empty/None content rather than by proportion. Update the test to reflect the new empty-check contract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetch): fall back when Readability strips hidden SSR content#3922

fix(fetch): fall back when Readability strips hidden SSR content#3922
Christian-Sidak wants to merge 4 commits into
modelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback

Christian-Sidak commented Apr 12, 2026

Uh oh!

olaservo left a comment

Uh oh!

morozow commented Apr 14, 2026 •

edited

Loading

Uh oh!

morozow commented Apr 14, 2026

Uh oh!

Christian-Sidak commented Apr 19, 2026

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Christian-Sidak commented Apr 22, 2026

Uh oh!

Christian-Sidak commented Apr 24, 2026

Uh oh!

Christian-Sidak commented Apr 25, 2026

Uh oh!

Christian-Sidak commented May 3, 2026

Uh oh!

Christian-Sidak commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Christian-Sidak commented Apr 12, 2026

Summary

Motivation

Changes

Breaking Changes

Test plan

Uh oh!

olaservo left a comment

Choose a reason for hiding this comment

Uh oh!

morozow commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morozow commented Apr 14, 2026

Uh oh!

Christian-Sidak commented Apr 19, 2026

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

src/fetch/src/mcp_server_fetch/server.py

src/fetch/tests/test_server.py

Overall

Uh oh!

Christian-Sidak commented Apr 22, 2026

Uh oh!

Christian-Sidak commented Apr 24, 2026

Uh oh!

Christian-Sidak commented Apr 25, 2026

Uh oh!

Christian-Sidak commented May 3, 2026

Uh oh!

Christian-Sidak commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

morozow commented Apr 14, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading

`src/fetch/src/mcp_server_fetch/server.py`

`src/fetch/tests/test_server.py`