Skip to content

fix(fetch): fall back when Readability strips hidden SSR content#3922

Open
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback
Open

fix(fetch): fall back when Readability strips hidden SSR content#3922
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak:fix/fetch-ssr-content-fallback

Conversation

@Christian-Sidak
Copy link
Copy Markdown

Summary

  • Adds a three-stage fallback to extract_content_from_html() so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell text
  • Stage 1: Readability (existing behavior, unchanged for normal sites)
  • Stage 2: readabilipy without Readability JS (less aggressive, does not filter by CSS visibility)
  • Stage 3: Raw markdownify conversion (last resort)
  • Fallback only activates when Readability output is shorter than 1% of the input HTML

Motivation

Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causing mcp-server-fetch to return only the loading shell text with no indication that content was lost.

For example, fetching https://runtimeweb.com returns just "Unified Serverless Framework for Full-Stack TypeScript Applications" instead of the full page content.

Changes

  • src/fetch/src/mcp_server_fetch/server.py: Modified extract_content_from_html() to try three extraction stages, falling back only when the previous stage produces disproportionately little text
  • src/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pages

Breaking Changes

None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.

Test plan

  • All 6 new fallback tests pass
  • Existing tests unaffected (they test Readability directly via Node.js, independent of this change)
  • No new dependencies added

Fixes #3878

Add a three-stage extraction pipeline to extract_content_from_html():

1. Readability (existing, best quality for standard pages)
2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering)
3. Raw markdownify conversion (last resort)

Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the
input HTML length, which indicates Readability stripped meaningful content. This
commonly happens with progressive SSR sites that deliver content in hidden
containers (visibility:hidden, position:absolute) awaiting client-side hydration.

No new dependencies. No behavior change for sites where Readability works correctly.

Fixes modelcontextprotocol#3878
Copy link
Copy Markdown
Member

@olaservo olaservo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):

  • 3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
  • Proportional 1% threshold scales with page size, unlike a fixed constant
  • Preserves the <error> return for truly empty pages (no behavior change)
  • 6 fully-mocked deterministic tests with good edge case coverage
  • Smallest diff to production code (only 4 lines removed)

We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).


This review was assisted by Claude Code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mcp-server-fetch drops SSR content from streaming/progressive rendering sites

2 participants