fix(fetch): fall back when Readability strips hidden SSR content#3922
Open
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Open
fix(fetch): fall back when Readability strips hidden SSR content#3922Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Christian-Sidak wants to merge 1 commit intomodelcontextprotocol:mainfrom
Conversation
Add a three-stage extraction pipeline to extract_content_from_html(): 1. Readability (existing, best quality for standard pages) 2. readabilipy without Readability JS (less aggressive, no CSS visibility filtering) 3. Raw markdownify conversion (last resort) Stages 2 and 3 only activate when stage 1 produces text shorter than 1% of the input HTML length, which indicates Readability stripped meaningful content. This commonly happens with progressive SSR sites that deliver content in hidden containers (visibility:hidden, position:absolute) awaiting client-side hydration. No new dependencies. No behavior change for sites where Readability works correctly. Fixes modelcontextprotocol#3878
olaservo
approved these changes
Apr 12, 2026
Member
olaservo
left a comment
There was a problem hiding this comment.
This is the strongest of the three Readability fallback PRs (#3879, #3894, #3922):
- 3-stage pipeline (Readability → readabilipy without Readability → raw markdownify) gives a good quality gradient
- Proportional 1% threshold scales with page size, unlike a fixed constant
- Preserves the
<error>return for truly empty pages (no behavior change) - 6 fully-mocked deterministic tests with good edge case coverage
- Smallest diff to production code (only 4 lines removed)
We'll close the other two PRs with credit to @morozow for filing the original issue (#3878).
This review was assisted by Claude Code.
This was referenced Apr 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract_content_from_html()so that pages using progressive SSR (hidden pre-hydration markup) are not silently reduced to a single line of loading-shell textMotivation
Sites using progressive server-side rendering (Next.js streaming, Remix deferred, custom Lambda SSR) deliver content in two phases: a small visible loading shell, then the real content in a hidden container (
visibility:hidden; position:absolute; top:-9999px) that becomes visible after client-side hydration. Mozilla Readability treats hidden elements as non-content and strips them entirely, causingmcp-server-fetchto return only the loading shell text with no indication that content was lost.For example, fetching
https://runtimeweb.comreturns just"Unified Serverless Framework for Full-Stack TypeScript Applications"instead of the full page content.Changes
src/fetch/src/mcp_server_fetch/server.py: Modifiedextract_content_from_html()to try three extraction stages, falling back only when the previous stage produces disproportionately little textsrc/fetch/tests/test_server.py: Added 6 unit tests covering all fallback paths, threshold behavior, and no-regression for normal pagesBreaking Changes
None. The fix only activates when Readability extracts less than 1% of the HTML size as text. Normal sites where Readability works correctly are completely unaffected.
Test plan
Fixes #3878