Promote JsonDetector to JsonParser with lazy pretty-print#579
Conversation
Renames the misleadingly-named JsonDetector (whose parse() returned None) to JsonParser and gives it a real parse() that returns the dict/list. The Scraper no longer stores the parsed structure; instead, formatted_content parses on demand and falls back to raw text on JSONDecodeError or RecursionError. When log_response is enabled, JSON responses are written to page_json.txt (pretty-printed) alongside the existing page_soup.txt path for HTML. Groundwork for issue #555 (native JSON API support) — value_template remains the canonical extraction path via HA's value_json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR replaces the JSON-detector pattern with a JsonParser that parses JSON, adds pretty-printing and async page_json file logging in the scraper, updates parser and scraper tests for JSON behavior and state, and refreshes README formatting and wording. ChangesJSON Parser Implementation and Scraper Integration
Sequence DiagramsequenceDiagram
participant Client
participant ParserFactory
participant JsonParser
participant Scraper
participant FileLog
Client->>Scraper: HTTP response (text)
Scraper->>ParserFactory: get_parser(response_text)
ParserFactory->>JsonParser: select JsonParser
Scraper->>JsonParser: parse(content) via hass.async_add_executor_job
JsonParser-->>Scraper: parsed dict|list
Scraper->>Scraper: set _is_json = True
Scraper->>Scraper: set _soup = None
Scraper->>Scraper: formatted_content -> pretty JSON or fallback
Scraper->>FileLog: async write page_json with formatted_content
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Triggering a real RecursionError via 10000-deep nested arrays depends on sys.getrecursionlimit(), which is environment-dependent. On CI with a higher limit, json.loads succeeded and json.dumps(indent=2) produced a multi-megabyte string — the assertion diff then flooded the step log and hit the output cap. Monkeypatch json.loads instead so the soft-fail branch is exercised deterministically with a tiny input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_scraper.py (1)
326-342: 💤 Low valueNote: RecursionError test is sensitive to interpreter recursion limit.
The 10000-level nesting assumes CPython's default recursion limit (~1000). If any conftest or plugin raises
sys.setrecursionlimitsubstantially (e.g., for HA core fixtures),json.loadsmay no longer raiseRecursionErrorand the assertion would pass for the wrong reason (since the raw_dataalready equalsdeeply_nested). Consider pinning the recursion limit inside the test via a context manager, or asserting the fallback path by mockingjson.loadsto raiseRecursionErrordirectly — that decouples the test from interpreter state.♻️ Optional: pin recursion limit for determinism
async def test_scraper_handles_recursion_error_softly( hass: HomeAssistant, scraper_instance ): """Pathologically nested JSON (RecursionError) also soft-fails in formatted_content.""" - # 10000 levels of nesting reliably triggers RecursionError on CPython 3.13 - # while remaining a syntactically valid JSON-shaped string. - deeply_nested = "[" * 10000 + "]" * 10000 - - await scraper_instance.set_content(deeply_nested) - - assert scraper_instance._is_json is True - assert scraper_instance._data == deeply_nested - # formatted_content falls back to raw on RecursionError - assert scraper_instance.formatted_content == deeply_nested + import sys + # Ensure interpreter state cannot accidentally avoid the RecursionError path. + original_limit = sys.getrecursionlimit() + sys.setrecursionlimit(1000) + try: + deeply_nested = "[" * 10000 + "]" * 10000 + await scraper_instance.set_content(deeply_nested) + + assert scraper_instance._is_json is True + assert scraper_instance._data == deeply_nested + # formatted_content falls back to raw on RecursionError + assert scraper_instance.formatted_content == deeply_nested + finally: + sys.setrecursionlimit(original_limit)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_scraper.py` around lines 326 - 342, The test test_scraper_handles_recursion_error_softly is brittle because it depends on the interpreter recursion limit; instead patch json.loads to reliably raise RecursionError so formatted_content falls back to raw. Modify the test to use monkeypatch (or unittest.mock.patch) to replace json.loads with a callable that raises RecursionError, call scraper_instance.set_content(deeply_nested), then assert scraper_instance._is_json is True and scraper_instance.formatted_content == deeply_nested; ensure the patch targets the json.loads symbol and is undone after the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Line 168: Update the README description for the log_response option so the
path reads correctly by inserting a space between "to" and "/config" (i.e.,
"written to /config/multiscrape/name_of_config"); locate the table row for the
log_response flag in README.md and edit the description string for the
log_response entry to "Log the HTTP responses and HTML parsed by BeautifulSoup
in files. (Will be written to /config/multiscrape/name_of_config)" so users
don't see the malformed path.
---
Nitpick comments:
In `@tests/test_scraper.py`:
- Around line 326-342: The test test_scraper_handles_recursion_error_softly is
brittle because it depends on the interpreter recursion limit; instead patch
json.loads to reliably raise RecursionError so formatted_content falls back to
raw. Modify the test to use monkeypatch (or unittest.mock.patch) to replace
json.loads with a callable that raises RecursionError, call
scraper_instance.set_content(deeply_nested), then assert
scraper_instance._is_json is True and scraper_instance.formatted_content ==
deeply_nested; ensure the patch targets the json.loads symbol and is undone
after the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a41feb65-05dd-4a87-9062-b38d2cebac02
📒 Files selected for processing (5)
README.mdcustom_components/multiscrape/parsers.pycustom_components/multiscrape/scraper.pytests/test_parsers.pytests/test_scraper.py
| | binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list | | ||
| | button | See [Refresh button](#refresh-button) | False | | list | | ||
| | verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean | | ||
| | log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean | |
There was a problem hiding this comment.
Fix file path typo in log_response description.
There’s a missing space in written to/config/...; this reads like an invalid path and can confuse users copying instructions.
✏️ Suggested docs fix
-| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
+| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean | | |
| | log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean | |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 168, Update the README description for the log_response
option so the path reads correctly by inserting a space between "to" and
"/config" (i.e., "written to /config/multiscrape/name_of_config"); locate the
table row for the log_response flag in README.md and edit the description string
for the log_response entry to "Log the HTTP responses and HTML parsed by
BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config)"
so users don't see the malformed path.
Summary
JsonDetectortoJsonParser. The old class was misleadingly named — itsparse()returnedNone. The newJsonParser.parse()actually returns thedict | list.page_json.txtwhenlog_responseis enabled, alongside the existingpage_soup.txtpath for HTML.formatted_contentparses on demand and falls back to raw text onJSONDecodeError/RecursionError.get_contentservice when explicitly invoked).value_template/value_jsonremains the canonical extraction path via HA'sasync_render_with_possible_json_value.Groundwork toward #555 (native JSON API support). The JSONPath extractor portion of #555 is intentionally not in this PR.
Test plan
pytest tests/— 465 tests passlog_response: trueon a JSON endpoint and verifypage_json.txtcontains pretty-printed JSONvalue_template: "{{ value_json.x }}"configs continue to work unchangedpage_json.txt🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests