Promote JsonDetector to JsonParser with lazy pretty-print by danieldotnl · Pull Request #579 · danieldotnl/ha-multiscrape

danieldotnl · 2026-05-13T12:05:52Z

Summary

Renames JsonDetector to JsonParser. The old class was misleadingly named — its parse() returned None. The new JsonParser.parse() actually returns the dict | list.
Adds JSON pretty-printing to page_json.txt when log_response is enabled, alongside the existing page_soup.txt path for HTML. formatted_content parses on demand and falls back to raw text on JSONDecodeError / RecursionError.
No parsed structure is stored on the Scraper; the only side of the codebase that pays the parse cost is the file-logging path (or the get_content service when explicitly invoked). value_template / value_json remains the canonical extraction path via HA's async_render_with_possible_json_value.

Groundwork toward #555 (native JSON API support). The JSONPath extractor portion of #555 is intentionally not in this PR.

Test plan

pytest tests/ — 465 tests pass
Manual: enable log_response: true on a JSON endpoint and verify page_json.txt contains pretty-printed JSON
Manual: existing value_template: "{{ value_json.x }}" configs continue to work unchanged
Manual: malformed-JSON response (e.g. truncated) still flows through without raising; raw text ends up in page_json.txt

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Proper JSON parsing with pretty-printed output for clearer results.
- JSON responses are written to dedicated log output for easier review.
- Templates and value extraction now work with JSON responses.
Bug Fixes
- Graceful fallback for malformed or deeply-nested JSON to preserve raw content.
- JSON content is no longer treated like HTML, avoiding incorrect parsing.
Documentation
- README content and formatting updated for clarity.
Tests
- Test coverage expanded for JSON handling and edge cases.

Renames the misleadingly-named JsonDetector (whose parse() returned None) to JsonParser and gives it a real parse() that returns the dict/list. The Scraper no longer stores the parsed structure; instead, formatted_content parses on demand and falls back to raw text on JSONDecodeError or RecursionError. When log_response is enabled, JSON responses are written to page_json.txt (pretty-printed) alongside the existing page_soup.txt path for HTML. Groundwork for issue #555 (native JSON API support) — value_template remains the canonical extraction path via HA's value_json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-13T12:08:17Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1323b51c-eaae-4313-b6f7-61b376786316

📥 Commits

Reviewing files that changed from the base of the PR and between ca4ade6 and 9ff0c3f.

📒 Files selected for processing (1)

tests/test_scraper.py

📝 Walkthrough

Walkthrough

This PR replaces the JSON-detector pattern with a JsonParser that parses JSON, adds pretty-printing and async page_json file logging in the scraper, updates parser and scraper tests for JSON behavior and state, and refreshes README formatting and wording.

Changes

JSON Parser Implementation and Scraper Integration

Layer / File(s)	Summary
JsonParser implementation and factory wiring `custom_components/multiscrape/parsers.py`	`JsonDetector` is replaced with `JsonParser` class that uses `json.loads` via `hass.async_add_executor_job` to parse JSON into dict or list, and `ParserFactory` registers the new parser instead of the detector.
Scraper JSON detection, formatting, and file logging `custom_components/multiscrape/scraper.py`	Scraper imports updated to use `JsonParser`; `formatted_content` returns pretty-printed JSON (via `json.dumps(json.loads(...))`) when `_is_json` is set with exception fallback to raw content; `set_content` detects `JsonParser` instances, sets `_is_json`, skips BeautifulSoup parsing for JSON, and writes pretty-printed JSON to async `page_json` when enabled.
JsonParser and ParserFactory unit tests `tests/test_parsers.py`	`TestJsonDetector` is replaced with `TestJsonParser` covering async parse behavior for dict/list outcomes, malformed JSON error handling, escaped characters, and JSON detection; `ParserFactory` tests updated to expect `JsonParser`.
Scraper JSON integration and state tests `tests/test_scraper.py`	JSON fixture imports expanded, `_is_json` state flag asserted in JSON-detection and reset tests, reset test verifies HTML and JSON state transitions, and new JSON-specific test section validates pretty-printed output, soft-failure for malformed/deeply-nested JSON, `value_template` support, and file logging to `page_json.txt`.
README documentation updates `README.md`	GitHub Markdown TIP callout added, support section typo fixed, example configuration guidance italicized, advanced configuration section restructured with separate paragraphs, options table formatting updated, and a service description punctuation/escaping change applied.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ParserFactory
  participant JsonParser
  participant Scraper
  participant FileLog
  Client->>Scraper: HTTP response (text)
  Scraper->>ParserFactory: get_parser(response_text)
  ParserFactory->>JsonParser: select JsonParser
  Scraper->>JsonParser: parse(content) via hass.async_add_executor_job
  JsonParser-->>Scraper: parsed dict|list
  Scraper->>Scraper: set _is_json = True
  Scraper->>Scraper: set _soup = None
  Scraper->>Scraper: formatted_content -> pretty JSON or fallback
  Scraper->>FileLog: async write page_json with formatted_content

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

danieldotnl/ha-multiscrape#568: The main PR’s JSON strategy changes (replacing JsonDetector with a real JsonParser and updating Scraper/ParserFactory and JSON handling accordingly) directly build on and modify the strategy-pattern parsing work introduced in #568.

Poem

🐰 I found some JSON in the night,
I parsed it true by morning light.
Pretty prints tucked into a log,
Tests hop by and clear the fog,
Multiscrape hums—everything's right! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately reflects the core changes: renaming JsonDetector to JsonParser and adding lazy pretty-printing functionality for JSON content.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/555-json-parser

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Triggering a real RecursionError via 10000-deep nested arrays depends on sys.getrecursionlimit(), which is environment-dependent. On CI with a higher limit, json.loads succeeded and json.dumps(indent=2) produced a multi-megabyte string — the assertion diff then flooded the step log and hit the output cap. Monkeypatch json.loads instead so the soft-fail branch is exercised deterministically with a tiny input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_scraper.py (1)

326-342: 💤 Low value

Note: RecursionError test is sensitive to interpreter recursion limit.

The 10000-level nesting assumes CPython's default recursion limit (~1000). If any conftest or plugin raises sys.setrecursionlimit substantially (e.g., for HA core fixtures), json.loads may no longer raise RecursionError and the assertion would pass for the wrong reason (since the raw _data already equals deeply_nested). Consider pinning the recursion limit inside the test via a context manager, or asserting the fallback path by mocking json.loads to raise RecursionError directly — that decouples the test from interpreter state.

♻️ Optional: pin recursion limit for determinism

 async def test_scraper_handles_recursion_error_softly(
     hass: HomeAssistant, scraper_instance
 ):
     """Pathologically nested JSON (RecursionError) also soft-fails in formatted_content."""
-    # 10000 levels of nesting reliably triggers RecursionError on CPython 3.13
-    # while remaining a syntactically valid JSON-shaped string.
-    deeply_nested = "[" * 10000 + "]" * 10000
-
-    await scraper_instance.set_content(deeply_nested)
-
-    assert scraper_instance._is_json is True
-    assert scraper_instance._data == deeply_nested
-    # formatted_content falls back to raw on RecursionError
-    assert scraper_instance.formatted_content == deeply_nested
+    import sys
+    # Ensure interpreter state cannot accidentally avoid the RecursionError path.
+    original_limit = sys.getrecursionlimit()
+    sys.setrecursionlimit(1000)
+    try:
+        deeply_nested = "[" * 10000 + "]" * 10000
+        await scraper_instance.set_content(deeply_nested)
+
+        assert scraper_instance._is_json is True
+        assert scraper_instance._data == deeply_nested
+        # formatted_content falls back to raw on RecursionError
+        assert scraper_instance.formatted_content == deeply_nested
+    finally:
+        sys.setrecursionlimit(original_limit)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_scraper.py` around lines 326 - 342, The test
test_scraper_handles_recursion_error_softly is brittle because it depends on the
interpreter recursion limit; instead patch json.loads to reliably raise
RecursionError so formatted_content falls back to raw. Modify the test to use
monkeypatch (or unittest.mock.patch) to replace json.loads with a callable that
raises RecursionError, call scraper_instance.set_content(deeply_nested), then
assert scraper_instance._is_json is True and scraper_instance.formatted_content
== deeply_nested; ensure the patch targets the json.loads symbol and is undone
after the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 168: Update the README description for the log_response option so the
path reads correctly by inserting a space between "to" and "/config" (i.e.,
"written to /config/multiscrape/name_of_config"); locate the table row for the
log_response flag in README.md and edit the description string for the
log_response entry to "Log the HTTP responses and HTML parsed by BeautifulSoup
in files. (Will be written to /config/multiscrape/name_of_config)" so users
don't see the malformed path.

---

Nitpick comments:
In `@tests/test_scraper.py`:
- Around line 326-342: The test test_scraper_handles_recursion_error_softly is
brittle because it depends on the interpreter recursion limit; instead patch
json.loads to reliably raise RecursionError so formatted_content falls back to
raw. Modify the test to use monkeypatch (or unittest.mock.patch) to replace
json.loads with a callable that raises RecursionError, call
scraper_instance.set_content(deeply_nested), then assert
scraper_instance._is_json is True and scraper_instance.formatted_content ==
deeply_nested; ensure the patch targets the json.loads symbol and is undone
after the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a41feb65-05dd-4a87-9062-b38d2cebac02

📥 Commits

Reviewing files that changed from the base of the PR and between 67a78f2 and ca4ade6.

📒 Files selected for processing (5)

README.md
custom_components/multiscrape/parsers.py
custom_components/multiscrape/scraper.py
tests/test_parsers.py
tests/test_scraper.py

coderabbitai · 2026-05-13T12:11:23Z

-| binary_sensor     | See [Binary sensor](#sensorbinary-sensor)                                                                                 | False    |         | list            |
-| button            | See [Refresh button](#refresh-button)                                                                                     | False    |         | list            |
+| verify_ssl        | Verify the SSL certificate of the endpoint.                                                                               | False    | True    | boolean           |
+| log_response      | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config)  | False    | False   | boolean           |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix file path typo in log_response description.

There’s a missing space in written to/config/...; this reads like an invalid path and can confuse users copying instructions.

✏️ Suggested docs fix

-| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean | +| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |

| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean |

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 168, Update the README description for the log_response option so the path reads correctly by inserting a space between "to" and "/config" (i.e., "written to /config/multiscrape/name_of_config"); locate the table row for the log_response flag in README.md and edit the description string for the log_response entry to "Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config)" so users don't see the malformed path.

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

danieldotnl merged commit 29edc60 into master May 13, 2026
6 of 7 checks passed

danieldotnl deleted the feature/555-json-parser branch May 13, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote JsonDetector to JsonParser with lazy pretty-print#579

Promote JsonDetector to JsonParser with lazy pretty-print#579
danieldotnl merged 2 commits into
masterfrom
feature/555-json-parser

danieldotnl commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| log_response \| Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) \| False \| False \| boolean \|
	\| log_response \| Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) \| False \| False \| boolean \|

Conversation

danieldotnl commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danieldotnl commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading