Skip to content

Promote JsonDetector to JsonParser with lazy pretty-print#579

Merged
danieldotnl merged 2 commits into
masterfrom
feature/555-json-parser
May 13, 2026
Merged

Promote JsonDetector to JsonParser with lazy pretty-print#579
danieldotnl merged 2 commits into
masterfrom
feature/555-json-parser

Conversation

@danieldotnl
Copy link
Copy Markdown
Owner

@danieldotnl danieldotnl commented May 13, 2026

Summary

  • Renames JsonDetector to JsonParser. The old class was misleadingly named — its parse() returned None. The new JsonParser.parse() actually returns the dict | list.
  • Adds JSON pretty-printing to page_json.txt when log_response is enabled, alongside the existing page_soup.txt path for HTML. formatted_content parses on demand and falls back to raw text on JSONDecodeError / RecursionError.
  • No parsed structure is stored on the Scraper; the only side of the codebase that pays the parse cost is the file-logging path (or the get_content service when explicitly invoked). value_template / value_json remains the canonical extraction path via HA's async_render_with_possible_json_value.

Groundwork toward #555 (native JSON API support). The JSONPath extractor portion of #555 is intentionally not in this PR.

Test plan

  • pytest tests/ — 465 tests pass
  • Manual: enable log_response: true on a JSON endpoint and verify page_json.txt contains pretty-printed JSON
  • Manual: existing value_template: "{{ value_json.x }}" configs continue to work unchanged
  • Manual: malformed-JSON response (e.g. truncated) still flows through without raising; raw text ends up in page_json.txt

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Proper JSON parsing with pretty-printed output for clearer results.
    • JSON responses are written to dedicated log output for easier review.
    • Templates and value extraction now work with JSON responses.
  • Bug Fixes

    • Graceful fallback for malformed or deeply-nested JSON to preserve raw content.
    • JSON content is no longer treated like HTML, avoiding incorrect parsing.
  • Documentation

    • README content and formatting updated for clarity.
  • Tests

    • Test coverage expanded for JSON handling and edge cases.

Review Change Stack

Renames the misleadingly-named JsonDetector (whose parse() returned None) to
JsonParser and gives it a real parse() that returns the dict/list. The Scraper
no longer stores the parsed structure; instead, formatted_content parses on
demand and falls back to raw text on JSONDecodeError or RecursionError. When
log_response is enabled, JSON responses are written to page_json.txt
(pretty-printed) alongside the existing page_soup.txt path for HTML.

Groundwork for issue #555 (native JSON API support) — value_template remains
the canonical extraction path via HA's value_json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1323b51c-eaae-4313-b6f7-61b376786316

📥 Commits

Reviewing files that changed from the base of the PR and between ca4ade6 and 9ff0c3f.

📒 Files selected for processing (1)
  • tests/test_scraper.py

📝 Walkthrough

Walkthrough

This PR replaces the JSON-detector pattern with a JsonParser that parses JSON, adds pretty-printing and async page_json file logging in the scraper, updates parser and scraper tests for JSON behavior and state, and refreshes README formatting and wording.

Changes

JSON Parser Implementation and Scraper Integration

Layer / File(s) Summary
JsonParser implementation and factory wiring
custom_components/multiscrape/parsers.py
JsonDetector is replaced with JsonParser class that uses json.loads via hass.async_add_executor_job to parse JSON into dict or list, and ParserFactory registers the new parser instead of the detector.
Scraper JSON detection, formatting, and file logging
custom_components/multiscrape/scraper.py
Scraper imports updated to use JsonParser; formatted_content returns pretty-printed JSON (via json.dumps(json.loads(...))) when _is_json is set with exception fallback to raw content; set_content detects JsonParser instances, sets _is_json, skips BeautifulSoup parsing for JSON, and writes pretty-printed JSON to async page_json when enabled.
JsonParser and ParserFactory unit tests
tests/test_parsers.py
TestJsonDetector is replaced with TestJsonParser covering async parse behavior for dict/list outcomes, malformed JSON error handling, escaped characters, and JSON detection; ParserFactory tests updated to expect JsonParser.
Scraper JSON integration and state tests
tests/test_scraper.py
JSON fixture imports expanded, _is_json state flag asserted in JSON-detection and reset tests, reset test verifies HTML and JSON state transitions, and new JSON-specific test section validates pretty-printed output, soft-failure for malformed/deeply-nested JSON, value_template support, and file logging to page_json.txt.
README documentation updates
README.md
GitHub Markdown TIP callout added, support section typo fixed, example configuration guidance italicized, advanced configuration section restructured with separate paragraphs, options table formatting updated, and a service description punctuation/escaping change applied.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ParserFactory
  participant JsonParser
  participant Scraper
  participant FileLog
  Client->>Scraper: HTTP response (text)
  Scraper->>ParserFactory: get_parser(response_text)
  ParserFactory->>JsonParser: select JsonParser
  Scraper->>JsonParser: parse(content) via hass.async_add_executor_job
  JsonParser-->>Scraper: parsed dict|list
  Scraper->>Scraper: set _is_json = True
  Scraper->>Scraper: set _soup = None
  Scraper->>Scraper: formatted_content -> pretty JSON or fallback
  Scraper->>FileLog: async write page_json with formatted_content
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • danieldotnl/ha-multiscrape#568: The main PR’s JSON strategy changes (replacing JsonDetector with a real JsonParser and updating Scraper/ParserFactory and JSON handling accordingly) directly build on and modify the strategy-pattern parsing work introduced in #568.

Poem

🐰 I found some JSON in the night,
I parsed it true by morning light.
Pretty prints tucked into a log,
Tests hop by and clear the fog,
Multiscrape hums—everything's right! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately reflects the core changes: renaming JsonDetector to JsonParser and adding lazy pretty-printing functionality for JSON content.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/555-json-parser

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Triggering a real RecursionError via 10000-deep nested arrays depends on
sys.getrecursionlimit(), which is environment-dependent. On CI with a higher
limit, json.loads succeeded and json.dumps(indent=2) produced a multi-megabyte
string — the assertion diff then flooded the step log and hit the output cap.
Monkeypatch json.loads instead so the soft-fail branch is exercised
deterministically with a tiny input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_scraper.py (1)

326-342: 💤 Low value

Note: RecursionError test is sensitive to interpreter recursion limit.

The 10000-level nesting assumes CPython's default recursion limit (~1000). If any conftest or plugin raises sys.setrecursionlimit substantially (e.g., for HA core fixtures), json.loads may no longer raise RecursionError and the assertion would pass for the wrong reason (since the raw _data already equals deeply_nested). Consider pinning the recursion limit inside the test via a context manager, or asserting the fallback path by mocking json.loads to raise RecursionError directly — that decouples the test from interpreter state.

♻️ Optional: pin recursion limit for determinism
 async def test_scraper_handles_recursion_error_softly(
     hass: HomeAssistant, scraper_instance
 ):
     """Pathologically nested JSON (RecursionError) also soft-fails in formatted_content."""
-    # 10000 levels of nesting reliably triggers RecursionError on CPython 3.13
-    # while remaining a syntactically valid JSON-shaped string.
-    deeply_nested = "[" * 10000 + "]" * 10000
-
-    await scraper_instance.set_content(deeply_nested)
-
-    assert scraper_instance._is_json is True
-    assert scraper_instance._data == deeply_nested
-    # formatted_content falls back to raw on RecursionError
-    assert scraper_instance.formatted_content == deeply_nested
+    import sys
+    # Ensure interpreter state cannot accidentally avoid the RecursionError path.
+    original_limit = sys.getrecursionlimit()
+    sys.setrecursionlimit(1000)
+    try:
+        deeply_nested = "[" * 10000 + "]" * 10000
+        await scraper_instance.set_content(deeply_nested)
+
+        assert scraper_instance._is_json is True
+        assert scraper_instance._data == deeply_nested
+        # formatted_content falls back to raw on RecursionError
+        assert scraper_instance.formatted_content == deeply_nested
+    finally:
+        sys.setrecursionlimit(original_limit)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_scraper.py` around lines 326 - 342, The test
test_scraper_handles_recursion_error_softly is brittle because it depends on the
interpreter recursion limit; instead patch json.loads to reliably raise
RecursionError so formatted_content falls back to raw. Modify the test to use
monkeypatch (or unittest.mock.patch) to replace json.loads with a callable that
raises RecursionError, call scraper_instance.set_content(deeply_nested), then
assert scraper_instance._is_json is True and scraper_instance.formatted_content
== deeply_nested; ensure the patch targets the json.loads symbol and is undone
after the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 168: Update the README description for the log_response option so the
path reads correctly by inserting a space between "to" and "/config" (i.e.,
"written to /config/multiscrape/name_of_config"); locate the table row for the
log_response flag in README.md and edit the description string for the
log_response entry to "Log the HTTP responses and HTML parsed by BeautifulSoup
in files. (Will be written to /config/multiscrape/name_of_config)" so users
don't see the malformed path.

---

Nitpick comments:
In `@tests/test_scraper.py`:
- Around line 326-342: The test test_scraper_handles_recursion_error_softly is
brittle because it depends on the interpreter recursion limit; instead patch
json.loads to reliably raise RecursionError so formatted_content falls back to
raw. Modify the test to use monkeypatch (or unittest.mock.patch) to replace
json.loads with a callable that raises RecursionError, call
scraper_instance.set_content(deeply_nested), then assert
scraper_instance._is_json is True and scraper_instance.formatted_content ==
deeply_nested; ensure the patch targets the json.loads symbol and is undone
after the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a41feb65-05dd-4a87-9062-b38d2cebac02

📥 Commits

Reviewing files that changed from the base of the PR and between 67a78f2 and ca4ade6.

📒 Files selected for processing (5)
  • README.md
  • custom_components/multiscrape/parsers.py
  • custom_components/multiscrape/scraper.py
  • tests/test_parsers.py
  • tests/test_scraper.py

Comment thread README.md
| binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list |
| button | See [Refresh button](#refresh-button) | False | | list |
| verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean |
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix file path typo in log_response description.

There’s a missing space in written to/config/...; this reads like an invalid path and can confuse users copying instructions.

✏️ Suggested docs fix
-| log_response      | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config)  | False    | False   | boolean           |
+| log_response      | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False    | False   | boolean           |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 168, Update the README description for the log_response
option so the path reads correctly by inserting a space between "to" and
"/config" (i.e., "written to /config/multiscrape/name_of_config"); locate the
table row for the log_response flag in README.md and edit the description string
for the log_response entry to "Log the HTTP responses and HTML parsed by
BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config)"
so users don't see the malformed path.

@danieldotnl danieldotnl merged commit 29edc60 into master May 13, 2026
6 of 7 checks passed
@danieldotnl danieldotnl deleted the feature/555-json-parser branch May 13, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant