Skip to content

Commit 29edc60

Browse files
authored
Merge pull request #579 from danieldotnl/feature/555-json-parser
Promote JsonDetector to JsonParser with lazy pretty-print
2 parents 67a78f2 + 9ff0c3f commit 29edc60

5 files changed

Lines changed: 246 additions & 74 deletions

File tree

README.md

Lines changed: 35 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Multiscrape
22

33
---
4+
45
> [!TIP]
6+
>
57
> ## 👋 A quick note
68
>
79
> I run **[Smart Home Newsletter](https://smarthomenewsletter.com/?utm_source=github&utm_medium=readme&utm_campaign=multiscrape)** — a weekly curated digest for smart home enthusiasts.
@@ -11,9 +13,8 @@
1113
> Since you're here, you're clearly into home automation — so you might genuinely enjoy the newsletter.
1214
>
1315
> 👉 [Subscribe at smarthomenewsletter.com](https://smarthomenewsletter.com/?utm_source=github&utm_medium=readme&utm_campaign=multiscrape)
14-
>
15-
---
1616
17+
---
1718

1819
[![GitHub Release][releases-shield]][releases]
1920
[![License][license-shield]](LICENSE)
@@ -33,11 +34,12 @@
3334
## Need help with Multiscrape?
3435

3536
### Personal (paid) support option
37+
3638
I very often get asked for help, for example with finding the right CSS selectors or with a login. Actually more often than I can handle, so I'm running an experiment with a paid support option!
3739

3840
**Sponsor me [here](https://github.com/sponsors/danieldotnl/sponsorships?tier_id=432422), and I'll try to assist you with your `multiscrape` configuration within 1-2 days.** The support funds will go towards family time, making up for the hours I spend on Home Assistant ☺️.
3941

40-
**Note:** Scraping isn't always possible. I'd love to offer a "no cure, no pay" service, but GitHub Sponsoring doesn't support that. If you're concerned about sponsoring without guarentee, please reach out by email before sponsoring!
42+
**Note:** Scraping isn't always possible. I'd love to offer a "no cure, no pay" service, but GitHub Sponsoring doesn't support that. If you're concerned about sponsoring without guarantee, please reach out by email before sponsoring!
4143

4244
### Other options
4345

@@ -67,7 +69,8 @@ It is based on both the existing [Rest sensor](https://www.home-assistant.io/int
6769
Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.
6870

6971
## Example configuration (YAML)
70-
*This code example is to be placed into /config/configuration.yaml*
72+
73+
_This code example is to be placed into /config/configuration.yaml_
7174

7275
```yaml
7376
multiscrape:
@@ -100,17 +103,21 @@ multiscrape:
100103
select: ".release-date"
101104
attribute: href
102105
```
106+
103107
### Advanced Example Configuration (YAML)
108+
104109
For background on splitting the HA configuration, see the [HA Documentation](https://www.home-assistant.io/docs/configuration/splitting_configuration/).
105110
106-
*Inside the configuration.yaml file*
111+
_Inside the configuration.yaml file_
112+
107113
```yaml
108114
multiscrape: !include multiscrape.yaml
109115
```
110116
111117
Make a new file named /config/multiscrape.yaml
112118
113-
*Inside the multiscrape.yaml file. Syntax is the same but starting at the resource level*
119+
_Inside the multiscrape.yaml file. Syntax is the same but starting at the resource level_
120+
114121
```yaml
115122
- resource: https://www.home-assistant.io
116123
scan_interval: 3600
@@ -145,28 +152,28 @@ Make a new file named /config/multiscrape.yaml
145152
146153
Based on latest (pre) release.
147154
148-
| name | description | required | default | type |
149-
| ----------------- | ------------------------------------------------------------------------------------------------------------------------- | -------- | ------- | --------------- |
150-
| name | The name for the integration. | False | | string |
151-
| resource | The url for retrieving the site or a template that will output an url. Not required when `resource_template` is provided. | True | | string |
152-
| resource_template | A template that will output an url after being rendered. Only required when `resource` is not provided. | True | | template |
153-
| authentication | Configure HTTP authentication. `basic` or `digest`. Use this with username and password fields. | False | | string |
154-
| username | The username for accessing the url. | False | | string |
155-
| password | The password for accessing the url. | False | | string |
156-
| headers | The headers for the requests. | False | | template - list |
157-
| params | The query params for the requests. | False | | template - list |
158-
| method | The method for the request. Either `POST` or `GET`. | False | GET | string |
155+
| name | description | required | default | type |
156+
| ----------------- | ------------------------------------------------------------------------------------------------------------------------- | -------- | ------- | ----------------- |
157+
| name | The name for the integration. | False | | string |
158+
| resource | The url for retrieving the site or a template that will output an url. Not required when `resource_template` is provided. | True | | string |
159+
| resource_template | A template that will output an url after being rendered. Only required when `resource` is not provided. | True | | template |
160+
| authentication | Configure HTTP authentication. `basic` or `digest`. Use this with username and password fields. | False | | string |
161+
| username | The username for accessing the url. | False | | string |
162+
| password | The password for accessing the url. | False | | string |
163+
| headers | The headers for the requests. | False | | template - list |
164+
| params | The query params for the requests. | False | | template - list |
165+
| method | The method for the request. Either `POST` or `GET`. | False | GET | string |
159166
| payload | Optional payload to send with a POST request. | False | | template - string |
160-
| verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean |
161-
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
162-
| timeout | Defines max time to wait data from the endpoint. | False | 10 | int |
163-
| scan_interval | Determines how often the url will be requested. | False | 60 | int |
164-
| parser | Determines the parser to be used with beautifulsoup. `lxml-xml` for xml recommended and `lxml` for everything else. | False | lxml | string |
165-
| list_separator | Separator to be used in combination with `select_list` features. | False | , | string |
166-
| form_submit | See [Form-submit](#form-submit) | False | | |
167-
| sensor | See [Sensor](#sensorbinary-sensor) | False | | list |
168-
| binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list |
169-
| button | See [Refresh button](#refresh-button) | False | | list |
167+
| verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean |
168+
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
169+
| timeout | Defines max time to wait data from the endpoint. | False | 10 | int |
170+
| scan_interval | Determines how often the url will be requested. | False | 60 | int |
171+
| parser | Determines the parser to be used with beautifulsoup. `lxml-xml` for xml recommended and `lxml` for everything else. | False | lxml | string |
172+
| list_separator | Separator to be used in combination with `select_list` features. | False | , | string |
173+
| form_submit | See [Form-submit](#form-submit) | False | | |
174+
| sensor | See [Sensor](#sensorbinary-sensor) | False | | list |
175+
| binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list |
176+
| button | See [Refresh button](#refresh-button) | False | | list |
170177

171178
### Sensor/Binary Sensor
172179

@@ -273,7 +280,7 @@ Configure what should happen in case of a scraping error (the css selector does
273280
For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.)
274281
The services are named `multiscrape.trigger_{name of integration}`.
275282

276-
Multiscrape also offers a `get_content` and a `scrape` service. `get_content` retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable `log_response` and open the page_soup.txt file.\
283+
Multiscrape also offers a `get_content` and a `scrape` service. `get_content` retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable `log_response` and open the `page_soup.txt` file (or `page_json.txt` when the response is JSON).\
277284
`scrape` does what it says. It scrapes a website and provides the sensors and attributes.
278285

279286
Both services accept the same configuration as what you would provide in your configuration yaml (what is described above), with a small but important caveat: if the service input contains templates, those are automatically parsed by home assistant when the service is being called. That is fine for templates like `resource` and `select`, but templates that need to be applied on the scraped data itself (like `value_template`), cannot be parsed when the service is called. Therefore you need to slightly alter the syntax and add a `!` in the middle. E.g. `{{` becomes `{!{` and `%}` becomes `%!}`. Multiscrape will then understand that this string needs to handled as a template after the service has been called.\

custom_components/multiscrape/parsers.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
"""Content parsers for multiscrape using the Strategy pattern."""
2+
23
from __future__ import annotations
34

5+
import json
46
import logging
57
from abc import ABC, abstractmethod
68
from typing import Any
@@ -55,8 +57,13 @@ async def parse(self, content: str, hass: Any) -> BeautifulSoup:
5557
)
5658

5759

58-
class JsonDetector(ContentParser):
59-
"""Detects JSON content. Does not parse it (JSON uses value_template only)."""
60+
class JsonParser(ContentParser):
61+
"""Parse JSON content into a Python structure.
62+
63+
Values are typically extracted via Jinja value_template (the canonical
64+
Home Assistant pattern); the parsed structure is used for pretty-printing
65+
and file logging.
66+
"""
6067

6168
@property
6269
def name(self) -> str:
@@ -68,9 +75,9 @@ def can_parse(self, content: str) -> bool:
6875
content_stripped = content.lstrip() if content else ""
6976
return bool(content_stripped) and content_stripped[0] in ("{", "[")
7077

71-
async def parse(self, content: str, hass: Any) -> None:
72-
"""JSON is not parsed into a queryable structure."""
73-
return None
78+
async def parse(self, content: str, hass: Any) -> dict | list:
79+
"""Parse JSON content. Raises json.JSONDecodeError on malformed input."""
80+
return await hass.async_add_executor_job(json.loads, content)
7481

7582

7683
class ParserFactory:
@@ -79,7 +86,7 @@ class ParserFactory:
7986
def __init__(self, parser_name: str):
8087
"""Initialize with the HTML parser name."""
8188
self._parsers: list[ContentParser] = [
82-
JsonDetector(),
89+
JsonParser(),
8390
HtmlParser(parser_name),
8491
]
8592

custom_components/multiscrape/scraper.py

Lines changed: 22 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
"""Support for multiscrape requests."""
2+
3+
import json
24
import logging
35

46
from bs4 import BeautifulSoup
57

68
from .const import CONF_PARSER, CONF_SEPARATOR
79
from .extractors import ValueExtractor
8-
from .parsers import JsonDetector, ParserFactory
10+
from .parsers import JsonParser, ParserFactory
911
from .scrape_context import ScrapeContext
1012

1113
DEFAULT_TIMEOUT = 10
@@ -64,22 +66,32 @@ def reset(self):
6466

6567
@property
6668
def formatted_content(self):
67-
"""Property for getting the content. HTML will be prettified."""
69+
"""Return the content for display: HTML prettified, JSON pretty-printed, or raw."""
6870
if self._soup:
6971
return self._soup.prettify()
72+
if self._is_json and self._data:
73+
try:
74+
return json.dumps(
75+
json.loads(self._data), indent=2, ensure_ascii=False
76+
)
77+
except (json.JSONDecodeError, RecursionError):
78+
# Detected as JSON-shaped but unparsable — fall back to raw.
79+
return self._data
7080
return self._data
7181

7282
async def set_content(self, content):
7383
"""Set the content to be scraped."""
7484
self._data = content
7585
parser = self._parser_factory.get_parser(content)
7686

77-
if isinstance(parser, JsonDetector):
87+
if isinstance(parser, JsonParser):
7888
_LOGGER.debug(
79-
"%s # Response seems to be json. Skip parsing with BeautifulSoup.",
89+
"%s # Response detected as JSON; skipping BeautifulSoup parsing.",
8090
self._config_name,
8191
)
8292
self._is_json = True
93+
if self._file_manager:
94+
await self._async_file_log("page_json", self.formatted_content)
8395
return
8496

8597
try:
@@ -101,7 +113,9 @@ async def set_content(self, content):
101113
)
102114
raise
103115

104-
def scrape(self, selector, sensor, attribute=None, context: ScrapeContext | None = None):
116+
def scrape(
117+
self, selector, sensor, attribute=None, context: ScrapeContext | None = None
118+
):
105119
"""Scrape based on given selector the data."""
106120
if context is None:
107121
context = ScrapeContext.empty()
@@ -123,25 +137,22 @@ def scrape(self, selector, sensor, attribute=None, context: ScrapeContext | None
123137
value = self._extract_value(selector, log_prefix)
124138

125139
if value is not None and selector.value_template is not None:
126-
_LOGGER.debug(
127-
"%s # Applying value_template on selector result", log_prefix)
140+
_LOGGER.debug("%s # Applying value_template on selector result", log_prefix)
128141
render_ctx = context.with_current_value(value)
129142
value = selector.value_template.async_render(
130143
variables=render_ctx.to_template_variables(), parse_result=True
131144
)
132145

133146
_LOGGER.debug(
134-
"%s # Final selector value: %s of type %s", log_prefix, value, type(
135-
value)
147+
"%s # Final selector value: %s of type %s", log_prefix, value, type(value)
136148
)
137149
return value
138150

139151
def _extract_value(self, selector, log_prefix):
140152
"""Delegate extraction to ValueExtractor."""
141153
if selector.is_list:
142154
tags = self._soup.select(selector.list)
143-
_LOGGER.debug("%s # List selector selected tags: %s",
144-
log_prefix, tags)
155+
_LOGGER.debug("%s # List selector selected tags: %s", log_prefix, tags)
145156
return self._extractor.extract_list(tags, selector)
146157
else:
147158
tag = self._soup.select_one(selector.element)

0 commit comments

Comments
 (0)