Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 35 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Multiscrape

---

> [!TIP]
>
> ## 👋 A quick note
>
> I run **[Smart Home Newsletter](https://smarthomenewsletter.com/?utm_source=github&utm_medium=readme&utm_campaign=multiscrape)** — a weekly curated digest for smart home enthusiasts.
Expand All @@ -11,9 +13,8 @@
> Since you're here, you're clearly into home automation — so you might genuinely enjoy the newsletter.
>
> 👉 [Subscribe at smarthomenewsletter.com](https://smarthomenewsletter.com/?utm_source=github&utm_medium=readme&utm_campaign=multiscrape)
>
---

---

[![GitHub Release][releases-shield]][releases]
[![License][license-shield]](LICENSE)
Expand All @@ -33,11 +34,12 @@
## Need help with Multiscrape?

### Personal (paid) support option

I very often get asked for help, for example with finding the right CSS selectors or with a login. Actually more often than I can handle, so I'm running an experiment with a paid support option!

**Sponsor me [here](https://github.com/sponsors/danieldotnl/sponsorships?tier_id=432422), and I'll try to assist you with your `multiscrape` configuration within 1-2 days.** The support funds will go towards family time, making up for the hours I spend on Home Assistant ☺️.

**Note:** Scraping isn't always possible. I'd love to offer a "no cure, no pay" service, but GitHub Sponsoring doesn't support that. If you're concerned about sponsoring without guarentee, please reach out by email before sponsoring!
**Note:** Scraping isn't always possible. I'd love to offer a "no cure, no pay" service, but GitHub Sponsoring doesn't support that. If you're concerned about sponsoring without guarantee, please reach out by email before sponsoring!

### Other options

Expand Down Expand Up @@ -67,7 +69,8 @@ It is based on both the existing [Rest sensor](https://www.home-assistant.io/int
Install via HACS (default store) or install manually by copying the files in a new 'custom_components/multiscrape' directory.

## Example configuration (YAML)
*This code example is to be placed into /config/configuration.yaml*

_This code example is to be placed into /config/configuration.yaml_

```yaml
multiscrape:
Expand Down Expand Up @@ -100,17 +103,21 @@ multiscrape:
select: ".release-date"
attribute: href
```

### Advanced Example Configuration (YAML)

For background on splitting the HA configuration, see the [HA Documentation](https://www.home-assistant.io/docs/configuration/splitting_configuration/).

*Inside the configuration.yaml file*
_Inside the configuration.yaml file_

```yaml
multiscrape: !include multiscrape.yaml
```

Make a new file named /config/multiscrape.yaml

*Inside the multiscrape.yaml file. Syntax is the same but starting at the resource level*
_Inside the multiscrape.yaml file. Syntax is the same but starting at the resource level_

```yaml
- resource: https://www.home-assistant.io
scan_interval: 3600
Expand Down Expand Up @@ -145,28 +152,28 @@ Make a new file named /config/multiscrape.yaml

Based on latest (pre) release.

| name | description | required | default | type |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------- | -------- | ------- | --------------- |
| name | The name for the integration. | False | | string |
| resource | The url for retrieving the site or a template that will output an url. Not required when `resource_template` is provided. | True | | string |
| resource_template | A template that will output an url after being rendered. Only required when `resource` is not provided. | True | | template |
| authentication | Configure HTTP authentication. `basic` or `digest`. Use this with username and password fields. | False | | string |
| username | The username for accessing the url. | False | | string |
| password | The password for accessing the url. | False | | string |
| headers | The headers for the requests. | False | | template - list |
| params | The query params for the requests. | False | | template - list |
| method | The method for the request. Either `POST` or `GET`. | False | GET | string |
| name | description | required | default | type |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------- | -------- | ------- | ----------------- |
| name | The name for the integration. | False | | string |
| resource | The url for retrieving the site or a template that will output an url. Not required when `resource_template` is provided. | True | | string |
| resource_template | A template that will output an url after being rendered. Only required when `resource` is not provided. | True | | template |
| authentication | Configure HTTP authentication. `basic` or `digest`. Use this with username and password fields. | False | | string |
| username | The username for accessing the url. | False | | string |
| password | The password for accessing the url. | False | | string |
| headers | The headers for the requests. | False | | template - list |
| params | The query params for the requests. | False | | template - list |
| method | The method for the request. Either `POST` or `GET`. | False | GET | string |
| payload | Optional payload to send with a POST request. | False | | template - string |
| verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean |
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
| timeout | Defines max time to wait data from the endpoint. | False | 10 | int |
| scan_interval | Determines how often the url will be requested. | False | 60 | int |
| parser | Determines the parser to be used with beautifulsoup. `lxml-xml` for xml recommended and `lxml` for everything else. | False | lxml | string |
| list_separator | Separator to be used in combination with `select_list` features. | False | , | string |
| form_submit | See [Form-submit](#form-submit) | False | | |
| sensor | See [Sensor](#sensorbinary-sensor) | False | | list |
| binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list |
| button | See [Refresh button](#refresh-button) | False | | list |
| verify_ssl | Verify the SSL certificate of the endpoint. | False | True | boolean |
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix file path typo in log_response description.

There’s a missing space in written to/config/...; this reads like an invalid path and can confuse users copying instructions.

✏️ Suggested docs fix
-| log_response      | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config)  | False    | False   | boolean           |
+| log_response      | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False    | False   | boolean           |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to/config/multiscrape/name_of_config) | False | False | boolean |
| log_response | Log the HTTP responses and HTML parsed by BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config) | False | False | boolean |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 168, Update the README description for the log_response
option so the path reads correctly by inserting a space between "to" and
"/config" (i.e., "written to /config/multiscrape/name_of_config"); locate the
table row for the log_response flag in README.md and edit the description string
for the log_response entry to "Log the HTTP responses and HTML parsed by
BeautifulSoup in files. (Will be written to /config/multiscrape/name_of_config)"
so users don't see the malformed path.

| timeout | Defines max time to wait data from the endpoint. | False | 10 | int |
| scan_interval | Determines how often the url will be requested. | False | 60 | int |
| parser | Determines the parser to be used with beautifulsoup. `lxml-xml` for xml recommended and `lxml` for everything else. | False | lxml | string |
| list_separator | Separator to be used in combination with `select_list` features. | False | , | string |
| form_submit | See [Form-submit](#form-submit) | False | | |
| sensor | See [Sensor](#sensorbinary-sensor) | False | | list |
| binary_sensor | See [Binary sensor](#sensorbinary-sensor) | False | | list |
| button | See [Refresh button](#refresh-button) | False | | list |

### Sensor/Binary Sensor

Expand Down Expand Up @@ -273,7 +280,7 @@ Configure what should happen in case of a scraping error (the css selector does
For each multiscrape instance, a service will be created to trigger a scrape run through an automation. (For manual triggering, the button entity can now be configured.)
The services are named `multiscrape.trigger_{name of integration}`.

Multiscrape also offers a `get_content` and a `scrape` service. `get_content` retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable `log_response` and open the page_soup.txt file.\
Multiscrape also offers a `get_content` and a `scrape` service. `get_content` retrieves the content of the website you want to scrape. It shows the same data for which you now need to enable `log_response` and open the `page_soup.txt` file (or `page_json.txt` when the response is JSON).\
`scrape` does what it says. It scrapes a website and provides the sensors and attributes.

Both services accept the same configuration as what you would provide in your configuration yaml (what is described above), with a small but important caveat: if the service input contains templates, those are automatically parsed by home assistant when the service is being called. That is fine for templates like `resource` and `select`, but templates that need to be applied on the scraped data itself (like `value_template`), cannot be parsed when the service is called. Therefore you need to slightly alter the syntax and add a `!` in the middle. E.g. `{{` becomes `{!{` and `%}` becomes `%!}`. Multiscrape will then understand that this string needs to handled as a template after the service has been called.\
Expand Down
19 changes: 13 additions & 6 deletions custom_components/multiscrape/parsers.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""Content parsers for multiscrape using the Strategy pattern."""

from __future__ import annotations

import json
import logging
from abc import ABC, abstractmethod
from typing import Any
Expand Down Expand Up @@ -55,8 +57,13 @@ async def parse(self, content: str, hass: Any) -> BeautifulSoup:
)


class JsonDetector(ContentParser):
"""Detects JSON content. Does not parse it (JSON uses value_template only)."""
class JsonParser(ContentParser):
"""Parse JSON content into a Python structure.

Values are typically extracted via Jinja value_template (the canonical
Home Assistant pattern); the parsed structure is used for pretty-printing
and file logging.
"""

@property
def name(self) -> str:
Expand All @@ -68,9 +75,9 @@ def can_parse(self, content: str) -> bool:
content_stripped = content.lstrip() if content else ""
return bool(content_stripped) and content_stripped[0] in ("{", "[")

async def parse(self, content: str, hass: Any) -> None:
"""JSON is not parsed into a queryable structure."""
return None
async def parse(self, content: str, hass: Any) -> dict | list:
"""Parse JSON content. Raises json.JSONDecodeError on malformed input."""
return await hass.async_add_executor_job(json.loads, content)


class ParserFactory:
Expand All @@ -79,7 +86,7 @@ class ParserFactory:
def __init__(self, parser_name: str):
"""Initialize with the HTML parser name."""
self._parsers: list[ContentParser] = [
JsonDetector(),
JsonParser(),
HtmlParser(parser_name),
]

Expand Down
33 changes: 22 additions & 11 deletions custom_components/multiscrape/scraper.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
"""Support for multiscrape requests."""

import json
import logging

from bs4 import BeautifulSoup

from .const import CONF_PARSER, CONF_SEPARATOR
from .extractors import ValueExtractor
from .parsers import JsonDetector, ParserFactory
from .parsers import JsonParser, ParserFactory
from .scrape_context import ScrapeContext

DEFAULT_TIMEOUT = 10
Expand Down Expand Up @@ -64,22 +66,32 @@ def reset(self):

@property
def formatted_content(self):
"""Property for getting the content. HTML will be prettified."""
"""Return the content for display: HTML prettified, JSON pretty-printed, or raw."""
if self._soup:
return self._soup.prettify()
if self._is_json and self._data:
try:
return json.dumps(
json.loads(self._data), indent=2, ensure_ascii=False
)
except (json.JSONDecodeError, RecursionError):
# Detected as JSON-shaped but unparsable — fall back to raw.
return self._data
return self._data

async def set_content(self, content):
"""Set the content to be scraped."""
self._data = content
parser = self._parser_factory.get_parser(content)

if isinstance(parser, JsonDetector):
if isinstance(parser, JsonParser):
_LOGGER.debug(
"%s # Response seems to be json. Skip parsing with BeautifulSoup.",
"%s # Response detected as JSON; skipping BeautifulSoup parsing.",
self._config_name,
)
self._is_json = True
if self._file_manager:
await self._async_file_log("page_json", self.formatted_content)
return

try:
Expand All @@ -101,7 +113,9 @@ async def set_content(self, content):
)
raise

def scrape(self, selector, sensor, attribute=None, context: ScrapeContext | None = None):
def scrape(
self, selector, sensor, attribute=None, context: ScrapeContext | None = None
):
"""Scrape based on given selector the data."""
if context is None:
context = ScrapeContext.empty()
Expand All @@ -123,25 +137,22 @@ def scrape(self, selector, sensor, attribute=None, context: ScrapeContext | None
value = self._extract_value(selector, log_prefix)

if value is not None and selector.value_template is not None:
_LOGGER.debug(
"%s # Applying value_template on selector result", log_prefix)
_LOGGER.debug("%s # Applying value_template on selector result", log_prefix)
render_ctx = context.with_current_value(value)
value = selector.value_template.async_render(
variables=render_ctx.to_template_variables(), parse_result=True
)

_LOGGER.debug(
"%s # Final selector value: %s of type %s", log_prefix, value, type(
value)
"%s # Final selector value: %s of type %s", log_prefix, value, type(value)
)
return value

def _extract_value(self, selector, log_prefix):
"""Delegate extraction to ValueExtractor."""
if selector.is_list:
tags = self._soup.select(selector.list)
_LOGGER.debug("%s # List selector selected tags: %s",
log_prefix, tags)
_LOGGER.debug("%s # List selector selected tags: %s", log_prefix, tags)
return self._extractor.extract_list(tags, selector)
else:
tag = self._soup.select_one(selector.element)
Expand Down
Loading
Loading