Skip to content

Commit a8a57e8

Browse files
committed
updates
1 parent 1bf1099 commit a8a57e8

5 files changed

Lines changed: 346 additions & 1 deletion

File tree

Lines changed: 345 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,345 @@
1+
---
2+
title: Self-Healing Tailscale Nodes
3+
date: 2026-01-13
4+
description: Stabilize Tailscale Exit Nodes running as Proxmox LXCs and integrate them into Home Assistant for active monitoring and automated recovery.
5+
tags:
6+
- Tailscale
7+
- Home Assistant
8+
- Proxmox
9+
- LXC
10+
- Automation
11+
---
12+
13+
# Self-Healing Tailscale Nodes
14+
15+
This guide documents how to stabilize Tailscale Exit Nodes running as Proxmox LXCs and integrate them into Home Assistant for active monitoring and automated recovery.
16+
17+
## Prerequisites
18+
19+
Before starting, ensure you have:
20+
21+
* **Proxmox VE**: Hosting your Tailscale LXC container.
22+
* **Home Assistant**: Running and connected to your network.
23+
* **Proxmox Integration**: Installed in Home Assistant (via HACS or Official) and configured with permissions to restart LXCs.
24+
* **Tailscale**: Installed on both the LXC and Home Assistant.
25+
* **SSH/Console Access**: To the Proxmox LXC to create scripts.
26+
27+
## 1. Proxmox LXC Resource Settings
28+
29+
To keep these nodes "lightweight" but stable as Exit Nodes, we used the following hardware allocation in Proxmox:
30+
31+
![Proxmox Resources](../self-healing-tailscale-nodes/proxmox-tailscale-resources.png)
32+
33+
* **Template**: `debian-12-standard` (or similar Ubuntu/Debian template).
34+
* **CPU**: 1 Core (Tailscale is very efficient; one core is plenty for a 1Gbps tunnel).
35+
* **Memory**: 512 MB (You can go as low as 256MB, but 512MB ensures the netcheck and ping scripts run smoothly without OOM issues).
36+
* **Disk**: 4 GB (Tailscale and its logs take up very little space).
37+
* **Unprivileged Container**: Yes (For security).
38+
* **Features**: Ensure **Nesting** is checked. (Required for Tailscale's internal networking).
39+
40+
## 2. The Critical Networking Fix (DHCPv6 Loop)
41+
42+
Before running any commands, we had to fix the Proxmox network configuration to prevent the "Death Loop."
43+
44+
By default, many Linux LXC templates negotiate both IPv4 and IPv6 DHCP. If your router (e.g., UniFi) isn't providing a DHCPv6 lease, the networking service enters an infinite `XMT: Solicit` loop and crashes.
45+
46+
In **Proxmox GUI** go to **LXC > Network > Edit eth0**:
47+
48+
![Proxmox Network Config](../self-healing-tailscale-nodes/proxmox-network.png)
49+
50+
* **IPv4**: DHCP (Assigned via UniFi with a Static IP Reservation).
51+
* **IPv6**: Static, but with **all fields left completely blank**.
52+
53+
*Why? This prevents systemd-networkd from waiting for a DHCPv6 lease that never comes.*
54+
55+
## 3. Installation & Initial Setup
56+
57+
Once the LXC was started, we ran these commands in the terminal:
58+
59+
**Update the OS:**
60+
```bash
61+
apt update && apt upgrade -y
62+
```
63+
64+
**Install Tailscale:**
65+
```bash
66+
curl -fsSL https://tailscale.com/install.sh | sh
67+
```
68+
69+
**Enable the Service (Persistence):**
70+
```bash
71+
systemctl enable --now tailscaled
72+
```
73+
74+
## 4. Authenticating as an Exit Node
75+
76+
To make these nodes useful for your travel router or remote access, we initialized them with specific flags:
77+
78+
```bash
79+
tailscale up --advertise-exit-node --accept-routes=false
80+
```
81+
82+
* `--advertise-exit-node`: Tells your Tailnet this node can act as an internet gateway.
83+
* `--accept-routes=false`: (**Crucial for LXCs**) Prevents the node from trying to route local subnet traffic through the tunnel, which often breaks SSH access to the container.
84+
85+
## 5. Enabling IP Forwarding (The "Engine")
86+
87+
For an Exit Node to actually pass traffic from other devices to the internet, Linux needs "IP Forwarding" enabled.
88+
89+
Run these to enable it immediately and permanently:
90+
91+
```bash
92+
# Enable for IPv4
93+
echo 'net.ipv4.ip_forward = 1' | tee -a /etc/sysctl.conf
94+
# Enable for IPv6
95+
echo 'net.ipv6.conf.all.forwarding = 1' | tee -a /etc/sysctl.conf
96+
# Apply changes
97+
sysctl -p
98+
```
99+
100+
101+
102+
## 6. The Heartbeat Script
103+
104+
This script is the brain of the self-healing mechanism. Unlike a simple "is the process running" check, it validates the actual data path and reporting health status to Home Assistant.
105+
106+
We use a Bash script inside the LXC to perform a multi-stage check:
107+
1. **Daemon Check**: If `tailscaled` is stopped, it attempts to start it immediately.
108+
2. **Connectivity Check**: It pings Home Assistant to verify the tunnel is actually passing traffic.
109+
110+
### Create the script
111+
112+
Create the file at `/usr/local/bin/ts-heartbeat.sh`:
113+
114+
```bash
115+
#!/bin/bash
116+
TAILSCALE="/usr/bin/tailscale"
117+
CURL="/usr/bin/curl"
118+
HA_WEBHOOK_URL="http://10.0.0.105:8123/api/webhook/tailscale_halo_heartbeat"
119+
HA_TAILSCALE_IP="100.92.181.98"
120+
121+
# 1. RECOVERY: If Tailscale is stopped, force it up with your Exit Node flags
122+
if [[ $($TAILSCALE status) == *"Tailscale is stopped."* ]]; then
123+
$TAILSCALE up --advertise-exit-node --accept-routes=false > /dev/null 2>&1
124+
sleep 5
125+
fi
126+
127+
# 2. INTEL: Refresh only the Tailscale repo to see if a new version exists
128+
# This keeps the 'Candidate' version in apt-cache accurate for Gemini
129+
apt-get update -o Dir::Etc::sourcelist="sources.list.d/tailscale.list" \
130+
-o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0" > /dev/null 2>&1
131+
132+
# 3. GATHER DATA
133+
CURRENT_VER=$($TAILSCALE version | head -n 1)
134+
LATEST_VER=$(apt-cache policy tailscale | grep Candidate | awk '{print $2}')
135+
# Extract Health warnings (converts JSON array to a comma-separated string)
136+
HEALTH_DATA=$($TAILSCALE status --json | jq -r '.Health | join(", ")')
137+
[[ -z "$HEALTH_DATA" || "$HEALTH_DATA" == "null" ]] && HEALTH_DATA="Healthy"
138+
139+
# 4. DEEP PULSE: Ping the HA Tailscale and send the JSON payload
140+
if $TAILSCALE ping -c 1 -timeout 2s $HA_TAILSCALE_IP > /dev/null 2>&1; then
141+
$CURL -s -X POST "$HA_WEBHOOK_URL" \
142+
-H "Content-Type: application/json" \
143+
-d "{
144+
\"current\": \"$CURRENT_VER\",
145+
\"latest\": \"$LATEST_VER\",
146+
\"health\": \"$HEALTH_DATA\"
147+
}" > /dev/null 2>&1
148+
fi
149+
```
150+
151+
### Apply Permissions & Enable Service
152+
153+
The script must be executable, and the Tailscale service must be enabled to start on boot (critical for reboots):
154+
155+
```bash
156+
chmod +x /usr/local/bin/ts-heartbeat.sh
157+
systemctl enable tailscaled
158+
```
159+
160+
161+
## Methodology: Why ping Home Assistant?
162+
163+
We specifically check strict connectivity to **Home Assistant** rather than the general Internet (`netcheck`) or peer-to-peer (`Halo <-> Edge`):
164+
165+
1. **Verifies Data Plane**: `tailscale status` only checks if the daemon is running. `tailscale ping` confirms packets can actually flow through the tunnel.
166+
2. **ISP Outage Proof**: Tailscale can route traffic over the local LAN even if the Internet is down. By pinging a local peer (Home Assistant), we prevent the node from rebooting in a loop during a simple ISP outage.
167+
3. **Fail-Safe**: If Home Assistant goes down, the automation engine (Watchdog) is also down. This creates a natural fail-safe where the node won't be rebooted accidentally if the monitoring server itself crashes.
168+
169+
## Scheduling with Cron
170+
171+
To ensure the pulse is consistent, we schedule the script to run every minute using the root user's crontab.
172+
173+
1. Open crontab: `crontab -e`
174+
2. Add the following line at the bottom (ensure there is a blank line after it):
175+
176+
```plaintext
177+
* * * * * /usr/local/bin/ts-heartbeat.sh
178+
```
179+
180+
## Home Assistant Integration (The Package)
181+
182+
Instead of scattered sensors, we bundle the logic into a single Home Assistant Package. This includes a template binary sensor, a maintenance toggle, and the self-healing automation.
183+
184+
**File:** `/config/packages/halo_tailscale.yaml`
185+
186+
```yaml
187+
# 0. SSH COMMANDS
188+
shell_command:
189+
update_tailscale_halo: >
190+
ssh -i /config/.ssh/id_rsa
191+
-o IdentitiesOnly=yes
192+
-o BatchMode=yes
193+
-o ConnectTimeout=10
194+
-o StrictHostKeyChecking=no
195+
-o UserKnownHostsFile=/dev/null
196+
root@10.0.0.89
197+
'pct exec 101 -- tailscale update --yes'
198+
199+
script:
200+
update_tailscale_halo_node:
201+
alias: "Action: Update Tailscale Halo"
202+
icon: mdi:cloud-upload
203+
sequence:
204+
- action: shell_command.update_tailscale_halo
205+
- action: script.notify_smart_master
206+
data:
207+
title: "Tailscale Halo"
208+
message: "Update command sent to Proxmox Host (10.0.0.89). LXC 101 will reboot shortly."
209+
category: "system"
210+
tag: "tailscale_halo"
211+
212+
# 1. MAINTENANCE TOGGLE
213+
# Use this in your dashboard to stop reboots during manual work
214+
input_boolean:
215+
halo_maintenance_mode:
216+
name: "Halo Maintenance Mode"
217+
icon: mdi:wrench
218+
219+
# 2. HEARTBEAT SENSOR
220+
# Receives the pulse from the LXC and turns off after 2 minutes of silence
221+
template:
222+
- trigger:
223+
- platform: webhook
224+
webhook_id: "tailscale_halo_heartbeat"
225+
local_only: true
226+
binary_sensor:
227+
- name: "Tailscale Halo Heartbeat"
228+
state: "on"
229+
auto_off: "00:02:00"
230+
attributes:
231+
current_version: "{{ trigger.json.current }}"
232+
latest_version: "{{ trigger.json.latest }}"
233+
update_available: "{{ trigger.json.current != trigger.json.latest }}"
234+
health_status: "{{ trigger.json.health | default('Healthy') }}"
235+
236+
# 3. SELF-HEALING AUTOMATION
237+
# Reboots the LXC via Proxmox if heartbeat is lost for 5 minutes
238+
automation:
239+
- alias: "Tailscale Halo Auto-Recover"
240+
id: "tailscale_halo_auto_recover"
241+
trigger:
242+
- platform: state
243+
entity_id: binary_sensor.tailscale_halo_heartbeat
244+
to: "off"
245+
for: "00:05:00"
246+
condition:
247+
- condition: and
248+
conditions:
249+
# Condition 1: Maintenance mode must be OFF
250+
- condition: state
251+
entity_id: input_boolean.halo_maintenance_mode
252+
state: "off"
253+
254+
# Condition 2: Container must be running (Proxmox Status)
255+
- condition: state
256+
entity_id: binary_sensor.lxc_tailscale_halo_101_status
257+
state: "on"
258+
action:
259+
- service: button.press
260+
target:
261+
# Verify your specific entity ID in HA
262+
entity_id: button.lxc_tailscale_halo_101_reboot
263+
- action: script.notify_smart_master
264+
data:
265+
title: "⚠️ Tailscale Halo Down"
266+
message: "Connectivity lost for 5m. Proxmox reboot command sent."
267+
category: "system"
268+
critical: true
269+
tag: "tailscale_halo"
270+
271+
# 4. GEMINI ADVISOR (Updates & Health Issues)
272+
- alias: "Tailscale Gemini Advisor"
273+
id: "ts_gemini_advisor"
274+
trigger:
275+
# Trigger A: New version detected
276+
- platform: state
277+
entity_id: binary_sensor.tailscale_halo_heartbeat
278+
attribute: update_available
279+
to: true
280+
# Trigger B: Health status is no longer "Healthy"
281+
- platform: template
282+
value_template: >
283+
{{ state_attr('binary_sensor.tailscale_halo_heartbeat', 'health_status') not in ['Healthy', 'OK', 'null', none] }}
284+
action:
285+
- action: conversation.process
286+
data:
287+
agent_id: conversation.gemini_web_advisor # Dedicated Agent with Google Search enabled
288+
text: >
289+
Analyzing Tailscale Halo Node status.
290+
Update available: {{ state_attr('binary_sensor.tailscale_halo_heartbeat', 'update_available') }}
291+
Current Health: "{{ state_attr('binary_sensor.tailscale_halo_heartbeat', 'health_status') }}"
292+
Running Version: {{ state_attr('binary_sensor.tailscale_halo_heartbeat', 'current_version') }}
293+
Latest Version: {{ state_attr('binary_sensor.tailscale_halo_heartbeat', 'latest_version') }}
294+
295+
If there is a health error, explain potential fixes for a Proxmox LXC context.
296+
If there is an update, search the web for the Tailscale changelog. Summarize the major changes and check for any breaking changes related to Subnet Routers or Exit Nodes.
297+
298+
IMPORTANT: Keep your response extremely concise (max 3 sentences) as it will be sent as a phone notification.
299+
response_variable: gemini_result
300+
- action: script.notify_smart_master
301+
data:
302+
title: "🚀 Gemini Tailscale Advice"
303+
message: "{{ gemini_result.response.speech.plain.speech }}"
304+
category: "system"
305+
tag: "tailscale_halo"
306+
sticky: true
307+
actions:
308+
- action: "UPDATE_TAILSCALE_HALO"
309+
title: "Update Now"
310+
activationMode: "background"
311+
312+
# 5. HANDLE NOTIFICATION ACTION
313+
- alias: "Tailscale Halo: Action Handler"
314+
id: "tailscale_halo_action_handler"
315+
trigger:
316+
- platform: event
317+
event_type: mobile_app_notification_action
318+
event_data:
319+
action: "UPDATE_TAILSCALE_HALO"
320+
action:
321+
- action: script.update_tailscale_halo_node
322+
```
323+
324+
325+
## 7. AI-Driven Maintenance: The Gemini Advisor
326+
327+
Rather than blindly updating our infrastructure, we utilize the Google Gemini integration in Home Assistant to perform a "Pre-Flight Check" and diagnose health issues.
328+
329+
> [!NOTE]
330+
> **Requirement:** You must enable the **Google Search Tool** in your Google Gemini integration settings for the agent to look up real-time changelogs.
331+
332+
This automation triggers in two scenarios:
333+
1. **Version Mismatch**: Gemini searches the web for the latest Tailscale changelogs and checks for breaking changes specific to Proxmox LXC environments.
334+
2. **Health Warnings**: If `tailscale status` reports an error (e.g., specific sub-service failure), Gemini explains the error and suggests a fix.
335+
336+
This provides a human-readable recommendation directly to your mobile device before you ever touch the terminal.
337+
338+
## Verification & Testing
339+
340+
To verify the system is working:
341+
342+
343+
1. **Verify Pulse**: Check the "Last Updated" attribute of the `binary_sensor` in HA; it should refresh every 60 seconds.
344+
2. **Test Failure**: Edit the crontab (`crontab -e`) and comment out the script line to simulate a total failure.
345+
3. **Confirm Automation**: Within 2 minutes, the HA sensor should flip to "Disconnected." After 5 minutes, you should receive a notification and see the LXC reboot in Proxmox.
22.5 KB
Loading
18 KB
Loading

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ render_macros: true
7676
<div class="feature-grid">
7777
<a href="workflow/" class="feature-card">
7878
<span>Documentation</span>
79-
<h4>Architecture & Workflow</h4>
79+
<h4>Documentation & Workflow</h4>
8080
<p>See how the Home Assistant documentation is managed via an Agentic Documentation Workflow.</p>
8181
</a>
8282
<a href="smart-home/dashboards/" class="feature-card">
1.06 MB
Loading

0 commit comments

Comments
 (0)