Skip to content

Commit 3d147f7

Browse files
committed
WAN stability detection and backoff to prevent SFE kernel errors (#517)
* Add WAN stability detection and backoff to prevent SFE kernel errors When WAN interfaces flap (reboots, GPON swaps, ISP outages), wansteer's health checker toggled state every ~6-10s, flushing SFE 3-6 times/min while the gateway's own failover engine manipulated the same conntrack entries. This race produced sfe_ipv4_remove_connection kernel errors. Changes: - Startup grace period: wait for WAN interfaces to be link-up and stable before applying rules (prevents stale assumptions post-reboot) - Instability detection: track state transition frequency per-WAN, mark as "unstable" when transitions exceed threshold in window - Backoff mode: when any WAN is unstable, suppress SFE/conntrack flushes from health transitions and reconciliation drift detection. On recovery, single forced flush + full rule reapply for clean state. - SFE flush coalescing: minimum 10s interval between flushes, redundant calls within cooldown are skipped New config options (all with sensible defaults): startup_grace_seconds (30), instability_threshold (3), instability_window_seconds (300), backoff_recovery_seconds (60), sfe_flush_cooldown_seconds (10) * Fix reconciler false drift detection when WAN is down expectedRuleCount() compared against the full config, but reapplyRules() disables traffic classes targeting unhealthy WANs. This mismatch caused the reconciler to detect "drift" every 30s while a WAN was down, triggering unnecessary SFE flushes each time. Fix: pass unhealthy WANs to expectedRuleCount so it skips the same traffic classes that reapplyRules disables. * Refactor: deduplicate onStateChange callback, remove closure capture - Extract onHealthChange to a named function used by both initial setup and SIGHUP reload (was duplicated inline) - Move stability check into health tick case (remove separate ticker) - Pass inBackoff explicitly through health checker instead of closing over a variable from the main loop scope * Show warning when deployed wansteer binary version is outdated Compares the app's assembly version against the wansteer status JSON version field after each status poll. Shows an alert-warning banner in the status card when versions don't match. Suppressed when both are dev, or either has a non-release version string (alpha, pre-release, +metadata). Warns when one is dev and the other is a tagged release, or when both are different tagged releases. * Stamp Go binary versions in macOS install script The Mac install script was building cfspeedtest, uwnspeedtest, and wansteer without passing -X main.version, so all binaries reported "dev". Now uses git describe --tags --always to stamp the version, matching what the MSI build and Dockerfile already do. * Show wansteer version in daemon status metrics * Fix status JSON parse failure from missing newline after cat The SSH command concatenated the ---VERSION--- delimiter onto the last line of the status JSON (cat doesn't guarantee a trailing newline). This caused JsonSerializer to fail parsing every poll, leaving _parsedStatus null and hiding uptime, rule count, WAN health, and version metrics. Fix: add 'echo' after cat to ensure a newline separator. * Remove per-WAN health metrics from status card * Fix uptime display timezone - use DateTimeOffset to preserve offset
1 parent 223697d commit 3d147f7

9 files changed

Lines changed: 851 additions & 64 deletions

File tree

scripts/install-macos-native.sh

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,11 @@ echo "[3b/9] Building Go binaries..."
261261
if command -v go &> /dev/null; then
262262
mkdir -p "$INSTALL_DIR/tools"
263263

264+
# Get version from git tags for Go binary version stamps
265+
GO_VERSION=$(cd "$REPO_ROOT" && git describe --tags --always 2>/dev/null || echo "dev")
266+
GO_VERSION="${GO_VERSION#v}" # strip leading v
267+
echo "Go binary version: $GO_VERSION"
268+
264269
# Detect Go architecture for local binary
265270
GO_ARCH="amd64"
266271
if [ "$ARCH" = "arm64" ]; then
@@ -271,7 +276,7 @@ if command -v go &> /dev/null; then
271276
if [ -d "$CFSPEEDTEST_SRC" ]; then
272277
cd "$CFSPEEDTEST_SRC"
273278
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -trimpath \
274-
-ldflags "-s -w" \
279+
-ldflags "-s -w -X main.version=$GO_VERSION" \
275280
-o "$INSTALL_DIR/tools/cfspeedtest-linux-arm64" .
276281
echo "Built cfspeedtest for linux/arm64"
277282
else
@@ -283,12 +288,12 @@ if command -v go &> /dev/null; then
283288
cd "$UWNSPEEDTEST_SRC"
284289
# Build local binary for server-side WAN speed tests
285290
CGO_ENABLED=0 GOOS=darwin GOARCH=$GO_ARCH go build -a -trimpath \
286-
-ldflags "-s -w" \
291+
-ldflags "-s -w -X main.version=$GO_VERSION" \
287292
-o "$INSTALL_DIR/tools/uwnspeedtest-darwin-$GO_ARCH" .
288293
echo "Built uwnspeedtest for darwin/$GO_ARCH (local)"
289294
# Build gateway binary for deployment via SSH to UniFi gateways
290295
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -trimpath \
291-
-ldflags "-s -w" \
296+
-ldflags "-s -w -X main.version=$GO_VERSION" \
292297
-o "$INSTALL_DIR/tools/uwnspeedtest-linux-arm64" .
293298
echo "Built uwnspeedtest for linux/arm64 (gateway)"
294299
else
@@ -300,7 +305,7 @@ if command -v go &> /dev/null; then
300305
cd "$WANSTEER_SRC"
301306
# Build gateway binary for WAN steering (deployed via SSH to UniFi gateways)
302307
CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -trimpath \
303-
-ldflags "-s -w" \
308+
-ldflags "-s -w -X main.version=$GO_VERSION" \
304309
-o "$INSTALL_DIR/tools/wansteer-linux-arm64" .
305310
echo "Built wansteer for linux/arm64 (gateway)"
306311
else

src/NetworkOptimizer.Web/Components/Pages/WanSteering.razor

Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@
1212
@using Microsoft.EntityFrameworkCore
1313
@using System.Text.Json
1414
@using NetworkOptimizer.Core.Helpers
15+
@using System.Reflection
16+
@using System.Text.RegularExpressions
1517
@inject IJSRuntime JS
1618

1719
<PageTitle>WAN Steering - Network Optimizer</PageTitle>
@@ -82,6 +84,13 @@
8284
@(_status.BinaryDeployed ? "Deployed" : "Not Deployed")
8385
</div>
8486
</div>
87+
@if (_parsedStatus != null && !string.IsNullOrEmpty(_parsedStatus.Version))
88+
{
89+
<div class="metric">
90+
<div class="metric-label">Version</div>
91+
<div class="metric-value">@_parsedStatus.Version</div>
92+
</div>
93+
}
8594
<div class="metric">
8695
<div class="metric-label">Daemon</div>
8796
<div class="metric-value">
@@ -99,18 +108,15 @@
99108
<div class="metric-label">Active Rules</div>
100109
<div class="metric-value">@_parsedStatus.RuleCount</div>
101110
</div>
102-
@foreach (var kvp in _parsedStatus.WanHealth)
103-
{
104-
<div class="metric">
105-
<div class="metric-label">@kvp.Key</div>
106-
<div class="metric-value">
107-
<span class="status-indicator @(kvp.Value.Healthy ? "status-active" : "status-inactive")"></span>
108-
@(kvp.Value.Healthy ? "Healthy" : "Down")
109-
</div>
110-
</div>
111-
}
112111
}
113112
</div>
113+
@if (_showVersionWarning && _parsedStatus != null)
114+
{
115+
<div class="alert alert-warning" style="margin-top: 1rem;">
116+
<strong>WAN Steer binary outdated.</strong>
117+
Gateway is running @(_parsedStatus.Version) but the app is @_appVersion. Redeploy to update.
118+
</div>
119+
}
114120
}
115121
</div>
116122
</div>
@@ -496,6 +502,8 @@
496502
private List<WanSteerWanInfo> _wanInterfaces = new();
497503
private WanSteerStatus? _status;
498504
private ParsedDaemonStatus? _parsedStatus;
505+
private bool _showVersionWarning;
506+
private string _appVersion = "";
499507

500508
// Rule editing
501509
private int? _editingRuleId; // null = not editing, 0 = adding new, >0 = editing existing
@@ -1066,10 +1074,10 @@
10661074
return parts.Count > 0 ? string.Join(" | ", parts) : "Match all traffic";
10671075
}
10681076

1069-
private static string FormatUptime(DateTime startedAt)
1077+
private static string FormatUptime(DateTimeOffset startedAt)
10701078
{
10711079
if (startedAt == default) return "-";
1072-
var span = DateTime.UtcNow - startedAt;
1080+
var span = DateTimeOffset.UtcNow - startedAt;
10731081
if (span.TotalDays >= 1)
10741082
return $"{(int)span.TotalDays}d {span.Hours}h";
10751083
if (span.TotalHours >= 1)
@@ -1082,18 +1090,53 @@
10821090
private void ParseStatusJson()
10831091
{
10841092
_parsedStatus = null;
1093+
_showVersionWarning = false;
10851094
if (string.IsNullOrWhiteSpace(_status?.StatusJson)) return;
10861095

10871096
try
10881097
{
10891098
_parsedStatus = JsonSerializer.Deserialize<ParsedDaemonStatus>(_status.StatusJson, _jsonOptions);
1099+
CheckVersionMismatch();
10901100
}
10911101
catch (Exception ex)
10921102
{
10931103
Logger.LogDebug(ex, "Failed to parse WAN Steering status JSON");
10941104
}
10951105
}
10961106

1107+
private void CheckVersionMismatch()
1108+
{
1109+
_showVersionWarning = false;
1110+
if (_parsedStatus == null) return;
1111+
1112+
_appVersion = Assembly.GetExecutingAssembly()
1113+
.GetCustomAttribute<AssemblyInformationalVersionAttribute>()
1114+
?.InformationalVersion ?? "";
1115+
1116+
var binaryVersion = _parsedStatus.Version ?? "";
1117+
1118+
// Extract base version (X.Y.Z) from both, stripping v prefix, pre-release, and metadata.
1119+
// e.g., "v1.14.7" -> "1.14.7", "1.14.7-alpha.0.2+hash" -> "1.14.7", "dev" -> "dev"
1120+
var appBase = ExtractBaseVersion(_appVersion);
1121+
var binBase = ExtractBaseVersion(binaryVersion);
1122+
1123+
// Source builds produce 0.0.0 - suppress
1124+
if (appBase == "0.0.0") return;
1125+
1126+
// Binary is "dev" or a different base version: warn
1127+
if (binBase != appBase)
1128+
{
1129+
_showVersionWarning = true;
1130+
}
1131+
}
1132+
1133+
// Extracts X.Y.Z from version strings like "v1.14.7", "1.14.7-alpha.0.2+hash", "dev"
1134+
private static string ExtractBaseVersion(string version)
1135+
{
1136+
var match = Regex.Match(version, @"v?(\d+\.\d+\.\d+)");
1137+
return match.Success ? match.Groups[1].Value : version;
1138+
}
1139+
10971140
private static readonly JsonSerializerOptions _jsonOptions = new()
10981141
{
10991142
PropertyNameCaseInsensitive = true,
@@ -1217,8 +1260,8 @@
12171260
{
12181261
public string Version { get; set; } = "";
12191262
public bool Running { get; set; }
1220-
public DateTime StartedAt { get; set; }
1221-
public DateTime LastReconcile { get; set; }
1263+
public DateTimeOffset StartedAt { get; set; }
1264+
public DateTimeOffset LastReconcile { get; set; }
12221265
public int RuleCount { get; set; }
12231266
public int ReconcileCount { get; set; }
12241267
public Dictionary<string, ParsedWanHealth> WanHealth { get; set; } = new();

src/NetworkOptimizer.Web/Services/WanSteerDeploymentService.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ public async Task<WanSteerStatus> GetStatusAsync()
4848
{
4949
var combinedCommand =
5050
"echo '---PROCESS---'; pgrep -x wansteer > /dev/null 2>&1 && echo running || echo stopped; " +
51-
"echo '---STATUS---'; cat /tmp/wan-steer-status.json 2>/dev/null || echo '{}'; " +
51+
"echo '---STATUS---'; cat /tmp/wan-steer-status.json 2>/dev/null || echo '{}'; echo; " +
5252
"echo '---VERSION---'; /data/wan-steer/wansteer -version 2>/dev/null || echo 'not installed'; " +
5353
"echo '---BINARY---'; test -x /data/wan-steer/wansteer && echo 'exists' || echo 'missing'";
5454

src/wansteer/config.go

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,27 @@ type Config struct {
1717
HealthPassThreshold int `json:"health_pass_threshold"`
1818
TrafficClasses []TrafficClass `json:"traffic_classes"`
1919
StatusFile string `json:"status_file"`
20+
21+
// Stability detection: prevent SFE kernel errors during WAN flapping.
22+
23+
// StartupGraceSeconds is how long to wait after start for WAN interfaces
24+
// to stabilize before applying rules or starting health checks.
25+
StartupGraceSeconds int `json:"startup_grace_seconds"`
26+
27+
// InstabilityThreshold is the number of state transitions within
28+
// InstabilityWindowSeconds that marks a WAN as "unstable".
29+
InstabilityThreshold int `json:"instability_threshold"`
30+
31+
// InstabilityWindowSeconds is the sliding window for counting state transitions.
32+
InstabilityWindowSeconds int `json:"instability_window_seconds"`
33+
34+
// BackoffRecoverySeconds is how long all WANs must be stable before
35+
// exiting backoff mode (single flush + full reapply on exit).
36+
BackoffRecoverySeconds int `json:"backoff_recovery_seconds"`
37+
38+
// SFEFlushCooldownSeconds is the minimum interval between SFE flushes.
39+
// Calls within the cooldown window are skipped.
40+
SFEFlushCooldownSeconds int `json:"sfe_flush_cooldown_seconds"`
2041
}
2142

2243
// WANInterface describes a WAN link the daemon can steer traffic to.
@@ -90,6 +111,21 @@ func loadConfig(path string) (*Config, error) {
90111
if cfg.StatusFile == "" {
91112
cfg.StatusFile = "/tmp/wan-steer-status.json"
92113
}
114+
if cfg.StartupGraceSeconds <= 0 {
115+
cfg.StartupGraceSeconds = 30
116+
}
117+
if cfg.InstabilityThreshold <= 0 {
118+
cfg.InstabilityThreshold = 3
119+
}
120+
if cfg.InstabilityWindowSeconds <= 0 {
121+
cfg.InstabilityWindowSeconds = 300
122+
}
123+
if cfg.BackoffRecoverySeconds <= 0 {
124+
cfg.BackoffRecoverySeconds = 60
125+
}
126+
if cfg.SFEFlushCooldownSeconds <= 0 {
127+
cfg.SFEFlushCooldownSeconds = 10
128+
}
93129

94130
return &cfg, nil
95131
}

0 commit comments

Comments
 (0)