Skip to content

Commit fcdfeed

Browse files
cjagwaniclaude
andauthored
fix(onboard): detect messaging bot-token conflicts across sandboxes (#1953) (#1979)
## Summary Closes #1953. Telegram/Discord/Slack bot tokens only allow one consumer per token, globally — enforced by the platform, not by NemoClaw. Two NemoClaw sandboxes resolving the same token from `credentials.json` both start polling, each kicks the other off, and neither delivers messages. `nemoclaw status` still reports the bridge as running; the only evidence is a repeating 409 line inside `/tmp/gateway.log`. This PR adds three layers of defense inside NemoClaw. No OpenClaw changes needed. ## What changed - **Onboard-time prevention** — before creating a sandbox that enables a messaging channel, check the registry for any other sandbox already using that channel. Prompt `[y/N]` default No; hard-exit 1 in non-interactive mode. - **Status-time overlap warning** — `nemoclaw status` lists every cross-sandbox messaging-channel overlap so users who upgrade or click through the onboard prompt still see the problem. - **Status-time 409 detection** — `nemoclaw status` tails `/tmp/gateway.log` inside the default sandbox and flags Telegram's `getUpdates conflict` / `409 Conflict` pattern as `degraded`. Short timeout (3s) and silent on failure — it never breaks status. - **Registry backfill** — new `messagingChannels` field on `SandboxEntry`. Pre-existing sandboxes have no record of their channels, so the check would miss conflicts after an upgrade; a lazy backfill probes OpenShell for `${name}-{telegram,discord,slack}-bridge` providers and fills the field. Runs on first onboard/status after upgrade. - **Troubleshooting docs** — new section explaining the 409 pattern, how to diagnose via `/tmp/gateway.log`, and how to recover. The conflict-detection logic is factored into a pure `src/lib/messaging-conflict.ts` with dependency injection and unit tests — no `child_process` or `openshell` imports, so it's trivially testable. ## Scope notes - **Option B from the issue ("only default sandbox polls")** is intentionally not implemented. Silently disabling a feature the user explicitly enabled is worse UX than a clear warning. The issue listed all three expected results as `OR` alternatives; this PR picks the detect-and-warn branch. - **Discord/Slack log-pattern matching** is not included. The only verified log signature is Telegram's (from the issue reporter's actual output). Discord gateway and Slack Socket Mode have similar protocol-level conflicts but different log formats that belong to OpenClaw, not NemoClaw. The registry-based overlap warning covers Discord/Slack for prevention; log-based diagnosis for those platforms can be added when OpenClaw exposes a structured bridge-health signal. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Detects when the same messaging bot token/channel is enabled in multiple sandboxes; status shows warnings and onboarding will prompt (or abort in non-interactive mode) to prevent conflicts. * Sandboxes now persist selected messaging channels so status reflects configured channels. * **Documentation** * Added a troubleshooting guide with symptoms, diagnosis steps, example log indicators, and remediation for messaging-bridge token conflicts. * **Tests** * Added tests covering conflict detection, channel backfill, overlap reporting, and status warnings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 56ae053 commit fcdfeed

9 files changed

Lines changed: 606 additions & 0 deletions

File tree

.agents/skills/nemoclaw-user-reference/references/troubleshooting.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,25 @@ In that case:
414414
- inspect gateway logs and blocked requests with `openshell term`
415415
- treat the failure as a native Discord gateway problem, not as a bridge startup problem
416416

417+
### Messaging bridge appears running but no messages arrive
418+
419+
Bot tokens for Telegram (`getUpdates`), Discord (gateway), and Slack (Socket Mode) only allow one active consumer per token. If two NemoClaw sandboxes are configured with the same bot token, each one kicks the other off its polling connection and neither delivers messages. `nemoclaw status` still reports the bridge as running because the gateway process itself is alive.
420+
421+
To diagnose, open a shell in the sandbox and inspect the gateway log:
422+
423+
```console
424+
$ openshell term <sandbox-name>
425+
$ tail -f /tmp/gateway.log
426+
```
427+
428+
A repeating line like the following confirms the conflict:
429+
430+
```text
431+
[telegram] getUpdates conflict: 409: Conflict: terminated by other getUpdates request; retrying in 30s.
432+
```
433+
434+
To fix, run `nemoclaw <other-sandbox> destroy` on whichever sandbox should stop polling, or rerun onboarding on it with the channel disabled. Current NemoClaw warns at `nemoclaw onboard` time when another sandbox already has the same channel enabled, but sandboxes created before that check was added may still be in a conflict loop.
435+
417436
### Landlock filesystem restrictions silently degraded
418437

419438
After sandbox creation, NemoClaw checks whether the host kernel supports Landlock (Linux 5.13+).

docs/reference/troubleshooting.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -444,6 +444,25 @@ In that case:
444444
- inspect gateway logs and blocked requests with `openshell term`
445445
- treat the failure as a native Discord gateway problem, not as a bridge startup problem
446446

447+
### Messaging bridge appears running but no messages arrive
448+
449+
Bot tokens for Telegram (`getUpdates`), Discord (gateway), and Slack (Socket Mode) only allow one active consumer per token. If two NemoClaw sandboxes are configured with the same bot token, each one kicks the other off its polling connection and neither delivers messages. `nemoclaw status` still reports the bridge as running because the gateway process itself is alive.
450+
451+
To diagnose, open a shell in the sandbox and inspect the gateway log:
452+
453+
```console
454+
$ openshell term <sandbox-name>
455+
$ tail -f /tmp/gateway.log
456+
```
457+
458+
A repeating line like the following confirms the conflict:
459+
460+
```text
461+
[telegram] getUpdates conflict: 409: Conflict: terminated by other getUpdates request; retrying in 30s.
462+
```
463+
464+
To fix, run `nemoclaw <other-sandbox> destroy` on whichever sandbox should stop polling, or rerun onboarding on it with the channel disabled. Current NemoClaw warns at `nemoclaw onboard` time when another sandbox already has the same channel enabled, but sandboxes created before that check was added may still be in a conflict loop.
465+
447466
### Landlock filesystem restrictions silently degraded
448467

449468
After sandbox creation, NemoClaw checks whether the host kernel supports Landlock (Linux 5.13+).

src/lib/inventory-commands.test.ts

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,77 @@ describe("inventory commands", () => {
8888
);
8989
});
9090

91+
it("flags messaging bridge as degraded when checkMessagingBridgeHealth reports conflicts", () => {
92+
const lines: string[] = [];
93+
const checkMessagingBridgeHealth = vi.fn().mockReturnValue([
94+
{ channel: "telegram", conflicts: 7 },
95+
]);
96+
showStatusCommand({
97+
listSandboxes: () => ({
98+
sandboxes: [
99+
{
100+
name: "alpha",
101+
model: "m",
102+
messagingChannels: ["telegram"],
103+
},
104+
],
105+
defaultSandbox: "alpha",
106+
}),
107+
getLiveInference: () => null,
108+
showServiceStatus: vi.fn(),
109+
checkMessagingBridgeHealth,
110+
log: (message = "") => lines.push(message),
111+
});
112+
113+
expect(checkMessagingBridgeHealth).toHaveBeenCalledWith("alpha", ["telegram"]);
114+
expect(lines).toContain(
115+
" ⚠ telegram bridge: degraded (7 conflict errors in /tmp/gateway.log)",
116+
);
117+
});
118+
119+
it("skips messaging bridge check when the default sandbox has no channels", () => {
120+
const lines: string[] = [];
121+
const checkMessagingBridgeHealth = vi.fn().mockReturnValue([]);
122+
showStatusCommand({
123+
listSandboxes: () => ({
124+
sandboxes: [{ name: "alpha", model: "m" }],
125+
defaultSandbox: "alpha",
126+
}),
127+
getLiveInference: () => null,
128+
showServiceStatus: vi.fn(),
129+
checkMessagingBridgeHealth,
130+
log: (message = "") => lines.push(message),
131+
});
132+
133+
expect(checkMessagingBridgeHealth).not.toHaveBeenCalled();
134+
expect(lines.some((l) => l.includes("degraded"))).toBe(false);
135+
});
136+
137+
it("prints a cross-sandbox overlap warning when backfillAndFindOverlaps reports overlaps", () => {
138+
const lines: string[] = [];
139+
const backfillAndFindOverlaps = vi.fn().mockReturnValue([
140+
{ channel: "telegram", sandboxes: ["alice", "bob"] },
141+
]);
142+
showStatusCommand({
143+
listSandboxes: () => ({
144+
sandboxes: [
145+
{ name: "alice", model: "m", messagingChannels: ["telegram"] },
146+
{ name: "bob", model: "m", messagingChannels: ["telegram"] },
147+
],
148+
defaultSandbox: "alice",
149+
}),
150+
getLiveInference: () => null,
151+
showServiceStatus: vi.fn(),
152+
backfillAndFindOverlaps,
153+
log: (message = "") => lines.push(message),
154+
});
155+
156+
expect(backfillAndFindOverlaps).toHaveBeenCalled();
157+
expect(
158+
lines.some((l) => l.includes("telegram is enabled on both 'alice' and 'bob'")),
159+
).toBe(true);
160+
});
161+
91162
it("prints stored sandbox models in status and delegates service status", () => {
92163
const lines: string[] = [];
93164
const showServiceStatus = vi.fn();

src/lib/inventory-commands.ts

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,12 @@ export interface SandboxEntry {
99
provider?: string | null;
1010
gpuEnabled?: boolean;
1111
policies?: string[] | null;
12+
messagingChannels?: string[] | null;
13+
}
14+
15+
export interface MessagingBridgeHealth {
16+
channel: string;
17+
conflicts: number;
1218
}
1319

1420
export interface RecoveryResult {
@@ -25,10 +31,20 @@ export interface ListSandboxesCommandDeps {
2531
log?: (message?: string) => void;
2632
}
2733

34+
export interface MessagingOverlap {
35+
channel: string;
36+
sandboxes: [string, string];
37+
}
38+
2839
export interface ShowStatusCommandDeps {
2940
listSandboxes: () => { sandboxes: SandboxEntry[]; defaultSandbox?: string | null };
3041
getLiveInference: () => GatewayInference | null;
3142
showServiceStatus: (options: { sandboxName?: string }) => void;
43+
checkMessagingBridgeHealth?: (
44+
sandboxName: string,
45+
channels: string[],
46+
) => MessagingBridgeHealth[];
47+
backfillAndFindOverlaps?: () => MessagingOverlap[];
3248
log?: (message?: string) => void;
3349
}
3450

@@ -99,4 +115,42 @@ export function showStatusCommand(deps: ShowStatusCommandDeps): void {
99115
}
100116

101117
deps.showServiceStatus({ sandboxName: defaultSandbox || undefined });
118+
119+
if (deps.backfillAndFindOverlaps) {
120+
const overlaps = deps.backfillAndFindOverlaps();
121+
if (overlaps.length > 0) {
122+
log("");
123+
for (const { channel, sandboxes: pair } of overlaps) {
124+
log(
125+
` ⚠ ${channel} is enabled on both '${pair[0]}' and '${pair[1]}'. Bot tokens only allow one sandbox to poll — both bridges will fail.`,
126+
);
127+
}
128+
log(
129+
" Run `nemoclaw <sandbox> destroy` on whichever sandbox should stop polling, or rerun onboarding with the channel disabled.",
130+
);
131+
}
132+
}
133+
134+
if (deps.checkMessagingBridgeHealth && defaultSandbox) {
135+
// Re-fetch: backfillAndFindOverlaps above may have populated
136+
// messagingChannels for the default sandbox on first run after upgrade,
137+
// and the original `sandboxes` snapshot is stale.
138+
const refreshed = deps.listSandboxes().sandboxes;
139+
const defaultEntry = refreshed.find((sb) => sb.name === defaultSandbox);
140+
const channels = defaultEntry?.messagingChannels;
141+
if (Array.isArray(channels) && channels.length > 0) {
142+
const degraded = deps.checkMessagingBridgeHealth(defaultSandbox, channels);
143+
if (degraded.length > 0) {
144+
log("");
145+
for (const { channel, conflicts } of degraded) {
146+
log(
147+
` ⚠ ${channel} bridge: degraded (${conflicts} conflict errors in /tmp/gateway.log)`,
148+
);
149+
}
150+
log(
151+
" Another sandbox is likely polling with the same bot token. See docs/reference/troubleshooting.md.",
152+
);
153+
}
154+
}
155+
}
102156
}

src/lib/messaging-conflict.test.ts

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
import { describe, expect, it, vi } from "vitest";
5+
6+
import type { SandboxEntry } from "./registry";
7+
import {
8+
backfillMessagingChannels,
9+
findAllOverlaps,
10+
findChannelConflicts,
11+
} from "./messaging-conflict";
12+
13+
function makeRegistry(sandboxes: SandboxEntry[]) {
14+
const store = new Map(sandboxes.map((s) => [s.name, { ...s }]));
15+
return {
16+
listSandboxes: () => ({
17+
sandboxes: Array.from(store.values()),
18+
defaultSandbox: sandboxes[0]?.name ?? null,
19+
}),
20+
updateSandbox: vi.fn((name: string, updates: Partial<SandboxEntry>) => {
21+
const entry = store.get(name);
22+
if (!entry) return false;
23+
Object.assign(entry, updates);
24+
return true;
25+
}),
26+
};
27+
}
28+
29+
describe("findChannelConflicts", () => {
30+
it("returns conflicts when another sandbox already has the channel", () => {
31+
const registry = makeRegistry([
32+
{ name: "alice", messagingChannels: ["telegram"] },
33+
{ name: "bob", messagingChannels: [] },
34+
]);
35+
expect(findChannelConflicts("bob", ["telegram"], registry)).toEqual([
36+
{ channel: "telegram", sandbox: "alice" },
37+
]);
38+
});
39+
40+
it("excludes the current sandbox from its own conflicts", () => {
41+
const registry = makeRegistry([{ name: "alice", messagingChannels: ["telegram"] }]);
42+
expect(findChannelConflicts("alice", ["telegram"], registry)).toEqual([]);
43+
});
44+
45+
it("skips entries with no messagingChannels field (pre-backfill)", () => {
46+
const registry = makeRegistry([{ name: "alice" }, { name: "bob", messagingChannels: [] }]);
47+
expect(findChannelConflicts("bob", ["telegram"], registry)).toEqual([]);
48+
});
49+
50+
it("returns empty when no channels are enabled", () => {
51+
const registry = makeRegistry([{ name: "alice", messagingChannels: ["telegram"] }]);
52+
expect(findChannelConflicts("bob", [], registry)).toEqual([]);
53+
});
54+
});
55+
56+
describe("findAllOverlaps", () => {
57+
it("reports each overlapping pair once", () => {
58+
const registry = makeRegistry([
59+
{ name: "alice", messagingChannels: ["telegram"] },
60+
{ name: "bob", messagingChannels: ["telegram"] },
61+
{ name: "carol", messagingChannels: ["discord"] },
62+
]);
63+
expect(findAllOverlaps(registry)).toEqual([
64+
{ channel: "telegram", sandboxes: ["alice", "bob"] },
65+
]);
66+
});
67+
68+
it("reports all pairs when three sandboxes share a channel", () => {
69+
const registry = makeRegistry([
70+
{ name: "a", messagingChannels: ["telegram"] },
71+
{ name: "b", messagingChannels: ["telegram"] },
72+
{ name: "c", messagingChannels: ["telegram"] },
73+
]);
74+
expect(findAllOverlaps(registry)).toEqual([
75+
{ channel: "telegram", sandboxes: ["a", "b"] },
76+
{ channel: "telegram", sandboxes: ["a", "c"] },
77+
{ channel: "telegram", sandboxes: ["b", "c"] },
78+
]);
79+
});
80+
81+
it("returns empty when channels do not overlap", () => {
82+
const registry = makeRegistry([
83+
{ name: "alice", messagingChannels: ["telegram"] },
84+
{ name: "bob", messagingChannels: ["discord"] },
85+
]);
86+
expect(findAllOverlaps(registry)).toEqual([]);
87+
});
88+
});
89+
90+
describe("backfillMessagingChannels", () => {
91+
it("fills in missing messagingChannels by probing OpenShell", () => {
92+
const registry = makeRegistry([{ name: "alice" }]);
93+
const probe = {
94+
providerExists: vi.fn((name: string) =>
95+
name === "alice-telegram-bridge" ? "present" : "absent",
96+
) as (name: string) => "present" | "absent" | "error",
97+
};
98+
backfillMessagingChannels(registry, probe);
99+
expect(registry.updateSandbox).toHaveBeenCalledWith("alice", {
100+
messagingChannels: ["telegram"],
101+
});
102+
expect(probe.providerExists).toHaveBeenCalledWith("alice-telegram-bridge");
103+
expect(probe.providerExists).toHaveBeenCalledWith("alice-discord-bridge");
104+
expect(probe.providerExists).toHaveBeenCalledWith("alice-slack-bridge");
105+
});
106+
107+
it("leaves entries with existing messagingChannels alone", () => {
108+
const registry = makeRegistry([
109+
{ name: "alice", messagingChannels: ["telegram"] },
110+
]);
111+
const probe = {
112+
providerExists: vi.fn(() => "present") as (name: string) => "present" | "absent" | "error",
113+
};
114+
backfillMessagingChannels(registry, probe);
115+
expect(registry.updateSandbox).not.toHaveBeenCalled();
116+
expect(probe.providerExists).not.toHaveBeenCalled();
117+
});
118+
119+
it("writes an empty array when all probes return absent", () => {
120+
const registry = makeRegistry([{ name: "alice" }]);
121+
const probe = {
122+
providerExists: vi.fn(() => "absent") as (name: string) => "present" | "absent" | "error",
123+
};
124+
backfillMessagingChannels(registry, probe);
125+
expect(registry.updateSandbox).toHaveBeenCalledWith("alice", { messagingChannels: [] });
126+
});
127+
128+
it("does NOT persist when a probe returns error (retry on next call)", () => {
129+
// "error" is distinct from "absent": a transient gateway failure must not
130+
// be collapsed into "provider not attached" and persisted, because that
131+
// would prevent all future backfill retries and hide real overlaps.
132+
const registry = makeRegistry([{ name: "alice" }]);
133+
const probe = {
134+
providerExists: vi.fn((name: string) => {
135+
if (name.endsWith("-telegram-bridge")) return "error";
136+
return name.endsWith("-discord-bridge") ? "present" : "absent";
137+
}) as (name: string) => "present" | "absent" | "error",
138+
};
139+
backfillMessagingChannels(registry, probe);
140+
expect(registry.updateSandbox).not.toHaveBeenCalled();
141+
});
142+
143+
it("also treats a thrown probe as error (defensive; callers should return 'error' instead)", () => {
144+
const registry = makeRegistry([{ name: "alice" }]);
145+
const probe = {
146+
providerExists: vi.fn(() => {
147+
throw new Error("unexpected");
148+
}) as (name: string) => "present" | "absent" | "error",
149+
};
150+
backfillMessagingChannels(registry, probe);
151+
expect(registry.updateSandbox).not.toHaveBeenCalled();
152+
});
153+
154+
it("re-attempts backfill on a subsequent call after a prior error", () => {
155+
const registry = makeRegistry([{ name: "alice" }]);
156+
let firstPass = true;
157+
const probe = {
158+
providerExists: vi.fn((name: string) => {
159+
if (name.endsWith("-telegram-bridge") && firstPass) {
160+
firstPass = false;
161+
return "error";
162+
}
163+
return name === "alice-telegram-bridge" ? "present" : "absent";
164+
}) as (name: string) => "present" | "absent" | "error",
165+
};
166+
backfillMessagingChannels(registry, probe);
167+
expect(registry.updateSandbox).not.toHaveBeenCalled();
168+
backfillMessagingChannels(registry, probe);
169+
expect(registry.updateSandbox).toHaveBeenCalledWith("alice", {
170+
messagingChannels: ["telegram"],
171+
});
172+
});
173+
});

0 commit comments

Comments
 (0)