fix(whatsapp): scope whatsmeow device per channel instance by kamushadenes · Pull Request #1065 · nextlevelbuilder/goclaw

kamushadenes · 2026-04-28T22:09:20Z

Summary

Resolve each WhatsApp channel's whatsmeow device by JID stored on its channel_instances.config, so multiple WhatsApp instances no longer collide on a shared GetFirstDevice and end up logged in as the same account.
Persist the paired JID back to channel_instances.config on events.PairSuccess (and on a guarded one-time adoption of a legacy single-device store) so subsequent boots skip QR.
Add an optional SetInstanceID(uuid.UUID) hook the InstanceLoader calls after construction; the WhatsApp channel uses it to scope its persistence to its own row.
Reauth() bypasses adoption and clears the cached configJID so a fresh QR pairing always rewrites the JID.

Why this is needed

whatsapp.Channel.Start (and the lazy-init path in StartQRFlow/Reauth) used container.GetFirstDevice(ctx) on a sqlstore.Container that all WhatsApp instances share through the same Postgres DB. With more than one WhatsApp channel_instance, every channel reused the first device row in whatsmeow_device, connected as the same WhatsApp account, and emitted whatsapp QR: start flow failed: GetQRChannel must be called before connecting because the second instance inherited an already-paired Store.ID.

Approach

resolveDevice(ctx) chooses between three paths:

config.jid set → container.GetDevice(ctx, jid) reuses that channel's prior pairing.
config.jid empty, exactly one device row, exactly one WhatsApp instance, and that instance is us → adopt the device (covers single-instance deploys upgrading to multi-instance without re-pairing the existing account). The adopted JID is persisted immediately so the path collapses to (1) on subsequent starts.
Otherwise → container.NewDevice(), which lets whatsmeow.Connect drive a fresh QR pairing. events.PairSuccess writes the new JID back to channel_instances.config.

The adoption guard intentionally refuses to act when the situation is ambiguous (multiple devices or multiple WhatsApp instances) — those configurations require an explicit re-pair.

Migration note

For existing deploys with two+ WhatsApp instances already in the DB but only one device row in whatsmeow_device, set the device JID on the legacy instance's config before deploying, e.g.:

UPDATE channel_instances
SET    config = jsonb_set(coalesce(config, '{}'::jsonb), '{jid}', to_jsonb(<jid>::text)),
       updated_at = now()
WHERE  id = <legacy_instance_id>;

After that, the legacy instance takes path (1) and the second instance takes path (3) on next start. (Single-instance deploys do nothing — they hit path (2) on first boot.)

Test plan

go build ./... (PG) and go build -tags sqliteonly ./... (Desktop)
go vet ./internal/channels/whatsapp ./internal/channels ./cmd
go test ./internal/channels/...
Deployed to a Postgres-backed gateway with two WhatsApp instances; legacy instance reused its device, new instance got a fresh QR session and bound a separate JID. Restart preserved both pairings.

WhatsApp LID-format chat IDs contain @ (e.g. 551152861098:5@s.whatsapp.net) which is invalid in Docker container names, causing sandbox creation to fail for any WhatsApp-triggered agent session. Add @ to the sanitizeKey replacer alongside the existing : / . and space characters. Adds a test case with a realistic WhatsApp LID key. Fixes nextlevelbuilder#1029 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The stateless flag was supposed to prevent session history accumulation, but the reset logic was inverted: non-stateless jobs got reset while stateless jobs were skipped. Since the agent loop always persists messages to the session key regardless of the stateless flag, stateless cron jobs accumulated unbounded history across runs. Fix by unconditionally resetting the session before every cron execution. This is consistent with nextlevelbuilder#294 (which added the reset for non-stateless jobs) and ensures stateless jobs actually start fresh each run. Fixes nextlevelbuilder#1029-related session accumulation observed in production. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When an LLM wraps a credentialed CLI in a shell chain like "which gh && gh pr list", lookupCredentialedBinary only checks the first binary ("which") and misses "gh". The command falls through to regular exec, running without credential injection — the CLI reports "not authenticated" and the agent gives up. Add detectCredentialedBinaryInChain() which scans all segments of a shell-operator command for registered credentialed binaries. When found, returns an actionable error telling the LLM to call the CLI directly without shell operators, instead of silently falling through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ction When LLMs wrap credentialed CLIs in shell chains (e.g. "which gh && gh pr list"), the credentialed exec gate only checks the first binary and misses the CLI deeper in the chain. The command falls through to regular exec without credential injection. Add a per-CLI `allow_chain_exec` boolean (default: false) that controls behavior when this is detected: - **false (default)**: return an actionable error telling the LLM to call the CLI directly without shell operators (safe, no token leak) - **true**: inject all matching credential env vars into the full command chain and execute via shell (convenient but tokens visible to all commands in the chain) Changes: - Migration 000057: add `allow_chain_exec` column to `secure_cli_binaries` - Store: SecureCLIBinary struct + PG/SQLite CRUD (select, insert, update, scan) - HTTP API: create/update request structs + allowlist - Exec logic: handleCredentialedChain() with two-mode dispatch - Credential context: per-CLI note when chain exec is enabled - Web UI: toggle switch in CLI credential settings form - i18n: English labels + hint text Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a rolling cache_control breakpoint on the last message in every Anthropic request body. Without this, conversation history was sent uncached every turn, dominating cost on long agent sessions — observed 36% effective cache hit on a 187-message Slack thread that should have been ~80%. System prompt and the last tool definition were already cached; messages were not. Anthropic allows up to 4 breakpoints, leaving 2 free for messages. Handles all three content shapes used in `buildRequestBody`: - Plain string (typical text-only user message): converted to a single text block so cache_control can be attached. Anthropic accepts both shapes. - []map[string]any (multi-modal user, tool_result, assistant text+tool_use): cache_control attaches to the last block. - []json.RawMessage (assistant raw blocks preserving thinking signatures): last block is re-marshaled with cache_control; decode failures are skipped silently to avoid corrupting the request body. Refs nextlevelbuilder#1042. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a pre_tool_use script hook returns DecisionBlock with a non-empty `reason`, the dispatcher now copies it onto FireResult.Reason and the pipeline appends it to the synthetic tool message instead of the bare "Hook blocked: pre_tool_use" line. user_prompt_submit gets the same treatment via a wrapped error. This lets a hook self-document the block to the agent (e.g. retry hints like "use rtk prefix") without expanding the system prompt or relying on out-of-band documentation. Backward-compatible: handlers that do not set a reason — including all non-script handler types (HTTP, command, prompt) — yield FireResult.Reason == "" and the original generic message is used. Tests cover both paths (reason set / not set) and verify Reason stays empty for non-script handlers.

…fault noexec

Each WhatsApp channel previously called container.GetFirstDevice on a shared sqlstore.Container, so multiple instances ended up bound to the same device row and connected as the same WhatsApp account. The QR flow logged "GetQRChannel must be called before connecting" because the second instance inherited an already-paired Store.ID and short-circuited. Channels now resolve their device via configJID stored in the channel_instances row. resolveDevice tries GetDevice(jid) first, falls back to a guarded one-time adoption (only when exactly one device + one whatsapp instance exist), otherwise allocates NewDevice for a fresh QR pairing. PairSuccess persists the new JID back to channel_instances.config so future boots skip QR. Reauth bypasses adoption to force a clean re-pair. InstanceLoader now invokes SetInstanceID on channels that implement it so the WhatsApp channel can scope its device persistence to its row id.

kamushadenes and others added 20 commits April 24, 2026 17:52

fix: sanitize @ in sandbox names

ce11fb4

fix: always reset cron sessions

90948f5

feat: chain exec + allow_chain_exec

160a7ad

fix: bump RequiredSchemaVersion to 57 for allow_chain_exec migration

b21495e

fix: bump schema version to 57

e25a317

fix: add missing $16 placeholder in secure_cli INSERT

ef710d2

fix: missing $16 placeholder in secure_cli INSERT

0842150

fix: add missing 16th placeholder in SQLite secure_cli INSERT

73d8432

fix: SQLite INSERT placeholder count

9a2f847

feat(sandbox): mount data volume read-only for skills/config access

f89cb1e

feat: mount data volume ro in sandbox

ffc464c

feat(sandbox): add AllowTmpExec opt-out for tmpfs noexec

422e707

fix(sandbox): use explicit exec flag in tmpfs to override Docker's de…

8f4b581

…fault noexec

kamushadenes mentioned this pull request Apr 28, 2026

WhatsApp: multiple channel_instances share one whatsmeow_device row #1064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(whatsapp): scope whatsmeow device per channel instance#1065

fix(whatsapp): scope whatsmeow device per channel instance#1065
kamushadenes wants to merge 20 commits intonextlevelbuilder:devfrom
kamushadenes:fix/whatsapp-per-channel-device

kamushadenes commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kamushadenes commented Apr 28, 2026

Summary

Why this is needed

Approach

Migration note

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant