Skip to content

feat(channels/zalo): OA OAuth + webhook transport (#966)#1048

Open
vanducng wants to merge 148 commits intodevfrom
feat/zalo-oa-webhook-966-clean
Open

feat(channels/zalo): OA OAuth + webhook transport (#966)#1048
vanducng wants to merge 148 commits intodevfrom
feat/zalo-oa-webhook-966-clean

Conversation

@vanducng
Copy link
Copy Markdown
Contributor

@vanducng vanducng commented Apr 27, 2026

What this does

Adds Zalo OA (Official Account) channel with OAuth + webhook, and brings the existing Zalo Bot channel up to parity (webhook mode, modular package layout, shared router).

Highlights

  • Zalo OA channel — OAuth consent flow, encrypted token storage, automatic refresh, polling fallback, post-restart catch-up sweep.
  • Webhook transport for Bot + OA — single shared router under /channels/zalo/webhook/<slug>, HMAC signature verify, replay-window check, per-instance dedup + rate limit.
  • Webhook URL UI — operators copy a paste-ready URL straight from the channel detail page; per-tenant host override stored locally.
  • Reactions on OA — debounced status reactions (thinking / done / error) tuned for B2C tone.
  • Channel renamezalo_oazalo_bot swap (old names were inverted vs. Zalo's own product taxonomy). Migration 000058 handles existing rows; SQLite has a parallel patch.
  • Tests — unit coverage for signature verify, webhook router, cursor/dedup, OAuth flow; integration tests for multi-instance routing, signature mismatch, webhook URL RPC.

Closes #966.

vanducng added 30 commits April 27, 2026 14:02
…ode flow

Phase 01 of plans/260419-2128-zalo-oa-oauth: introduces a third Zalo
transport (phone-number-tied Official Account via OAuth v4) alongside the
existing Bot OA and personal QR variants. New package
internal/channels/zalo/oauth/ implements the authorization-code exchange
against oauth.zaloapp.com; channel lifecycle is a stub until phase 02
wires lazy refresh and phase 03 wires Send.

Also fixes a pre-existing bug where WS-side isValidChannelType was missing
facebook + pancake (HTTP side already had them).

Refs: #966
…fety ticker

Phase 02 of plans/260419-2128-zalo-oa-oauth: adds the lazy-refresh token
source mirroring DBTokenSource (single mutex serializes both cache reads
and HTTP refresh — Zalo refresh tokens are single-use, so racing goroutines
must not be allowed to issue concurrent refreshes). Adds a 30-min safety
ticker so idle channels still keep their tokens alive between Sends.

Loader now injects channel_instances.id via duck-typed SetInstanceID,
reusing the same pattern as SetTenantID — needed by the refresh path to
persist rotated credentials back to the same row.

Refs: #966
Phase 03 of plans/260419-2128-zalo-oa-oauth: implements Channel.Send for
the OAuth-based Zalo OA channel. Text goes straight to /oa/message/cs;
image/file follow Zalo's 2-step upload-then-attach pattern with the
upload `token` referenced in the message attachment payload.

The send wrapper retries exactly once on auth-class API errors after
calling tokens.ForceRefresh, so a token rotated externally between
Send calls recovers transparently without an infinite loop.

Caption + Content alongside an attachment ride as a separate trailing
text message; if that trailing send fails after the attachment was
delivered, returns ErrPartialSend so callers can distinguish from a
full failure.

Refs: #966
Phase 04 of plans/260419-2128-zalo-oa-oauth: implements the inbound
path. A 15s ticker fans out to /v3.0/oa/listrecentchat then per-thread
/conversation, deduplicates against an in-memory pollCursor (LRU at
500 entries), and publishes new messages via BaseChannel.HandleMessage
with peerKind="direct" — Zalo OA has no groups.

Cursor is debounced-persisted (60s) into channel_instances.config
under a poll_cursor key, surviving channel restarts without replaying
old messages. Loader-side instance ID injection (added in phase 02)
is what makes the persistence path actually addressable.

Top-K fan-out cap (20 threads/cycle) keeps the API request budget
bounded under high-volume OAs. HTTP 429 trips a 30s backoff via
ticker.Reset and flips channel health to Degraded; the next clean
cycle restores Healthy and the original interval. Halt-on-reauth
guard skips the API entirely when health is Failed/Auth so we don't
hammer a dead token.

Refs: #966
…ero-byte)

Phase 05 of plans/260419-2128-zalo-oa-oauth: hardens the file outbound
path with sanitizeFilename (strips path, falls back on dot-only/empty,
caps length at 200, preserves unicode), zero-byte rejection, and an
admin-opt-in MIME deny list (FlexibleStringSlice on the channel config,
exact case-insensitive match — no glob).

All three rejections fire BEFORE the HTTP upload, so a denied call
never burns API quota or rotates the token. Sticker / quoted-reply /
request_user_info template sends remain deferred — they need a
channel-agnostic extension to bus.OutboundMessage that's tracked in
the phase doc as follow-up.

Refs: #966
Phase 06 of plans/260419-2128-zalo-oa-oauth: data-driven schema entries
for zalo_oauth credentials + config (rendered automatically by the
existing ChannelFields component) plus a new paste-code dialog that
drives the two-step OAuth flow:

  1. Calls channels.instances.zalo_oauth.consent_url to fetch the
     pre-built Zalo authorization URL (server-side keeps app_id
     out of the masked instance payload).
  2. User opens the URL, approves on Zalo, copies the redirect's
     `code` query param, pastes it back, and the dialog calls
     channels.instances.zalo_oauth.exchange_code which validates
     the CSRF state token and persists the rotated credentials.

Registered as a reauthDialog so the row-level Connect / Re-auth
button is wired by the existing channel-table machinery — no new
button code in the page itself.

Strings are English literals with TODO i18n markers; phase 07
wires translations.

Refs: #966
Phase 07 of plans/260419-2128-zalo-oa-oauth: front-end i18n roll-up
for the paste-code dialog (15 zaloOauth.* keys translated to en/vi/zh,
verified key-set parity via jq) and dialog rewired to use
useTranslation("channels") + t() instead of the placeholder English
literals it shipped with in phase 06.

Backend i18n parity already enforced by internal/i18n/i18n_test.go;
11 MsgZaloOAuth* constants present in all 3 catalogs as expected
(10 from plan + MsgZaloOAuthInvalidState added during the phase-01
audit fix).

Migration evaluation: no schema delta. Tokens live in the existing
channel_instances.credentials BLOB; the cursor lives in the existing
config TEXT; the new "zalo_oauth" channel_type is just data in the
existing channel_type column. RequiredSchemaVersion stays at 55,
SQLite SchemaVersion stays at 24.

Refs: #966
Phase 08 of plans/260419-2128-zalo-oa-oauth: most unit tests already
shipped via TDD across phases 01-06. Fills 2 gaps:

  - TestForceRefresh_ClearsCache: verifies tokenSource.ForceRefresh
    zero-time math triggers refresh on next Access.
  - TestPollOnce_AllowlistBlocksNonAllowedSender: proves the inbound
    path goes through BaseChannel.HandleMessage's allowlist check,
    not direct bus.Publish (phase-04 audit C1 verification).

Adds tests/integration/zalo_oauth_lifecycle_test.go covering the
full feature against real Postgres + a mocked Zalo API: row create
→ store-encrypts → Get round-trip → factory + Start → SendText →
ForceRefresh + send → invalid_grant → health flips Failed/Auth →
Stop bounded.

Two test-only methods (SetTestEndpointsForTest / ForceRefreshForTest)
are named with the *ForTest suffix per Go convention so accidental
production usage is grep-visible.

Also wires zalo_oauth into ui/web/src/constants/channels.ts so the
Add Channel dropdown actually shows the option — the dropdown is
hard-coded there, not derived from credentialsSchema keys. The
existing zalo_oa entry is relabeled "Zalo OA (Bot)" to disambiguate
the two channel types in the UI.

Refs: #966
Zalo enforces redirect_uri match against the dev-console-registered
callback (error_code=-14003 "Invalid redirect uri"). The hardcoded
placeholder shipped in phase 06 will never satisfy that check, so
any operator who didn't pre-register https://oa.local/zalo_oauth_callback
(i.e. everyone) was blocked at the consent screen.

Adds redirect_uri to ChannelCreds + the credentials wizard schema
as a required text field. The consent_url WS handler now reads it
from creds with the old hardcoded URL as a fallback (so existing
rows without the field don't break, even though they'll still hit
the -14003 error until the operator fills it in).

Refs: #966 (post-smoke fix)
Previously the zalo_oauth consent flow was only reachable by clicking
the auth icon on the row after Create. Users expect the same flow as
whatsapp / zalo_personal where Create transitions directly into the
authentication step of the wizard.

Splits the paste-code state machine into a shared hook
(use-zalo-oauth-connect) + a shared view (zalo-oauth-connect-body),
so both entry points (wizardAuthSteps + reauthDialogs) reuse the
same logic and can diverge only in their action-button layout.

wizardConfig.zalo_oauth.steps = ["auth"] wires the transition; the
reauth dialog is unchanged for operators who dismiss the wizard and
want to complete authentication later from the row.

Refs: #966 (post-smoke UX fix)
Two fixes surfaced during the first live Connect attempt:

  1. Zalo's live OAuth endpoint returns expires_in as a quoted string
     ("3600"), not a number. This broke the exchange with a JSON
     unmarshal error. Introduces a flexSeconds type whose custom
     UnmarshalJSON strips surrounding quotes before ParseInt, so
     both shapes decode correctly. TestTokenResponseShape gains a
     second assertion for the string form.

  2. Zalo redirects to the operator's callback with a long URL of
     the form https://example.com/zalo-callback?oa_id=...&code=iYP...
     &state=db8... — operators shouldn't have to extract code by
     hand. Adds an extractCode helper that accepts either a raw
     code or the full URL; when a URL is pasted, we pull out the
     code param and opportunistically compare the URL's state to
     the one we stashed from consent_url. Placeholder + help text
     updated in all 3 locales.

Refs: #966 (post-smoke fixes)
Zalo's POST /v4/oa/access_token returns tokens but not oa_id — that
rides back in the redirect URL query string alongside `code` and
`state`. Without it, creds.OAID stayed empty after a successful
exchange, so the post-reload Channel.Start() kept marking the
instance Degraded with "awaiting consent" / "no oa_id yet — paste
consent code to authorize" — even though the tokens WERE valid.

extractCode now returns oaID alongside code. The hook forwards it
as an optional oa_id param on the exchange_code WS call. The
handler persists it onto creds before Marshal → store.Update,
so the reloaded Channel sees a non-empty OAID and goes Healthy.

Refs: #966 (post-smoke fix)
Every poll cycle was 404-ing against Zalo with "You are accessing an
empty or invalid API". Two root causes, fixed together because they
rhyme:

  1. The phase-04 plan assumed listrecentchat/getconversation lived
     at /v3.0/oa/* per the ChickenAI SDK note that "v2.0 is
     discontinued". Live Zalo OA disagrees — those read endpoints
     remain on /v2.0/* (nh4ttruong/zalo-oa-api-wrapper confirms).
     Paths moved to /v2.0/oa/getlistrecentchat and /v2.0/oa/getconversation
     with the v2.0 convention of packing GET params into ?data={json}.

  2. access_token was riding as a URL query param; Zalo OA's OpenAPI
     expects it in an `access_token` header. Query-form token gets
     the 404 "empty or invalid API" response. Moved to the header
     everywhere (apiGet, apiPost, apiPostMultipart). Side benefit:
     the token no longer appears in any URL, eliminating the
     phase-03 L4 concern about *url.Error leaking the token through
     DNS/TLS errors.

Send (/v3.0/oa/message/cs) and upload (/v2.0/oa/upload/*) paths are
unchanged — those already matched the wrapper.

Refs: #966 (post-smoke fix — surfaced by repeated poll_failed 404s
in docker logs)
The phase-04 plan said `listrecentchat` / `conversation` live at `/v3.0/*`,
and my first post-smoke fix flipped to `/v2.0/*` AND added a `get` prefix
(as nh4ttruong/zalo-oa-api-wrapper uses for `user/getlist`). Both guesses
were wrong.

Empirically probed all four variants against live Zalo — unprefixed v2.0
paths return `error:-216 Access token is invalid` (endpoint exists, just
wants a valid token) while the other three return the generic `error:404`
endpoint-not-found response. So the correct paths are:

    GET /v2.0/oa/listrecentchat?data={json}
    GET /v2.0/oa/conversation?data={json}

No get prefix. Send (/v3.0/oa/message/cs) and upload (/v2.0/oa/upload/*)
paths remain unchanged — those match the wrapper.

Refs: #966 (post-smoke fix, verified against live API)
Live smoke found the poll path looping forever on -216
"Access token has expired (-155)" because the ticker just logged and
retried with the same stale token. Channel.post already does a retry-
once-on-auth via ForceRefresh; poll now mirrors the same pattern for
the listRecentChat call.

Zalo's documented token lifetime is 24h but operators have seen the
token rejected well inside that window (likely revoked app-side on
certain operations). Refresh-on-reject keeps the channel alive without
waiting for the 30min safety ticker to catch up.

If the second attempt also auth-fails, the existing downstream path
(ErrAuthExpired → markAuthFailedIfNeeded) flips health to Failed/Auth
so the operator sees a clear re-auth prompt instead of silent looping.

Refs: #966 (post-smoke fix)
Empirically verified via Zalo's API explorer that /v2.0/oa/listrecentchat
returns the last N messages across all users, with each data[] item
shaped as:

    {from_id, from_display_name, from_avatar,
     to_id,   to_display_name,   to_avatar,
     message_id, type, message, time}

Two bugs in the phase-04 design:

1. The `message` Go struct read the text field as "text" — Zalo's
   actual field name is "message". Every inbound was silently dropped
   at the `if m.Text == ""` guard in dispatchInbound, even when the
   HTTP call succeeded.

2. Modeled as "threads → per-thread getconversation fan-out", but
   listrecentchat already returns messages. getConversation + thread
   type + topKThreads cap are all dead. Simplified pollOnce to iterate
   messages directly, filter OA echoes (from_id == oa_id), dedup per-
   user by time.

Also:
- Added display-name passthrough in metadata.sender_display_name.
- DM chatID derives from from_id (Zalo OA is DM-only).
- Removed /v2.0/oa/conversation test routing and per-thread call tracking.
- Removed TestPollOnce_TopKThreadsCap (concept no longer applies).

Refs: #966 (post-smoke fix, verified via Zalo API explorer 2026-04-20)
…ll auth errors

Live smoke caught a silent drift: Zalo was rejecting the access_token
(-216 / -155 "Access token has expired") but the channel dashboard
stayed green because MarkHealthy was set at Start() and nothing ever
downgraded it. Operator saw "healthy" + grep-able WARN logs but no
visible re-auth prompt.

runPollLoop now calls markAuthFailedIfNeeded after every poll failure.
The helper was extended to treat *APIError.isAuth() as a Failed/Auth
trigger in addition to ErrAuthExpired:

  - ErrAuthExpired: refresh-token dead (existing behavior)
  - *APIError.isAuth() surviving retry-once-on-auth: access-token
    rejected even after ForceRefresh, which usually means the OA-app
    association is broken and needs operator re-consent

Either path now flips the row to red "Re-authenticate" in the UI, so
operators get the visible cue they were missing.

Refs: #966 (post-smoke fix, spotted by operator: "logs showed expired
but dashboard still green")
Operator's agent tried to reply with an image attachment and saw
"⚠️ Failed to deliver message." The image upload 404'd against
/v3.0/oa/upload/image; Channel.Send bubbled the error and upstream
sent the fallback text.

Empirical probe confirmed only /v2.0/oa/upload/image and
/v2.0/oa/upload/file exist; the v3.0 variants are 404. The message-
send endpoint /v3.0/oa/message/cs stays unchanged (it's the one
path that genuinely IS on v3.0).

Moved uploadImagePath + uploadFilePath constants to /v2.0/*. Test
fixtures updated accordingly.

Refs: #966 (post-smoke fix — bot "Failed to deliver" after agent
generated an image and tried to attach it)
…l cap

Zalo OA's upload endpoint hard-rejects files over 1MB with
error -210 "file is invalid. The file must be smaller than or equal 1MB".
The previous default of 10MB let uploads through to the HTTP call
before Zalo bounced them — wasted API budget and confusing error
for operators.

Changing the default to 1MB rejects oversized files BEFORE the upload
attempt, with a message that explicitly names Zalo's -210 cap so the
operator understands this isn't a bug in our code.

Operators who want to send bigger media have two options, both out
of scope for this commit:
  (a) compress images client-side before upload
  (b) use Zalo's URL-attachment template (image-by-URL instead of
      file upload) — requires a public HTTPS endpoint serving the
      workspace files

Refs: #966 (post-smoke fix — "⚠️ Failed to deliver" was a 1.36MB
AI-generated PNG rejected by Zalo's 1MB cap)
… routing

Zalo OA enforces distinct format + size caps per upload endpoint
(confirmed against the official docs the operator pointed us to):

  - /v2.0/oa/upload/image : JPG/PNG only, ≤1MB
  - /v2.0/oa/upload/gif   : GIF only, ≤5MB
  - /v2.0/oa/upload/file  : PDF/DOC/DOCX only, ≤5MB

AI-generated PNGs routinely exceed 1MB, so the outbound path now
auto-compresses images before the upload call. Pipeline:

  1. Decode input bytes (png / jpeg / webp).
  2. Resize progressively (longest side 1600 → 1200 → 900 → 600).
  3. At each size, JPEG-encode with quality ladder 85 → 35.
  4. Return the first encoding that fits under the cap.
  5. Fail loudly with original + compressed dims if even the lowest
     quality at the smallest size exceeds the cap.

Channel.Send routes by MIME before upload:

  - image/gif → SendGIF (new) → /upload/gif
  - image/*   → compressForZaloImage + SendImage → /upload/image
  - PDF/DOC/DOCX → SendFile → /upload/file (size pre-check at 5MB)
  - anything else → reject at the dispatcher

Also tightens the stat-first check to 50MB (absolute OOM guard) so
compression gets a chance to run on mid-sized files. Operators who
set cfg.MediaMaxMB still see it honored on the file path — images
always compress-to-fit.

Adds github.com/disintegration/imaging + golang.org/x/image/webp for
the compression pipeline (already pulled in via agent/media_sanitize).
Three new unit tests cover the compressor's passthrough, shrink-over-
cap, and decode-error paths.

Refs: #966 (post-smoke fix — bot "Failed to deliver" on 1.36MB PNG)
Live smoke caught periodic poll_failed with "Client.Timeout exceeded
while awaiting headers" — the 15s default wasn't enough for Zalo's
tail latency under load, and the default transport was letting
idle connections go stale between poll cycles.

Client.Timeout bumped to 30s. Custom http.Transport with:
  - bounded MaxIdleConns/MaxIdleConnsPerHost (10/4) so we don't
    keep a connection pool bigger than the poll cadence needs
  - IdleConnTimeout = 60s so stale connections get evicted before
    the next 15s poll can try to reuse a dead one
  - TLSHandshakeTimeout = 10s so a slow TLS dial aborts fast
  - ForceAttemptHTTP2 = true (matches Zalo's actual offering)
  - ProxyFromEnvironment so HTTPS_PROXY honored if set

Also bumped the runPollLoop per-cycle ctx to 45s so it always
outlives the 30s client timeout — otherwise the ctx would fire
first and mask the real "slow upstream" signal with a generic
"context deadline exceeded".

Refs: #966 (post-smoke fix — "Client.Timeout exceeded while awaiting
headers" recurring in poll cycles)
…aw on empty token

Live smoke showed compression working (1.26MB PNG → 110KB JPEG) but
the upload itself came back with HTTP 200 + empty data.token. Digging
in: my uploadImage passed multipart filename "image" (no extension),
and Zalo's /v2.0/oa/upload/image seems to use the filename extension
to validate payload type — strip the extension and the endpoint
accepts the bytes but returns no token.

Fix: SendImage now forwards the real MIME through to uploadImage,
which picks "image.jpg" for image/jpeg and "image.png" for image/png.

Also: parseUploadToken now embeds a 500-char prefix of the raw
response in its error when data.token is missing, so the next-time
triage sees what Zalo actually returned instead of the generic
"missing data.token".

Refs: #966 (post-smoke fix — "upload response missing data.token"
after successful compression)
Live smoke captured Zalo's actual upload response:
  {"data":{"attachment_id":"1I5sCR-..."}, "error":0, "message":"Success"}

Both the plan and our code used "token" (inherited from the ChickenAI
SDK documentation). The real wire name is `attachment_id`, which is
why uploads succeeded (HTTP 200 + error=0) but our parser returned
"missing data.token" and the send never happened.

Fix threads the correct name through:

  - parseUploadToken renamed to parseUploadAttachmentID. Primary
    field is `attachment_id`; keep `token` as a legacy fallback for
    defensive forward-compat if Zalo ever adds an alias.
  - SendImage, SendFile, SendGIF outbound payloads now carry
    `{"attachment_id": id}` instead of `{"token": id}` — matches
    what the upload endpoint returned and what Zalo's own send
    endpoint almost certainly expects.

Test fixtures updated in bulk (8 replacements).

Refs: #966 (post-smoke fix — raw response added by previous commit
made this immediately diagnosable)
Zalo returns -201 "Params is invalid" on the simple
{"attachment":{"type":"image","payload":{"attachment_id":"..."}}} shape.

Confirmed against the nh4ttruong/zalo-oa-api-wrapper Python reference
and the esms.vn integration docs: /v3.0/oa/message/cs wants the
template/media shape for uploaded media:

  {
    "recipient": {"user_id": "..."},
    "message": {
      "attachment": {
        "type": "template",
        "payload": {
          "template_type": "media",
          "elements": [{
            "media_type": "image",          // or "gif"
            "attachment_id": "<upload-id>"  // or "url" for URL-based
          }]
        }
      }
    }
  }

Extracted the common construction into buildMediaAttachmentBody so
image + gif reuse the shape. SendFile stays on its original simple
form {"type":"file","payload":{"attachment_id":"..."}} — that's what
the file endpoint wants (template/media is image-only).

Test assertions updated to walk the template→payload→elements→
media_type/attachment_id path.

Refs: #966 (post-smoke fix — -201 after attachment_id wire-up)

Sources:
  nh4ttruong/zalo-oa-api-wrapper/dependencies/messages.py
  developers.esms.vn zalo send-with-photo docs
…/GIF/File

TDD anchor for Phase 02 Workstream A consolidation. Locks exact JSON bytes
each Send* function produces against /v3.0/oa/message/cs so the upcoming
builder unification (A3) cannot silently drift.

JSON-canonicalizes captured request body + testdata expected-shape via
unmarshal + remarshal (Go sorts map keys deterministically) so field order
differences don't cause false failures but structural differences do.

Fixtures cover text, image (template/media shape), gif (template/media
shape, media_type=gif), file (plain type=file shape).
…nts.go

Consolidate scattered URL literals + path constants into one file.
Before: defaultAPIBase/defaultOAuthBase in api.go, sendMessagePath in
send.go, uploadImagePath/uploadFilePath/uploadGIFPath in upload.go,
and /v2.0/oa/listrecentchat + /oa/access_token inlined in poll.go/auth.go.
Now: all eight constants live in endpoints.go grouped by API family
(v3.0 send, v2.0 read+upload, v4 OAuth) with comments explaining why
each version prefix is load-bearing.

Pure consolidation — no endpoint values changed. A5 fixture test
continues to pass, confirming no wire-shape drift.
Introduce errors.go registering 5 observed code families with exact
Zalo semantics preserved: access-token-invalid (216/-216/401/-401
family), invalid_grant (-118), params-invalid (-201), file-size-exceeded
(-210), invalid-redirect-uri (-14003).

Current-use:
- api.go isAuth() switches from inline `case 216, -216, 401, -401:` to
  the named helper isAccessTokenInvalid(code). Same semantics, named.
- Other codes registered for future code-based routing; today several
  are detected via substring (classifyRefreshError) or appear only in
  error-message strings. Documented in each constant's comment.

NOT introducing -155 per audit — it is absent from the live codebase.
Extract the 4 body-construction shapes from SendText/SendImage/SendGIF/
SendFile into three named builders living together in send.go:
- buildTextBody — text-only shape
- buildMediaAttachmentBody — template/media shape (image + gif)
- buildFileAttachmentBody — plain type=file shape

Each Send* now reads as a one-liner: upload (when applicable) → build →
post. Drift between the 4 shapes is obvious on read. No byte change —
A5 wire-shape fixture test (send_fixture_test.go) stays green,
confirming every outbound request is still byte-identical to pre-refactor.

NOTE: caption+attachment single-request flow (merge the current
"trailing text as second message" round-trip) is out of scope — plan
defers to a follow-up. Builders intentionally don't accept text params
so no dead code is introduced.
Zalo's per-endpoint upload caps (image 1MB, file 5MB, gif 5MB) are
hard-enforced by Zalo itself with error -210. The config.MediaMaxMB
override was never read after the New() clamp — pure dead code.
FileDenyMIME was a belt-and-suspenders duplicate of the narrower
isZaloSupportedFileMIME allow-list; removing it collapses two layers
into the single authoritative source.

Removed:
- ZaloOAuthConfig.MediaMaxMB + FileDenyMIME fields
- oauth/channel.go defaultMediaMaxMB const + clamp
- oauth/send.go isMIMEDenied helper + SendFile denial branch
- SendFile's mime param (caller did the supported-type check already)
- MsgZaloOAuthFileDenied i18n key + 3 catalog entries
- TestNew_DefaultMediaMaxMB + TestSendFile_{RejectsDeniedMIME,PassesAllowedMIME}
- file_deny_mime + media_max_mb from the zalo_oauth frontend schema

Config backcompat: existing channel_instances JSON payloads carrying
stale file_deny_mime / media_max_mb keys load cleanly — Go json
silently ignores unknown fields.

A5 fixture test still green.
Single env-gated slog.Debug line in doRequest + postForm captures the
raw Zalo response body for every API call. Default off (zero runtime
cost — just a bool check). Set GOCLAW_ZALO_OA_TRACE=1 to turn on.

Rationale: Zalo has shipped field-name changes mid-release before
(token → attachment_id) and debugging required live smoke. With this
trace, 5 minutes of production logs give definitive triage data.

SECURITY: response bodies may contain PII (user IDs, display names,
phone numbers). Documented inline. Do NOT enable in production without
scrubbing review.

Cached at init — flipping the env live requires a restart, which keeps
the hot path allocation-free.
vanducng added 25 commits May 1, 2026 04:26
…ation

Create dialog flagged "Required: Webhook Path" even when Ingestion Mode
was Polling because the required filter ignored showWhen visibility.
Extracted isFieldVisible helper, applied to both creds and config submit
checks, and reused it in the renderer to drop duplicated showWhen logic.
Credentials check now also pulls config values into the visibility
context so cross-schema dependencies (e.g. webhook_secret on transport)
resolve correctly.
Polling works out of the box without a public endpoint and has been the
reliable path while Zalo Bot Platform's webhook delivery remains opaque.
Mark Polling (recommended) and reorder so it's selected by default in
the Create dialog.
Zalo Bot Platform rejects getUpdates with HTTP 400 while a webhook URL is
registered. Channels that switch from webhook→polling (or polling
channels created against a bot that previously had its webhook set out-
of-band) loop on the 400 forever. Call deleteWebhook on Start before
spinning up the polling loop. Best-effort: log and continue on error;
the polling loop's first call would surface the same condition anyway.
…ten lifecycle

- webhook_signature_mode default flips strict→disabled so a freshly
  created OA channel processes events before the operator pastes the
  Khoá bí mật OA. Operators opt into log_only (migration) or strict.
- Reaction tombstone replaced time.AfterFunc with a goroutine guarded
  by reactionWG and stopCh so Stop drains pending tombstones.
- Token source / webhook attachments / signature tests cleaned up
  alongside.
- i18n help text in en/vi/zh updated to match the new defaults and
  reflect Zalo's listrecentchat page-size cap (10).
- Replace string-grep error checking with errors.As pattern matching
- Add APIError struct with Code field for semantic error handling
- Update poll.go to use isAPIErrCode() instead of substring search
- Add NewSSRFSafeClient() with DialContext IP validation + redirect checks
- Prevents DNS rebind TOCTOU and 3xx-to-private-IP bypasses
- Use in zalo OA media download instead of plain http.Client
- poll_count: 50 → 10 (Zalo hard-caps at 10; error -210 if exceeded)
- poll_burndown_max_pages: 5 → 10 (allow longer burndown cycles)
- Update message ceiling math: 250 → 100 messages/cycle
- Privacy cleanup: dataplanelabs.com → example.com in test fixtures
- channel-advanced-dialog: filter unknown keys (e.g. removed webhook_url)
  out of existingConfig before MergeConfig so legacy DB rows stop being
  re-posted
- channels-status-utils + contacts-page: add zalo_bot to label map and
  channel-type filter list
- use-zalo-oa-connect: per-effect cancelled flag on consent fetch closes
  open→close→open race that aliveRef alone didn't cover
Replace hardcoded 169.254.169.254 with parameterized checks
in SSRF validation. CIDR-based link-local and .internal suffix
matching remain unchanged, preserving attack surface coverage.
This reverts commit 2efae27. 169.254.169.254 is the public,
universally-documented cloud-metadata service address (AWS, GCP,
Azure, etc.) — keeping it as a literal makes SSRF defense intent
explicit and matches OWASP convention.
One-shot scraper not wired into CI; docs/zalo-error-codes.md is checked in
and updated manually when needed.
…enant ctx

- store: loadExistingCreds surfaces decrypt/unmarshal errors and is
  tenant-scoped (would otherwise silently wipe other credential fields
  on a partial update merge)
- zalo bot: SSRF-check inbound photo URLs before downloadMedia
- zalo oa: one tombstone goroutine per reaction controller via sync.Once
- zalo webhook router: per-instance ctx carries tenant id
- zalo oa poll: warn instead of silently dropping no-dedup messages
- tools.CheckSSRF: reject non-http(s) schemes
- web ui: validate webhook host URL before persisting; clear stale
  exchange error on code retype
- docs: webhook_signature_mode defaults to disabled (operators must
  flip to strict for production)
The PG/SQLite channel-type rename (commits 09f2a60, 32375be) is being
held back; revert the SQLite v26 migration so SchemaVersion stays at 26
and operators don't get partway through the swap. Also drop the now-stale
"phase 05/06" comments in config_channels.go.
Nginx was returning 405 Not Allowed for POST /channels/zalo/webhook/<slug> requests because they matched the SPA fallback `try_files` rule. Added explicit proxy location before the fallback to route webhook traffic to the backend.

The /channels/zalo/webhook/ prefix covers both Zalo OA and Bot variants. Feishu and Pancake webhook channels have the same latent issue but are out of scope for GH-966.
Zalo Bot webhook handler parsed request body as flat {event_name, message}
struct, but per official spec (bot.zapps.me/docs/apis/webhook) platform
wraps every push in {ok, result: {event_name, message}}. Result: every
Bot inbound event silently dropped (zero-value zaloUpdate, nil Message),
dedup disabled because botMessageIDExtractor read wrong path
(message.message_id vs result.message.message_id).

Fix: HandleWebhookEvent unmarshals into zaloAPIResponse (existing type),
short-circuits on ok=false or empty result, then unmarshals result into
zaloUpdate. botMessageIDExtractor reads result.message.message_id.

Polling path unaffected — getUpdates already strips envelope inside
callAPIWith. Zalo OA unaffected (different signature/payload schema).

Tests: fixtures rewritten to wrapped shape; added docs-sample payload
test; added ok=false drop test; legacy unwrapped shape guard (prevents
silent schema drift).
Zalo posts webhook events directly ({event_name, message, ...}) — the
{ok, result} envelope only applies to polling getUpdates responses.
Previous unwrap-only path silently dropped real Zalo pushes (returned
200 but no inbound dispatch). Try the envelope first, fall back to
direct unmarshal so a future shape change won't black-hole traffic.
resolver 127.0.0.11 is Docker-only — in Kubernetes nginx couldn't
resolve the goclaw Service ('Connection refused while resolving') and
returned 502 on every /channels/zalo/webhook/<slug> POST. Other proxy
paths (/ws, /v1, /health) never went through nginx (Traefik routes
them direct to the gateway), so the latent bug surfaced only when the
new webhook proxy block was added in e6efaad.

Drop the resolver + variable form; use a static upstream block. nginx
resolves at config-load against the pod's /etc/resolv.conf — works in
both K8s (kube-dns) and docker-compose (Docker DNS).
…ation (GH-966)

- Rename Ingestion Mode → Connection Mode for both Zalo channels (i18n en/vi/zh)
- Zalo Bot defaults to webhook; option label "Webhook (recommended)"
- Webhook Secret now pasted from bot.zapps.me; Generate button removed
- Drop unused FieldDef.generatable, generate-secret util, and fieldConfig.generate i18n keys
SQLite v27 rename lockstep with PG migration 58 · bot SSRF-safe media
client · bot Stop drains polling WaitGroup · getUpdates batch+single
envelope decode · OA reaction stopCh gate + tombstone Once · reject
oa_id mismatch on paste · pg credentials read-modify-write in tx with
SELECT FOR UPDATE · migration 58 down trim oauth · UI form post-save
setValues + autoComplete · tenant-scoped webhook localStorage · nginx
upstream behavior doc note.
- oa/reactions: re-check stopCh under lock in SetStatus to close
  WaitGroup wg.Add-after-Wait panic window
- oa/webhook: explicit oa_send_* drop; prefer message.time over
  envelope timestamp for cursor advance to avoid skipping later
  messages on burst
- bot/typing: clarify race-window comment
- common/webhook_router: move per-instance rate limit after
  signature verify so a guessed slug can't burn the bucket
- gateway/methods: log security.cross_tenant_access_attempt on
  zalo_oa.consent_url, zalo_oa.exchange_code, zalo.webhook_url
- migrations 000058 down: document rollback-with-binary-revert caveat
- ui/web zalo-webhook-url-section: cancelled flag for stale setData
- ui/web use-webhook-host: gate localStorage read on tenantId to
  prevent cross-tenant flash
- channel-fields: autoComplete="off" on non-password text inputs so
  browsers don't suggest unrelated saved values into webhook URLs /
  app IDs
- oa/token_source: distinct slog.Info on pre-authorization refresh
  attempt with has_oa_id flag so ops can tell "never consented"
  apart from "consent dropped mid-flow"
@clark-cant
Copy link
Copy Markdown

🔍 Code Review — feat(channels/zalo): OA login flow + webhook transport (#966)

🎯 Tổng quan

Adds Zalo OA channel with login flow v4 + webhook transport, brings Zalo Bot to parity (shared router, modular layout), renames zalo_oa ↔ zalo_bot to match Zalo taxonomy.

Scope: ~55 files, ~7.5k LOC — large but focused on single feature


🤖 CI Status + Merge Conflicts

CI:All checks passed — go: pass (4m15s), web: pass (53s)
Merge conflicts: ✅ No conflicts (MERGEABLE, mergeStateStatus: CLEAN)


✅ Điểm Tốt

  1. Phased implementation với live smoke testing. 20+ post-smoke commits — author test thực tế trên Zalo API thật, không phải đoán mò.

  2. Security đúng chuẩn. HMAC signature verify, replay-window check, AES-256-GCM encrypted tokens, per-instance dedup + rate limit, MIME deny list.

  3. Shared webhook router. Bot + OA dùng chung router /channels/zalo/webhook/<slug> — DRY, dễ maintain.

  4. Token refresh single-flight. Single mutex serializes cache reads + HTTP refresh — Zalo refresh tokens là single-use, correct handling.

  5. Polling với bounded LRU cursor. 15s ticker, 20 threads/cycle cap, 429 backoff, stop-on-re — production-ready.

  6. Compress-before-upload. Progressive resize + quality ladder — giải quyết Zalo 1MB cap.

  7. Integration test đầy đủ. Real Postgres + mocked Zalo: create → encrypt → Start → Send → ForceRefresh → invalid_grant → health Failed → Stop.

  8. Self-documented audit trail. Mỗi commit refs Support Zalo OA for non-bot (phone-number) accounts #966 với root cause → fix → verification.


🔴 CRITICAL — Migration Conflict với PR #1057

PR này dùng migration 000058 (rename zalo_oa ↔ zalo_bot).
PR #1057 (Bitrix24) cũng dùng migration 000058 (bitrix_portals table).

Cả 2 PR không thể merge cùng lúc mà không renumber. Nếu merge PR này trước, #1057 phải renumber 000058 → 000059. Ngược lại PR này phải renumber.

Fix: Coordination giữa @vanducng@tech-synity — ai merge trước, người kia renumber migration + update RequiredSchemaVersion.


🟡 MEDIUM

  1. RequiredSchemaVersion audit. PR body nói stays at 55 — nhưng nếu rebase lên dev hiện tại (đã có 000056, 000057, 000058), cần update.

  2. Channel rename rollback. Verify 000058.down.sql swap lại đúng thứ tự.

  3. Post-smoke commits nên squash. 20+ fix commits → squash thành 1-2 cho cleaner history.


🟢 LOW

  1. i18n TODO markers — nên follow up complete translations.
  2. Sticker / quoted-reply / request_user_info deferred — acceptable.

📊 Summary

Severity Count Issues
🔴 CRITICAL 1 Migration 000058 conflict với PR #1057
🟡 MEDIUM 3 RequiredSchemaVersion audit, rename rollback, squash commits
🟢 LOW 2 i18n TODO, deferred features

💡 Recommendation

🟢 APPROVE (với 1 điều kiện)

Zalo OA integration chất lượng cao, security đúng pattern, testing kỹ, live smoke fixes thorough. Code clean, architecture solid.

Điều kiện: Resolve migration conflict với PR #1057 trước khi merge. Cần coordination giữa @vanducng@tech-synity để renumber một trong hai migrations.

Great work @vanducng! Zalo OA là channel quan trọng cho B2C use case. Implementation production-ready. 🚀

vanducng added 2 commits May 1, 2026 19:46
Off by default. Operators who want CS-style threaded replies can
opt in via quote_user_message=true. Existing instances with
explicit `true` keep current behavior.

- channels/zalo/oa/channel.go: QuoteInboundOnDM returns false when
  cfg.QuoteUserMessage is unset.
- channels/zalo/oa/send_quote_test.go: 'unset_defaults_off' case
  expects false.
- config/config_channels.go: field comment notes default false.
- ui/web/src/pages/channels/channel-schemas.ts: defaultValue=false
  + help text adjusted.
…H-966)

Prevents the heart/sad reaction from firing immediately when the agent reply
lands. Terminal reactions (done/error) now scheduled via time.AfterFunc with
jittered delay (reaction_terminal_delay_min_ms/max_ms, defaults 800–2000ms),
scaled by reply character count (+1ms/char, capped +1.5s). Reuses debounce
timer slot so Stop/ClearReaction still cancel cleanly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Zalo OA for non-bot (phone-number) accounts

2 participants