Skip to content

Discord image attachments are silently dropped on text-only model providers #1142

@benhoverter

Description

@benhoverter

Summary

When a Discord user sends an image attachment to an OpenFang agent — either with a caption or bare — the attachment is silently discarded before it reaches the model on text-only providers (e.g. the claude_code driver). The model then either:

  • (captioned case) receives only the caption text and confabulates an acknowledgement of "the image" it never saw, or
  • (bare-image case) receives nothing at all — the message is dropped before dispatch and the agent appears unresponsive.

Vision-capable providers are unaffected in principle, but the parser path also mishandled the bare-image shape, so multimodal dispatch was incomplete in practice.

Reproduction

  1. Configure an OpenFang agent on a Discord channel with a text-only provider (claude_code driver, or any provider without vision).
  2. Case A: DM the agent "look at this" + a PNG attachment.
  3. Case B: DM the agent a PNG attachment with no message body.

Observed

  • Case A: Agent responds as if to a text-only message; references "the image" hallucinated from prior context or the caption alone.
  • Case B: No response. Daemon log shows the inbound MESSAGE_CREATE payload but no dispatch downstream.

Expected

  • Case A: Model receives the caption and a coherent indication that an image was attached, with enough metadata (mime, size) to acknowledge it without confabulation.
  • Case B: Model receives a coherent indication that a bare image was sent.
  • Vision-capable providers receive the actual image bytes as a ContentBlock::Image for true multimodal dispatch.

Root cause

Two defects in crates/openfang-channels/src/discord.rs::parse_discord_message:

  1. Bare-image drop. An early if content.is_empty() { return None; } killed any message whose body was empty, regardless of attachment count. Bare-image posts never reached the bridge.
  2. Caption-wins drop. When both text and attachments were present, only the text was preserved as ChannelContent::Text; attachments were discarded silently.

There was also no representation in ChannelContent for "a caption plus one or more attachments as a coherent unit," so even fixing the parser had no destination type to emit into.

Proposed fix

End-to-end vertical slice across the channel + runtime layer:

  • ChannelContent::Multipart(Vec<ChannelContent>) — new variant for caption + attachment(s) as sibling blocks. Nesting forbidden by doc + debug_assert.
  • Discord parser — classify attachments by MIME (with a filename-extension fallback for bot-relayed payloads that omit content_type) under a 5 MB vision-size cap matching Anthropic's image-block limit. Vision-eligible images become Image; everything else becomes File. Emit Multipart whenever text + attachments coexist or multiple attachments are present.
  • Bridge — flat-map Multipart in both dispatch paths: into Vec<ContentBlock> for multimodal-capable providers, and into a newline-joined text descriptor for text-flatten providers.
  • Telegram channel — exhaustive-match parity for the new variant; defensive flatten on outbound.
  • claude_code driver — render Image blocks as [attachment: <mime> image, ~N KB — not viewable on this provider] instead of dropping them. The model still cannot see the image, but it can acknowledge it coherently rather than confabulate.

Out of scope (follow-ups)

  • Vision-provider dispatch refinements beyond exposing the existing image bytes.
  • Non-Discord channel parity for inbound attachment classification (Telegram inbound is unchanged here; only the outbound Multipart arm was added).
  • Any handling of files larger than the 5 MB vision cap beyond classifying them as File and rendering the marker.

Test plan

  • 9 new unit tests in the discord parser covering all (text-empty, n-attachments) shapes plus MIME edge cases (HEIC, oversize, missing content_type).
  • 2 new unit tests in the claude_code driver covering captioned and bare-image marker rendering.
  • Manual smoke test, both shapes, end-to-end (Discord → daemon log → model prompt).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions