Skip to content

Discord image attachments are silently dropped on text-only model providers #1142

@benhoverter

Description

@benhoverter

Summary

When a Discord user sends an image attachment to an OpenFang agent — either with a caption or bare — the attachment is silently discarded before it reaches the model on text-only providers (e.g. the claude_code driver). The model then either:

  • (captioned case) receives only the caption text and confabulates an acknowledgement of "the image" it never saw, or
  • (bare-image case) receives nothing at all — the message is dropped before dispatch and the agent appears unresponsive.

Vision-capable providers are unaffected in principle, but the parser path also mishandled the bare-image shape, so multimodal dispatch was incomplete in practice.

Reproduction

  1. Configure an OpenFang agent on a Discord channel with a text-only provider (claude_code driver, or any provider without vision).
  2. Case A: DM the agent "look at this" + a PNG attachment.
  3. Case B: DM the agent a PNG attachment with no message body.

Observed

  • Case A: Agent responds as if to a text-only message; references "the image" hallucinated from prior context or the caption alone.
  • Case B: No response. Daemon log shows the inbound MESSAGE_CREATE payload but no dispatch downstream.

Expected

  • Case A: Model receives the caption and a coherent indication that an image was attached, with enough metadata (mime, size) to acknowledge it without confabulation.
  • Case B: Model receives a coherent indication that a bare image was sent.
  • Vision-capable providers receive the actual image bytes as a ContentBlock::Image for true multimodal dispatch.

Root cause

Two defects in crates/openfang-channels/src/discord.rs::parse_discord_message:

  1. Bare-image drop. An early if content.is_empty() { return None; } killed any message whose body was empty, regardless of attachment count. Bare-image posts never reached the bridge.
  2. Caption-wins drop. When both text and attachments were present, only the text was preserved as ChannelContent::Text; attachments were discarded silently.

There was also no representation in ChannelContent for "a caption plus one or more attachments as a coherent unit," so even fixing the parser had no destination type to emit into.

Proposed fix

End-to-end vertical slice across the channel + runtime layer:

  • ChannelContent::Multipart(Vec<ChannelContent>) — new variant for caption + attachment(s) as sibling blocks. Nesting forbidden by doc + debug_assert.
  • Discord parser — classify attachments by MIME (with a filename-extension fallback for bot-relayed payloads that omit content_type) under a 5 MB vision-size cap matching Anthropic's image-block limit. Vision-eligible images become Image; everything else becomes File. Emit Multipart whenever text + attachments coexist or multiple attachments are present.
  • Bridge — flat-map Multipart in both dispatch paths: into Vec<ContentBlock> for multimodal-capable providers, and into a newline-joined text descriptor for text-flatten providers.
  • Telegram channel — exhaustive-match parity for the new variant; defensive flatten on outbound.
  • claude_code driver — render Image blocks as [attachment: <mime> image, ~N KB — not viewable on this provider] instead of dropping them. The model still cannot see the image, but it can acknowledge it coherently rather than confabulate.

Out of scope (follow-ups)

  • Vision-provider dispatch refinements beyond exposing the existing image bytes.
  • Non-Discord channel parity for inbound attachment classification (Telegram inbound is unchanged here; only the outbound Multipart arm was added).
  • Any handling of files larger than the 5 MB vision cap beyond classifying them as File and rendering the marker.

Test plan

  • 9 new unit tests in the discord parser covering all (text-empty, n-attachments) shapes plus MIME edge cases (HEIC, oversize, missing content_type).
  • 2 new unit tests in the claude_code driver covering captioned and bare-image marker rendering.
  • Manual smoke test, both shapes, end-to-end (Discord → daemon log → model prompt).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions