Summary
When a Discord user sends an image attachment to an OpenFang agent — either with a caption or bare — the attachment is silently discarded before it reaches the model on text-only providers (e.g. the claude_code driver). The model then either:
- (captioned case) receives only the caption text and confabulates an acknowledgement of "the image" it never saw, or
- (bare-image case) receives nothing at all — the message is dropped before dispatch and the agent appears unresponsive.
Vision-capable providers are unaffected in principle, but the parser path also mishandled the bare-image shape, so multimodal dispatch was incomplete in practice.
Reproduction
- Configure an OpenFang agent on a Discord channel with a text-only provider (
claude_code driver, or any provider without vision).
- Case A: DM the agent
"look at this" + a PNG attachment.
- Case B: DM the agent a PNG attachment with no message body.
Observed
- Case A: Agent responds as if to a text-only message; references "the image" hallucinated from prior context or the caption alone.
- Case B: No response. Daemon log shows the inbound
MESSAGE_CREATE payload but no dispatch downstream.
Expected
- Case A: Model receives the caption and a coherent indication that an image was attached, with enough metadata (mime, size) to acknowledge it without confabulation.
- Case B: Model receives a coherent indication that a bare image was sent.
- Vision-capable providers receive the actual image bytes as a
ContentBlock::Image for true multimodal dispatch.
Root cause
Two defects in crates/openfang-channels/src/discord.rs::parse_discord_message:
- Bare-image drop. An early
if content.is_empty() { return None; } killed any message whose body was empty, regardless of attachment count. Bare-image posts never reached the bridge.
- Caption-wins drop. When both text and attachments were present, only the text was preserved as
ChannelContent::Text; attachments were discarded silently.
There was also no representation in ChannelContent for "a caption plus one or more attachments as a coherent unit," so even fixing the parser had no destination type to emit into.
Proposed fix
End-to-end vertical slice across the channel + runtime layer:
ChannelContent::Multipart(Vec<ChannelContent>) — new variant for caption + attachment(s) as sibling blocks. Nesting forbidden by doc + debug_assert.
- Discord parser — classify attachments by MIME (with a filename-extension fallback for bot-relayed payloads that omit
content_type) under a 5 MB vision-size cap matching Anthropic's image-block limit. Vision-eligible images become Image; everything else becomes File. Emit Multipart whenever text + attachments coexist or multiple attachments are present.
- Bridge — flat-map
Multipart in both dispatch paths: into Vec<ContentBlock> for multimodal-capable providers, and into a newline-joined text descriptor for text-flatten providers.
- Telegram channel — exhaustive-match parity for the new variant; defensive flatten on outbound.
claude_code driver — render Image blocks as [attachment: <mime> image, ~N KB — not viewable on this provider] instead of dropping them. The model still cannot see the image, but it can acknowledge it coherently rather than confabulate.
Out of scope (follow-ups)
- Vision-provider dispatch refinements beyond exposing the existing image bytes.
- Non-Discord channel parity for inbound attachment classification (Telegram inbound is unchanged here; only the outbound
Multipart arm was added).
- Any handling of files larger than the 5 MB vision cap beyond classifying them as
File and rendering the marker.
Test plan
- 9 new unit tests in the discord parser covering all
(text-empty, n-attachments) shapes plus MIME edge cases (HEIC, oversize, missing content_type).
- 2 new unit tests in the
claude_code driver covering captioned and bare-image marker rendering.
- Manual smoke test, both shapes, end-to-end (Discord → daemon log → model prompt).
Summary
When a Discord user sends an image attachment to an OpenFang agent — either with a caption or bare — the attachment is silently discarded before it reaches the model on text-only providers (e.g. the
claude_codedriver). The model then either:Vision-capable providers are unaffected in principle, but the parser path also mishandled the bare-image shape, so multimodal dispatch was incomplete in practice.
Reproduction
claude_codedriver, or any provider without vision)."look at this"+ a PNG attachment.Observed
MESSAGE_CREATEpayload but no dispatch downstream.Expected
ContentBlock::Imagefor true multimodal dispatch.Root cause
Two defects in
crates/openfang-channels/src/discord.rs::parse_discord_message:if content.is_empty() { return None; }killed any message whose body was empty, regardless of attachment count. Bare-image posts never reached the bridge.ChannelContent::Text; attachments were discarded silently.There was also no representation in
ChannelContentfor "a caption plus one or more attachments as a coherent unit," so even fixing the parser had no destination type to emit into.Proposed fix
End-to-end vertical slice across the channel + runtime layer:
ChannelContent::Multipart(Vec<ChannelContent>)— new variant for caption + attachment(s) as sibling blocks. Nesting forbidden by doc +debug_assert.content_type) under a 5 MB vision-size cap matching Anthropic's image-block limit. Vision-eligible images becomeImage; everything else becomesFile. EmitMultipartwhenever text + attachments coexist or multiple attachments are present.Multipartin both dispatch paths: intoVec<ContentBlock>for multimodal-capable providers, and into a newline-joined text descriptor for text-flatten providers.claude_codedriver — renderImageblocks as[attachment: <mime> image, ~N KB — not viewable on this provider]instead of dropping them. The model still cannot see the image, but it can acknowledge it coherently rather than confabulate.Out of scope (follow-ups)
Multipartarm was added).Fileand rendering the marker.Test plan
(text-empty, n-attachments)shapes plus MIME edge cases (HEIC, oversize, missingcontent_type).claude_codedriver covering captioned and bare-image marker rendering.