Skip to content

Commit c9b3a51

Browse files
committed
docs: add web image input design spec
1 parent 58b1bd4 commit c9b3a51

1 file changed

Lines changed: 308 additions & 0 deletions

File tree

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# Web Image Input Design
2+
3+
## Summary
4+
5+
Add image input support to the Web chat composer with two user entry points:
6+
7+
- Paste images directly into the chat textarea with `Cmd+V` / `Ctrl+V`
8+
- Select one or more images through the existing `Paperclip` button
9+
10+
This is a full-stack change. The feature is only considered complete if Web users can:
11+
12+
1. attach images before sending,
13+
2. send text-only, image-only, or mixed text+image messages,
14+
3. see image previews in the composer and user message bubbles, and
15+
4. reload an existing session and still see previously sent images.
16+
17+
## Problem
18+
19+
The current Web client only sends plain string content.
20+
21+
- [`ChatInput.tsx`](../../../packages/cli/web/src/components/chat/ChatInput.tsx) has no attachment state, no paste-image handling, and the `Paperclip` button is inert.
22+
- [`sessionService.ts`](../../../packages/cli/web/src/services/sessionService.ts) posts `{ content, permissionMode }` only.
23+
- [`sessionSlice.ts`](../../../packages/cli/web/src/store/session/slices/sessionSlice.ts) assumes outgoing user messages are text-only.
24+
- [`ChatMessage.tsx`](../../../packages/cli/web/src/components/chat/ChatMessage.tsx) renders user messages as plain text.
25+
- The server route already accepts `attachments`, but ignores them when executing runs.
26+
- Shared session persistence currently stores user messages as a single `text` part, so image content is lost after reload.
27+
- When a Web session is not present in memory, [`session.ts`](../../../packages/cli/src/server/routes/session.ts) recreates it with an empty `messages` array instead of hydrating persisted history. That means history continuation in Web is already incomplete even before image support.
28+
29+
## Goals
30+
31+
- Support paste-to-attach for clipboard images in the Web chat input.
32+
- Support file-picker image attachment from the `Paperclip` button.
33+
- Allow multiple images per message.
34+
- Allow image-only messages.
35+
- Preserve attachment order as shown in the composer.
36+
- Render user message images in the current session and after session reload.
37+
- Rehydrate persisted Web sessions so follow-up prompts continue with prior multimodal history.
38+
39+
## Non-Goals
40+
41+
- No drag-and-drop upload in this iteration.
42+
- No generic file attachments in this iteration. Only browser-supported image files are in scope.
43+
- No client-side resize, compression, or annotation workflow.
44+
- No assistant-side image rendering requirements beyond preserving multimodal message history for the model and showing user-sent image previews in chat.
45+
46+
## User-Facing Behavior
47+
48+
### Composer
49+
50+
- Pasting one or more images into the textarea adds them as attachments instead of inserting text.
51+
- Clicking `Paperclip` opens a hidden file input restricted to `image/*`.
52+
- The composer shows attachment thumbnails above the textarea.
53+
- Each thumbnail has a remove action before send.
54+
- Sending clears both text input and pending attachments.
55+
- The send button is enabled when there is either:
56+
- non-empty text, or
57+
- at least one attached image.
58+
59+
### Message layout
60+
61+
- User bubbles render text first, followed by attached images.
62+
- Pure image messages render an image grid without requiring fallback text.
63+
- Multiple images render in attachment order.
64+
- Existing text sanitization for `<system-reminder>` and `<file>` blocks still applies to the textual portion only.
65+
66+
### Persistence and resume
67+
68+
- Refreshing the page or reopening a saved session still shows user-sent images.
69+
- Sending a follow-up message in a persisted Web session must include earlier image messages in the in-memory conversation passed to the agent.
70+
71+
## Architecture Decision
72+
73+
Chosen approach: upgrade the shared message flow to handle multimodal user content end-to-end instead of adding a Web-only attachment side channel.
74+
75+
Why:
76+
77+
- The agent already supports `Message.content` as `string | ContentPart[]`.
78+
- The server request schema already includes `attachments`.
79+
- A Web-only workaround would still lose images on reload and would not preserve prior multimodal context.
80+
81+
## Data Model
82+
83+
### Web composer state
84+
85+
Add a Web-local attachment model:
86+
87+
```ts
88+
type ComposerImageAttachment = {
89+
id: string
90+
name: string
91+
mimeType: string
92+
dataUrl: string
93+
}
94+
```
95+
96+
Rules:
97+
98+
- `id` is client-generated and stable for React rendering/removal.
99+
- `dataUrl` is used for preview and request payload.
100+
- Attachment order is append-only based on user action order.
101+
102+
### Shared message content
103+
104+
Web store and service types must stop collapsing all messages to plain strings. User messages need to preserve multimodal content:
105+
106+
```ts
107+
type WebMessageContent =
108+
| string
109+
| Array<
110+
| { type: 'text'; text: string }
111+
| { type: 'image_url'; image_url: { url: string } }
112+
>
113+
```
114+
115+
Web rendering may normalize this into:
116+
117+
- `textContent: string`
118+
- `imageUrls: string[]`
119+
120+
but the stored message shape itself must preserve the original multimodal form.
121+
122+
### Persistent JSONL parts
123+
124+
Extend shared session persistence to support an image part:
125+
126+
```ts
127+
type PartType =
128+
| 'text'
129+
| 'image'
130+
| 'tool_call'
131+
| 'tool_result'
132+
| 'diff'
133+
| 'patch'
134+
| 'summary'
135+
| 'subtask_ref'
136+
```
137+
138+
Image payload:
139+
140+
```ts
141+
{
142+
mimeType: string
143+
dataUrl: string
144+
}
145+
```
146+
147+
Decision:
148+
149+
- Persist the complete Data URL string.
150+
- Do not split binary storage from session JSONL in this iteration.
151+
- Preserve part ordering by emitting `part_created` entries in display order.
152+
153+
## Request/Response Flow
154+
155+
### Web client -> server
156+
157+
`sessionService.sendMessage()` will send:
158+
159+
```json
160+
{
161+
"content": "optional text",
162+
"permissionMode": "default",
163+
"attachments": [
164+
{
165+
"type": "image",
166+
"content": "data:image/png;base64,..."
167+
}
168+
]
169+
}
170+
```
171+
172+
Rules:
173+
174+
- `content` may be an empty string when attachments exist.
175+
- Attachment `content` stores the full Data URL.
176+
- Web temp user messages should use the same multimodal shape the server will later return, so optimistic UI and persisted UI match.
177+
178+
### Server -> agent
179+
180+
In [`session.ts`](../../../packages/cli/src/server/routes/session.ts):
181+
182+
- Parse `attachments` from the request body.
183+
- Convert the incoming request into `UserMessageContent`:
184+
- prepend one text part when `content.trim()` is non-empty,
185+
- append one image part per attachment in attachment order.
186+
- For text-only requests, continue using a plain string to minimize unrelated churn.
187+
- For mixed or image-only requests, call `agent.chat()` with `ContentPart[]`.
188+
189+
### Session hydration
190+
191+
When `POST /sessions/:sessionId/message` receives a session id that is not currently in the in-memory `sessions` map:
192+
193+
- try to load persisted message history through `SessionService.loadSession(sessionId)`,
194+
- initialize `session.messages` with that history,
195+
- then append the new user message and assistant response.
196+
197+
This fixes existing Web follow-up behavior for persisted sessions and is required for multimodal continuity.
198+
199+
## Persistence Flow
200+
201+
### Writing
202+
203+
Update shared persistence helpers so saving a multimodal user message writes:
204+
205+
- one `message_created` event,
206+
- one `part_created(text)` event for each text part,
207+
- one `part_created(image)` event for each image part,
208+
- in original part order.
209+
210+
Assistant and tool messages remain unchanged.
211+
212+
### Reading
213+
214+
Update shared session loading so consecutive parts for a single message reconstruct:
215+
216+
- `string` when the message only has one text part,
217+
- `ContentPart[]` when it has mixed text/image parts or multiple ordered parts.
218+
219+
Web consumers must stop stringifying arrays during fetch normalization. Text extraction for existing views should happen in rendering helpers, not in the transport layer.
220+
221+
## Rendering Strategy
222+
223+
Add a small normalization helper in Web chat rendering:
224+
225+
- Extract text parts into a single display string joined with `\n`.
226+
- Extract image parts into an ordered image list.
227+
228+
Rendering rules:
229+
230+
- user text remains monospaced and wrapped as today,
231+
- images render below text with rounded corners and constrained max dimensions,
232+
- single image can use a larger width,
233+
- multiple images render in a responsive grid,
234+
- clicking preview expansion is out of scope.
235+
236+
## Validation and Error Handling
237+
238+
- Ignore non-image pasted clipboard items and fall back to normal text paste behavior.
239+
- Reject non-image files selected from the file picker.
240+
- If file reading fails, keep existing composer state and surface a local error message near the composer.
241+
- Do not block sending because one attachment failed to preview; only successfully parsed images are attached.
242+
- Empty text plus zero valid attachments must still be rejected.
243+
244+
## Compatibility Constraints
245+
246+
- Existing text-only Web chat behavior must remain unchanged.
247+
- Existing persisted text-only sessions must continue to load without migration.
248+
- Older messages without image parts must still deserialize exactly as before.
249+
- Any code path that assumes `message.content` is always a string must either:
250+
- stay on a string-only input path, or
251+
- be upgraded to handle `ContentPart[]`.
252+
253+
## Files Expected To Change
254+
255+
Web:
256+
257+
- `packages/cli/web/src/components/chat/ChatInput.tsx`
258+
- `packages/cli/web/src/components/chat/ChatView.tsx`
259+
- `packages/cli/web/src/components/chat/ChatMessage.tsx`
260+
- `packages/cli/web/src/services/sessionService.ts`
261+
- `packages/cli/web/src/store/session/types.ts`
262+
- `packages/cli/web/src/store/session/slices/sessionSlice.ts`
263+
- `packages/cli/web/tests/components/chat/ChatMessage.test.tsx`
264+
- new Web tests for `ChatInput` and/or session sending
265+
266+
Shared/server:
267+
268+
- `packages/cli/src/api/schemas.ts`
269+
- `packages/cli/src/server/routes/session.ts`
270+
- `packages/cli/src/context/types.ts`
271+
- `packages/cli/src/context/storage/PersistentStore.ts`
272+
- `packages/cli/src/services/SessionService.ts`
273+
- `packages/cli/src/agent/Agent.ts`
274+
275+
## Testing Requirements
276+
277+
- Web component test for paste-image attachment flow.
278+
- Web component test for `Paperclip` image selection flow.
279+
- Web store/service test verifying `sendMessage` includes image attachments.
280+
- Web rendering test verifying user messages render text and image previews from multimodal content.
281+
- Shared persistence test verifying a saved multimodal user message reloads with image parts intact.
282+
- Server route test verifying persisted session hydration before follow-up send.
283+
284+
## Open Decisions Resolved Here
285+
286+
- Multiple images: supported.
287+
- Pure image messages: supported.
288+
- Ordering: text first, then images in attachment order.
289+
- Storage format: Data URLs persisted inline in JSONL for this iteration.
290+
- Scope: Web input + shared transport/persistence needed to make Web history and follow-up prompts correct.
291+
292+
## Spec Self-Review
293+
294+
### Completeness
295+
296+
This spec covers input capture, optimistic UI, transport, server execution, persistence, history reload, and follow-up context hydration.
297+
298+
### Internal consistency
299+
300+
The selected architecture keeps one message model across Web UI, server execution, and persistence. There is no separate Web-only attachment representation after the request boundary.
301+
302+
### Scope check
303+
304+
This remains one implementation plan. The work spans multiple layers, but all changes serve one user-visible feature: multimodal image input in Web chat with persistence.
305+
306+
### Ambiguity check
307+
308+
The spec explicitly defines message ordering, image-only behavior, and the storage format, which were the main ambiguous areas.

0 commit comments

Comments
 (0)