Research compiled 2026-05-03. Sources: GitHub repos, API documentation, arXiv papers, and project documentation for 50+ systems.
- Screenshot-to-Code Tools
- Vision-Based UI Testing
- Design-to-Code Pipelines
- OCR for Code Extraction
- Diagram Understanding
- Visual Debugging
- Whiteboard-to-Code
- Multi-Modal RAG
- Vision Models for UI Analysis
- Web Agent Vision
- Mobile App Screenshot Analysis
- PDF/Document Understanding
- Chart/Graph Data Extraction
- Terminal Screenshot Understanding
- Git Diff Visualization
- Code Visualization
- Automated Screenshot Capture
- Image Generation for Documentation
- SVG/Diagram Generation
- Accessibility Testing via Screenshot
- Hawk Current State and Gaps
- Recommended Implementation Plan
What it does. Converts screenshots, mockups, and Figma designs into clean, functional code. Users upload an image and receive working code in their chosen framework.
How well it works. The tool produces usable code for static UIs with good accuracy for common layouts. It supports HTML+Tailwind, React+Tailwind, Vue+Tailwind, Bootstrap, Ionic+Tailwind, and SVG output. Complex interactive UIs, custom animations, and deeply nested state management remain weak areas. Experimental video-to-code support exists but is unreliable.
Models. Claude Opus 4.5, GPT-5.3/5.2/4.1, Gemini 3 Flash/Pro. Uses DALL-E 3 or Flux Schnell for image generation within mockups.
Architecture. React/Vite frontend, FastAPI backend. The pipeline sends the screenshot to a vision model with a detailed system prompt describing the target framework, then iterates on the result.
What it does. Vercel's generative UI system that takes text descriptions or screenshots and produces React/Next.js components using shadcn/ui and Tailwind.
How well it works. Strong for component-level generation (buttons, cards, forms, dashboards). Weaker for full application architecture. The generated code uses modern React patterns and is production-quality for UI components. It integrates tightly with the Vercel deployment ecosystem.
Key insight. v0 shows that constraining output to a specific design system (shadcn/ui) dramatically improves quality versus open-ended code generation.
What it does. AI-powered web development agent running entirely in the browser via WebContainers. Can prompt, run, edit, and deploy full-stack JavaScript applications.
How well it works. Strong for JavaScript/Node.js ecosystems. The browser-based sandbox means full environment control (filesystem, npm, terminal) without local setup. Limited to frameworks compatible with StackBlitz WebContainers.
What it does. Wireframe-to-app generator. Takes hand-drawn sketches or screenshots and generates functional applications.
Models. Uses Kimi K2.5 on Together AI inference. Code runs in Sandpack sandbox.
What it does. CLI tool that generates code from prompts, then iterates via test-driven development until tests pass. Includes experimental visual matching -- provide a design screenshot alongside code, and it generates code to match the visual.
How well it works. Deliberately narrow scope (single file, won't install deps or modify multiple files). The visual matching requires an Anthropic API key for Claude vision feedback while using OpenAI for code generation. The focused approach avoids compounding errors common in broader agents.
A coding agent should offer a /screenshot-to-code or /mockup command that:
- Accepts an image path or clipboard screenshot
- Detects the project's framework from package.json/go.mod
- Sends the screenshot to a vision model with framework-specific prompts
- Generates component code in the correct framework
- Writes files and optionally runs the dev server to verify
What it does. Visual AI testing platform that captures application screens and uses AI to detect visual regressions. Replicates human perception instead of pixel-by-pixel comparison.
How well it works. 99.8% pass percentage reported by customers. Intelligently distinguishes dynamic content (ads, dates, personalized dashboards) from genuine defects. Reduces false positives dramatically compared to pixel-diff tools. The AI learns what "looks right" for your application.
Key capabilities. Cross-browser testing, intelligent grouping of changes for batch review, one-click maintenance for expected changes.
What it does. Built-in toHaveScreenshot() assertion that captures screenshots and performs pixel-level comparison using the pixelmatch library.
How well it works. Reliable for deterministic UIs when run in consistent environments. Requires same OS/browser/resolution for baseline matching. Provides maxDiffPixels tolerance and stylePath for hiding volatile elements.
Limitations. Pixel-based comparison is brittle. Browser rendering varies by OS, hardware, and version. No semantic understanding of what changed.
What it does. Uses vision language models (VLMs) to understand UIs purely from screenshots. Enables natural language automation across web, mobile, desktop, and canvas surfaces without DOM parsing.
How well it works. Supports Qwen3-VL, Gemini-3-Pro, UI-TARS, and Doubao-1.6-vision models. Pure vision approach means it works even on canvas elements and non-DOM surfaces where traditional testing fails. Reduced token costs versus full DOM + screenshot approaches.
A coding agent should support a /visual-test command that:
- Takes a before/after screenshot (or captures one via headless browser)
- Sends both images to a vision model
- Gets a semantic diff: "The navigation bar shifted 2px left, the button text changed from 'Submit' to 'Send'"
- Reports whether changes are intentional or regressions
- Can generate Playwright snapshot assertions automatically
What it does. AI-powered design-to-code platform. Imports Figma designs, text prompts, or images and generates functional applications. Exports to React, HTML, and Vue.
Pipeline. Import design -> instant generation -> iterate via chat -> deploy with one click. Also supports website cloning via browser extension.
Key feature. API access for integrating with coding AI agents, making it composable in automated workflows.
What it does. Design-to-code tool focused on converting Figma/Adobe XD designs to production-ready frontend code. Emphasizes responsive layouts and component extraction.
What it does. The most effective current approach combines Figma's structured JSON export (frames, components, auto-layout properties) with vision model analysis of the rendered design. The structured data provides exact spacing/colors/typography while the vision model handles intent and interaction patterns.
A coding agent should implement a /design or /figma command that:
- Accepts a Figma URL, image file, or clipboard paste
- For Figma URLs: fetches the design JSON via Figma API for precise measurements
- For images: uses vision model to extract layout structure, colors, typography
- Maps design elements to the project's existing component library
- Generates code that reuses existing components rather than creating duplicates
- Highlights elements that don't match existing patterns for human review
What it does. Open-source OCR engine combining LSTM neural networks with legacy pattern-recognition. Supports 100+ languages, outputs plain text, hOCR, PDF, TSV, ALTO, and PAGE formats.
How well it works for code. General-purpose OCR, not specialized for code. Struggles with monospace font ligatures, syntax highlighting colors on dark backgrounds, and low-resolution terminal screenshots. Accuracy depends heavily on image quality.
What it does. Modern vision models can read code from screenshots with near-perfect accuracy for clear images. They understand syntax highlighting, can infer indentation, and recognize programming languages.
How well it works. Far superior to traditional OCR for code. Claude and GPT-4V can read code from terminal screenshots, IDE screenshots, documentation screenshots, and even handwritten code with high accuracy. They also understand context -- they can identify the language, spot errors, and explain what the code does.
Key advantage. Vision models handle the full range of code presentation: dark/light themes, syntax highlighting, line numbers, diff markers, error underlines, and annotations.
A coding agent should:
- Accept image inputs in the conversation (paste, path, URL)
- Automatically detect if an image contains code
- Extract the code using the vision model (not traditional OCR)
- Offer to create a file with the extracted code
- Handle terminal screenshots by extracting both commands and output
What it does. Text-to-diagram generation from markdown-like syntax. Supports flowcharts, sequence diagrams, Gantt charts, class diagrams, state diagrams, pie charts, git graphs, user journey diagrams, and C4 architecture diagrams.
Programmatic parsing. Mermaid diagrams are text-based and can be parsed to extract architectural information: which components exist, how they connect, what the data flow looks like. The mermaid-cli (mmdc) provides a Node.js API for rendering.
What it does. Modern diagram scripting language that compiles to SVG/PNG/PDF. Uses plugin-based layout engines (ELK, TALA) for different diagram styles. Has official VSCode, Vim, and Obsidian plugins.
Advantage over Mermaid. Runs entirely server-side without browser dependencies, making it suitable for automated generation in CI/CD pipelines and coding agents.
What it does. Virtual hand-drawn style whiteboard. Exports to PNG, SVG, and an open .excalidraw JSON format. The JSON format contains structured shape/connection data that can be parsed programmatically.
What it does. Current vision models (Claude Opus 4.7, GPT-4V, Gemini) can interpret architecture diagrams, flowcharts, sequence diagrams, and ERD diagrams from images with moderate accuracy. They can describe the components, identify relationships, and translate diagrams into text descriptions or code structures.
Limitations. Complex diagrams with many overlapping connections, small text, or unconventional layouts degrade accuracy. Best when diagrams follow standard conventions.
A coding agent should:
- Parse Mermaid/D2/PlantUML in documentation to understand architecture
- Accept architecture diagram images and extract structure via vision
- Generate Mermaid/D2 diagrams from code analysis (reverse engineering)
- Offer a
/diagramcommand that generates architecture diagrams from the current codebase - Translate between diagram formats (Mermaid -> D2, image -> Mermaid)
What it does. Vision models can analyze screenshots of error states: browser console errors, terminal stack traces, IDE error panels, and application error screens. They identify the error type, location, and likely cause.
How well it works. Very effective for common error patterns. Claude and GPT-4V can read stack traces from terminal screenshots, identify the failing line, correlate with code context, and suggest fixes. Browser DevTools screenshots (Network tab, Console, Elements) are well understood.
What it does. Tools like Better Stack, Sentry, and LogRocket capture and visualize errors with context. When integrated with a coding agent, the agent can see the same visual representation a developer would see.
Emerging pattern. Agents that can take a screenshot of the error state (browser, terminal, IDE) and automatically diagnose the issue without the developer needing to manually copy error text.
A coding agent should:
- Accept error screenshots via paste or path
- Extract error text, stack trace, and context from the image
- Cross-reference with project source files
- Suggest specific fixes with code diffs
- Optionally capture its own screenshots when running browser tests
What it does. Users draw UI sketches on the tldraw whiteboard, and the tool converts them into functional HTML/CSS/JS code using vision models.
How well it works. Impressive for simple UIs (landing pages, forms, dashboards). The hand-drawn aesthetic of tldraw actually helps the model understand intent vs. precise positioning. Works best when combined with text annotations on the whiteboard.
What it does. Similar concept using Excalidraw's structured JSON export. The combination of structured shape data + visual rendering gives better results than pure image analysis.
A coding agent should support a /sketch or /whiteboard command that:
- Accepts a hand-drawn sketch image (photo of whiteboard, tablet sketch, Excalidraw export)
- Uses vision model to identify UI components (buttons, inputs, lists, navigation)
- Maps to the project's design system/component library
- Generates a first-pass implementation
- Enters an iterative refinement loop: "Move the sidebar to the left", "Make the header sticky"
What it does. High-performance vector database supporting dense/sparse embeddings, multi-vector storage, and hybrid search. Supports text, image, and video modalities in a single collection.
Image search. Images are converted to vector embeddings (typically 768+ dimensions) and stored alongside metadata. Search uses approximate nearest neighbor (ANN) algorithms (HNSW, IVF).
What it does. RAG-based QA on document collections with explicit multi-modal support. Handles documents with figures and tables using Azure Document Intelligence, Adobe PDF Extract, or Docling for parsing.
Key feature. Hybrid retrieval combining full-text and vector search with re-ranking.
Architecture. The state-of-the-art approach for multi-modal RAG:
- Document ingestion: Extract text, tables, and images separately
- Embedding: Use CLIP or similar for image embeddings, text embedders for text
- Storage: Store all modalities in a vector database with metadata
- Retrieval: Query across modalities -- a text query can retrieve relevant images, a diagram query can retrieve related code
- Generation: Present retrieved multi-modal context to a vision model
What it does. Free API that converts URLs to LLM-friendly content. Automatically captions images using vision language models, formatted as alt tags.
A coding agent should:
- Index project documentation including embedded images and diagrams
- When answering questions, retrieve relevant images alongside text
- Include design mockups, architecture diagrams, and screenshots in RAG context
- Support queries like "show me the diagram from the design doc that describes the auth flow"
Capabilities. Supports JPEG, PNG, GIF, WebP. Up to 600 images per API request. Max 8000x8000px per image. Images are tokenized at roughly width * height / 750 tokens.
Claude Opus 4.7 high-res. First Claude model with high-resolution image support: 2576px on the long edge (up from 1568px), 4784 tokens per image (up from 1568). Automatic, no opt-in needed. Particularly strong for computer use, screenshot understanding, and document analysis.
Best practices. Place images before text in prompts. Use base64-encoded images, URL references, or the Files API. Downsample before sending to control costs. Lossy JPEG compression reduces latency but can degrade OCR accuracy.
Cost. ~$0.004 per 1000x1000px image on Sonnet, ~$0.007 on Opus 4.7.
Capabilities. Strong visual understanding across screenshots, diagrams, charts, and documents. The newer GPT-5.x series improves on spatial reasoning and precise text extraction.
Capabilities. Natively multimodal -- built from the ground up to process images alongside text. Supports PNG, JPEG, WEBP, HEIC, HEIF. Images tokenized at 258 tokens for images under 384px, with 768x768 tiles for larger images at 258 tokens each.
Key advantage. Up to 3,600 images per request. media_resolution parameter controls detail level for cost/quality tradeoff.
Capabilities. Open-source vision-language model in 3B, 7B, and 72B sizes. 256K context window. Strong OCR supporting 32 languages. Robust in low light, blur, and tilt. Good at document layout, UI parsing, and object grounding.
What it does. Parses UI screenshots into structured elements using YOLO-based icon detection + Florence/BLIP2 captioning. Achieves 39.5% on ScreenSpot Pro benchmark.
Key contribution. Significantly enhances GPT-4V's ability to ground actions to specific UI regions.
What it does. Evaluates 220+ vision-language models across 80+ benchmarks. Covers visual QA, OCR, physics reasoning, spatial understanding, video comprehension, and medical imaging.
Key finding from benchmarks. GPT-4V and Claude Opus 4.7 lead on screenshot and UI understanding tasks. Open-source models like Qwen2.5-VL-72B are competitive for OCR and document tasks but lag on complex reasoning about UIs.
A coding agent should:
- Use vision model capabilities already available through the configured provider
- Detect when the current model supports vision (check
Capabilities.Visionin router) - Automatically route image-containing messages to vision-capable models
- Apply appropriate image preprocessing: resize to optimal dimensions, compress for cost control
- Use Claude Opus 4.7's high-res mode for detailed screenshots, Sonnet/Haiku for quick analysis
What it does. Python library enabling AI agents to control browsers. Takes screenshots, identifies interactive elements, executes browser commands via LLM decisions.
Models. ChatBrowserUse (optimized), OpenAI, Gemini, Claude, Ollama, and open-source bu-30b-a3b.
How well it works. Strong for multi-step web tasks: form filling, navigation, data extraction. The vision-based approach handles dynamic content and JavaScript-heavy sites.
What it does. Two-component architecture: World Model (analyzes current page state) + Action Engine (converts instructions to Selenium/Playwright code). Uses GPT-4o by default, fully customizable.
Capabilities. Iframe handling, multi-tab navigation, Gradio interface for testing.
What it does. Uses vision models to interact with web pages through screenshots rather than HTML parsing. Supports Set-of-Mark (SoM) annotations that overlay visual identifiers on page elements for precise grounding.
Models. GPT-4V, Gemini, LLaVA (open-source).
What it does. Combines AI with code for flexible browser automation. act() for actions, extract() for structured data extraction with Zod schema validation. Self-healing via action caching.
Key pattern. Moves from AI exploration to cached/replayed workflows, reducing token costs over time.
A coding agent should support a /browse or /web command that:
- Launches a headless browser
- Navigates to a URL and takes a screenshot
- Uses the vision model to understand the page
- Extracts data or performs actions as instructed
- Captures before/after screenshots for verification
- Useful for: verifying deployed changes, scraping docs, testing integrations
What it does. Modern vision models understand mobile app screenshots well. They can identify UI components (tab bars, navigation, cards, lists), read text, understand layout hierarchy, and detect platform-specific patterns (iOS vs Android).
What it does. Microsoft's OmniParser works on mobile screenshots too: detects interactive regions, generates functional descriptions of UI elements.
What it does. Supports iOS and Android automation using vision-only localization. Works without accessibility tree access.
What it does. Multi-device orchestration framework. UFO3 ("Galaxy") supports cross-device workflows. Uses visual + accessibility API detection for robust element identification.
A coding agent working on mobile projects should:
- Accept mobile screenshots for UI implementation reference
- Detect platform (iOS/Android) from screenshot characteristics
- Generate platform-appropriate code (SwiftUI, Jetpack Compose, React Native)
- Compare implementation screenshots against design reference
- Identify platform-specific issues (safe areas, notch handling, gesture conflicts)
What it does. Converts complex documents (PDF, DOCX, PPTX, XLSX, HTML, images, LaTeX) into LLM-ready markdown/JSON. Handles page layout detection, reading order, table structure, code blocks, formulas, image classification, and chart understanding.
How well it works. VLM + OCR dual engine for improved accuracy. Supports 109 languages. Charts are converted to tables, code, or detailed descriptions.
What it does. Document parsing engine converting PDFs and Office docs into structured markdown/JSON. VLM + OCR dual engine. Handles text, tables (HTML format), images, formulas (LaTeX), and handwritten content.
How well it works. Strong on academic papers, technical documentation, and business documents. Supports 109 languages. Available as web app, desktop client, Docker, Python SDK, and REST API.
What it does. Converts PDFs and documents to markdown/JSON/HTML using a multi-model pipeline: OCR (Surya), layout detection, block cleaning, optional LLM enhancement.
How well it works. Benchmark score of 95.67 vs LlamaParse (84.2) and Mathpix (86.4). Processing at 2.84 seconds/page on H100 GPU. Hybrid mode with LLMs improves table accuracy to 0.907.
What it does. Python library for table extraction from text-based PDFs. Outputs to CSV, JSON, Excel, HTML, Markdown, SQLite. Reports 99% accuracy on clean documents. Does not work with scanned PDFs.
What it does. Claude can directly process PDF pages as images. Claude Opus 4.7's high-resolution mode (2576px) is particularly effective for dense documents with small text, tables, and diagrams.
Hawk already has readPDFFile() in tool/file_read_media.go with page range parsing, size limits, and magic byte validation. Currently shells out to pdftotext (poppler-utils) but returns a fallback message if unavailable.
A coding agent should:
- Accept PDF/document paths and extract structured content automatically
- For technical specs: extract requirements, API definitions, data models
- For design docs: extract wireframes, flow descriptions, component lists
- Integrate with the RAG system so document content is searchable
- Handle mixed content: text paragraphs, tables, code blocks, diagrams
- Upgrade from pdftotext to a vision-model approach for scanned/complex PDFs
What it does. Vision models can extract data from charts: bar charts, line graphs, pie charts, scatter plots, and tables. They identify axes, labels, data points, trends, and relationships.
How well it works. Claude and GPT-4V are strong on standard chart types. They can extract approximate numerical values, identify trends ("revenue grew 30% YoY"), and describe relationships. Precise value extraction requires high-resolution images.
What it does. Docling specifically supports chart understanding: converts bar charts, pie charts, and line plots into tables, code, or detailed descriptions.
What it does. Model trained specifically for parsing screenshots into structured output. Understands charts, documents, and UIs. Pre-trained on web page screenshots, fine-tuned for specific tasks.
A coding agent should:
- Accept chart/graph images and extract data automatically
- Convert chart data to structured formats (CSV, JSON, table)
- Generate code to recreate the chart from extracted data
- Answer questions about chart data: "What was the peak value in Q3?"
- Detect data visualization in project documentation and index it
What it does. Vision models can read terminal screenshots including:
- Command output with ANSI colors
- Error messages with stack traces
- Build logs with warnings/errors highlighted
- htop/top system monitoring output
- Git log with branch visualization
How well it works. Very strong for standard terminal output. The models understand ANSI color semantics (red = error, green = success, yellow = warning). They can parse complex formatted output like tables, progress bars, and tree structures.
Limitations. Very dense terminal output (100+ lines of small text) may lose detail at standard resolutions. Opus 4.7's high-res mode helps here.
A coding agent should:
- Accept terminal screenshots when users paste error output
- Extract command, output, and error information
- Understand build system output (webpack, go build, cargo, etc.)
- Parse CI/CD log screenshots from GitHub Actions, Jenkins, etc.
- Automatically suggest fixes based on detected error patterns
Text diffs. Standard unified diff format is well understood by all LLMs. Hawk already supports diff coloring in the TUI.
Visual diff tools. GitHub's rich diff view, VS Code's inline/side-by-side diff, and tools like Delta (terminal diff viewer) provide visual context that screenshots can capture.
Vision model understanding. Vision models can interpret screenshot diffs from GitHub PRs, VS Code, and other tools. They understand added/removed lines, file changes, and structural modifications.
A coding agent should:
- Generate visual diffs for proposed changes (already partially in hawk's diffsandbox)
- Accept screenshots of diffs from GitHub/IDE for review context
- Support
/diffcommand that shows changes with syntax highlighting - Understand PR screenshot context when users share GitHub screenshots
What it does. Graph description language and rendering engine for directed/undirected graphs. Used extensively for call graphs, dependency diagrams, class hierarchies, and control flow graphs.
What it does. Low-level JavaScript data visualization library using SVG/Canvas/HTML. Can create any custom visualization including call graphs, dependency trees, flame charts, and treemaps.
What it does. Mermaid's class diagrams, flowcharts, and sequence diagrams are widely used for documenting code architecture. Can be generated programmatically from code analysis.
What it does. Go's built-in AST package can be used to extract function calls, type relationships, and package dependencies, then render them as diagrams.
Hawk has magicdocs with Go AST parsing for automatic markdown generation, and repomap for incremental code indexing. These could generate visualization input data.
A coding agent should:
- Offer
/architecturecommand that generates codebase diagrams - Generate Mermaid/D2 from code analysis (call graphs, dependency trees)
- Render diagrams to SVG/PNG for embedding in documentation
- Accept existing code visualization screenshots and correlate with source
- Update diagrams when code structure changes
What it does. Headless browser screenshot capture with full control: full-page, element, viewport, clip regions. Supports PNG, JPEG with quality settings.
What it does. Chrome DevTools protocol for automated screenshot capture. Supports device emulation, network throttling, and pixel-perfect rendering.
Architecture. The most effective approach for coding agents:
- Agent makes code changes
- Agent starts/hot-reloads the dev server
- Agent takes a screenshot via headless browser
- Agent sends screenshot to vision model for verification
- Agent iterates if the result doesn't match expectations
A coding agent should:
- Offer a
/screenshotcommand that captures the current UI - Integrate with the project's dev server (detect from package.json/go.mod)
- Auto-capture screenshots before and after UI changes
- Store screenshots for visual regression tracking
- Support responsive testing: capture at multiple viewport sizes
What it does. While vision models can't generate pixel images, they can generate:
- Mermaid diagrams
- D2 diagrams
- SVG markup
- ASCII art
- PlantUML
These text-based formats are then rendered to images.
What it does. Image generation models can create UI mockups, icons, and illustrations for documentation. screenshot-to-code uses DALL-E 3 or Flux Schnell for generating placeholder images within designs.
A coding agent should:
- Generate Mermaid/D2 diagrams for documentation
- Render diagrams to SVG/PNG using mermaid-cli or D2
- Generate ASCII diagrams for terminal-based documentation
- Not use image generation models for code-related documentation (too imprecise)
What it does. LLMs are strong at generating Mermaid syntax from descriptions or code analysis. The structured text format is easy to validate and iterate on.
What it does. D2's syntax is also LLM-friendly. Its server-side rendering makes it particularly suitable for automated pipelines.
What it does. LLMs can generate simple SVG for icons, logos, and simple diagrams. Complex SVGs with many elements are unreliable.
A coding agent should:
- Generate Mermaid by default (widest compatibility, GitHub/GitLab render natively)
- Support D2 for teams that use it
- Offer
/mermaidcommand that generates diagrams from descriptions - Validate generated diagram syntax before writing to files
- Include diagrams in generated documentation
What it does. Accessibility testing engine that analyzes DOM structure against WCAG guidelines. Identifies violations, provides fix suggestions, and supports custom rules. Used by Playwright, Cypress, and other test frameworks.
What it does. Vision models can analyze screenshots for accessibility issues:
- Insufficient color contrast
- Missing alt text (visible image placeholders)
- Touch target sizes too small
- Text too small to read
- No visible focus indicators
- Layout issues at zoom levels
How well it works. Complementary to DOM-based tools. Vision analysis catches issues that DOM analysis misses: visual contrast problems, overlapping elements, misleading visual hierarchy. DOM analysis catches issues vision misses: missing ARIA attributes, tab order, screen reader announcements.
A coding agent should:
- Run axe-core on generated UI code
- Supplement with vision model analysis of screenshots
- Report both programmatic and visual accessibility issues
- Suggest fixes with specific code changes
- Offer
/a11ycommand that audits the current UI
-
Image reading (
tool/file_read_media.go): Reads PNG, JPEG, GIF, WebP images and converts to base64. Handles SVG as text. Resizes oversized images. Validates dimensions and file size. -
PDF reading (
tool/file_read_media.go): Validates PDF magic bytes, parses page ranges, shells out to pdftotext. Falls back gracefully if pdftotext is unavailable. -
Vision capability flag (
model/router.go):Capabilities.Visionfield exists in the model router, tracking which models support vision. -
Code review bridge (
sight/bridge.go): Integrates with the sight code-review library for AI-powered diff analysis. -
Repo mapping (
repomap/): Incremental code indexing that could feed visualization. -
Magic docs (
magicdocs/): Go AST parsing for automatic documentation. -
Diff sandbox (
diffsandbox/): Virtual file overlay for proposed edits.
-
No multi-modal message support. The engine passes messages as string content only. There is no
ContentPart,ImageBlock, or structured content block type that would allow sending images alongside text to the LLM. -
No vision routing. While
Capabilities.Visionexists, the router does not use it. When a user includes an image, the agent does not automatically select a vision-capable model. -
No clipboard/paste image support. Users cannot paste screenshots into the terminal input.
-
No screenshot capture. No integration with headless browsers for capturing UI state.
-
No diagram generation. No Mermaid/D2 integration for generating or rendering diagrams.
-
No multi-modal RAG. The memory/yaad system indexes text only, not images.
-
PDF extraction is minimal. Falls back to "pdftotext not available" in most cases. No vision-based PDF understanding.
-
Image analysis is read-only. Images are converted to base64 text but not sent as actual image content to the vision API.
P1a. Multi-modal message protocol.
Add a content block type system to the engine message format. Messages should support an array of content blocks: text, image (with base64 data + media type), and tool_result (which can contain image blocks). This is the single blocker for all other multi-modal features.
P1b. Vision-aware model routing. When a message contains image content blocks, the router should prefer vision-capable models. If the current model lacks vision, the agent should either auto-switch or warn the user.
P1c. Image input pipeline.
Connect file_read_media.go's image reading to the message protocol. When the Read tool encounters an image, it should return an image content block (not just base64 text). This enables "read this screenshot and explain what you see."
P2a. Screenshot-to-code command (/mockup or /screenshot-to-code).
Accept an image path, detect the project's framework, generate code. This is the single most requested multi-modal feature for solo developers who often start from a design screenshot.
P2b. Error screenshot diagnosis. Accept a screenshot of an error (terminal, browser, IDE) and automatically extract + diagnose the issue. Solo developers frequently screenshot errors to share in Slack/Discord -- the same workflow should work with their agent.
P2c. PDF/document understanding upgrade. Replace pdftotext fallback with vision-model PDF reading. Send PDF pages as images to the vision model. This handles scanned documents, complex layouts, and diagrams that pdftotext misses.
P3a. Diagram generation (/diagram, /architecture).
Generate Mermaid diagrams from code analysis. Leverage existing repomap and magicdocs infrastructure. Render via mermaid-cli if available, otherwise output raw Mermaid for GitHub/GitLab rendering.
P3b. Visual diff review. Enhance the diffsandbox to capture before/after screenshots when changes affect UI files. Send both screenshots to the vision model for verification.
P3c. Design reference comparison. Accept a design image alongside code changes. After generating UI code, capture a screenshot and compare it to the reference design, reporting discrepancies.
P4a. Multi-modal RAG. Extend yaad to store image embeddings alongside text. Enable retrieval of design mockups, architecture diagrams, and documentation screenshots.
P4b. Browser integration (/browse).
Integrate a headless browser for capturing UI state, running visual tests, and verifying deployed changes.
P4c. Accessibility auditing (/a11y).
Combine axe-core DOM analysis with vision model screenshot analysis for comprehensive accessibility testing.
P4d. Mobile screenshot analysis. Platform detection (iOS/Android) from screenshots, with framework-appropriate code generation.
-
Use eyrie for multi-modal API calls. The image content block protocol should be implemented in eyrie so all providers handle it consistently. Claude, GPT-4V, and Gemini all support image inputs but with different API formats.
-
Image preprocessing pipeline. Build a shared pipeline for: dimension checking, resize to optimal size for the target model, format conversion (HEIC->JPEG), base64 encoding, and token cost estimation.
-
Cost awareness. Image tokens are expensive. A 1920x1080 screenshot costs ~1568 tokens ($0.005 on Sonnet, $0.014 on Opus). The agent should inform users of image costs and downsample aggressively for exploratory queries.
-
Progressive detail. Start with low-resolution analysis, then re-analyze at high resolution only if needed. Similar to how Claude's computer use demo recommends XGA resolution for interaction.
-
Tool composition. Multi-modal tools should compose: "read this PDF" -> extract images -> analyze diagrams -> generate code. The engine's tool orchestration already supports this pattern.
-
Image generation models (DALL-E, Midjourney) for code documentation. Text-based diagram formats (Mermaid, D2) are superior: version-controllable, diffable, and precise.
-
Custom OCR models. Vision models are better at code OCR than Tesseract. Don't add a separate OCR dependency.
-
Full browser automation agent. This is a different product (browser-use, LaVague). Hawk should support screenshot capture and verification, not general web automation.
-
Video understanding. While some models support it, the use cases for coding agents are too narrow to justify the complexity and cost.