Skip to content

Commit e9e55ea

Browse files
committed
Added multimodal (vision) support
1 parent c8f98ec commit e9e55ea

File tree

6 files changed

+653
-53
lines changed

6 files changed

+653
-53
lines changed

API.md

Lines changed: 79 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Returns the current version of the SQLite-AI extension.
1616

1717
```sql
1818
SELECT ai_version();
19-
-- e.g., '0.9.0'
19+
-- e.g., '1.0.0'
2020
```
2121

2222
---
@@ -608,17 +608,29 @@ SELECT llm_embed_generate('hello world', 'json_output=1');
608608

609609
---
610610

611-
## `llm_text_generate(text TEXT, options TEXT)`
611+
## `llm_text_generate(text TEXT, [image1, image2, ...], options TEXT)`
612612

613613
**Returns:** `TEXT`
614614

615615
**Description:**
616616
Generates a full-text completion based on input, with optional configuration provided as a comma-separated list of key=value pairs.
617617

618-
**Example:**
618+
When a vision model is loaded via `llm_vision_load()`, you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF.
619+
620+
**Examples:**
619621

620622
```sql
623+
-- Text-only generation
621624
SELECT llm_text_generate('Once upon a time', 'n_predict=1024');
625+
626+
-- Vision: describe an image
627+
SELECT llm_text_generate('Describe this image', './photos/cat.jpg');
628+
629+
-- Vision: compare multiple images
630+
SELECT llm_text_generate('What is different between these images?', './img1.jpg', './img2.jpg');
631+
632+
-- Vision: image from BLOB column
633+
SELECT llm_text_generate('What do you see?', image_data) FROM photos WHERE id = 1;
622634
```
623635

624636
---
@@ -700,18 +712,27 @@ SELECT llm_chat_restore('b59e...');
700712

701713
---
702714

703-
## `llm_chat_respond(text TEXT)`
715+
## `llm_chat_respond(text TEXT, [image1, image2, ...])`
704716

705717
**Returns:** `TEXT`
706718

707719
**Description:**
708720
Generates a context-aware reply using chat memory, returned as a single, complete response.
709721
For a streaming model reply, use the llm_chat virtual table.
710722

711-
**Example:**
723+
When a vision model is loaded via `llm_vision_load()`, you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF.
724+
725+
**Examples:**
712726

713727
```sql
728+
-- Text-only chat
714729
SELECT llm_chat_respond('What are the most visited cities in Italy?');
730+
731+
-- Vision: ask about an image
732+
SELECT llm_chat_respond('What is in this photo?', './photos/landscape.jpg');
733+
734+
-- Vision: multiple images
735+
SELECT llm_chat_respond('Compare these two charts', './chart1.png', './chart2.png');
715736
```
716737

717738
---
@@ -735,6 +756,59 @@ SELECT llm_chat_system_prompt();
735756

736757
---
737758

759+
## Vision Functions
760+
761+
### `llm_vision_load(path TEXT, options TEXT)`
762+
763+
**Returns:** `NULL`
764+
765+
**Description:**
766+
Loads a multimodal projector (mmproj) model for vision capabilities. This requires a text model to already be loaded via `llm_model_load()`. The mmproj file is a separate GGUF file that contains the vision encoder and projector weights.
767+
768+
Once loaded, vision capabilities are available through `llm_text_generate()` and `llm_chat_respond()` by passing image arguments.
769+
770+
The following option keys are available:
771+
772+
| Key | Type | Default | Meaning |
773+
| ------------------ | --------------------------------- | ------- | -------------------------------------------------------------------- |
774+
| `use_gpu` | `1 or 0` | `1` | Use GPU for vision encoding. |
775+
| `n_threads` | `number` | `4` | Number of threads for vision processing. |
776+
| `warmup` | `1 or 0` | `1` | Run a warmup pass on load for faster first use. |
777+
| `flash_attn_type` | `auto, disabled, enabled` | `auto` | Controls Flash Attention for the vision encoder. |
778+
| `image_min_tokens` | `number` | `0` | Minimum image tokens for dynamic resolution models (0 = from model). |
779+
| `image_max_tokens` | `number` | `0` | Maximum image tokens for dynamic resolution models (0 = from model). |
780+
781+
**Example:**
782+
783+
```sql
784+
-- Load text model first
785+
SELECT llm_model_load('./models/Gemma-3-4B-IT-Q4_K_M.gguf', 'gpu_layers=99');
786+
SELECT llm_context_create_textgen();
787+
788+
-- Load vision projector
789+
SELECT llm_vision_load('./models/mmproj-Gemma-3-4B-IT-f16.gguf');
790+
791+
-- Now use vision with llm_text_generate or llm_chat_respond
792+
SELECT llm_text_generate('Describe this image', './photos/cat.jpg');
793+
```
794+
795+
---
796+
797+
### `llm_vision_free()`
798+
799+
**Returns:** `NULL`
800+
801+
**Description:**
802+
Unloads the current vision (mmproj) model and frees associated memory. The text model remains loaded.
803+
804+
**Example:**
805+
806+
```sql
807+
SELECT llm_vision_free();
808+
```
809+
810+
---
811+
738812
## Audio Functions
739813

740814
### `audio_model_load(path TEXT, options TEXT)`

Makefile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@ SKIP_UNITTEST ?= 0
5858
# Compiler and flags
5959
CC = gcc
6060
CXX = g++
61-
CFLAGS = -Wall -Wextra -Wno-unused-parameter -I$(SRC_DIR) -I$(BUILD_GGML)/include -I$(WHISPER_DIR)/include -I$(MINIAUDIO_DIR)
62-
LLAMA_OPTIONS = $(LLAMA) -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=OFF -DLLAMA_BUILD_SERVER=OFF -DGGML_RPC=OFF
61+
CFLAGS = -Wall -Wextra -Wno-unused-parameter -I$(SRC_DIR) -I$(BUILD_GGML)/include -I$(WHISPER_DIR)/include -I$(MINIAUDIO_DIR) -I$(LLAMA_DIR)/tools/mtmd
62+
LLAMA_OPTIONS = $(LLAMA) -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_EXAMPLES=OFF -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=OFF -DGGML_RPC=OFF
6363
WHISPER_OPTIONS = $(LLAMA) $(WHISPER) -DBUILD_SHARED_LIBS=OFF -DWHISPER_BUILD_EXAMPLES=OFF -DWHISPER_BUILD_TESTS=OFF -DWHISPER_BUILD_SERVER=OFF -DWHISPER_RPC=OFF -DWHISPER_USE_SYSTEM_GGML=ON
6464
MINIAUDIO_OPTIONS = $(MINIAUDIO) -DBUILD_SHARED_LIBS=OFF -DMINIAUDIO_BUILD_EXAMPLES=OFF -DMINIAUDIO_BUILD_TESTS=OFF
6565
# MinGW produces .a files without lib prefix, use -l:filename.a syntax
@@ -69,7 +69,7 @@ ifeq ($(PLATFORM),windows)
6969
else
7070
L = -l
7171
endif
72-
LLAMA_LDFLAGS = -L./$(BUILD_GGML)/lib -L./$(BUILD_LLAMA)/src -lllama $(L)ggml$(A) $(L)ggml-base$(A) $(L)ggml-cpu$(A)
72+
LLAMA_LDFLAGS = -L./$(BUILD_LLAMA)/tools/mtmd -L./$(BUILD_GGML)/lib -L./$(BUILD_LLAMA)/src -lmtmd -lllama $(L)ggml$(A) $(L)ggml-base$(A) $(L)ggml-cpu$(A)
7373
WHISPER_LDFLAGS = -L./$(BUILD_WHISPER)/src -lwhisper
7474
MINIAUDIO_LDFLAGS = -L./$(BUILD_MINIAUDIO) -lminiaudio -lminiaudio_channel_combiner_node -lminiaudio_channel_separator_node -lminiaudio_ltrim_node -lminiaudio_reverb_node -lminiaudio_vocoder_node
7575
LDFLAGS = $(LLAMA_LDFLAGS) $(WHISPER_LDFLAGS) $(MINIAUDIO_LDFLAGS)
@@ -85,7 +85,7 @@ SQLITE_TEST_SRC = tests/c/sqlite3.c
8585
# Files
8686
SRC_FILES = $(wildcard $(SRC_DIR)/*.c)
8787
OBJ_FILES = $(patsubst %.c, $(BUILD_DIR)/%.o, $(notdir $(SRC_FILES)))
88-
LLAMA_LIBS = $(BUILD_GGML)/libggml.a $(BUILD_GGML)/libggml-base.a $(BUILD_GGML)/libggml-cpu.a $(BUILD_LLAMA)/src/libllama.a
88+
LLAMA_LIBS = $(BUILD_LLAMA)/tools/mtmd/libmtmd.a $(BUILD_GGML)/libggml.a $(BUILD_GGML)/libggml-base.a $(BUILD_GGML)/libggml-cpu.a $(BUILD_LLAMA)/src/libllama.a
8989
WHISPER_LIBS = $(BUILD_WHISPER)/src/libwhisper.a
9090
MINIAUDIO_LIBS = $(BUILD_MINIAUDIO)/libminiaudio.a
9191

README.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,10 @@
1111
* **Offline-First**: No server dependencies or internet connection required.
1212
* **Composable SQL Interface**: AI + relational logic in a single unified layer.
1313
* **Audio Transcription**: Speech-to-text via Whisper models (WAV, MP3, FLAC).
14+
* **Vision / Multimodal**: Analyze images via multimodal models (JPG, PNG, BMP, GIF).
1415
* **Supports any GGUF model**: available on Huggingface; Qwen, Gemma, Llama, DeepSeek and more
1516

16-
SQLite-AI supports **text embedding generation** for search and classification, a **chat-like interface with history and token streaming**, **automatic context save and restore** across sessions, and **audio transcription** via Whisper models — making it ideal for building conversational agents, memory-aware assistants, and voice-enabled applications.
17+
SQLite-AI supports **text embedding generation** for search and classification, a **chat-like interface with history and token streaming**, **automatic context save and restore** across sessions, **audio transcription** via Whisper models, and **vision/multimodal** image understanding — making it ideal for building conversational agents, memory-aware assistants, and voice-enabled applications.
1718

1819
## Getting Started
1920

@@ -82,6 +83,25 @@ SELECT audio_model_transcribe('./audio/speech.mp3', 'language=it,translate=1');
8283
SELECT audio_model_transcribe(audio_data) FROM recordings WHERE id = 1;
8384
```
8485

86+
### Vision / Multimodal
87+
88+
```sql
89+
-- Load a multimodal model and its vision projector
90+
SELECT llm_model_load('./models/Gemma-3-4B-IT-Q4_K_M.gguf', 'gpu_layers=99');
91+
SELECT llm_context_create_textgen();
92+
SELECT llm_vision_load('./models/mmproj-Gemma-3-4B-IT-f16.gguf');
93+
94+
-- Describe an image
95+
SELECT llm_text_generate('Describe this image', './photos/cat.jpg');
96+
97+
-- Use vision in a chat conversation
98+
SELECT llm_context_create_chat();
99+
SELECT llm_chat_respond('What do you see in this photo?', './photos/landscape.jpg');
100+
101+
-- Analyze multiple images
102+
SELECT llm_text_generate('Compare these two images', './img1.jpg', './img2.jpg');
103+
```
104+
85105
## Documentation
86106

87107
For detailed information on all available functions, their parameters, and examples, refer to the [comprehensive API Reference](./API.md).

0 commit comments

Comments
 (0)