You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI-powered semantic image comparison using Vision Language Models via Ollama.
3
+
Hybrid image comparison combining pixelmatch for objective difference detection and Vision Language Models (via Ollama) for human-noticeability analysis.
4
+
5
+
## Architecture Flow
6
+
7
+
```text
8
+
VLM Comparison Request
9
+
│
10
+
▼
11
+
Run Pixelmatch Comparison
12
+
│
13
+
├─→ No Differences Found → Return OK Status
14
+
│
15
+
└─→ Differences Found
16
+
│
17
+
▼
18
+
Save Diff Image
19
+
│
20
+
▼
21
+
Run VLM with 3 Images:
22
+
(Baseline, Comparison, Diff)
23
+
│
24
+
├─→ Not Noticeable → Override: Return OK Status
25
+
│
26
+
└─→ Noticeable → Return Unresolved with VLM Description
27
+
```
4
28
5
29
## Quick Start
6
30
@@ -18,10 +42,9 @@ ollama serve
18
42
19
43
```bash
20
44
# Recommended for accuracy
21
-
ollama pull llava:7b
45
+
ollama pull gemma3:12b
22
46
23
-
# Or for speed (smaller, less accurate)
24
-
ollama pull moondream
47
+
# Note: Smaller models do not show proper results - use gemma3:12b only
Set project's image comparison to `vlm` with config:
37
60
```json
38
61
{
39
-
"model": "llava:7b",
62
+
"model": "gemma3:12b",
40
63
"temperature": 0.1
41
64
}
42
65
```
43
66
44
67
Optional custom prompt (replaces default system prompt):
45
68
```json
46
69
{
47
-
"model": "llava:7b",
70
+
"model": "gemma3:12b",
48
71
"prompt": "Focus on button colors and text changes",
49
72
"temperature": 0.1
50
73
}
51
74
```
52
75
53
-
**Note:** The `prompt` field replaces the entire system prompt. If omitted, a default system prompt is used that focuses on semantic differences while ignoring rendering artifacts.
76
+
**Note:** The `prompt` field replaces the entire system prompt. If omitted, a default system prompt is used that analyzes the diff image to determine if highlighted differences are noticeable to humans.
|`minicpm-v`| 5.5GB | ⚡⚡ | ⭐⭐⭐ | Good alternative |
80
+
| Model | Size |
81
+
|-------|------|
82
+
|`gemma3:12b`|~12GB - **Recommended**|
83
+
84
+
**Note:** Models smaller than the default (`gemma3:12b`) have been tested and do not show proper results. They fail to follow structured output formats reliably and may produce incorrect or inconsistent responses. For production use, only use `gemma3:12b` or `llava:13b`.
64
85
65
86
## Configuration
66
87
67
88
| Option | Type | Default | Description |
68
89
|--------|------|---------|-------------|
69
-
|`model`| string |`llava:7b`| Ollama vision model name |
90
+
|`model`| string |`gemma3:12b`| Ollama vision model name |
70
91
|`prompt`| string | System prompt (see below) | Custom prompt for image comparison |
71
92
|`temperature`| number |`0.1`| Lower = more consistent results (0.0-1.0) |
72
93
73
-
## How It Works
74
-
75
-
1. VLM analyzes both images semantically
76
-
2. Returns JSON with `{"identical": true/false, "description": "..."}`
77
-
3.`identical: true` = images match (pass), `identical: false` = differences found (fail)
0 commit comments