Skip to content

Commit 6d82bec

Browse files
committed
Update readme
1 parent 832a8bd commit 6d82bec

3 files changed

Lines changed: 239 additions & 7 deletions

File tree

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Add `:image_vision` to `mix.exs` along with whichever optional ML backends you n
5858
```elixir
5959
def deps do
6060
[
61-
{:image_vision, "~> 0.2"},
61+
{:image_vision, "~> 0.3"},
6262

6363
# Required for Image.Classification and Image.Classification.embed/2
6464
{:bumblebee, "~> 0.6"},
@@ -199,9 +199,10 @@ config :image_vision, :classifier,
199199

200200
## Guides
201201

202-
- [Classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/classification.md) — classifying images and computing embeddings
203-
- [Detection](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/detection.md) — bounding-box object detection
204-
- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/segmentation.md) — promptable and panoptic segmentation
205-
- [Background removal](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/background.md) — class-agnostic foreground cutout
206-
- [Captioning](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/captioning.md) — natural-language image descriptions
207-
- [Zero-shot classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/zero_shot.md) — classify against arbitrary labels via CLIP
202+
- [Overview](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/overview.md) — the four computer-vision task families (classification, detection, segmentation, description), a decision table mapping intent → module, worked recipes per module, and a section on composing tasks (face-aware crops, background swap, visual search, custom-taxonomy moderation)
203+
- [Classification](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/classification.md) — classifying images and computing embeddings
204+
- [Detection](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/detection.md) — bounding-box object detection
205+
- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/segmentation.md) — promptable and panoptic segmentation
206+
- [Background removal](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/background.md) — class-agnostic foreground cutout
207+
- [Captioning](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/captioning.md) — natural-language image descriptions
208+
- [Zero-shot classification](https://github.com/elixir-image/image_vision/blob/v0.3.0/guides/zero_shot.md) — classify against arbitrary labels via CLIP

guides/overview.md

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# Computer-vision tasks at a glance
2+
3+
`image_vision` ships seven public modules, each wrapping a different ONNX or Bumblebee-served model. This guide is a map: it explains which module solves which kind of problem, shows the call shape for each, and points you at the per-module guides for the depth.
4+
5+
Four task families cover everything in the library:
6+
7+
* **Classification** — "what is this image about?". Returns a list of labels.
8+
* **Detection** — "where are the things?". Returns bounding boxes plus labels.
9+
* **Segmentation** — "which pixels belong to which thing?". Returns pixel masks.
10+
* **Description** — "what's happening in this image, in words?". Returns a natural-language sentence.
11+
12+
Most computer-vision tasks reduce to one of these four. Some applications — face-aware crops, background removal, content moderation — stitch two or three together.
13+
14+
## The decision table
15+
16+
| You want… | Use | Returns |
17+
| --- | --- | --- |
18+
| One-line label for what's in the image | `Image.Classification.labels/2` | `["Blenheim spaniel"]` |
19+
| Embedding vector for similarity / search | `Image.Classification.embed/2` | `Nx.Tensor` (1000-dim by default) |
20+
| Classify against custom labels (no retraining) | `Image.ZeroShot.classify/3` | `[%{label: "cat", score: 0.92}, …]` |
21+
| Bounding boxes around every detected object | `Image.Detection.detect/2` | `[%{box: {x, y, w, h}, label: "dog", score: 0.87}, …]` |
22+
| Bounding boxes around every face (plus landmarks) | `Image.FaceDetection.detect/2` | `[%{box: …, score: …, landmarks: [{x, y}, …]}, …]` |
23+
| Cut out a specific object from a click or box | `Image.Segmentation.segment/2` | `Vix.Vips.Image` mask |
24+
| Class-labeled regions for the whole scene | `Image.Segmentation.segment_panoptic/2` | `[%{label: "sky", mask: …}, …]` |
25+
| Extract just the foreground subject | `Image.Background.remove/2` | `Vix.Vips.Image` with alpha |
26+
| Natural-language caption for the image | `Image.Captioning.caption/2` | `"a dog sitting on a park bench"` |
27+
28+
The same `t:Vix.Vips.Image.t/0` is the input for everything; the modules differ only in what they extract from it.
29+
30+
## Classification — "what is this image about?"
31+
32+
`Image.Classification` runs an ImageNet-trained backbone (ConvNeXt by default) and returns the top labels. It's the cheapest task in the library — fast inference, small model, no per-pixel work — so it's a good first reach for content tagging, search indexing, or anything that needs a coarse "what kind of image is this?" signal.
33+
34+
```elixir
35+
puppy = Image.open!("puppy.jpg")
36+
37+
Image.Classification.labels(puppy)
38+
#=> ["Blenheim spaniel"]
39+
40+
Image.Classification.labels(puppy, min_score: 0.1)
41+
#=> ["Blenheim spaniel", "cocker spaniel", "papillon"]
42+
```
43+
44+
For similarity search — "find me other images that look like this" — use `embed/2` instead. It returns the model's penultimate-layer activations as an `Nx.Tensor`, which you can dot-product against pre-computed embeddings of your library:
45+
46+
```elixir
47+
query_vec = Image.Classification.embed(puppy)
48+
# Compare against pre-stored vectors via cosine similarity, etc.
49+
```
50+
51+
When ImageNet's 1000 categories don't cover your labels — "branded vs non-branded", "indoor vs outdoor", "edited vs unedited" — reach for `Image.ZeroShot` instead. It uses CLIP to classify against arbitrary labels you supply at request time:
52+
53+
```elixir
54+
Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a bird"])
55+
#=> [%{label: "a dog", score: 0.97}, %{label: "a cat", score: 0.02}, …]
56+
```
57+
58+
CLIP is heavier than ConvNeXt (a vision transformer plus a text transformer) but the labels are free-form, which makes it the right tool for any taxonomy you didn't train against.
59+
60+
See [`classification.md`](classification.md) and [`zero_shot.md`](zero_shot.md) for the full API surface.
61+
62+
## Detection — "where are the things?"
63+
64+
`Image.Detection` and `Image.FaceDetection` both return bounding boxes; they differ in what they look for and how rich the per-box metadata is.
65+
66+
`Image.Detection` uses RT-DETR (real-time detection transformer) trained on COCO, so it knows ~80 object categories — people, vehicles, animals, household items, food. Box plus label plus confidence:
67+
68+
```elixir
69+
park = Image.open!("park.jpg")
70+
71+
Image.Detection.detect(park)
72+
#=> [
73+
#=> %{box: {120, 80, 240, 380}, label: "person", score: 0.95},
74+
#=> %{box: {410, 220, 180, 160}, label: "dog", score: 0.91},
75+
#=> %{box: {620, 30, 300, 200}, label: "kite", score: 0.82}
76+
#=> ]
77+
```
78+
79+
`Image.FaceDetection` uses YuNet, a much smaller and faster model specialised for faces. The trade-off: only one class (faces), but you also get five-point landmarks (eyes, nose, mouth corners) per box:
80+
81+
```elixir
82+
portrait = Image.open!("portrait.jpg")
83+
84+
Image.FaceDetection.detect(portrait)
85+
#=> [
86+
#=> %{
87+
#=> box: {142, 84, 218, 240},
88+
#=> score: 0.98,
89+
#=> landmarks: [{195, 162}, {275, 160}, {235, 200}, {200, 240}, {270, 240}]
90+
#=> }
91+
#=> ]
92+
```
93+
94+
Both modules ship a `draw/2` helper that overlays the boxes on the image — useful for debugging and for the "show the user what the model saw" UX.
95+
96+
`Image.FaceDetection` is the basis for `image_plug`'s `gravity: :face` crop and `Ops.PixelateFaces`. See [`detection.md`](detection.md) for the general object case and [`image_plug`'s face-aware guide](https://hexdocs.pm/image_plug/face_aware.html) for the face-aware crop integration.
97+
98+
## Segmentation — "which pixels belong to which thing?"
99+
100+
`Image.Segmentation` is the heaviest task by far — it produces per-pixel masks rather than just boxes. Two flavours, with very different shapes:
101+
102+
* **Promptable** (`segment/2`) — you give it a hint ("here's where the object is") and it returns one mask. Backed by SAM 2. Good for "the user clicked here, cut out what they pointed at".
103+
104+
* **Panoptic** (`segment_panoptic/2`) — no prompt, returns one mask per detected region with a class label. Good for scene parsing — sky, road, building, person, car all separated automatically.
105+
106+
Promptable, with a click point:
107+
108+
```elixir
109+
photo = Image.open!("scene.jpg")
110+
mask = Image.Segmentation.segment(photo, point: {340, 220})
111+
112+
# Apply the mask: blank out everything outside the segmented object.
113+
{:ok, isolated} = Vix.Vips.Operation.bandjoin([photo, mask])
114+
```
115+
116+
Promptable, with a bounding box:
117+
118+
```elixir
119+
mask = Image.Segmentation.segment(photo, box: {120, 80, 280, 320})
120+
```
121+
122+
Panoptic — every region gets a class:
123+
124+
```elixir
125+
Image.Segmentation.segment_panoptic(photo)
126+
#=> [
127+
#=> %{label: "sky", mask: %Vix.Vips.Image{...}},
128+
#=> %{label: "person", mask: %Vix.Vips.Image{...}},
129+
#=> %{label: "building", mask: %Vix.Vips.Image{...}}
130+
#=> ]
131+
```
132+
133+
`Image.Background` is segmentation's narrowest and most ergonomic case: foreground vs background, no prompt needed. It's what you reach for when the only thing you actually want is "give me the subject with a transparent background":
134+
135+
```elixir
136+
{:ok, cutout} = Image.Background.remove(photo)
137+
# `cutout` is the original image with everything outside the detected
138+
# foreground subject made transparent.
139+
```
140+
141+
`Image.Background.remove/2` is a class-agnostic foreground extractor — it works on any subject (person, product, pet, plant) without needing to be told what to look for. For arbitrary user-pointed segmentation (the "click anywhere" UX), use `Image.Segmentation.segment/2` instead.
142+
143+
See [`segmentation.md`](segmentation.md) and [`background.md`](background.md) for the full call shapes.
144+
145+
## Description — "what's happening, in words?"
146+
147+
`Image.Captioning` produces a natural-language sentence describing the image, using BLIP (Bootstrapping Language-Image Pre-training):
148+
149+
```elixir
150+
photo = Image.open!("park.jpg")
151+
152+
Image.Captioning.caption(photo)
153+
#=> "a man walking his dog in a park"
154+
```
155+
156+
Captions are useful for accessibility (`alt` text generation), search (full-text indexing on caption strings), and content workflows (auto-tagging that would have to enumerate labels otherwise). The model is generative, so the same image can produce slightly different captions across runs unless you fix the seed; for stable indexing pin the generation parameters via `:generation_config`.
157+
158+
BLIP is the heaviest model in the library — a vision encoder plus a language decoder, a few hundred MB on disk and several seconds per inference on CPU. For high-throughput captioning you want a GPU-backed Nx backend (EXLA on CUDA or Metal); CPU works for low-volume use.
159+
160+
See [`captioning.md`](captioning.md) for prompt templates, batch processing, and the generation-parameter knobs.
161+
162+
## Composing tasks
163+
164+
A few common combinations:
165+
166+
* **Face-aware crop**`FaceDetection.detect/2` → take the highest-confidence box → crop to that region with padding. This is what `image_plug`'s `gravity: :face` does end-to-end.
167+
168+
* **Pixelate faces**`FaceDetection.detect/2` → for each box, apply pixelation only inside the region. This is `Image.Plug.Pipeline.Ops.PixelateFaces`.
169+
170+
* **Caption + tag**`Captioning.caption/2` for a sentence + `Classification.labels/2` for indexable labels. Caption is human-readable, labels are queryable; storing both gives you free-text search plus faceted filtering.
171+
172+
* **Background swap**`Background.remove/2` to extract the subject → composite onto a different background image with `Image.compose/3`. The "studio shot" effect from arbitrary phone photos.
173+
174+
* **Object-aware blur**`Detection.detect/2` → for each detection over a confidence threshold, blur outside the box. Useful for privacy redaction at scale.
175+
176+
* **Visual search**`Classification.embed/2` on every image at index time → store the vectors in a vector database → at query time, `embed` the query image and dot-product. This is the standard "more like this" search.
177+
178+
* **Custom-taxonomy moderation**`ZeroShot.classify/3` with your moderation labels (`["safe", "violent", "explicit", …]`). No training data, no model fine-tuning — just label strings.
179+
180+
The library is designed to be composable: the modules don't share state, every function takes a `Vix.Vips.Image`, and every output is either another `Vix.Vips.Image` (so you can pipe it into the next step) or a plain Elixir term (so you can serialise / store / route it).
181+
182+
## Optional dependencies and model loading
183+
184+
Different modules need different runtimes:
185+
186+
| Module | Runtime | Notes |
187+
| --- | --- | --- |
188+
| `Image.Classification`, `Image.Captioning`, `Image.ZeroShot` | Bumblebee + Nx + EXLA | Hugging Face model loading; GPU-friendly via EXLA backend. |
189+
| `Image.Detection`, `Image.Segmentation`, `Image.Background` | Ortex + Nx | ONNX Runtime; faster cold-start than Bumblebee, no GPU acceleration on macOS. |
190+
| `Image.FaceDetection` | Ortex + Nx | YuNet ONNX model, ~340 KB on disk. |
191+
192+
Add `:image_vision` to your deps plus the runtime stack you need. The minimum to use everything is:
193+
194+
```elixir
195+
def deps do
196+
[
197+
{:image_vision, "~> 0.3"},
198+
{:bumblebee, "~> 0.6"},
199+
{:ortex, "~> 0.1"},
200+
{:nx, "~> 0.10"},
201+
{:exla, "~> 0.10"}
202+
]
203+
end
204+
```
205+
206+
If you only need Ortex-backed tasks (Detection / Segmentation / Background / FaceDetection), you can drop Bumblebee. If you only need Bumblebee-backed tasks (Classification / Captioning / ZeroShot), you can drop Ortex.
207+
208+
### Model cache
209+
210+
All modules download their weights from HuggingFace on first use and cache them on disk. The default cache directory is OS-dependent:
211+
212+
* macOS: `~/Library/Caches/image_vision/`
213+
* Linux: `~/.cache/image_vision/`
214+
* configurable via `config :image_vision, :cache_dir, "/path/to/cache"`
215+
216+
In a containerised deployment, mount that directory as a volume so the model weights survive container restarts. See `image_playground`'s Dockerfile for an example.
217+
218+
### CPU vs GPU
219+
220+
Every model runs on CPU out of the box. For higher throughput, configure EXLA with a CUDA or Metal backend; both `bumblebee` and `nx_image` honour the configured backend automatically. On Apple Silicon the Metal backend is the realistic option; on Linux + NVIDIA, CUDA is the standard. See [Nx's installation guide](https://hexdocs.pm/exla/EXLA.html) for backend setup.
221+
222+
## Related
223+
224+
* [`classification.md`](classification.md) — full Classification API including `embed/2` and the model-config knobs.
225+
* [`zero_shot.md`](zero_shot.md) — CLIP-based custom-label classification.
226+
* [`detection.md`](detection.md) — RT-DETR object detection plus `draw/2` overlay.
227+
* [`segmentation.md`](segmentation.md) — promptable (SAM 2) and panoptic segmentation.
228+
* [`background.md`](background.md) — class-agnostic foreground extraction.
229+
* [`captioning.md`](captioning.md) — BLIP-based natural-language captioning.
230+
* [`image_plug`'s face-aware guide](https://hexdocs.pm/image_plug/face_aware.html)`Image.FaceDetection` integrated into the URL-driven transform pipeline.

mix.exs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ defmodule ImageVision.MixProject do
129129
"README.md",
130130
"LICENSE.md",
131131
"CHANGELOG.md",
132+
"guides/overview.md",
132133
"guides/classification.md",
133134
"guides/segmentation.md",
134135
"guides/detection.md",

0 commit comments

Comments
 (0)