Skip to content

Commit 96aca92

Browse files
committed
Add Image.Background, Image.Captioning, Image.ZeroShot for v0.2.0
1 parent 0f28c66 commit 96aca92

17 files changed

Lines changed: 1696 additions & 18 deletions

CHANGELOG.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,23 @@
11
# Changelog
22

3+
## ImageVision v0.2.0
4+
5+
### Added
6+
7+
* **`Image.Background`** — class-agnostic foreground/background separation. `remove/2` returns the input image with the background made transparent (alpha mask applied); `mask/2` returns the foreground mask alone for custom compositing. Default model is [BiRefNet lite](https://huggingface.co/onnx-community/BiRefNet_lite-ONNX) (MIT, ~210 MB), powered by Ortex.
8+
9+
* **`Image.Captioning`** — natural-language description of an image. `caption/2` returns a string like `"a man riding a horse with a bird of prey"`. Default model is [BLIP base](https://huggingface.co/Salesforce/blip-image-captioning-base) (BSD-3-Clause, ~990 MB), powered by Bumblebee. Heavy enough that it is not autostarted by default; configure `autostart: true` or add the child spec to your supervisor.
10+
11+
* **`Image.ZeroShot`** — classify an image against arbitrary labels you supply at call time, no retraining. `classify/3` returns `[%{label, score}]` sorted descending; `label/3` returns just the best label; `similarity/3` computes CLIP-space cosine similarity between two images. Default model is [OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) (MIT, ~600 MB), powered by Bumblebee. Default prompt template `"a photo of {label}"` boosts accuracy on bare-noun labels; override or disable as needed.
12+
13+
* New flags `--background`, `--caption`, and `--zero-shot` for `mix image_vision.download_models` to pre-fetch the new defaults.
14+
15+
### Changed
16+
17+
* The `:files` list in `mix.exs` now ships `logo.jpg` so the docs render the project logo on hexdocs.pm.
18+
19+
See the [README](https://github.com/elixir-image/image_vision/blob/v0.2.0/README.md) for the full feature list and the [background](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/background.md), [captioning](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/captioning.md), and [zero-shot](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/zero_shot.md) guides for detail on the new tasks.
20+
321
## ImageVision v0.1.0
422

523
`image_vision` is a thin, opinionated wrapper around the Elixir ML ecosystem (Bumblebee, Ortex, Nx) that sits next to the [`image`](https://hex.pm/packages/image) library. It exposes three vision tasks through a small API designed for developers who are not ML experts: pass a `t:Vix.Vips.Image.t/0` in, get useful results out. Strong, permissively-licensed defaults handle model selection, backend configuration, and weight downloads automatically.

README.md

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ImageVision
22

3-
`ImageVision` is a simple, opinionated image vision library for Elixir. It sits alongside the [`image`](https://hex.pm/packages/image) library and answers three questions about any image — **what's in it**, **where are the objects**, and **which pixels belong to which object** — with strong defaults and no ML expertise required.
3+
`ImageVision` is a simple, opinionated image vision library for Elixir. It sits alongside the [`image`](https://hex.pm/packages/image) library and answers common questions about an image — **what's in it**, **where are the objects**, **which pixels belong to which object**, **what's the foreground**, **describe it in words**, **does it match these labels** — with strong defaults and no ML expertise required.
44

55
## Quick start
66

@@ -34,6 +34,17 @@ iex> {:ok, cutout} = Image.Segmentation.apply_mask(puppy, mask)
3434
# Embedding — 768-dim feature vector for similarity search
3535
iex> Image.Classification.embed(puppy)
3636
#Nx.Tensor<f32[768]>
37+
38+
# Background removal — class-agnostic foreground cutout
39+
iex> {:ok, cutout} = Image.Background.remove(puppy)
40+
41+
# Image captioning — natural-language description
42+
iex> Image.Captioning.caption(puppy)
43+
"a small brown and white puppy sitting on a wooden floor"
44+
45+
# Zero-shot classification — your labels, no retraining required
46+
iex> Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a horse"])
47+
[%{label: "a dog", score: 0.998}, %{label: "a cat", score: 0.002}, ...]
3748
```
3849

3950
## Installation
@@ -43,7 +54,7 @@ Add `:image_vision` to `mix.exs` along with whichever optional ML backends you n
4354
```elixir
4455
def deps do
4556
[
46-
{:image_vision, "~> 0.1"},
57+
{:image_vision, "~> 0.2"},
4758

4859
# Required for Image.Classification and Image.Classification.embed/2
4960
{:bumblebee, "~> 0.6"},
@@ -83,9 +94,13 @@ Model weights are downloaded on first call and cached on disk. Across all three
8394
| Task | Default model | Size |
8495
|---|---|---|
8596
| Classification | `facebook/convnext-tiny-224` | ~110 MB |
97+
| Embedding | `facebook/dinov2-base` | ~340 MB |
8698
| Detection | `onnx-community/rtdetr_r50vd` | ~175 MB |
8799
| Segmentation (SAM 2) | `SharpAI/sam2-hiera-tiny-onnx` | ~150 MB |
88100
| Segmentation (panoptic) | `Xenova/detr-resnet-50-panoptic` | ~175 MB |
101+
| Background removal | `onnx-community/BiRefNet_lite-ONNX` | ~210 MB |
102+
| Captioning | `Salesforce/blip-image-captioning-base` | ~990 MB |
103+
| Zero-shot classification | `openai/clip-vit-base-patch32` | ~605 MB |
89104

90105
The first call to each task therefore appears to "hang" while weights download — that's expected, not a bug.
91106

@@ -95,7 +110,7 @@ To pre-download all default models before first use (recommended for production
95110
mix image_vision.download_models
96111
```
97112

98-
Pass `--classify`, `--detect`, or `--segment` to limit scope.
113+
Pass `--classify`, `--detect`, `--segment`, `--background`, `--caption`, or `--zero-shot` to limit scope.
99114

100115
### Livebook Desktop
101116

@@ -177,6 +192,9 @@ config :image_vision, :classifier,
177192

178193
## Guides
179194

180-
- [Classification](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/classification.md) — classifying images and computing embeddings
181-
- [Detection](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/detection.md) — bounding-box object detection
182-
- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/segmentation.md) — promptable and panoptic segmentation
195+
- [Classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/classification.md) — classifying images and computing embeddings
196+
- [Detection](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/detection.md) — bounding-box object detection
197+
- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/segmentation.md) — promptable and panoptic segmentation
198+
- [Background removal](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/background.md) — class-agnostic foreground cutout
199+
- [Captioning](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/captioning.md) — natural-language image descriptions
200+
- [Zero-shot classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/zero_shot.md) — classify against arbitrary labels via CLIP

guides/background.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Background Removal
2+
3+
`Image.Background` answers "what is the foreground here?" by separating subject from background. The model is class-agnostic — it doesn't care whether the foreground is a person, a product, an animal, or a piece of furniture; it decides what's the salient subject and isolates it.
4+
5+
## Removing the background
6+
7+
The simplest call returns the input image with the background made transparent (alpha channel applied):
8+
9+
```elixir
10+
iex> photo = Image.open!("portrait.jpg")
11+
iex> {:ok, cutout} = Image.Background.remove(photo)
12+
iex> Image.write!(cutout, "portrait-cutout.png")
13+
```
14+
15+
The result has four bands (RGB + alpha). Save as PNG or any other format that supports transparency.
16+
17+
## Getting just the mask
18+
19+
If you want the foreground mask itself — for compositing onto a different background, layering, or further processing — call `mask/2`:
20+
21+
```elixir
22+
iex> photo = Image.open!("portrait.jpg")
23+
iex> mask = Image.Background.mask(photo)
24+
iex> Image.write!(mask, "mask.png")
25+
```
26+
27+
The mask is a single-band greyscale image at the same dimensions as the input. Pixel intensity reflects model confidence: pure white (255) is "definitely foreground", pure black (0) is "definitely background", values in between reflect uncertainty at boundaries.
28+
29+
## When to use this vs. segmentation
30+
31+
`image_vision` has three different ways to produce masks; pick by what you actually want:
32+
33+
- **`Image.Background.remove/2`** — class-agnostic, no input beyond the image. Best for "isolate the subject of this photo". Works without prompts. Single foreground/background distinction.
34+
35+
- **`Image.Segmentation.segment/2` (SAM 2)** — promptable. You click a point or draw a box and SAM masks *that specific object*. Best when an image has multiple distinct objects and you want one of them, or when the salient-object heuristic of background removal disagrees with what you actually want.
36+
37+
- **`Image.Segmentation.segment_panoptic/2`** — labels every region in the image with a class. Best when you want to enumerate everything in a scene, not just the foreground.
38+
39+
## Using a different model
40+
41+
`remove/2` and `mask/2` accept `:repo` and `:model_file` to swap in any compatible BiRefNet ONNX export:
42+
43+
```elixir
44+
# Full BiRefNet (~890 MB, higher quality, slower)
45+
iex> Image.Background.remove(image, repo: "onnx-community/BiRefNet-ONNX")
46+
```
47+
48+
Both functions also share the `ImageVision.ModelCache` cache root with the segmentation and detection models, so configure once via `config :image_vision, :cache_dir, ...`.
49+
50+
## Pre-downloading
51+
52+
To populate the cache before first use:
53+
54+
```bash
55+
mix image_vision.download_models --background
56+
```
57+
58+
## Default model
59+
60+
[BiRefNet lite](https://huggingface.co/onnx-community/BiRefNet_lite-ONNX) is MIT licensed and ~210 MB. It's a distilled variant of the full BiRefNet that trades a small amount of accuracy for a much smaller model and faster inference. State-of-the-art for salient-object detection / dichotomous image segmentation as of mid-2024.
61+
62+
## Dependencies
63+
64+
Background removal requires `:ortex`. Add to `mix.exs`:
65+
66+
```elixir
67+
{:ortex, "~> 0.1"}
68+
```

guides/captioning.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Image Captioning
2+
3+
`Image.Captioning` answers "describe this image in plain English". Useful for accessibility (alt text), image search and indexing (caption as a search field), assistive tooling, content moderation pipelines that want a quick natural-language summary, and any product surface where you'd otherwise need a human to write descriptions for thousands of images.
4+
5+
## Quick start
6+
7+
The captioner is a heavyweight Bumblebee serving — it loads ~990 MB of weights and JIT-compiles a transformer decoder. It is **not** autostarted by default. Before calling `caption/2`, either configure autostart:
8+
9+
```elixir
10+
# config/runtime.exs
11+
config :image_vision, :captioner, autostart: true
12+
```
13+
14+
…or add the child spec to your own supervision tree:
15+
16+
```elixir
17+
# application.ex
18+
children = [Image.Captioning.captioner()]
19+
Supervisor.start_link(children, strategy: :one_for_one)
20+
```
21+
22+
Then:
23+
24+
```elixir
25+
iex> photo = Image.open!("photo.jpg")
26+
iex> Image.Captioning.caption(photo)
27+
"a man riding a horse with a bird of prey on his arm"
28+
```
29+
30+
## Choosing a model
31+
32+
The default is BLIP base (`Salesforce/blip-image-captioning-base`, BSD-3-Clause, ~990 MB) — a solid baseline that produces concise, generally accurate captions.
33+
34+
For higher-quality captions at ~3× the size:
35+
36+
```elixir
37+
config :image_vision, :captioner,
38+
model: {:hf, "Salesforce/blip-image-captioning-large"},
39+
featurizer: {:hf, "Salesforce/blip-image-captioning-large"},
40+
tokenizer: {:hf, "Salesforce/blip-image-captioning-large"},
41+
generation_config: {:hf, "Salesforce/blip-image-captioning-large"}
42+
```
43+
44+
Avoid BLIP-2 (`Salesforce/blip2-*`) — its OPT decoder has non-commercial licensing.
45+
46+
## Tuning generation
47+
48+
The default generation config is whatever the model ships with — usually around 20 tokens, greedy decoding. To tune (more tokens, beam search, sampling), set the relevant fields on the generation config via Bumblebee's loaded config and configure them at serving construction time. For most use cases the defaults are fine.
49+
50+
## Pre-downloading
51+
52+
The default BLIP weights are ~990 MB — by far the heaviest of the library's defaults. Pre-download to avoid blocking on first call:
53+
54+
```bash
55+
mix image_vision.download_models --caption
56+
```
57+
58+
## Why captioning is heavy
59+
60+
A vision classifier runs the encoder once per image and is done. A captioner runs the encoder once, then runs the text decoder *autoregressively* — one forward pass per generated token. A 20-word caption is ~20 forward passes through the decoder. That's why generation is slower than classification, and why we autostart the serving so the model only loads once across the lifetime of the application.
61+
62+
## Default model
63+
64+
[BLIP base captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) — Salesforce, BSD-3-Clause licensed, ~990 MB. Trained on 129M image-text pairs. The base variant trades some descriptive richness for size and speed compared to the large variant.
65+
66+
## Dependencies
67+
68+
Captioning requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:
69+
70+
```elixir
71+
{:bumblebee, "~> 0.6"},
72+
{:nx, "~> 0.10"},
73+
{:exla, "~> 0.10"}
74+
```

guides/zero_shot.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Zero-Shot Classification
2+
3+
`Image.ZeroShot` lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where `Image.Classification` is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?".
4+
5+
This is enormously useful when your label space:
6+
7+
- Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging)
8+
- Changes over time (new categories appear without retraining)
9+
- Is unknown until query time (interactive search, user-driven filtering)
10+
11+
Powered by [CLIP](https://openai.com/research/clip), a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs.
12+
13+
## Classifying
14+
15+
```elixir
16+
iex> photo = Image.open!("portrait.jpg")
17+
iex> Image.ZeroShot.classify(photo, [
18+
...> "a person on a horse",
19+
...> "a person walking a dog",
20+
...> "a parked car",
21+
...> "an empty street"
22+
...> ])
23+
[
24+
%{label: "a person on a horse", score: 0.94},
25+
%{label: "a person walking a dog", score: 0.04},
26+
%{label: "an empty street", score: 0.01},
27+
%{label: "a parked car", score: 0.01}
28+
]
29+
```
30+
31+
Scores sum to `1.0` (softmax over the candidate set). Results are sorted by descending score.
32+
33+
## Just the best label
34+
35+
When you only want the winner:
36+
37+
```elixir
38+
iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"])
39+
"horse"
40+
```
41+
42+
## Image-to-image similarity
43+
44+
CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly:
45+
46+
```elixir
47+
iex> a = Image.open!("dog1.jpg")
48+
iex> b = Image.open!("dog2.jpg")
49+
iex> c = Image.open!("car.jpg")
50+
iex>
51+
iex> Image.ZeroShot.similarity(a, b)
52+
0.82
53+
iex> Image.ZeroShot.similarity(a, c)
54+
0.41
55+
```
56+
57+
Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them.
58+
59+
## Prompt templates matter
60+
61+
CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is `"a photo of {label}"` — every label is wrapped before tokenisation.
62+
63+
You can override:
64+
65+
```elixir
66+
# Photography-domain template
67+
iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"],
68+
...> template: "a high-quality photograph of {label}")
69+
70+
# Document-domain template
71+
iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"],
72+
...> template: "a scanned {label}")
73+
74+
# Disable templating if your labels are already full sentences
75+
iex> Image.ZeroShot.classify(image,
76+
...> ["a black cat sitting on a chair", "an empty room with a chair"],
77+
...> template: nil)
78+
```
79+
80+
A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use `template: nil`. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help.
81+
82+
## Choosing a model
83+
84+
The default is `openai/clip-vit-base-patch32` — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size:
85+
86+
```elixir
87+
iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14")
88+
```
89+
90+
(About ~1.7 GB — roughly 3× the default.)
91+
92+
## Pre-downloading
93+
94+
```bash
95+
mix image_vision.download_models --zero-shot
96+
```
97+
98+
## Default model
99+
100+
[OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space.
101+
102+
## Dependencies
103+
104+
Zero-shot classification requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:
105+
106+
```elixir
107+
{:bumblebee, "~> 0.6"},
108+
{:nx, "~> 0.10"},
109+
{:exla, "~> 0.10"}
110+
```

lib/application.ex

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ defmodule ImageVision.Application do
2020
if ImageVision.bumblebee_configured?() do
2121
@services [
2222
{{Image.Classification, :classifier, []}, false},
23-
{{Image.Classification, :embedder, []}, false}
23+
{{Image.Classification, :embedder, []}, false},
24+
{{Image.Captioning, :captioner, []}, false}
2425
]
2526

2627
defp children(true) do

0 commit comments

Comments
 (0)