elixir-image
diff --git a/‎CHANGELOG.md‎
Lines changed: 18 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 6 deletions b/‎README.md‎
Lines changed: 24 additions & 6 deletions
diff --git a/‎guides/background.md‎
Lines changed: 68 additions & 0 deletions b/‎guides/background.md‎
Lines changed: 68 additions & 0 deletions
diff --git a/‎guides/captioning.md‎
Lines changed: 74 additions & 0 deletions b/‎guides/captioning.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎guides/zero_shot.md‎
Lines changed: 110 additions & 0 deletions b/‎guides/zero_shot.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎lib/application.ex‎
Lines changed: 2 additions & 1 deletion b/‎lib/application.ex‎
Lines changed: 2 additions & 1 deletion
@@ -1,5 +1,23 @@
 # Changelog
 
+## ImageVision v0.2.0
+
+### Added
+
+* **`Image.Background`** — class-agnostic foreground/background separation. `remove/2` returns the input image with the background made transparent (alpha mask applied); `mask/2` returns the foreground mask alone for custom compositing. Default model is [BiRefNet lite](https://huggingface.co/onnx-community/BiRefNet_lite-ONNX) (MIT, ~210 MB), powered by Ortex.
+
+* **`Image.Captioning`** — natural-language description of an image. `caption/2` returns a string like `"a man riding a horse with a bird of prey"`. Default model is [BLIP base](https://huggingface.co/Salesforce/blip-image-captioning-base) (BSD-3-Clause, ~990 MB), powered by Bumblebee. Heavy enough that it is not autostarted by default; configure `autostart: true` or add the child spec to your supervisor.
+
+* **`Image.ZeroShot`** — classify an image against arbitrary labels you supply at call time, no retraining. `classify/3` returns `[%{label, score}]` sorted descending; `label/3` returns just the best label; `similarity/3` computes CLIP-space cosine similarity between two images. Default model is [OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) (MIT, ~600 MB), powered by Bumblebee. Default prompt template `"a photo of {label}"` boosts accuracy on bare-noun labels; override or disable as needed.
+
+* New flags `--background`, `--caption`, and `--zero-shot` for `mix image_vision.download_models` to pre-fetch the new defaults.
+
+### Changed
+
+* The `:files` list in `mix.exs` now ships `logo.jpg` so the docs render the project logo on hexdocs.pm.
+
+See the [README](https://github.com/elixir-image/image_vision/blob/v0.2.0/README.md) for the full feature list and the [background](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/background.md), [captioning](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/captioning.md), and [zero-shot](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/zero_shot.md) guides for detail on the new tasks.
+
 ## ImageVision v0.1.0
 
 `image_vision` is a thin, opinionated wrapper around the Elixir ML ecosystem (Bumblebee, Ortex, Nx) that sits next to the [`image`](https://hex.pm/packages/image) library. It exposes three vision tasks through a small API designed for developers who are not ML experts: pass a `t:Vix.Vips.Image.t/0` in, get useful results out. Strong, permissively-licensed defaults handle model selection, backend configuration, and weight downloads automatically.
 
@@ -1,6 +1,6 @@
 # ImageVision
 
-`ImageVision` is a simple, opinionated image vision library for Elixir. It sits alongside the [`image`](https://hex.pm/packages/image) library and answers three questions about any image — **what's in it**, **where are the objects**, and **which pixels belong to which object** — with strong defaults and no ML expertise required.
+`ImageVision` is a simple, opinionated image vision library for Elixir. It sits alongside the [`image`](https://hex.pm/packages/image) library and answers common questions about an image — **what's in it**, **where are the objects**, **which pixels belong to which object**, **what's the foreground**, **describe it in words**, **does it match these labels** — with strong defaults and no ML expertise required.
 
 ## Quick start
 
@@ -34,6 +34,17 @@ iex> {:ok, cutout} = Image.Segmentation.apply_mask(puppy, mask)
 # Embedding — 768-dim feature vector for similarity search
 iex> Image.Classification.embed(puppy)
 #Nx.Tensor<f32[768]>
+
+# Background removal — class-agnostic foreground cutout
+iex> {:ok, cutout} = Image.Background.remove(puppy)
+
+# Image captioning — natural-language description
+iex> Image.Captioning.caption(puppy)
+"a small brown and white puppy sitting on a wooden floor"
+
+# Zero-shot classification — your labels, no retraining required
+iex> Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a horse"])
+[%{label: "a dog", score: 0.998}, %{label: "a cat", score: 0.002}, ...]
 ```
 
 ## Installation
@@ -43,7 +54,7 @@ Add `:image_vision` to `mix.exs` along with whichever optional ML backends you n
 ```elixir
 def deps do
   [
-    {:image_vision, "~> 0.1"},
+    {:image_vision, "~> 0.2"},
 
     # Required for Image.Classification and Image.Classification.embed/2
     {:bumblebee, "~> 0.6"},
@@ -83,9 +94,13 @@ Model weights are downloaded on first call and cached on disk. Across all three
 | Task | Default model | Size |
 |---|---|---|
 | Classification | `facebook/convnext-tiny-224` | ~110 MB |
+| Embedding | `facebook/dinov2-base` | ~340 MB |
 | Detection | `onnx-community/rtdetr_r50vd` | ~175 MB |
 | Segmentation (SAM 2) | `SharpAI/sam2-hiera-tiny-onnx` | ~150 MB |
 | Segmentation (panoptic) | `Xenova/detr-resnet-50-panoptic` | ~175 MB |
+| Background removal | `onnx-community/BiRefNet_lite-ONNX` | ~210 MB |
+| Captioning | `Salesforce/blip-image-captioning-base` | ~990 MB |
+| Zero-shot classification | `openai/clip-vit-base-patch32` | ~605 MB |
 
 The first call to each task therefore appears to "hang" while weights download — that's expected, not a bug.
 
@@ -95,7 +110,7 @@ To pre-download all default models before first use (recommended for production
 mix image_vision.download_models
 ```
 
-Pass `--classify`, `--detect`, or `--segment` to limit scope.
+Pass `--classify`, `--detect`, `--segment`, `--background`, `--caption`, or `--zero-shot` to limit scope.
 
 ### Livebook Desktop
 
@@ -177,6 +192,9 @@ config :image_vision, :classifier,
 
 ## Guides
 
-- [Classification](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/classification.md) — classifying images and computing embeddings
-- [Detection](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/detection.md) — bounding-box object detection
-- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.1.0/guides/segmentation.md) — promptable and panoptic segmentation
+- [Classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/classification.md) — classifying images and computing embeddings
+- [Detection](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/detection.md) — bounding-box object detection
+- [Segmentation](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/segmentation.md) — promptable and panoptic segmentation
+- [Background removal](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/background.md) — class-agnostic foreground cutout
+- [Captioning](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/captioning.md) — natural-language image descriptions
+- [Zero-shot classification](https://github.com/elixir-image/image_vision/blob/v0.2.0/guides/zero_shot.md) — classify against arbitrary labels via CLIP
@@ -0,0 +1,68 @@
+# Background Removal
+
+`Image.Background` answers "what is the foreground here?" by separating subject from background. The model is class-agnostic — it doesn't care whether the foreground is a person, a product, an animal, or a piece of furniture; it decides what's the salient subject and isolates it.
+
+## Removing the background
+
+The simplest call returns the input image with the background made transparent (alpha channel applied):
+
+```elixir
+iex> photo = Image.open!("portrait.jpg")
+iex> {:ok, cutout} = Image.Background.remove(photo)
+iex> Image.write!(cutout, "portrait-cutout.png")
+```
+
+The result has four bands (RGB + alpha). Save as PNG or any other format that supports transparency.
+
+## Getting just the mask
+
+If you want the foreground mask itself — for compositing onto a different background, layering, or further processing — call `mask/2`:
+
+```elixir
+iex> photo = Image.open!("portrait.jpg")
+iex> mask = Image.Background.mask(photo)
+iex> Image.write!(mask, "mask.png")
+```
+
+The mask is a single-band greyscale image at the same dimensions as the input. Pixel intensity reflects model confidence: pure white (255) is "definitely foreground", pure black (0) is "definitely background", values in between reflect uncertainty at boundaries.
+
+## When to use this vs. segmentation
+
+`image_vision` has three different ways to produce masks; pick by what you actually want:
+
+- **`Image.Background.remove/2`** — class-agnostic, no input beyond the image. Best for "isolate the subject of this photo". Works without prompts. Single foreground/background distinction.
+
+- **`Image.Segmentation.segment/2` (SAM 2)** — promptable. You click a point or draw a box and SAM masks *that specific object*. Best when an image has multiple distinct objects and you want one of them, or when the salient-object heuristic of background removal disagrees with what you actually want.
+
+- **`Image.Segmentation.segment_panoptic/2`** — labels every region in the image with a class. Best when you want to enumerate everything in a scene, not just the foreground.
+
+## Using a different model
+
+`remove/2` and `mask/2` accept `:repo` and `:model_file` to swap in any compatible BiRefNet ONNX export:
+
+```elixir
+# Full BiRefNet (~890 MB, higher quality, slower)
+iex> Image.Background.remove(image, repo: "onnx-community/BiRefNet-ONNX")
+```
+
+Both functions also share the `ImageVision.ModelCache` cache root with the segmentation and detection models, so configure once via `config :image_vision, :cache_dir, ...`.
+
+## Pre-downloading
+
+To populate the cache before first use:
+
+```bash
+mix image_vision.download_models --background
+```
+
+## Default model
+
+[BiRefNet lite](https://huggingface.co/onnx-community/BiRefNet_lite-ONNX) is MIT licensed and ~210 MB. It's a distilled variant of the full BiRefNet that trades a small amount of accuracy for a much smaller model and faster inference. State-of-the-art for salient-object detection / dichotomous image segmentation as of mid-2024.
+
+## Dependencies
+
+Background removal requires `:ortex`. Add to `mix.exs`:
+
+```elixir
+{:ortex, "~> 0.1"}
+```
@@ -0,0 +1,74 @@
+# Image Captioning
+
+`Image.Captioning` answers "describe this image in plain English". Useful for accessibility (alt text), image search and indexing (caption as a search field), assistive tooling, content moderation pipelines that want a quick natural-language summary, and any product surface where you'd otherwise need a human to write descriptions for thousands of images.
+
+## Quick start
+
+The captioner is a heavyweight Bumblebee serving — it loads ~990 MB of weights and JIT-compiles a transformer decoder. It is **not** autostarted by default. Before calling `caption/2`, either configure autostart:
+
+```elixir
+# config/runtime.exs
+config :image_vision, :captioner, autostart: true
+```
+
+…or add the child spec to your own supervision tree:
+
+```elixir
+# application.ex
+children = [Image.Captioning.captioner()]
+Supervisor.start_link(children, strategy: :one_for_one)
+```
+
+Then:
+
+```elixir
+iex> photo = Image.open!("photo.jpg")
+iex> Image.Captioning.caption(photo)
+"a man riding a horse with a bird of prey on his arm"
+```
+
+## Choosing a model
+
+The default is BLIP base (`Salesforce/blip-image-captioning-base`, BSD-3-Clause, ~990 MB) — a solid baseline that produces concise, generally accurate captions.
+
+For higher-quality captions at ~3× the size:
+
+```elixir
+config :image_vision, :captioner,
+  model: {:hf, "Salesforce/blip-image-captioning-large"},
+  featurizer: {:hf, "Salesforce/blip-image-captioning-large"},
+  tokenizer: {:hf, "Salesforce/blip-image-captioning-large"},
+  generation_config: {:hf, "Salesforce/blip-image-captioning-large"}
+```
+
+Avoid BLIP-2 (`Salesforce/blip2-*`) — its OPT decoder has non-commercial licensing.
+
+## Tuning generation
+
+The default generation config is whatever the model ships with — usually around 20 tokens, greedy decoding. To tune (more tokens, beam search, sampling), set the relevant fields on the generation config via Bumblebee's loaded config and configure them at serving construction time. For most use cases the defaults are fine.
+
+## Pre-downloading
+
+The default BLIP weights are ~990 MB — by far the heaviest of the library's defaults. Pre-download to avoid blocking on first call:
+
+```bash
+mix image_vision.download_models --caption
+```
+
+## Why captioning is heavy
+
+A vision classifier runs the encoder once per image and is done. A captioner runs the encoder once, then runs the text decoder *autoregressively* — one forward pass per generated token. A 20-word caption is ~20 forward passes through the decoder. That's why generation is slower than classification, and why we autostart the serving so the model only loads once across the lifetime of the application.
+
+## Default model
+
+[BLIP base captioning](https://huggingface.co/Salesforce/blip-image-captioning-base) — Salesforce, BSD-3-Clause licensed, ~990 MB. Trained on 129M image-text pairs. The base variant trades some descriptive richness for size and speed compared to the large variant.
+
+## Dependencies
+
+Captioning requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:
+
+```elixir
+{:bumblebee, "~> 0.6"},
+{:nx, "~> 0.10"},
+{:exla, "~> 0.10"}
+```
@@ -0,0 +1,110 @@
+# Zero-Shot Classification
+
+`Image.ZeroShot` lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where `Image.Classification` is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?".
+
+This is enormously useful when your label space:
+
+- Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging)
+- Changes over time (new categories appear without retraining)
+- Is unknown until query time (interactive search, user-driven filtering)
+
+Powered by [CLIP](https://openai.com/research/clip), a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs.
+
+## Classifying
+
+```elixir
+iex> photo = Image.open!("portrait.jpg")
+iex> Image.ZeroShot.classify(photo, [
+...>   "a person on a horse",
+...>   "a person walking a dog",
+...>   "a parked car",
+...>   "an empty street"
+...> ])
+[
+  %{label: "a person on a horse", score: 0.94},
+  %{label: "a person walking a dog", score: 0.04},
+  %{label: "an empty street", score: 0.01},
+  %{label: "a parked car", score: 0.01}
+]
+```
+
+Scores sum to `1.0` (softmax over the candidate set). Results are sorted by descending score.
+
+## Just the best label
+
+When you only want the winner:
+
+```elixir
+iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"])
+"horse"
+```
+
+## Image-to-image similarity
+
+CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly:
+
+```elixir
+iex> a = Image.open!("dog1.jpg")
+iex> b = Image.open!("dog2.jpg")
+iex> c = Image.open!("car.jpg")
+iex>
+iex> Image.ZeroShot.similarity(a, b)
+0.82
+iex> Image.ZeroShot.similarity(a, c)
+0.41
+```
+
+Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them.
+
+## Prompt templates matter
+
+CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is `"a photo of {label}"` — every label is wrapped before tokenisation.
+
+You can override:
+
+```elixir
+# Photography-domain template
+iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"],
+...>   template: "a high-quality photograph of {label}")
+
+# Document-domain template
+iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"],
+...>   template: "a scanned {label}")
+
+# Disable templating if your labels are already full sentences
+iex> Image.ZeroShot.classify(image,
+...>   ["a black cat sitting on a chair", "an empty room with a chair"],
+...>   template: nil)
+```
+
+A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use `template: nil`. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help.
+
+## Choosing a model
+
+The default is `openai/clip-vit-base-patch32` — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size:
+
+```elixir
+iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14")
+```
+
+(About ~1.7 GB — roughly 3× the default.)
+
+## Pre-downloading
+
+```bash
+mix image_vision.download_models --zero-shot
+```
+
+## Default model
+
+[OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space.
+
+## Dependencies
+
+Zero-shot classification requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:
+
+```elixir
+{:bumblebee, "~> 0.6"},
+{:nx, "~> 0.10"},
+{:exla, "~> 0.10"}
+```
@@ -20,7 +20,8 @@ defmodule ImageVision.Application do
   if ImageVision.bumblebee_configured?() do
     @services [
       {{Image.Classification, :classifier, []}, false},
-      {{Image.Classification, :embedder, []}, false}
+      {{Image.Classification, :embedder, []}, false},
+      {{Image.Captioning, :captioner, []}, false}
     ]
 
     defp children(true) do
Original file line number	Diff line number	Diff line change
`@@ -20,7 +20,8 @@ defmodule ImageVision.Application do`
`20`	`20`	`if ImageVision.bumblebee_configured?() do`
`21`	`21`	`@services [`
`22`	`22`	`{{Image.Classification, :classifier, []}, false},`
`23`		`- {{Image.Classification, :embedder, []}, false}`
	`23`	`+ {{Image.Classification, :embedder, []}, false},`
	`24`	`+ {{Image.Captioning, :captioner, []}, false}`
`24`	`25`	`]`
`25`	`26`
`26`	`27`	`defp children(true) do`