|
| 1 | +# Computer-vision tasks at a glance |
| 2 | + |
| 3 | +`image_vision` ships seven public modules, each wrapping a different ONNX or Bumblebee-served model. This guide is a map: it explains which module solves which kind of problem, shows the call shape for each, and points you at the per-module guides for the depth. |
| 4 | + |
| 5 | +Four task families cover everything in the library: |
| 6 | + |
| 7 | +* **Classification** — "what is this image about?". Returns a list of labels. |
| 8 | +* **Detection** — "where are the things?". Returns bounding boxes plus labels. |
| 9 | +* **Segmentation** — "which pixels belong to which thing?". Returns pixel masks. |
| 10 | +* **Description** — "what's happening in this image, in words?". Returns a natural-language sentence. |
| 11 | + |
| 12 | +Most computer-vision tasks reduce to one of these four. Some applications — face-aware crops, background removal, content moderation — stitch two or three together. |
| 13 | + |
| 14 | +## The decision table |
| 15 | + |
| 16 | +| You want… | Use | Returns | |
| 17 | +| --- | --- | --- | |
| 18 | +| One-line label for what's in the image | `Image.Classification.labels/2` | `["Blenheim spaniel"]` | |
| 19 | +| Embedding vector for similarity / search | `Image.Classification.embed/2` | `Nx.Tensor` (1000-dim by default) | |
| 20 | +| Classify against custom labels (no retraining) | `Image.ZeroShot.classify/3` | `[%{label: "cat", score: 0.92}, …]` | |
| 21 | +| Bounding boxes around every detected object | `Image.Detection.detect/2` | `[%{box: {x, y, w, h}, label: "dog", score: 0.87}, …]` | |
| 22 | +| Bounding boxes around every face (plus landmarks) | `Image.FaceDetection.detect/2` | `[%{box: …, score: …, landmarks: [{x, y}, …]}, …]` | |
| 23 | +| Cut out a specific object from a click or box | `Image.Segmentation.segment/2` | `Vix.Vips.Image` mask | |
| 24 | +| Class-labeled regions for the whole scene | `Image.Segmentation.segment_panoptic/2` | `[%{label: "sky", mask: …}, …]` | |
| 25 | +| Extract just the foreground subject | `Image.Background.remove/2` | `Vix.Vips.Image` with alpha | |
| 26 | +| Natural-language caption for the image | `Image.Captioning.caption/2` | `"a dog sitting on a park bench"` | |
| 27 | + |
| 28 | +The same `t:Vix.Vips.Image.t/0` is the input for everything; the modules differ only in what they extract from it. |
| 29 | + |
| 30 | +## Classification — "what is this image about?" |
| 31 | + |
| 32 | +`Image.Classification` runs an ImageNet-trained backbone (ConvNeXt by default) and returns the top labels. It's the cheapest task in the library — fast inference, small model, no per-pixel work — so it's a good first reach for content tagging, search indexing, or anything that needs a coarse "what kind of image is this?" signal. |
| 33 | + |
| 34 | +```elixir |
| 35 | +puppy = Image.open!("puppy.jpg") |
| 36 | + |
| 37 | +Image.Classification.labels(puppy) |
| 38 | +#=> ["Blenheim spaniel"] |
| 39 | + |
| 40 | +Image.Classification.labels(puppy, min_score: 0.1) |
| 41 | +#=> ["Blenheim spaniel", "cocker spaniel", "papillon"] |
| 42 | +``` |
| 43 | + |
| 44 | +For similarity search — "find me other images that look like this" — use `embed/2` instead. It returns the model's penultimate-layer activations as an `Nx.Tensor`, which you can dot-product against pre-computed embeddings of your library: |
| 45 | + |
| 46 | +```elixir |
| 47 | +query_vec = Image.Classification.embed(puppy) |
| 48 | +# Compare against pre-stored vectors via cosine similarity, etc. |
| 49 | +``` |
| 50 | + |
| 51 | +When ImageNet's 1000 categories don't cover your labels — "branded vs non-branded", "indoor vs outdoor", "edited vs unedited" — reach for `Image.ZeroShot` instead. It uses CLIP to classify against arbitrary labels you supply at request time: |
| 52 | + |
| 53 | +```elixir |
| 54 | +Image.ZeroShot.classify(puppy, ["a dog", "a cat", "a bird"]) |
| 55 | +#=> [%{label: "a dog", score: 0.97}, %{label: "a cat", score: 0.02}, …] |
| 56 | +``` |
| 57 | + |
| 58 | +CLIP is heavier than ConvNeXt (a vision transformer plus a text transformer) but the labels are free-form, which makes it the right tool for any taxonomy you didn't train against. |
| 59 | + |
| 60 | +See [`classification.md`](classification.md) and [`zero_shot.md`](zero_shot.md) for the full API surface. |
| 61 | + |
| 62 | +## Detection — "where are the things?" |
| 63 | + |
| 64 | +`Image.Detection` and `Image.FaceDetection` both return bounding boxes; they differ in what they look for and how rich the per-box metadata is. |
| 65 | + |
| 66 | +`Image.Detection` uses RT-DETR (real-time detection transformer) trained on COCO, so it knows ~80 object categories — people, vehicles, animals, household items, food. Box plus label plus confidence: |
| 67 | + |
| 68 | +```elixir |
| 69 | +park = Image.open!("park.jpg") |
| 70 | + |
| 71 | +Image.Detection.detect(park) |
| 72 | +#=> [ |
| 73 | +#=> %{box: {120, 80, 240, 380}, label: "person", score: 0.95}, |
| 74 | +#=> %{box: {410, 220, 180, 160}, label: "dog", score: 0.91}, |
| 75 | +#=> %{box: {620, 30, 300, 200}, label: "kite", score: 0.82} |
| 76 | +#=> ] |
| 77 | +``` |
| 78 | + |
| 79 | +`Image.FaceDetection` uses YuNet, a much smaller and faster model specialised for faces. The trade-off: only one class (faces), but you also get five-point landmarks (eyes, nose, mouth corners) per box: |
| 80 | + |
| 81 | +```elixir |
| 82 | +portrait = Image.open!("portrait.jpg") |
| 83 | + |
| 84 | +Image.FaceDetection.detect(portrait) |
| 85 | +#=> [ |
| 86 | +#=> %{ |
| 87 | +#=> box: {142, 84, 218, 240}, |
| 88 | +#=> score: 0.98, |
| 89 | +#=> landmarks: [{195, 162}, {275, 160}, {235, 200}, {200, 240}, {270, 240}] |
| 90 | +#=> } |
| 91 | +#=> ] |
| 92 | +``` |
| 93 | + |
| 94 | +Both modules ship a `draw/2` helper that overlays the boxes on the image — useful for debugging and for the "show the user what the model saw" UX. |
| 95 | + |
| 96 | +`Image.FaceDetection` is the basis for `image_plug`'s `gravity: :face` crop and `Ops.PixelateFaces`. See [`detection.md`](detection.md) for the general object case and [`image_plug`'s face-aware guide](https://hexdocs.pm/image_plug/face_aware.html) for the face-aware crop integration. |
| 97 | + |
| 98 | +## Segmentation — "which pixels belong to which thing?" |
| 99 | + |
| 100 | +`Image.Segmentation` is the heaviest task by far — it produces per-pixel masks rather than just boxes. Two flavours, with very different shapes: |
| 101 | + |
| 102 | +* **Promptable** (`segment/2`) — you give it a hint ("here's where the object is") and it returns one mask. Backed by SAM 2. Good for "the user clicked here, cut out what they pointed at". |
| 103 | + |
| 104 | +* **Panoptic** (`segment_panoptic/2`) — no prompt, returns one mask per detected region with a class label. Good for scene parsing — sky, road, building, person, car all separated automatically. |
| 105 | + |
| 106 | +Promptable, with a click point: |
| 107 | + |
| 108 | +```elixir |
| 109 | +photo = Image.open!("scene.jpg") |
| 110 | +mask = Image.Segmentation.segment(photo, point: {340, 220}) |
| 111 | + |
| 112 | +# Apply the mask: blank out everything outside the segmented object. |
| 113 | +{:ok, isolated} = Vix.Vips.Operation.bandjoin([photo, mask]) |
| 114 | +``` |
| 115 | + |
| 116 | +Promptable, with a bounding box: |
| 117 | + |
| 118 | +```elixir |
| 119 | +mask = Image.Segmentation.segment(photo, box: {120, 80, 280, 320}) |
| 120 | +``` |
| 121 | + |
| 122 | +Panoptic — every region gets a class: |
| 123 | + |
| 124 | +```elixir |
| 125 | +Image.Segmentation.segment_panoptic(photo) |
| 126 | +#=> [ |
| 127 | +#=> %{label: "sky", mask: %Vix.Vips.Image{...}}, |
| 128 | +#=> %{label: "person", mask: %Vix.Vips.Image{...}}, |
| 129 | +#=> %{label: "building", mask: %Vix.Vips.Image{...}} |
| 130 | +#=> ] |
| 131 | +``` |
| 132 | + |
| 133 | +`Image.Background` is segmentation's narrowest and most ergonomic case: foreground vs background, no prompt needed. It's what you reach for when the only thing you actually want is "give me the subject with a transparent background": |
| 134 | + |
| 135 | +```elixir |
| 136 | +{:ok, cutout} = Image.Background.remove(photo) |
| 137 | +# `cutout` is the original image with everything outside the detected |
| 138 | +# foreground subject made transparent. |
| 139 | +``` |
| 140 | + |
| 141 | +`Image.Background.remove/2` is a class-agnostic foreground extractor — it works on any subject (person, product, pet, plant) without needing to be told what to look for. For arbitrary user-pointed segmentation (the "click anywhere" UX), use `Image.Segmentation.segment/2` instead. |
| 142 | + |
| 143 | +See [`segmentation.md`](segmentation.md) and [`background.md`](background.md) for the full call shapes. |
| 144 | + |
| 145 | +## Description — "what's happening, in words?" |
| 146 | + |
| 147 | +`Image.Captioning` produces a natural-language sentence describing the image, using BLIP (Bootstrapping Language-Image Pre-training): |
| 148 | + |
| 149 | +```elixir |
| 150 | +photo = Image.open!("park.jpg") |
| 151 | + |
| 152 | +Image.Captioning.caption(photo) |
| 153 | +#=> "a man walking his dog in a park" |
| 154 | +``` |
| 155 | + |
| 156 | +Captions are useful for accessibility (`alt` text generation), search (full-text indexing on caption strings), and content workflows (auto-tagging that would have to enumerate labels otherwise). The model is generative, so the same image can produce slightly different captions across runs unless you fix the seed; for stable indexing pin the generation parameters via `:generation_config`. |
| 157 | + |
| 158 | +BLIP is the heaviest model in the library — a vision encoder plus a language decoder, a few hundred MB on disk and several seconds per inference on CPU. For high-throughput captioning you want a GPU-backed Nx backend (EXLA on CUDA or Metal); CPU works for low-volume use. |
| 159 | + |
| 160 | +See [`captioning.md`](captioning.md) for prompt templates, batch processing, and the generation-parameter knobs. |
| 161 | + |
| 162 | +## Composing tasks |
| 163 | + |
| 164 | +A few common combinations: |
| 165 | + |
| 166 | +* **Face-aware crop** — `FaceDetection.detect/2` → take the highest-confidence box → crop to that region with padding. This is what `image_plug`'s `gravity: :face` does end-to-end. |
| 167 | + |
| 168 | +* **Pixelate faces** — `FaceDetection.detect/2` → for each box, apply pixelation only inside the region. This is `Image.Plug.Pipeline.Ops.PixelateFaces`. |
| 169 | + |
| 170 | +* **Caption + tag** — `Captioning.caption/2` for a sentence + `Classification.labels/2` for indexable labels. Caption is human-readable, labels are queryable; storing both gives you free-text search plus faceted filtering. |
| 171 | + |
| 172 | +* **Background swap** — `Background.remove/2` to extract the subject → composite onto a different background image with `Image.compose/3`. The "studio shot" effect from arbitrary phone photos. |
| 173 | + |
| 174 | +* **Object-aware blur** — `Detection.detect/2` → for each detection over a confidence threshold, blur outside the box. Useful for privacy redaction at scale. |
| 175 | + |
| 176 | +* **Visual search** — `Classification.embed/2` on every image at index time → store the vectors in a vector database → at query time, `embed` the query image and dot-product. This is the standard "more like this" search. |
| 177 | + |
| 178 | +* **Custom-taxonomy moderation** — `ZeroShot.classify/3` with your moderation labels (`["safe", "violent", "explicit", …]`). No training data, no model fine-tuning — just label strings. |
| 179 | + |
| 180 | +The library is designed to be composable: the modules don't share state, every function takes a `Vix.Vips.Image`, and every output is either another `Vix.Vips.Image` (so you can pipe it into the next step) or a plain Elixir term (so you can serialise / store / route it). |
| 181 | + |
| 182 | +## Optional dependencies and model loading |
| 183 | + |
| 184 | +Different modules need different runtimes: |
| 185 | + |
| 186 | +| Module | Runtime | Notes | |
| 187 | +| --- | --- | --- | |
| 188 | +| `Image.Classification`, `Image.Captioning`, `Image.ZeroShot` | Bumblebee + Nx + EXLA | Hugging Face model loading; GPU-friendly via EXLA backend. | |
| 189 | +| `Image.Detection`, `Image.Segmentation`, `Image.Background` | Ortex + Nx | ONNX Runtime; faster cold-start than Bumblebee, no GPU acceleration on macOS. | |
| 190 | +| `Image.FaceDetection` | Ortex + Nx | YuNet ONNX model, ~340 KB on disk. | |
| 191 | + |
| 192 | +Add `:image_vision` to your deps plus the runtime stack you need. The minimum to use everything is: |
| 193 | + |
| 194 | +```elixir |
| 195 | +def deps do |
| 196 | + [ |
| 197 | + {:image_vision, "~> 0.3"}, |
| 198 | + {:bumblebee, "~> 0.6"}, |
| 199 | + {:ortex, "~> 0.1"}, |
| 200 | + {:nx, "~> 0.10"}, |
| 201 | + {:exla, "~> 0.10"} |
| 202 | + ] |
| 203 | +end |
| 204 | +``` |
| 205 | + |
| 206 | +If you only need Ortex-backed tasks (Detection / Segmentation / Background / FaceDetection), you can drop Bumblebee. If you only need Bumblebee-backed tasks (Classification / Captioning / ZeroShot), you can drop Ortex. |
| 207 | + |
| 208 | +### Model cache |
| 209 | + |
| 210 | +All modules download their weights from HuggingFace on first use and cache them on disk. The default cache directory is OS-dependent: |
| 211 | + |
| 212 | +* macOS: `~/Library/Caches/image_vision/` |
| 213 | +* Linux: `~/.cache/image_vision/` |
| 214 | +* configurable via `config :image_vision, :cache_dir, "/path/to/cache"` |
| 215 | + |
| 216 | +In a containerised deployment, mount that directory as a volume so the model weights survive container restarts. See `image_playground`'s Dockerfile for an example. |
| 217 | + |
| 218 | +### CPU vs GPU |
| 219 | + |
| 220 | +Every model runs on CPU out of the box. For higher throughput, configure EXLA with a CUDA or Metal backend; both `bumblebee` and `nx_image` honour the configured backend automatically. On Apple Silicon the Metal backend is the realistic option; on Linux + NVIDIA, CUDA is the standard. See [Nx's installation guide](https://hexdocs.pm/exla/EXLA.html) for backend setup. |
| 221 | + |
| 222 | +## Related |
| 223 | + |
| 224 | +* [`classification.md`](classification.md) — full Classification API including `embed/2` and the model-config knobs. |
| 225 | +* [`zero_shot.md`](zero_shot.md) — CLIP-based custom-label classification. |
| 226 | +* [`detection.md`](detection.md) — RT-DETR object detection plus `draw/2` overlay. |
| 227 | +* [`segmentation.md`](segmentation.md) — promptable (SAM 2) and panoptic segmentation. |
| 228 | +* [`background.md`](background.md) — class-agnostic foreground extraction. |
| 229 | +* [`captioning.md`](captioning.md) — BLIP-based natural-language captioning. |
| 230 | +* [`image_plug`'s face-aware guide](https://hexdocs.pm/image_plug/face_aware.html) — `Image.FaceDetection` integrated into the URL-driven transform pipeline. |
0 commit comments