|
| 1 | +# Zero-Shot Classification |
| 2 | + |
| 3 | +`Image.ZeroShot` lets you classify an image against arbitrary labels you provide at call time, without training or fine-tuning anything. Where `Image.Classification` is constrained to whatever 1000 ImageNet labels its model was trained on, zero-shot says "here are five categories I care about right now — which fits best?". |
| 4 | + |
| 5 | +This is enormously useful when your label space: |
| 6 | + |
| 7 | +- Doesn't exist in standard datasets (custom product categories, brand-specific taxonomies, ad-hoc tagging) |
| 8 | +- Changes over time (new categories appear without retraining) |
| 9 | +- Is unknown until query time (interactive search, user-driven filtering) |
| 10 | + |
| 11 | +Powered by [CLIP](https://openai.com/research/clip), a contrastive vision-language model that learned a shared embedding space for images and text from 400 million image-caption pairs. |
| 12 | + |
| 13 | +## Classifying |
| 14 | + |
| 15 | +```elixir |
| 16 | +iex> photo = Image.open!("portrait.jpg") |
| 17 | +iex> Image.ZeroShot.classify(photo, [ |
| 18 | +...> "a person on a horse", |
| 19 | +...> "a person walking a dog", |
| 20 | +...> "a parked car", |
| 21 | +...> "an empty street" |
| 22 | +...> ]) |
| 23 | +[ |
| 24 | + %{label: "a person on a horse", score: 0.94}, |
| 25 | + %{label: "a person walking a dog", score: 0.04}, |
| 26 | + %{label: "an empty street", score: 0.01}, |
| 27 | + %{label: "a parked car", score: 0.01} |
| 28 | +] |
| 29 | +``` |
| 30 | + |
| 31 | +Scores sum to `1.0` (softmax over the candidate set). Results are sorted by descending score. |
| 32 | + |
| 33 | +## Just the best label |
| 34 | + |
| 35 | +When you only want the winner: |
| 36 | + |
| 37 | +```elixir |
| 38 | +iex> Image.ZeroShot.label(photo, ["dog", "cat", "horse"]) |
| 39 | +"horse" |
| 40 | +``` |
| 41 | + |
| 42 | +## Image-to-image similarity |
| 43 | + |
| 44 | +CLIP's image embeddings live in the same space as its text embeddings, so two images can also be compared directly: |
| 45 | + |
| 46 | +```elixir |
| 47 | +iex> a = Image.open!("dog1.jpg") |
| 48 | +iex> b = Image.open!("dog2.jpg") |
| 49 | +iex> c = Image.open!("car.jpg") |
| 50 | +iex> |
| 51 | +iex> Image.ZeroShot.similarity(a, b) |
| 52 | +0.82 |
| 53 | +iex> Image.ZeroShot.similarity(a, c) |
| 54 | +0.41 |
| 55 | +``` |
| 56 | + |
| 57 | +Useful for "find similar images" without standing up a vector database. For larger collections, compute embeddings once and cache them. |
| 58 | + |
| 59 | +## Prompt templates matter |
| 60 | + |
| 61 | +CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. Wrapping each label in a simple template reliably improves accuracy. The default is `"a photo of {label}"` — every label is wrapped before tokenisation. |
| 62 | + |
| 63 | +You can override: |
| 64 | + |
| 65 | +```elixir |
| 66 | +# Photography-domain template |
| 67 | +iex> Image.ZeroShot.classify(image, ["sunset", "rain", "snow"], |
| 68 | +...> template: "a high-quality photograph of {label}") |
| 69 | + |
| 70 | +# Document-domain template |
| 71 | +iex> Image.ZeroShot.classify(scan, ["invoice", "receipt", "letter"], |
| 72 | +...> template: "a scanned {label}") |
| 73 | + |
| 74 | +# Disable templating if your labels are already full sentences |
| 75 | +iex> Image.ZeroShot.classify(image, |
| 76 | +...> ["a black cat sitting on a chair", "an empty room with a chair"], |
| 77 | +...> template: nil) |
| 78 | +``` |
| 79 | + |
| 80 | +A general rule: if your labels are nouns, leave the default template. If they're already descriptive sentences, use `template: nil`. If they're domain-specific (medical imagery, document scans, product photography), a domain-tailored template can help. |
| 81 | + |
| 82 | +## Choosing a model |
| 83 | + |
| 84 | +The default is `openai/clip-vit-base-patch32` — MIT licensed, ~600 MB, broad training coverage, well-validated. For higher quality at larger size: |
| 85 | + |
| 86 | +```elixir |
| 87 | +iex> Image.ZeroShot.classify(image, labels, repo: "openai/clip-vit-large-patch14") |
| 88 | +``` |
| 89 | + |
| 90 | +(About ~1.7 GB — roughly 3× the default.) |
| 91 | + |
| 92 | +## Pre-downloading |
| 93 | + |
| 94 | +```bash |
| 95 | +mix image_vision.download_models --zero-shot |
| 96 | +``` |
| 97 | + |
| 98 | +## Default model |
| 99 | + |
| 100 | +[OpenAI CLIP ViT-B/32](https://huggingface.co/openai/clip-vit-base-patch32) — MIT licensed, ~600 MB. The original CLIP, the most well-validated and broadly applicable variant. Contains both a vision encoder (ViT-B/32) and a text encoder (transformer) that produce vectors in a shared 512-dim space. |
| 101 | + |
| 102 | +## Dependencies |
| 103 | + |
| 104 | +Zero-shot classification requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`: |
| 105 | + |
| 106 | +```elixir |
| 107 | +{:bumblebee, "~> 0.6"}, |
| 108 | +{:nx, "~> 0.10"}, |
| 109 | +{:exla, "~> 0.10"} |
| 110 | +``` |
0 commit comments