Improve install docs

kipcole9 · kipcole9 · commit 70f64c18b50e · 2026-04-26T07:20:47.000+10:00
diff --git a/README.md b/README.md
@@ -47,8 +47,8 @@ def deps do
 
     # Required for Image.Classification and Image.Classification.embed/2
     {:bumblebee, "~> 0.6"},
-    {:nx, "~> 0.11"},
-    {:exla, "~> 0.9"},      # or {:torchx, "~> 0.9"} for Torch backend
+    {:nx, "~> 0.10"},
+    {:exla, "~> 0.10"},     # or {:torchx, "~> 0.10"} for Torch backend
 
     # Required for Image.Detection and Image.Segmentation
     {:ortex, "~> 0.1"}
@@ -58,6 +58,69 @@ end
 
 All ML deps are optional — omit any you do not use. The library compiles cleanly without them.
 
+## Prerequisites
+
+For the vast majority of users on Linux x86_64, macOS (Intel and Apple Silicon), and Windows x86_64, **no native toolchain is required**. The libraries used here ship precompiled native binaries for those platforms and `mix deps.get` is all you need.
+
+If your platform isn't covered by precompiled binaries — uncommon Linux distros, ARM Linux, glibc mismatches — you'll need:
+
+* **A Rust toolchain** for `:ortex` (the ONNX runtime wrapper used by detection and segmentation) and for `:tokenizers` (pulled in transitively by `:bumblebee`). Install via [rustup](https://rustup.rs):
+
+  ```bash
+  curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+  ```
+
+* **A C compiler** for `:vix` (the libvips wrapper used by `:image`). On Linux install `build-essential` (Debian/Ubuntu) or `gcc` (Fedora/RHEL); on macOS install Xcode Command Line Tools (`xcode-select --install`).
+
+* **`libvips`** if you need advanced libvips features beyond what the precompiled NIF includes. On macOS: `brew install vips`. On Linux: your distro's `libvips-dev` / `vips-devel` package. Most users don't need this.
+
+If you see Cargo or `cc` errors during `mix deps.compile`, you've likely landed on a platform without precompiled coverage — install the toolchain above and re-run.
+
+### Disk space and first-call latency
+
+Model weights are downloaded on first call and cached on disk. Across all three default models the total is approximately:
+
+| Task | Default model | Size |
+|---|---|---|
+| Classification | `facebook/convnext-tiny-224` | ~110 MB |
+| Detection | `onnx-community/rtdetr_r50vd` | ~175 MB |
+| Segmentation (SAM 2) | `SharpAI/sam2-hiera-tiny-onnx` | ~150 MB |
+| Segmentation (panoptic) | `Xenova/detr-resnet-50-panoptic` | ~175 MB |
+
+The first call to each task therefore appears to "hang" while weights download — that's expected, not a bug.
+
+To pre-download all default models before first use (recommended for production deployments and CI):
+
+```bash
+mix image_vision.download_models
+```
+
+Pass `--classify`, `--detect`, or `--segment` to limit scope.
+
+### Livebook Desktop
+
+Livebook Desktop launches as a GUI application and **does not inherit your shell's `PATH`**. Tools installed via `rustup`, `mise`, `asdf`, or Homebrew aren't visible to it by default — even if `cargo` works fine in your terminal.
+
+If you hit "cargo: command not found" or similar during `Mix.install` inside Livebook Desktop, create `~/.livebookdesktop.sh` and add the relevant directories to `PATH`. A reasonable starting point:
+
+```bash
+# ~/.livebookdesktop.sh
+
+# Rust (rustup)
+export PATH="$HOME/.cargo/bin:$PATH"
+
+# Homebrew (Apple Silicon)
+export PATH="/opt/homebrew/bin:$PATH"
+
+# mise — uncomment if you use it
+# eval "$(mise activate bash)"
+
+# asdf — uncomment if you use it
+# . "$HOME/.asdf/asdf.sh"
+```
+
+Restart Livebook Desktop after creating this file. See the [Livebook Desktop documentation](https://github.com/livebook-dev/livebook/blob/main/README.md#livebook-desktop) for details.
+
 ## Default models
 
 All models are permissively licensed. Weights are downloaded automatically on first call and cached on disk — no manual setup required.
diff --git a/guides/classification.md b/guides/classification.md
@@ -59,6 +59,16 @@ config :image_vision, :classifier,
   featurizer: {:hf, "facebook/convnext-large-224-22k-1k"}
 ```
 
+The embedder is configured the same way, independently:
+
+```elixir
+config :image_vision, :embedder,
+  model: {:hf, "facebook/dinov2-large"},
+  featurizer: {:hf, "facebook/dinov2-large"}
+```
+
+Any image classification or image embedding model that Bumblebee can load works as a drop-in replacement — anything from the [HuggingFace image-classification](https://huggingface.co/models?pipeline_tag=image-classification) or [feature-extraction](https://huggingface.co/models?pipeline_tag=feature-extraction) catalogue with a corresponding featurizer. Larger models trade speed and memory for accuracy. The label set will be whatever the chosen model was trained on — for example, `convnext-large-224-22k-1k` returns the same 1000 ImageNet labels as the default, while a model fine-tuned on a different dataset returns that dataset's labels.
+
 To manage the serving yourself (e.g. in an umbrella app):
 
 ```elixir
@@ -69,6 +79,14 @@ config :image_vision, :classifier, autostart: false
 children = [Image.Classification.classifier()]
 ```
 
+Configuration changes take effect at application start. After editing `config/runtime.exs`, restart the application; the new model is downloaded on first call and cached.
+
+To pre-download a configured model so first-call latency is eliminated:
+
+```bash
+mix image_vision.download_models --classify
+```
+
 ## Dependencies
 
 Classification requires `:bumblebee`, `:nx`, and an Nx backend such as `:exla`. Add to `mix.exs`:
diff --git a/guides/detection.md b/guides/detection.md
@@ -55,14 +55,36 @@ iex> Image.Detection.classes()
 ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", ...]
 ```
 
-## Choosing a lighter model
+## Using a different model
 
-The default `onnx/model.onnx` (~175 MB FP32) gives the best accuracy. Use the quantized variant (~45 MB INT8) for a much smaller download at some accuracy cost:
+`detect/2` accepts `:repo` and `:filename` to swap in any RT-DETR-family ONNX model from HuggingFace:
 
 ```elixir
+# Smaller R18 backbone (~80 MB) — faster, slightly less accurate
+iex> Image.Detection.detect(image, repo: "onnx-community/rtdetr_r18vd")
+
+# Quantized variant of the default (~45 MB INT8) — much smaller download, some accuracy cost
 iex> Image.Detection.detect(image, filename: "onnx/model_quantized.onnx")
 ```
 
+For one-off use, pass options per call. To make a non-default model the project default, you can wrap the call:
+
+```elixir
+defp detect(image), do: Image.Detection.detect(image, repo: "onnx-community/rtdetr_r18vd")
+```
+
+To pre-download a model into the cache:
+
+```bash
+mix image_vision.download_models --detect
+```
+
+(The download task fetches the configured default; if you've changed the repo for a single call site, the cache will populate on first use.)
+
+### Caveat: COCO 80 labels are hardcoded
+
+`detect/2` maps class indices to label strings using a baked-in COCO 80 list ([detection.ex](https://github.com/elixir-image/image_vision/blob/main/lib/detection.ex)). RT-DETR models trained on a different label set (e.g. Open Images, custom domains) will produce indices the wrapper can't translate — labels will be wrong even though boxes and scores are correct. For non-COCO models, use the underlying `Ortex.run/2` directly with the model's own `id2label`.
+
 ## Default model
 
 RT-DETR (`onnx-community/rtdetr_r50vd`) is a real-time transformer-based detector that outperforms YOLOv8 on COCO while being **Apache 2.0 licensed** (YOLOv8/11 are AGPL). It is NMS-free — no Non-Maximum Suppression post-processing is needed.
diff --git a/guides/segmentation.md b/guides/segmentation.md
@@ -95,6 +95,54 @@ Adjust transparency with `:alpha` (default `0.5`):
 iex> overlay = Image.Segmentation.compose_overlay(street, segments, alpha: 0.3)
 ```
 
+## Using a different model
+
+Both `segment/2` and `segment_panoptic/2` accept options to swap models. They are passed per call rather than via app config — neither function uses a long-running serving, so there is no autostart cost to overriding on a single call.
+
+### Promptable (SAM 2)
+
+```elixir
+# Use a larger SAM 2 variant for better quality on small or thin objects
+iex> Image.Segmentation.segment(image,
+...>   prompt: {:point, 320, 240},
+...>   repo: "SharpAI/sam2-hiera-small-onnx")
+```
+
+`segment/2` accepts:
+
+- `:repo` — any HuggingFace repo containing a SAM 2 ONNX export with separate encoder and decoder files
+- `:encoder_file` — encoder filename within the repo (default `"encoder.onnx"`)
+- `:decoder_file` — decoder filename within the repo (default `"decoder.onnx"`)
+
+The protocol matches `SharpAI/sam2-hiera-tiny-onnx` (separate encoder/decoder, the standard SAM 2 ONNX export shape). Repos that bundle both into a single file or use a different I/O layout will not work without changes to the wrapper.
+
+### Class-labeled (DETR-panoptic)
+
+```elixir
+# Quantized variant — much smaller, some accuracy cost
+iex> Image.Segmentation.segment_panoptic(image, model_file: "onnx/model_quantized.onnx")
+
+# A different ONNX-exported DETR-panoptic repo
+iex> Image.Segmentation.segment_panoptic(image, repo: "your-org/detr-panoptic-onnx")
+```
+
+`segment_panoptic/2` accepts:
+
+- `:repo` — any HuggingFace repo with a DETR-panoptic ONNX export and a `config.json` providing `id2label`
+- `:model_file` — ONNX filename within the repo (default `"onnx/model.onnx"`)
+
+Labels are read from the repo's `config.json`. Where that config has placeholder `LABEL_n` entries, the wrapper falls back to the canonical [COCO panoptic taxonomy](https://github.com/cocodataset/panopticapi/blob/master/panoptic_coco_categories.json), so common stuff classes (`sky-other-merged`, `mountain-merged`, `grass-merged`, …) resolve correctly even on repos with incomplete configs.
+
+### Pre-downloading
+
+To populate the cache before first use:
+
+```bash
+mix image_vision.download_models --segment
+```
+
+This fetches the configured defaults. For non-default repos, the cache populates on first call to `segment/2` or `segment_panoptic/2`.
+
 ## Dependencies
 
 Segmentation requires `:ortex`. Add to `mix.exs`:
diff --git a/livebooks/image_vision.livemd b/livebooks/image_vision.livemd
@@ -1,5 +1,16 @@
 # Image Vision
 
+> #### Livebook Desktop users {: .info}
+>
+> Livebook Desktop launches as a GUI app and does **not** inherit your terminal's `PATH`. If `Mix.install` below fails with "cargo: command not found" or similar, your Rust toolchain isn't visible to Livebook Desktop. Fix by creating `~/.livebookdesktop.sh` with at minimum:
+>
+> ```sh
+> export PATH="$HOME/.cargo/bin:$PATH"
+> # If you use mise/asdf, also activate them here.
+> ```
+>
+> Restart Livebook Desktop after editing the file. See the project [README](https://github.com/elixir-image/image_vision#prerequisites) for details on which toolchains are needed.
+
 ```elixir
 Mix.install(
   [
@@ -54,7 +65,7 @@ Kino.DataTable.new(predictions)
 
 ## Object Detection
 
-Upload an image and the model returns bounding boxes with labels for every recognised object. The default model is `PekingU/rtdetr_v2_r18vd` (RT-DETR v2, Apache 2.0, ~160 MB), which detects 80 COCO object classes.
+Upload an image and the model returns bounding boxes with labels for every recognised object. The default model is `onnx-community/rtdetr_r50vd` (RT-DETR, Apache 2.0, ~175 MB), which detects 80 COCO object classes.
 
 ```elixir
 detection_input = Kino.Input.image("Image for object detection")
@@ -136,7 +147,7 @@ Segmentation answers "exactly which pixels belong to this object?". We show two
 
 ### Panoptic segmentation
 
-The default model is `facebook/detr-resnet-50-panoptic` (Apache 2.0, ~172 MB). It labels every pixel across 133 COCO categories — both things (cat, car) and stuff (sky, grass).
+The default model is `Xenova/detr-resnet-50-panoptic` (Apache 2.0, ~175 MB). It labels every pixel across 133 COCO panoptic categories — both things (cat, car) and stuff (sky, grass).
 
 ```elixir
 panoptic_input = Kino.Input.image("Image for panoptic segmentation")
@@ -174,7 +185,7 @@ end)
 
 ### Promptable segmentation (SAM 2)
 
-The default model is `facebook/sam2.1-hiera-tiny` (Apache 2.0, ~150 MB). Click a point inside the object you want to isolate and SAM 2 returns a precise pixel mask for it.
+The default model is `SharpAI/sam2-hiera-tiny-onnx` (Apache 2.0, ~150 MB total — encoder + decoder). Click a point inside the object you want to isolate and SAM 2 returns a precise pixel mask for it.
 
 ```elixir
 sam_input = Kino.Input.image("Image for SAM 2 segmentation")