Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Currently targets the ONNX backend. The pipeline is designed to support addition
|-------------------------------------------------------------------------------|---------|-------------------------------------------------|
| [`denoise-nafnet`](models/denoise-nafnet/README.md) | denoise | NAFNet denoiser trained on SIDD dataset |
| [`denoise-nind`](models/denoise-nind/README.md) | denoise | UNet denoiser trained on NIND dataset |
| [`embed-openclip-rn101-yfcc15m`](models/embed-openclip-rn101-yfcc15m/README.md) | embed | OpenCLIP RN101 trained on Flickr CC photos (YFCC15M) |
| [`mask-object-sam21-base-plus`](models/mask-object-sam21-base-plus/README.md) | mask | SAM 2.1 Hiera Base Plus for interactive masking |
| [`mask-object-sam21-small`](models/mask-object-sam21-small/README.md) | mask | SAM 2.1 Hiera Small for interactive masking |
| [`mask-object-sam21-tiny`](models/mask-object-sam21-tiny/README.md) | mask | SAM 2.1 Hiera Tiny for interactive masking |
Expand Down
96 changes: 96 additions & 0 deletions models/embed-openclip-rn101-yfcc15m/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# OpenCLIP RN101 (YFCC15M)

ResNet-101 image embedder from the OpenCLIP project, trained on YFCC15M –
15 million Flickr photos with Creative Commons licences. Produces
512-dimensional embeddings used in darktable for centroid-based tag
suggestion and image-similarity search.

## Source

- Architecture & weights: [mlfoundations/open_clip](https://github.com/mlfoundations/open_clip), `RN101 / yfcc15m`
- License: MIT
- Paper: [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143)
- Training data: [YFCC15M](https://www.researchgate.net/publication/270275998_The_New_Data_and_New_Challenges_in_Multimedia_Research_-_YFCC100M_The_New_Data_and_New_Challenges_in_Multimedia_Research) (Thomee et al.) – subset of YFCC100M curated by Radford et al. for CLIP training

## Why this model

YFCC15M is the only widely-available CLIP training dataset where every image
was opt-in licensed by its author (Flickr Creative Commons). Other CLIP-family
training corpora (LAION, WebLI, OpenAI's WIT-400M, DataComp, MetaCLIP) are
all web scrapes without per-image consent. For darktable's open-source
compliance criteria – *"Models trained on undisclosed or scraped personal
data without consent are not accepted"* – YFCC15M is the right fit.

The trade-off: ~31% ImageNet zero-shot top-1, vs ~67% for LAION-trained
ViT-B-32 or ~76% for SigLIP 2. For darktable's actual use case (photo-domain
centroid matching against a user's library) the gap is narrower than the
benchmark suggests, because YFCC's training distribution is photo-aligned
rather than the everything-on-the-web mix that boosts LAION's ImageNet score.

## How it's used in darktable

Two cooperating workflows:

- **Per-user tag taxonomy.** Average image embeddings per user-applied tag
→ centroid in 512-dim space. Suggest the same tag on new images whose
embedding is close to that centroid.
- **Cold-start defaults.** For users without enough examples of their own,
fall back to 86 precomputed centroids covering common photographic
concepts (see [`tags.md`](tags.md)). Hierarchical names use darktable's
`|` separator: `genre|landscape`, `subject|animal|dog`,
`lighting|golden hour`, etc.

The text encoder is run *only at convert time* to bake those centroids
into `tags.json`. At runtime, darktable runs only the image encoder.

## Architecture

| Property | Value |
| -------- | ----- |
| Visual backbone | Modified ResNet-101 (CLIP variant: anti-aliased pooling, attention pool head) |
| Input | 224×224 RGB, normalised with OpenAI CLIP stats (mean 0.485 / 0.458 / 0.408, std 0.269 / 0.261 / 0.276) |
| Embedding dim | 512 |
| Parameters | ~130M |

## ONNX Assets

| File | Purpose | Size |
|--------------|--------------------------------------------------------------------------|---------------|
| `model.onnx` | image → 512-dim L2-normalised embedding (mean/std subtraction baked in) | ~60 MB (FP16) |
| `tags.json` | precomputed 512-dim centroids for the 86-tag default taxonomy | ~270 KB |

The convert wrapper bakes mean/std subtraction and L2 normalisation into the
ONNX graph so the caller only needs to feed `[0, 1]` RGB pixels.

The visual backbone runs in FP16; the graph's input and output stay
FP32 because the casts happen inside the export wrapper (after mean/std
subtraction on the input, before L2 normalisation on the output).
Callers do not need to add any cast logic.

## Notes

- **No text encoder ONNX in the runtime package.** The text encoder runs only
at darktable-ai build time to produce `tags.json`. Free-text search would
require shipping the text encoder separately – deferred until that feature
lands in darktable.
- **Multilingual upgrade path is clean.** A future multilingual text encoder
distilled to match this YFCC15M-aligned space could be added as an optional
sidecar package without invalidating users' indexed image embeddings.
- **No re-indexing required for text-encoder additions.** As long as the
vision encoder stays unchanged, image embeddings remain valid forever.

## Selection Criteria

| Property | Value |
| -------- | ----- |
| Model license | MIT |
| OSAID v1.0 | Open Source AI |
| MOF | Class I (Open Science) |
| Training data license | Creative Commons (mixed: BY, BY-SA, BY-NC, BY-ND, BY-NC-SA, BY-NC-ND, public-domain) |
| Training data provenance | YFCC100M (Yahoo Flickr Creative Commons 100M), subset of 15M curated by Radford et al. for CLIP training. All images uploaded to Flickr by their authors under permissive licences. |
| Training code | Open: [open_clip](https://github.com/mlfoundations/open_clip) MIT |
| Known limitations | ResNet-101 backbone reaches modest benchmark scores compared to web-scrape-trained ViT models; the trade-off is deliberate to honour darktable's consent-based training-data criterion. |
| Published research | [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143) |
| Inference | Local only, no cloud dependencies |
| Scope | Image feature extraction for tagging and similarity (no image synthesis or modification) |
| Reproducibility | Full: `open_clip.create_model_and_transforms("RN101", pretrained="yfcc15m")` reproduces the weights; convert.py regenerates the ONNX export and tag centroids deterministically. |
Loading