|
| 1 | +# OpenCLIP RN101 (YFCC15M) |
| 2 | + |
| 3 | +ResNet-101 image embedder from the OpenCLIP project, trained on YFCC15M – |
| 4 | +15 million Flickr photos with Creative Commons licences. Produces |
| 5 | +512-dimensional embeddings used in darktable for centroid-based tag |
| 6 | +suggestion and image-similarity search. |
| 7 | + |
| 8 | +## Source |
| 9 | + |
| 10 | +- Architecture & weights: [mlfoundations/open_clip](https://github.com/mlfoundations/open_clip), `RN101 / yfcc15m` |
| 11 | +- License: MIT |
| 12 | +- Paper: [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143) |
| 13 | +- Training data: [YFCC15M](https://www.researchgate.net/publication/270275998_The_New_Data_and_New_Challenges_in_Multimedia_Research_-_YFCC100M_The_New_Data_and_New_Challenges_in_Multimedia_Research) (Thomee et al.) – subset of YFCC100M curated by Radford et al. for CLIP training |
| 14 | + |
| 15 | +## Why this model |
| 16 | + |
| 17 | +YFCC15M is the only widely-available CLIP training dataset where every image |
| 18 | +was opt-in licensed by its author (Flickr Creative Commons). Other CLIP-family |
| 19 | +training corpora (LAION, WebLI, OpenAI's WIT-400M, DataComp, MetaCLIP) are |
| 20 | +all web scrapes without per-image consent. For darktable's open-source |
| 21 | +compliance criteria – *"Models trained on undisclosed or scraped personal |
| 22 | +data without consent are not accepted"* – YFCC15M is the right fit. |
| 23 | + |
| 24 | +The trade-off: ~31% ImageNet zero-shot top-1, vs ~67% for LAION-trained |
| 25 | +ViT-B-32 or ~76% for SigLIP 2. For darktable's actual use case (photo-domain |
| 26 | +centroid matching against a user's library) the gap is narrower than the |
| 27 | +benchmark suggests, because YFCC's training distribution is photo-aligned |
| 28 | +rather than the everything-on-the-web mix that boosts LAION's ImageNet score. |
| 29 | + |
| 30 | +## How it's used in darktable |
| 31 | + |
| 32 | +Two cooperating workflows: |
| 33 | + |
| 34 | +- **Per-user tag taxonomy.** Average image embeddings per user-applied tag |
| 35 | + → centroid in 512-dim space. Suggest the same tag on new images whose |
| 36 | + embedding is close to that centroid. |
| 37 | +- **Cold-start defaults.** For users without enough examples of their own, |
| 38 | + fall back to 86 precomputed centroids covering common photographic |
| 39 | + concepts (see [`tags.md`](tags.md)). Hierarchical names use darktable's |
| 40 | + `|` separator: `genre|landscape`, `subject|animal|dog`, |
| 41 | + `lighting|golden hour`, etc. |
| 42 | + |
| 43 | +The text encoder is run *only at convert time* to bake those centroids |
| 44 | +into `tags.json`. At runtime, darktable runs only the image encoder. |
| 45 | + |
| 46 | +## Architecture |
| 47 | + |
| 48 | +| Property | Value | |
| 49 | +| -------- | ----- | |
| 50 | +| Visual backbone | Modified ResNet-101 (CLIP variant: anti-aliased pooling, attention pool head) | |
| 51 | +| Input | 224×224 RGB, normalised with OpenAI CLIP stats (mean 0.485 / 0.458 / 0.408, std 0.269 / 0.261 / 0.276) | |
| 52 | +| Embedding dim | 512 | |
| 53 | +| Parameters | ~130M | |
| 54 | + |
| 55 | +## ONNX Assets |
| 56 | + |
| 57 | +| File | Purpose | Size | |
| 58 | +|--------------|--------------------------------------------------------------------------|---------------| |
| 59 | +| `model.onnx` | image → 512-dim L2-normalised embedding (mean/std subtraction baked in) | ~60 MB (FP16) | |
| 60 | +| `tags.json` | precomputed 512-dim centroids for the 86-tag default taxonomy | ~270 KB | |
| 61 | + |
| 62 | +The convert wrapper bakes mean/std subtraction and L2 normalisation into the |
| 63 | +ONNX graph so the caller only needs to feed `[0, 1]` RGB pixels. |
| 64 | + |
| 65 | +The visual backbone runs in FP16; the graph's input and output stay |
| 66 | +FP32 because the casts happen inside the export wrapper (after mean/std |
| 67 | +subtraction on the input, before L2 normalisation on the output). |
| 68 | +Callers do not need to add any cast logic. |
| 69 | + |
| 70 | +## Notes |
| 71 | + |
| 72 | +- **No text encoder ONNX in the runtime package.** The text encoder runs only |
| 73 | + at darktable-ai build time to produce `tags.json`. Free-text search would |
| 74 | + require shipping the text encoder separately – deferred until that feature |
| 75 | + lands in darktable. |
| 76 | +- **Multilingual upgrade path is clean.** A future multilingual text encoder |
| 77 | + distilled to match this YFCC15M-aligned space could be added as an optional |
| 78 | + sidecar package without invalidating users' indexed image embeddings. |
| 79 | +- **No re-indexing required for text-encoder additions.** As long as the |
| 80 | + vision encoder stays unchanged, image embeddings remain valid forever. |
| 81 | + |
| 82 | +## Selection Criteria |
| 83 | + |
| 84 | +| Property | Value | |
| 85 | +| -------- | ----- | |
| 86 | +| Model license | MIT | |
| 87 | +| OSAID v1.0 | Open Source AI | |
| 88 | +| MOF | Class I (Open Science) | |
| 89 | +| Training data license | Creative Commons (mixed: BY, BY-SA, BY-NC, BY-ND, BY-NC-SA, BY-NC-ND, public-domain) | |
| 90 | +| Training data provenance | YFCC100M (Yahoo Flickr Creative Commons 100M), subset of 15M curated by Radford et al. for CLIP training. All images uploaded to Flickr by their authors under permissive licences. | |
| 91 | +| Training code | Open: [open_clip](https://github.com/mlfoundations/open_clip) MIT | |
| 92 | +| Known limitations | ResNet-101 backbone reaches modest benchmark scores compared to web-scrape-trained ViT models; the trade-off is deliberate to honour darktable's consent-based training-data criterion. | |
| 93 | +| Published research | [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143) | |
| 94 | +| Inference | Local only, no cloud dependencies | |
| 95 | +| Scope | Image feature extraction for tagging and similarity (no image synthesis or modification) | |
| 96 | +| Reproducibility | Full: `open_clip.create_model_and_transforms("RN101", pretrained="yfcc15m")` reproduces the weights; convert.py regenerates the ONNX export and tag centroids deterministically. | |
0 commit comments