Skip to content

Commit c2c86fc

Browse files
committed
Add embed-openclip-rn101-yfcc15m model (FP16, 86-tag default vocab)
1 parent 5454d7a commit c2c86fc

8 files changed

Lines changed: 952 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Currently targets the ONNX backend. The pipeline is designed to support addition
1010
|-------------------------------------------------------------------------------|---------|-------------------------------------------------|
1111
| [`denoise-nafnet`](models/denoise-nafnet/README.md) | denoise | NAFNet denoiser trained on SIDD dataset |
1212
| [`denoise-nind`](models/denoise-nind/README.md) | denoise | UNet denoiser trained on NIND dataset |
13+
| [`embed-openclip-rn101-yfcc15m`](models/embed-openclip-rn101-yfcc15m/README.md) | embed | OpenCLIP RN101 trained on Flickr CC photos (YFCC15M) |
1314
| [`mask-object-sam21-base-plus`](models/mask-object-sam21-base-plus/README.md) | mask | SAM 2.1 Hiera Base Plus for interactive masking |
1415
| [`mask-object-sam21-small`](models/mask-object-sam21-small/README.md) | mask | SAM 2.1 Hiera Small for interactive masking |
1516
| [`mask-object-sam21-tiny`](models/mask-object-sam21-tiny/README.md) | mask | SAM 2.1 Hiera Tiny for interactive masking |
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# OpenCLIP RN101 (YFCC15M)
2+
3+
ResNet-101 image embedder from the OpenCLIP project, trained on YFCC15M –
4+
15 million Flickr photos with Creative Commons licences. Produces
5+
512-dimensional embeddings used in darktable for centroid-based tag
6+
suggestion and image-similarity search.
7+
8+
## Source
9+
10+
- Architecture & weights: [mlfoundations/open_clip](https://github.com/mlfoundations/open_clip), `RN101 / yfcc15m`
11+
- License: MIT
12+
- Paper: [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143)
13+
- Training data: [YFCC15M](https://www.researchgate.net/publication/270275998_The_New_Data_and_New_Challenges_in_Multimedia_Research_-_YFCC100M_The_New_Data_and_New_Challenges_in_Multimedia_Research) (Thomee et al.) – subset of YFCC100M curated by Radford et al. for CLIP training
14+
15+
## Why this model
16+
17+
YFCC15M is the only widely-available CLIP training dataset where every image
18+
was opt-in licensed by its author (Flickr Creative Commons). Other CLIP-family
19+
training corpora (LAION, WebLI, OpenAI's WIT-400M, DataComp, MetaCLIP) are
20+
all web scrapes without per-image consent. For darktable's open-source
21+
compliance criteria – *"Models trained on undisclosed or scraped personal
22+
data without consent are not accepted"* – YFCC15M is the right fit.
23+
24+
The trade-off: ~31% ImageNet zero-shot top-1, vs ~67% for LAION-trained
25+
ViT-B-32 or ~76% for SigLIP 2. For darktable's actual use case (photo-domain
26+
centroid matching against a user's library) the gap is narrower than the
27+
benchmark suggests, because YFCC's training distribution is photo-aligned
28+
rather than the everything-on-the-web mix that boosts LAION's ImageNet score.
29+
30+
## How it's used in darktable
31+
32+
Two cooperating workflows:
33+
34+
- **Per-user tag taxonomy.** Average image embeddings per user-applied tag
35+
→ centroid in 512-dim space. Suggest the same tag on new images whose
36+
embedding is close to that centroid.
37+
- **Cold-start defaults.** For users without enough examples of their own,
38+
fall back to 86 precomputed centroids covering common photographic
39+
concepts (see [`tags.md`](tags.md)). Hierarchical names use darktable's
40+
`|` separator: `genre|landscape`, `subject|animal|dog`,
41+
`lighting|golden hour`, etc.
42+
43+
The text encoder is run *only at convert time* to bake those centroids
44+
into `tags.json`. At runtime, darktable runs only the image encoder.
45+
46+
## Architecture
47+
48+
| Property | Value |
49+
| -------- | ----- |
50+
| Visual backbone | Modified ResNet-101 (CLIP variant: anti-aliased pooling, attention pool head) |
51+
| Input | 224×224 RGB, normalised with OpenAI CLIP stats (mean 0.485 / 0.458 / 0.408, std 0.269 / 0.261 / 0.276) |
52+
| Embedding dim | 512 |
53+
| Parameters | ~130M |
54+
55+
## ONNX Assets
56+
57+
| File | Purpose | Size |
58+
|--------------|--------------------------------------------------------------------------|---------------|
59+
| `model.onnx` | image → 512-dim L2-normalised embedding (mean/std subtraction baked in) | ~60 MB (FP16) |
60+
| `tags.json` | precomputed 512-dim centroids for the 86-tag default taxonomy | ~270 KB |
61+
62+
The convert wrapper bakes mean/std subtraction and L2 normalisation into the
63+
ONNX graph so the caller only needs to feed `[0, 1]` RGB pixels.
64+
65+
The visual backbone runs in FP16; the graph's input and output stay
66+
FP32 because the casts happen inside the export wrapper (after mean/std
67+
subtraction on the input, before L2 normalisation on the output).
68+
Callers do not need to add any cast logic.
69+
70+
## Notes
71+
72+
- **No text encoder ONNX in the runtime package.** The text encoder runs only
73+
at darktable-ai build time to produce `tags.json`. Free-text search would
74+
require shipping the text encoder separately – deferred until that feature
75+
lands in darktable.
76+
- **Multilingual upgrade path is clean.** A future multilingual text encoder
77+
distilled to match this YFCC15M-aligned space could be added as an optional
78+
sidecar package without invalidating users' indexed image embeddings.
79+
- **No re-indexing required for text-encoder additions.** As long as the
80+
vision encoder stays unchanged, image embeddings remain valid forever.
81+
82+
## Selection Criteria
83+
84+
| Property | Value |
85+
| -------- | ----- |
86+
| Model license | MIT |
87+
| OSAID v1.0 | Open Source AI |
88+
| MOF | Class I (Open Science) |
89+
| Training data license | Creative Commons (mixed: BY, BY-SA, BY-NC, BY-ND, BY-NC-SA, BY-NC-ND, public-domain) |
90+
| Training data provenance | YFCC100M (Yahoo Flickr Creative Commons 100M), subset of 15M curated by Radford et al. for CLIP training. All images uploaded to Flickr by their authors under permissive licences. |
91+
| Training code | Open: [open_clip](https://github.com/mlfoundations/open_clip) MIT |
92+
| Known limitations | ResNet-101 backbone reaches modest benchmark scores compared to web-scrape-trained ViT models; the trade-off is deliberate to honour darktable's consent-based training-data criterion. |
93+
| Published research | [Reproducible scaling laws for contrastive language-image learning](https://arxiv.org/abs/2212.07143) |
94+
| Inference | Local only, no cloud dependencies |
95+
| Scope | Image feature extraction for tagging and similarity (no image synthesis or modification) |
96+
| Reproducibility | Full: `open_clip.create_model_and_transforms("RN101", pretrained="yfcc15m")` reproduces the weights; convert.py regenerates the ONNX export and tag centroids deterministically. |

0 commit comments

Comments
 (0)