You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Update image and text embeddings documentation (#449)
## Description
Changed:
- moved note about normalization
- changed CLIP model names
- changed CLIP model description
- added cosine similarity (now we don't normalize text and image
embeddings by default), normalization is done inside the models
### Type of change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [x] Documentation update (improves or adds clarity to existing
documentation)
### Checklist
- [x] I have performed a self-review of my code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have updated the documentation accordingly
- [x] My changes generate no new warnings
Copy file name to clipboardExpand all lines: docs/docs/02-hooks/01-natural-language-processing/useTextEmbeddings.md
+25-15Lines changed: 25 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,10 +66,6 @@ A string that specifies the location of the tokenizer JSON file.
66
66
67
67
To run the model, you can use the `forward` method. It accepts one argument, which is a string representing the text you want to embed. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
68
68
69
-
:::info
70
-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
|[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)| English | 256 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
114
-
|[all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)| English | 384 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
115
-
|[multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)| English | 511 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
116
-
|[multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1)| English | 512 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
116
+
| Model | Language | Max Tokens | Embedding Dimensions | Description |
|[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)| English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
119
+
|[all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)| English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
120
+
|[multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)| English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
121
+
|[multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1)| English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
122
+
|[clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32)| English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
117
123
118
124
**`Max Tokens`** - the maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
119
125
120
126
**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input text.
121
127
128
+
:::info
129
+
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
130
+
:::
131
+
122
132
## Benchmarks
123
133
124
134
### Model size
@@ -129,7 +139,7 @@ function App() {
129
139
| ALL_MPNET_BASE_V2 | 438 |
130
140
| MULTI_QA_MINILM_L6_COS_V1 | 91 |
131
141
| MULTI_QA_MPNET_BASE_DOT_V1 | 438 |
132
-
|CLIP_TEXT_ENCODER | 254 |
142
+
|CLIP_VIT_BASE_PATCH32_TEXT| 254 |
133
143
134
144
### Memory usage
135
145
@@ -139,7 +149,7 @@ function App() {
139
149
| ALL_MPNET_BASE_V2 | 520 | 470 |
140
150
| MULTI_QA_MINILM_L6_COS_V1 | 160 | 225 |
141
151
| MULTI_QA_MPNET_BASE_DOT_V1 | 540 | 500 |
142
-
|CLIP_TEXT_ENCODER | 275 | 250 |
152
+
|CLIP_VIT_BASE_PATCH32_TEXT| 275 | 250 |
143
153
144
154
### Inference time
145
155
@@ -153,4 +163,4 @@ Times presented in the tables are measured as consecutive runs of the model. Ini
@@ -62,23 +60,25 @@ A string that specifies the location of the model binary. For more information,
62
60
63
61
To run the model, you can use the `forward` method. It accepts one argument which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
64
62
65
-
:::info
66
-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
|[clip-vit-base-patch32-image-encoder](https://huggingface.co/openai/clip-vit-base-patch32)| English | 224 x 224 | 512 |Trained using contrastive learning for imagesearch use cases. |
91
+
| Model | Language | Image size | Embedding Dimensions | Description|
|[clip-vit-base-patch32-image](https://huggingface.co/openai/clip-vit-base-patch32)| English | 224 x 224 | 512 |CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout [clip-vit-base-patch32-text](../01-natural-language-processing/useTextEmbeddings.md#supported-models). |
94
94
95
95
**`Image size`** - the size of an image that the model takes as an input. Resize will happen automatically.
96
96
97
97
**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input image.
98
98
99
+
:::info
100
+
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
101
+
:::
102
+
99
103
## Benchmarks
100
104
101
105
### Model size
102
106
103
-
| Model | XNNPACK [MB]|
104
-
| ---------------------- | :----------: |
105
-
|CLIP_VIT_BASE_PATCH_32| 352 |
107
+
| Model | XNNPACK [MB]|
108
+
| ---------------------------| :----------: |
109
+
|CLIP_VIT_BASE_PATCH32_IMAGE| 352 |
106
110
107
111
### Memory usage
108
112
109
-
| Model | Android (XNNPACK) [MB]| iOS (XNNPACK) [MB]|
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.
117
121
:::
118
122
119
-
| Model | iPhone 16 Pro (XNNPACK) [ms]| iPhone 14 Pro Max (XNNPACK) [ms]| iPhone SE 3 (XNNPACK) [ms]| Samsung Galaxy S24 (XNNPACK) [ms]|
Copy file name to clipboardExpand all lines: docs/docs/03-typescript-api/02-computer-vision/ImageEmbeddingsModule.md
-4Lines changed: 0 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,3 @@ To load the model, use the `load` method. It accepts the `modelSource` which is
45
45
## Running the model
46
46
47
47
It accepts one argument, which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
48
-
49
-
:::info
50
-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
0 commit comments