Skip to content

Commit 274703b

Browse files
authored
docs: Update image and text embeddings documentation (#449)
## Description Changed: - moved note about normalization - changed CLIP model names - changed CLIP model description - added cosine similarity (now we don't normalize text and image embeddings by default), normalization is done inside the models ### Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] Documentation update (improves or adds clarity to existing documentation) ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings
1 parent 87e8371 commit 274703b

3 files changed

Lines changed: 51 additions & 41 deletions

File tree

docs/docs/02-hooks/01-natural-language-processing/useTextEmbeddings.md

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -66,10 +66,6 @@ A string that specifies the location of the tokenizer JSON file.
6666

6767
To run the model, you can use the `forward` method. It accepts one argument, which is a string representing the text you want to embed. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
6868

69-
:::info
70-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
71-
:::
72-
7369
## Example
7470

7571
```typescript
@@ -82,6 +78,13 @@ import {
8278
const dotProduct = (a: number[], b: number[]) =>
8379
a.reduce((sum, val, i) => sum + val * b[i], 0);
8480

81+
const cosineSimilarity = (a: number[], b: number[]) => {
82+
const dot = dotProduct(a, b);
83+
const normA = Math.sqrt(dotProduct(a, a));
84+
const normB = Math.sqrt(dotProduct(b, b));
85+
return dot / (normA * normB);
86+
};
87+
8588
function App() {
8689
const model = useTextEmbeddings({
8790
modelSource: ALL_MINILM_L6_V2,
@@ -94,8 +97,10 @@ function App() {
9497
const helloWorldEmbedding = await model.forward('Hello World!');
9598
const goodMorningEmbedding = await model.forward('Good Morning!');
9699

97-
// The embeddings are normalized, so we can use dot product to calculate cosine similarity
98-
const similarity = dotProduct(helloWorldEmbedding, goodMorningEmbedding);
100+
const similarity = cosineSimilarity(
101+
helloWorldEmbedding,
102+
goodMorningEmbedding
103+
);
99104

100105
console.log(`Cosine similarity: ${similarity}`);
101106
} catch (error) {
@@ -108,17 +113,22 @@ function App() {
108113

109114
## Supported models
110115

111-
| Model | Language | Max Tokens | Embedding Dimensions | Description |
112-
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
113-
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 256 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
114-
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 384 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
115-
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 511 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
116-
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 512 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
116+
| Model | Language | Max Tokens | Embedding Dimensions | Description |
117+
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
118+
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
119+
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
120+
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
121+
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
122+
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
117123

118124
**`Max Tokens`** - the maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
119125

120126
**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input text.
121127

128+
:::info
129+
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
130+
:::
131+
122132
## Benchmarks
123133

124134
### Model size
@@ -129,7 +139,7 @@ function App() {
129139
| ALL_MPNET_BASE_V2 | 438 |
130140
| MULTI_QA_MINILM_L6_COS_V1 | 91 |
131141
| MULTI_QA_MPNET_BASE_DOT_V1 | 438 |
132-
| CLIP_TEXT_ENCODER | 254 |
142+
| CLIP_VIT_BASE_PATCH32_TEXT | 254 |
133143

134144
### Memory usage
135145

@@ -139,7 +149,7 @@ function App() {
139149
| ALL_MPNET_BASE_V2 | 520 | 470 |
140150
| MULTI_QA_MINILM_L6_COS_V1 | 160 | 225 |
141151
| MULTI_QA_MPNET_BASE_DOT_V1 | 540 | 500 |
142-
| CLIP_TEXT_ENCODER | 275 | 250 |
152+
| CLIP_VIT_BASE_PATCH32_TEXT | 275 | 250 |
143153

144154
### Inference time
145155

@@ -153,4 +163,4 @@ Times presented in the tables are measured as consecutive runs of the model. Ini
153163
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
154164
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
155165
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
156-
| CLIP_TEXT_ENCODER | 35 | 48 | 49 | 40 | - |
166+
| CLIP_VIT_BASE_PATCH32_TEXT | 35 | 48 | 49 | 40 | - |

docs/docs/02-hooks/02-computer-vision/useImageEmbeddings.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -27,12 +27,10 @@ It is recommended to use models provided by us, which are available at our [Hugg
2727
```typescript
2828
import {
2929
useImageEmbeddings,
30-
CLIP_VIT_BASE_PATCH_32_IMAGE_ENCODER_MODEL,
30+
CLIP_VIT_BASE_PATCH32_IMAGE,
3131
} from 'react-native-executorch';
3232

33-
const model = useImageEmbeddings({
34-
modelSource: CLIP_VIT_BASE_PATCH_32_IMAGE_ENCODER_MODEL,
35-
});
33+
const model = useImageEmbeddings(CLIP_VIT_BASE_PATCH32_IMAGE);
3634

3735
try {
3836
const imageEmbedding = await model.forward('https://url-to-image.jpg');
@@ -62,23 +60,25 @@ A string that specifies the location of the model binary. For more information,
6260

6361
To run the model, you can use the `forward` method. It accepts one argument which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
6462

65-
:::info
66-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
67-
:::
68-
6963
## Example
7064

7165
```typescript
7266
const dotProduct = (a: number[], b: number[]) =>
7367
a.reduce((sum, val, i) => sum + val * b[i], 0);
7468

69+
const cosineSimilarity = (a: number[], b: number[]) => {
70+
const dot = dotProduct(a, b);
71+
const normA = Math.sqrt(dotProduct(a, a));
72+
const normB = Math.sqrt(dotProduct(b, b));
73+
return dot / (normA * normB);
74+
};
75+
7576
try {
7677
// we assume you've provided catImage and dogImage
7778
const catImageEmbedding = await model.forward(catImage);
7879
const dogImageEmbedding = await model.forward(dogImage);
7980

80-
// The embeddings are normalized, so we can use dot product to calculate cosine similarity
81-
const similarity = dotProduct(catImageEmbedding, dogImageEmbedding);
81+
const similarity = cosineSimilarity(catImageEmbedding, dogImageEmbedding);
8282

8383
console.log(`Cosine similarity: ${similarity}`);
8484
} catch (error) {
@@ -88,34 +88,38 @@ try {
8888

8989
## Supported models
9090

91-
| Model | Language | Image size | Embedding Dimensions | Description |
92-
| ------------------------------------------------------------------------------------------ | :------: | :--------: | :------------------: | -------------------------------------------------------------- |
93-
| [clip-vit-base-patch32-image-encoder](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | Trained using contrastive learning for image search use cases. |
91+
| Model | Language | Image size | Embedding Dimensions | Description |
92+
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
93+
| [clip-vit-base-patch32-image](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout [clip-vit-base-patch32-text](../01-natural-language-processing/useTextEmbeddings.md#supported-models). |
9494

9595
**`Image size`** - the size of an image that the model takes as an input. Resize will happen automatically.
9696

9797
**`Embedding Dimensions`** - the size of the output embedding vector. This is the number of dimensions in the vector representation of the input image.
9898

99+
:::info
100+
For the supported models, the returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
101+
:::
102+
99103
## Benchmarks
100104

101105
### Model size
102106

103-
| Model | XNNPACK [MB] |
104-
| ---------------------- | :----------: |
105-
| CLIP_VIT_BASE_PATCH_32 | 352 |
107+
| Model | XNNPACK [MB] |
108+
| --------------------------- | :----------: |
109+
| CLIP_VIT_BASE_PATCH32_IMAGE | 352 |
106110

107111
### Memory usage
108112

109-
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
110-
| ---------------------- | :--------------------: | :----------------: |
111-
| CLIP_VIT_BASE_PATCH_32 | 324 | 347 |
113+
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
114+
| --------------------------- | :--------------------: | :----------------: |
115+
| CLIP_VIT_BASE_PATCH32_IMAGE | 324 | 347 |
112116

113117
### Inference time
114118

115119
:::warning warning
116120
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.
117121
:::
118122

119-
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
120-
| ---------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
121-
| CLIP_VIT_BASE_PATCH_32 | 104 | 120 | 280 | 265 |
123+
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
124+
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
125+
| CLIP_VIT_BASE_PATCH32_IMAGE | 104 | 120 | 280 | 265 |

docs/docs/03-typescript-api/02-computer-vision/ImageEmbeddingsModule.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,3 @@ To load the model, use the `load` method. It accepts the `modelSource` which is
4545
## Running the model
4646

4747
It accepts one argument, which is a URI/URL to an image you want to encode. The function returns a promise, which can resolve either to an error or an array of numbers representing the embedding.
48-
49-
:::info
50-
The returned embedding vector is normalized, meaning that its length is equal to 1. This allows for easier comparison of vectors using cosine similarity, just calculate the dot product of two vectors to get the cosine similarity score.
51-
:::

0 commit comments

Comments
 (0)