Skip to content

Commit de57e5d

Browse files
authored
docs: Update benchmarks for embeddings (#459)
## Description Updated image and text embeddings benchmarks ### Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [x] Documentation update (improves or adds clarity to existing documentation) ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings
1 parent 536e1c4 commit de57e5d

5 files changed

Lines changed: 67 additions & 30 deletions

File tree

docs/docs/02-hooks/01-natural-language-processing/useTextEmbeddings.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -113,12 +113,12 @@ function App() {
113113

114114
## Supported models
115115

116-
| Model | Language | Max Tokens | Embedding Dimensions | Description |
117-
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
118-
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
119-
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
120-
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
121-
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
116+
| Model | Language | Max Tokens | Embedding Dimensions | Description |
117+
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
118+
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
119+
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
120+
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
121+
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
122122
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
123123

124124
**`Max Tokens`** - the maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
@@ -145,11 +145,11 @@ For the supported models, the returned embedding vector is normalized, meaning t
145145

146146
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
147147
| -------------------------- | :--------------------: | :----------------: |
148-
| ALL_MINILM_L6_V2 | 150 | 190 |
149-
| ALL_MPNET_BASE_V2 | 520 | 470 |
150-
| MULTI_QA_MINILM_L6_COS_V1 | 160 | 225 |
151-
| MULTI_QA_MPNET_BASE_DOT_V1 | 540 | 500 |
152-
| CLIP_VIT_BASE_PATCH32_TEXT | 275 | 250 |
148+
| ALL_MINILM_L6_V2 | 85 | 100 |
149+
| ALL_MPNET_BASE_V2 | 390 | 465 |
150+
| MULTI_QA_MINILM_L6_COS_V1 | 115 | 130 |
151+
| MULTI_QA_MPNET_BASE_DOT_V1 | 415 | 490 |
152+
| CLIP_VIT_BASE_PATCH32_TEXT | 195 | 250 |
153153

154154
### Inference time
155155

@@ -159,8 +159,12 @@ Times presented in the tables are measured as consecutive runs of the model. Ini
159159

160160
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] |
161161
| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: |
162-
| ALL_MINILM_L6_V2 | 53 | 69 | 78 | 60 | 65 |
163-
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
164-
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
165-
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
166-
| CLIP_VIT_BASE_PATCH32_TEXT | 35 | 48 | 49 | 40 | - |
162+
| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 |
163+
| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 |
164+
| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 |
165+
| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 |
166+
| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 |
167+
168+
:::info
169+
Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
170+
:::

docs/docs/02-hooks/02-computer-vision/useImageEmbeddings.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,8 @@ try {
8888

8989
## Supported models
9090

91-
| Model | Language | Image size | Embedding Dimensions | Description |
92-
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
91+
| Model | Language | Image size | Embedding Dimensions | Description |
92+
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
9393
| [clip-vit-base-patch32-image](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout [clip-vit-base-patch32-text](../01-natural-language-processing/useTextEmbeddings.md#supported-models). |
9494

9595
**`Image size`** - the size of an image that the model takes as an input. Resize will happen automatically.
@@ -112,14 +112,18 @@ For the supported models, the returned embedding vector is normalized, meaning t
112112

113113
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
114114
| --------------------------- | :--------------------: | :----------------: |
115-
| CLIP_VIT_BASE_PATCH32_IMAGE | 324 | 347 |
115+
| CLIP_VIT_BASE_PATCH32_IMAGE | 350 | 340 |
116116

117117
### Inference time
118118

119119
:::warning warning
120120
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.
121121
:::
122122

123-
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
124-
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
125-
| CLIP_VIT_BASE_PATCH32_IMAGE | 104 | 120 | 280 | 265 |
123+
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
124+
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
125+
| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 |
126+
127+
:::info
128+
Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.
129+
:::

docs/docs/04-benchmarks/inference-time.md

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,22 @@ Average time for decoding one token in sequence of 100 tokens, with encoding con
103103

104104
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] |
105105
| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: |
106-
| ALL_MINILM_L6_V2 | 53 | 69 | 78 | 60 | 65 |
107-
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
108-
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
109-
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
106+
| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 |
107+
| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 |
108+
| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 |
109+
| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 |
110+
| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 |
111+
112+
:::info
113+
Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
114+
:::
115+
116+
## Image Embeddings
117+
118+
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
119+
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
120+
| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 |
121+
122+
:::info
123+
Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.
124+
:::

0 commit comments

Comments
 (0)