Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,12 @@ function App() {

## Supported models

| Model | Language | Max Tokens | Embedding Dimensions | Description |
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| Model | Language | Max Tokens | Embedding Dimensions | Description |
| ----------------------------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |

**`Max Tokens`** - the maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
Expand All @@ -145,11 +145,11 @@ For the supported models, the returned embedding vector is normalized, meaning t

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| -------------------------- | :--------------------: | :----------------: |
| ALL_MINILM_L6_V2 | 150 | 190 |
| ALL_MPNET_BASE_V2 | 520 | 470 |
| MULTI_QA_MINILM_L6_COS_V1 | 160 | 225 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 540 | 500 |
| CLIP_VIT_BASE_PATCH32_TEXT | 275 | 250 |
| ALL_MINILM_L6_V2 | 85 | 100 |
| ALL_MPNET_BASE_V2 | 390 | 465 |
| MULTI_QA_MINILM_L6_COS_V1 | 115 | 130 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 415 | 490 |
| CLIP_VIT_BASE_PATCH32_TEXT | 195 | 250 |

### Inference time

Expand All @@ -159,8 +159,12 @@ Times presented in the tables are measured as consecutive runs of the model. Ini

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] |
| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: |
| ALL_MINILM_L6_V2 | 53 | 69 | 78 | 60 | 65 |
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
| CLIP_VIT_BASE_PATCH32_TEXT | 35 | 48 | 49 | 40 | - |
| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 |
| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 |
| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 |
| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 |

:::info
Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
:::
16 changes: 10 additions & 6 deletions docs/docs/02-hooks/02-computer-vision/useImageEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ try {

## Supported models

| Model | Language | Image size | Embedding Dimensions | Description |
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Model | Language | Image size | Embedding Dimensions | Description |
| ---------------------------------------------------------------------------------- | :------: | :--------: | :------------------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [clip-vit-base-patch32-image](https://huggingface.co/openai/clip-vit-base-patch32) | English | 224 x 224 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the image encoder part of the CLIP model. To embed text checkout [clip-vit-base-patch32-text](../01-natural-language-processing/useTextEmbeddings.md#supported-models). |

**`Image size`** - the size of an image that the model takes as an input. Resize will happen automatically.
Expand All @@ -112,14 +112,18 @@ For the supported models, the returned embedding vector is normalized, meaning t

| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
| --------------------------- | :--------------------: | :----------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 324 | 347 |
| CLIP_VIT_BASE_PATCH32_IMAGE | 350 | 340 |

### Inference time

:::warning warning
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. Performance also heavily depends on image size, because resize is expansive operation, especially on low-end devices.
:::

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 104 | 120 | 280 | 265 |
| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 |

:::info
Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.
:::
23 changes: 19 additions & 4 deletions docs/docs/04-benchmarks/inference-time.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,22 @@ Average time for decoding one token in sequence of 100 tokens, with encoding con

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] |
| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: |
| ALL_MINILM_L6_V2 | 53 | 69 | 78 | 60 | 65 |
| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 |
| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 |
| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 |
| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 |
| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 |
| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 |
| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 |

:::info
Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
:::

## Image Embeddings

| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 |

:::info
Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.
:::
Loading