Skip to content

Commit 35e33b3

Browse files
authored
feat: add PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 text embeddings model (#1115)
## Description Adds the `paraphrase-multilingual-MiniLM-L12-v2` sentence-transformer model — the second multilingual embeddings model after distiluse, completing #945. Ships **only the XNNPACK 8da4w variant** under `MODEL_REGISTRY.ALL_MODELS` (see "Why a single variant" below). 384-d output, max 126 tokens, 50+ languages. Tokenizer is Unigram + Precompiled normalizer + Metaspace decoder — **requires the bumped `pytorch/extension/llm/tokenizers` runtime from #1114**, so this PR blocks on that landing first and should be rebased onto main once #1114 merges. HF repo: [software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2) (`v0.9.0` tag, layout mirrors distiluse). **Why a single variant** — TLDR 8da4w works faster then all and was also one of the smallest, without loss in precision. Longer answer: unlike distiluse, where Core ML fp32 won iPhone thanks to ANE acceleration, benchmarks on iPhone 17 Pro + OnePlus 12 (~80-token input, 50 measured forwards after 3 warmups) showed the XNNPACK 8da4w variant Pareto-dominates the other three on both platforms: faster than XNNPACK fp32, Core ML fp32 *and* Core ML fp16 on iPhone, and ~36% smaller steady-state memory footprint than the next-best variant. Likely cause: paraphrase-multilingual-MiniLM-L12-v2 is a smaller model (~118 M params, 12 layers) where Core ML's runtime doesn't push enough work onto ANE for the precision-conversion overhead to pay off. fp16 being slower than fp32 on Core ML for this model is a tell that the runtime is falling back to slower compute units. Shipping only `_8DA4W` keeps the public surface aligned with the data; if a future Core ML or model update flips the verdict, easy to add the other variants back. **Memory methodology note** — the new paraphrase row in `docs/docs/02-benchmarks/memory-usage.md` reports RSS / `phys_footprint` deltas from a clean app baseline (loaded − idle), captured on-device at the same conceptual point. The existing distiluse rows there (36 / 44 MB) come from an older measurement pass with a different (and not reconstructable from the diff) methodology, so the two rows are not directly comparable. A separate pass to re-measure distiluse and other rows with the same methodology would be a good follow-up. ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions 1. `cd apps/text-embeddings && npx expo run:ios` (or `run:android`). 2. Pick **"Multilingual Paraphrase (8da4w)"** in the model picker. 3. Add a sentence in one language, query with an aligned sentence in another (e.g. Polish "Słoneczko" against "It's so sunny outside!"). The cross-lingual pair should top the matches. ### Related issues Closes the paraphrase-multilingual half of #945 (the distiluse half landed in #1098). ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes Blocks on #1114.
1 parent e7b7529 commit 35e33b3

7 files changed

Lines changed: 57 additions & 35 deletions

File tree

apps/text-embeddings/app/text-embeddings/index.tsx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import {
2020
MULTI_QA_MPNET_BASE_DOT_V1,
2121
DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W,
2222
DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
23+
PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
2324
TextEmbeddingsProps,
2425
} from 'react-native-executorch';
2526

@@ -38,6 +39,10 @@ const MODELS: { label: string; value: TextEmbeddingModel }[] = [
3839
label: 'Multilingual DistilUSE (CoreML)',
3940
value: DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
4041
},
42+
{
43+
label: 'Multilingual Paraphrase (8da4w)',
44+
value: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
45+
},
4146
];
4247
import { useIsFocused } from '@react-navigation/native';
4348
import { dotProduct } from '../../utils/math';

docs/docs/02-benchmarks/inference-time.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -180,15 +180,16 @@ Average time to synthesize speech from an input text of approximately 60 tokens,
180180
Benchmark times for text embeddings are highly dependent on the sentence length. The numbers below are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
181181
:::
182182

183-
| Model / Device | iPhone 17 Pro [ms] | OnePlus 12 [ms] |
184-
| ---------------------------------------------------- | :----------------: | :-------------: |
185-
| ALL_MINILM_L6_V2 (XNNPACK) | 7 | 21 |
186-
| ALL_MPNET_BASE_V2 (XNNPACK) | 24 | 90 |
187-
| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK) | 7 | 19 |
188-
| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK) | 24 | 88 |
189-
| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK) | 14 | 39 |
190-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) | 16 | 15 |
191-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32) | 15 | - |
183+
| Model / Device | iPhone 17 Pro [ms] | OnePlus 12 [ms] |
184+
| ----------------------------------------------------- | :----------------: | :-------------: |
185+
| ALL_MINILM_L6_V2 (XNNPACK) | 7 | 21 |
186+
| ALL_MPNET_BASE_V2 (XNNPACK) | 24 | 90 |
187+
| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK) | 7 | 19 |
188+
| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK) | 24 | 88 |
189+
| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK) | 14 | 39 |
190+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) | 16 | 15 |
191+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32) | 15 | - |
192+
| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 (XNNPACK 8da4w) | 14 | 15 |
192193

193194
## Image Embeddings
194195

docs/docs/02-benchmarks/memory-usage.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -98,15 +98,16 @@ The reported memory usage values include the memory footprint of the Phonemis pa
9898

9999
## Text Embeddings
100100

101-
| Model / Device | iPhone 17 Pro [MB] | OnePlus 12 [MB] |
102-
| ---------------------------------------------------- | :----------------: | :-------------: |
103-
| ALL_MINILM_L6_V2 (XNNPACK) | 110 | 95 |
104-
| ALL_MPNET_BASE_V2 (XNNPACK) | 455 | 405 |
105-
| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK) | 140 | 120 |
106-
| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK) | 455 | 435 |
107-
| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK) | 280 | 200 |
108-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) | 36 | 44 |
109-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32) | 55 | - |
101+
| Model / Device | iPhone 17 Pro [MB] | OnePlus 12 [MB] |
102+
| ----------------------------------------------------- | :----------------: | :-------------: |
103+
| ALL_MINILM_L6_V2 (XNNPACK) | 110 | 95 |
104+
| ALL_MPNET_BASE_V2 (XNNPACK) | 455 | 405 |
105+
| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK) | 140 | 120 |
106+
| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK) | 455 | 435 |
107+
| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK) | 280 | 200 |
108+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) | 36 | 44 |
109+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32) | 55 | - |
110+
| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 (XNNPACK 8da4w) | 131 | 141 |
110111

111112
## Image Embeddings
112113

docs/docs/02-benchmarks/model-size.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -119,15 +119,16 @@ title: Model Size
119119

120120
## Text Embeddings
121121

122-
| Model | Size [MB] |
123-
| ------------------------------------------- | :-------: |
124-
| ALL_MINILM_L6_V2 | 91 |
125-
| ALL_MPNET_BASE_V2 | 438 |
126-
| MULTI_QA_MINILM_L6_COS_V1 | 91 |
127-
| MULTI_QA_MPNET_BASE_DOT_V1 | 438 |
128-
| CLIP_VIT_BASE_PATCH32_TEXT | 254 |
129-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W | 393 |
130-
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML | 541 |
122+
| Model | Size [MB] |
123+
| ----------------------------------------------- | :-------: |
124+
| ALL_MINILM_L6_V2 | 91 |
125+
| ALL_MPNET_BASE_V2 | 438 |
126+
| MULTI_QA_MINILM_L6_COS_V1 | 91 |
127+
| MULTI_QA_MPNET_BASE_DOT_V1 | 438 |
128+
| CLIP_VIT_BASE_PATCH32_TEXT | 254 |
129+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W | 393 |
130+
| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML | 541 |
131+
| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED | 397 |
131132

132133
## Image Embeddings
133134

docs/docs/03-hooks/01-natural-language-processing/useTextEmbeddings.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -101,14 +101,15 @@ function App() {
101101

102102
## Supported models
103103

104-
| Model | Language | Max Tokens | Embedding Dimensions | Description |
105-
| ------------------------------------------------------------------------------------------------------------------------- | :-----------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
106-
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
107-
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
108-
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
109-
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
110-
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 50+ languages | 126 | 512 | Multilingual DistilBERT with a 768→512 projection head. Recommended when broader language coverage matters more than the exact English quality of MiniLM/MPNet. |
111-
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
104+
| Model | Language | Max Tokens | Embedding Dimensions | Description |
105+
| --------------------------------------------------------------------------------------------------------------------------- | :-----------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
106+
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | English | 254 | 384 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
107+
| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | English | 382 | 768 | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs. |
108+
| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) | English | 509 | 384 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
109+
| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) | English | 510 | 768 | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs. |
110+
| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 50+ languages | 126 | 512 | Multilingual DistilBERT with a 768→512 projection head. Recommended when broader language coverage matters more than the exact English quality of MiniLM/MPNet. |
111+
| [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 50+ languages | 126 | 384 | Multilingual MiniLM-L12 distilled from paraphrase-multilingual-mpnet-base-v2. Compact (≈118 M params) sentence encoder for cross-lingual semantic similarity and retrieval across 50+ languages. |
112+
| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32) | English | 74 | 512 | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
112113

113114
**`Max Tokens`** - The maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
114115

packages/react-native-executorch/src/constants/modelUrls.ts

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1102,6 +1102,8 @@ const MULTI_QA_MPNET_BASE_DOT_V1_TOKENIZER = `${URL_PREFIX}-multi-qa-mpnet-base-
11021102
const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W_MODEL = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_8da4w.pte`;
11031103
const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_MODEL = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/coreml/distiluse-base-multilingual-cased-v2_coreml_fp32.pte`;
11041104
const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_TOKENIZER = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/tokenizer.json`;
1105+
const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED_MODEL = `${URL_PREFIX}-paraphrase-multilingual-MiniLM-L12-v2/${NEXT_VERSION_TAG}/xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_8da4w.pte`;
1106+
const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_TOKENIZER = `${URL_PREFIX}-paraphrase-multilingual-MiniLM-L12-v2/${NEXT_VERSION_TAG}/tokenizer.json`;
11051107
const CLIP_VIT_BASE_PATCH32_TEXT_MODEL = `${URL_PREFIX}-clip-vit-base-patch32/${VERSION_TAG}/xnnpack/clip_vit_base_patch32_text_xnnpack_fp32.pte`;
11061108
const CLIP_VIT_BASE_PATCH32_TEXT_TOKENIZER = `${URL_PREFIX}-clip-vit-base-patch32/${VERSION_TAG}/tokenizer.json`;
11071109

@@ -1159,6 +1161,15 @@ export const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML = {
11591161
tokenizerSource: DISTILUSE_BASE_MULTILINGUAL_CASED_V2_TOKENIZER,
11601162
} as const;
11611163

1164+
/**
1165+
* @category Models - Text Embeddings
1166+
*/
1167+
export const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED = {
1168+
modelName: 'paraphrase-multilingual-minilm-l12-v2-quantized',
1169+
modelSource: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED_MODEL,
1170+
tokenizerSource: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_TOKENIZER,
1171+
} as const;
1172+
11621173
/**
11631174
* @category Models - Text Embeddings
11641175
*/
@@ -1349,6 +1360,7 @@ export const MODEL_REGISTRY = {
13491360
MULTI_QA_MPNET_BASE_DOT_V1,
13501361
DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W,
13511362
DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
1363+
PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
13521364
CLIP_VIT_BASE_PATCH32_TEXT,
13531365
BK_SDM_TINY_VPRED_512,
13541366
BK_SDM_TINY_VPRED_256,

packages/react-native-executorch/src/types/textEmbeddings.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ export type TextEmbeddingsModelName =
1212
| 'multi-qa-mpnet-base-dot-v1'
1313
| 'distiluse-base-multilingual-cased-v2-8da4w'
1414
| 'distiluse-base-multilingual-cased-v2-coreml'
15+
| 'paraphrase-multilingual-minilm-l12-v2-quantized'
1516
| 'clip-vit-base-patch32-text';
1617

1718
/**

0 commit comments

Comments
 (0)