feat: add PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 text embeddings model (#1115)

msluszniak · web-flow · commit 35e33b3f15c5 · 2026-05-07T13:56:50.000+02:00
## Description Adds the `paraphrase-multilingual-MiniLM-L12-v2` sentence-transformer model — the second multilingual embeddings model after distiluse, completing #945. Ships **only the XNNPACK 8da4w variant** under `MODEL_REGISTRY.ALL_MODELS` (see "Why a single variant" below). 384-d output, max 126 tokens, 50+ languages. Tokenizer is Unigram + Precompiled normalizer + Metaspace decoder — **requires the bumped `pytorch/extension/llm/tokenizers` runtime from #1114**, so this PR blocks on that landing first and should be rebased onto main once #1114 merges. HF repo: [software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2) (`v0.9.0` tag, layout mirrors distiluse). **Why a single variant** — TLDR 8da4w works faster then all and was also one of the smallest, without loss in precision. Longer answer: unlike distiluse, where Core ML fp32 won iPhone thanks to ANE acceleration, benchmarks on iPhone 17 Pro + OnePlus 12 (~80-token input, 50 measured forwards after 3 warmups) showed the XNNPACK 8da4w variant Pareto-dominates the other three on both platforms: faster than XNNPACK fp32, Core ML fp32 *and* Core ML fp16 on iPhone, and ~36% smaller steady-state memory footprint than the next-best variant. Likely cause: paraphrase-multilingual-MiniLM-L12-v2 is a smaller model (~118 M params, 12 layers) where Core ML's runtime doesn't push enough work onto ANE for the precision-conversion overhead to pay off. fp16 being slower than fp32 on Core ML for this model is a tell that the runtime is falling back to slower compute units. Shipping only `_8DA4W` keeps the public surface aligned with the data; if a future Core ML or model update flips the verdict, easy to add the other variants back. **Memory methodology note** — the new paraphrase row in `docs/docs/02-benchmarks/memory-usage.md` reports RSS / `phys_footprint` deltas from a clean app baseline (loaded − idle), captured on-device at the same conceptual point. The existing distiluse rows there (36 / 44 MB) come from an older measurement pass with a different (and not reconstructable from the diff) methodology, so the two rows are not directly comparable. A separate pass to re-measure distiluse and other rows with the same methodology would be a good follow-up. ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions 1. `cd apps/text-embeddings && npx expo run:ios` (or `run:android`). 2. Pick **"Multilingual Paraphrase (8da4w)"** in the model picker. 3. Add a sentence in one language, query with an aligned sentence in another (e.g. Polish "Słoneczko" against "It's so sunny outside!"). The cross-lingual pair should top the matches. ### Related issues Closes the paraphrase-multilingual half of #945 (the distiluse half landed in #1098). ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes Blocks on #1114.
diff --git a/apps/text-embeddings/app/text-embeddings/index.tsx b/apps/text-embeddings/app/text-embeddings/index.tsx
@@ -20,6 +20,7 @@ import {
   MULTI_QA_MPNET_BASE_DOT_V1,
   DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W,
   DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
+  PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
   TextEmbeddingsProps,
 } from 'react-native-executorch';
 
@@ -38,6 +39,10 @@ const MODELS: { label: string; value: TextEmbeddingModel }[] = [
     label: 'Multilingual DistilUSE (CoreML)',
     value: DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
   },
+  {
+    label: 'Multilingual Paraphrase (8da4w)',
+    value: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
+  },
 ];
 import { useIsFocused } from '@react-navigation/native';
 import { dotProduct } from '../../utils/math';
diff --git a/docs/docs/02-benchmarks/inference-time.md b/docs/docs/02-benchmarks/inference-time.md
@@ -180,15 +180,16 @@ Average time to synthesize speech from an input text of approximately 60 tokens,
 Benchmark times for text embeddings are highly dependent on the sentence length. The numbers below are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly.
 :::
 
-| Model / Device                                       | iPhone 17 Pro [ms] | OnePlus 12 [ms] |
-| ---------------------------------------------------- | :----------------: | :-------------: |
-| ALL_MINILM_L6_V2 (XNNPACK)                           |         7          |       21        |
-| ALL_MPNET_BASE_V2 (XNNPACK)                          |         24         |       90        |
-| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK)                  |         7          |       19        |
-| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK)                 |         24         |       88        |
-| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK)                 |         14         |       39        |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) |         16         |       15        |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32)  |         15         |        -        |
+| Model / Device                                        | iPhone 17 Pro [ms] | OnePlus 12 [ms] |
+| ----------------------------------------------------- | :----------------: | :-------------: |
+| ALL_MINILM_L6_V2 (XNNPACK)                            |         7          |       21        |
+| ALL_MPNET_BASE_V2 (XNNPACK)                           |         24         |       90        |
+| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK)                   |         7          |       19        |
+| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK)                  |         24         |       88        |
+| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK)                  |         14         |       39        |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w)  |         16         |       15        |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32)   |         15         |        -        |
+| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 (XNNPACK 8da4w) |         14         |       15        |
 
 ## Image Embeddings
 
diff --git a/docs/docs/02-benchmarks/memory-usage.md b/docs/docs/02-benchmarks/memory-usage.md
@@ -98,15 +98,16 @@ The reported memory usage values include the memory footprint of the Phonemis pa
 
 ## Text Embeddings
 
-| Model / Device                                       | iPhone 17 Pro [MB] | OnePlus 12 [MB] |
-| ---------------------------------------------------- | :----------------: | :-------------: |
-| ALL_MINILM_L6_V2 (XNNPACK)                           |        110         |       95        |
-| ALL_MPNET_BASE_V2 (XNNPACK)                          |        455         |       405       |
-| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK)                  |        140         |       120       |
-| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK)                 |        455         |       435       |
-| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK)                 |        280         |       200       |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w) |         36         |       44        |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32)  |         55         |        -        |
+| Model / Device                                        | iPhone 17 Pro [MB] | OnePlus 12 [MB] |
+| ----------------------------------------------------- | :----------------: | :-------------: |
+| ALL_MINILM_L6_V2 (XNNPACK)                            |        110         |       95        |
+| ALL_MPNET_BASE_V2 (XNNPACK)                           |        455         |       405       |
+| MULTI_QA_MINILM_L6_COS_V1 (XNNPACK)                   |        140         |       120       |
+| MULTI_QA_MPNET_BASE_DOT_V1 (XNNPACK)                  |        455         |       435       |
+| CLIP_VIT_BASE_PATCH32_TEXT (XNNPACK)                  |        280         |       200       |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (XNNPACK 8da4w)  |         36         |       44        |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2 (Core ML FP32)   |         55         |        -        |
+| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 (XNNPACK 8da4w) |        131         |       141       |
 
 ## Image Embeddings
 
diff --git a/docs/docs/02-benchmarks/model-size.md b/docs/docs/02-benchmarks/model-size.md
@@ -119,15 +119,16 @@ title: Model Size
 
 ## Text Embeddings
 
-| Model                                       | Size [MB] |
-| ------------------------------------------- | :-------: |
-| ALL_MINILM_L6_V2                            |    91     |
-| ALL_MPNET_BASE_V2                           |    438    |
-| MULTI_QA_MINILM_L6_COS_V1                   |    91     |
-| MULTI_QA_MPNET_BASE_DOT_V1                  |    438    |
-| CLIP_VIT_BASE_PATCH32_TEXT                  |    254    |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W  |    393    |
-| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML |    541    |
+| Model                                           | Size [MB] |
+| ----------------------------------------------- | :-------: |
+| ALL_MINILM_L6_V2                                |    91     |
+| ALL_MPNET_BASE_V2                               |    438    |
+| MULTI_QA_MINILM_L6_COS_V1                       |    91     |
+| MULTI_QA_MPNET_BASE_DOT_V1                      |    438    |
+| CLIP_VIT_BASE_PATCH32_TEXT                      |    254    |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W      |    393    |
+| DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML     |    541    |
+| PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED |    397    |
 
 ## Image Embeddings
 
diff --git a/docs/docs/03-hooks/01-natural-language-processing/useTextEmbeddings.md b/docs/docs/03-hooks/01-natural-language-processing/useTextEmbeddings.md
@@ -101,14 +101,15 @@ function App() {
 
 ## Supported models
 
-| Model                                                                                                                     |   Language    | Max Tokens | Embedding Dimensions | Description                                                                                                                                                                                                                                                                                                                                                                                                                      |
-| ------------------------------------------------------------------------------------------------------------------------- | :-----------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)                                         |    English    |    254     |         384          | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs.                                                                                                                                                                                                                                                                                                               |
-| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)                                       |    English    |    382     |         768          | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs.                                                                                                                                                                                                                                                                                                               |
-| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)                       |    English    |    509     |         384          | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs.                                                                                                                                                                                                                                                          |
-| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1)                     |    English    |    510     |         768          | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs.                                                                                                                                                                                                                                                          |
-| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 50+ languages |    126     |         512          | Multilingual DistilBERT with a 768→512 projection head. Recommended when broader language coverage matters more than the exact English quality of MiniLM/MPNet.                                                                                                                                                                                                                                                                  |
-| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32)                                         |    English    |     74     |         512          | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
+| Model                                                                                                                       |   Language    | Max Tokens | Embedding Dimensions | Description                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| --------------------------------------------------------------------------------------------------------------------------- | :-----------: | :--------: | :------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)                                           |    English    |    254     |         384          | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs.                                                                                                                                                                                                                                                                                                               |
+| [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)                                         |    English    |    382     |         768          | All-round model tuned for many use-cases. Trained on a large and diverse dataset of over 1 billion training pairs.                                                                                                                                                                                                                                                                                                               |
+| [multi-qa-MiniLM-L6-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)                         |    English    |    509     |         384          | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs.                                                                                                                                                                                                                                                          |
+| [multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1)                       |    English    |    510     |         768          | This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs.                                                                                                                                                                                                                                                          |
+| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2)   | 50+ languages |    126     |         512          | Multilingual DistilBERT with a 768→512 projection head. Recommended when broader language coverage matters more than the exact English quality of MiniLM/MPNet.                                                                                                                                                                                                                                                                  |
+| [paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) | 50+ languages |    126     |         384          | Multilingual MiniLM-L12 distilled from paraphrase-multilingual-mpnet-base-v2. Compact (≈118 M params) sentence encoder for cross-lingual semantic similarity and retrieval across 50+ languages.                                                                                                                                                                                                                                 |
+| [clip-vit-base-patch32-text](https://huggingface.co/openai/clip-vit-base-patch32)                                           |    English    |     74     |         512          | CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. CLIP allows to embed images and text into the same vector space. This allows to find similar images as well as to implement image search. This is the text encoder part of the CLIP model. To embed images checkout [clip-vit-base-patch32-image](../02-computer-vision/useImageEmbeddings.md#supported-models). |
 
 **`Max Tokens`** - The maximum number of tokens that can be processed by the model. If the input text exceeds this limit, it will be truncated.
 
diff --git a/packages/react-native-executorch/src/constants/modelUrls.ts b/packages/react-native-executorch/src/constants/modelUrls.ts
@@ -1102,6 +1102,8 @@ const MULTI_QA_MPNET_BASE_DOT_V1_TOKENIZER = `${URL_PREFIX}-multi-qa-mpnet-base-
 const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W_MODEL = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/xnnpack/distiluse-base-multilingual-cased-v2_xnnpack_8da4w.pte`;
 const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML_MODEL = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/coreml/distiluse-base-multilingual-cased-v2_coreml_fp32.pte`;
 const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_TOKENIZER = `${URL_PREFIX}-distiluse-base-multilingual-cased-v2/${NEXT_VERSION_TAG}/tokenizer.json`;
+const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED_MODEL = `${URL_PREFIX}-paraphrase-multilingual-MiniLM-L12-v2/${NEXT_VERSION_TAG}/xnnpack/paraphrase-multilingual-MiniLM-L12-v2_xnnpack_8da4w.pte`;
+const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_TOKENIZER = `${URL_PREFIX}-paraphrase-multilingual-MiniLM-L12-v2/${NEXT_VERSION_TAG}/tokenizer.json`;
 const CLIP_VIT_BASE_PATCH32_TEXT_MODEL = `${URL_PREFIX}-clip-vit-base-patch32/${VERSION_TAG}/xnnpack/clip_vit_base_patch32_text_xnnpack_fp32.pte`;
 const CLIP_VIT_BASE_PATCH32_TEXT_TOKENIZER = `${URL_PREFIX}-clip-vit-base-patch32/${VERSION_TAG}/tokenizer.json`;
 
@@ -1159,6 +1161,15 @@ export const DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML = {
   tokenizerSource: DISTILUSE_BASE_MULTILINGUAL_CASED_V2_TOKENIZER,
 } as const;
 
+/**
+ * @category Models - Text Embeddings
+ */
+export const PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED = {
+  modelName: 'paraphrase-multilingual-minilm-l12-v2-quantized',
+  modelSource: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED_MODEL,
+  tokenizerSource: PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_TOKENIZER,
+} as const;
+
 /**
  * @category Models - Text Embeddings
  */
@@ -1349,6 +1360,7 @@ export const MODEL_REGISTRY = {
     MULTI_QA_MPNET_BASE_DOT_V1,
     DISTILUSE_BASE_MULTILINGUAL_CASED_V2_8DA4W,
     DISTILUSE_BASE_MULTILINGUAL_CASED_V2_COREML,
+    PARAPHRASE_MULTILINGUAL_MINILM_L12_V2_QUANTIZED,
     CLIP_VIT_BASE_PATCH32_TEXT,
     BK_SDM_TINY_VPRED_512,
     BK_SDM_TINY_VPRED_256,
diff --git a/packages/react-native-executorch/src/types/textEmbeddings.ts b/packages/react-native-executorch/src/types/textEmbeddings.ts
@@ -12,6 +12,7 @@ export type TextEmbeddingsModelName =
   | 'multi-qa-mpnet-base-dot-v1'
   | 'distiluse-base-multilingual-cased-v2-8da4w'
   | 'distiluse-base-multilingual-cased-v2-coreml'
+  | 'paraphrase-multilingual-minilm-l12-v2-quantized'
   | 'clip-vit-base-patch32-text';
 
 /**