Skip to content

Commit 52756c7

Browse files
authored
chore: add support for quantized versions of CV models CLIP, Style Transfer, EfficientNetV2, SSDLite (#940)
## Description Adds support for quantized versions of CV models CLIP, Style Transfer, EfficientNetV2, SSDLite and updates paths to non-quantized models exported with ExecuTorch v1.1.0. ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [ ] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions 1. Run the Computer Vision example app: - Object detection with model set to: - `SSDLITE_320_MOBILENET_V3_LARGE` - Classification with model set to: - `EFFICIENTNET_V2_S`, - `EFFICIENTNET_V2_S_QUANTIZED` - Style transfer with model set to: - `STYLE_TRANSFER_CANDY`, - `STYLE_TRANSFER_MOSAIC`, - `STYLE_TRANSFER_UDNIE`, - `STYLE_TRANSFER_RAIN_PRINCESS`, - `STYLE_TRANSFER_CANDY_QUANTIZED`, - `STYLE_TRANSFER_MOSAIC_QUANTIZED`, - `STYLE_TRANSFER_UDNIE_QUANTIZED`, - `STYLE_TRANSFER_RAIN_PRINCESS_QUANTIZED`, 2. Run the Text Embeddings example app: - CLIP embeddings with image model set to: - `CLIP_VIT_BASE_PATCH32_IMAGE`, - `CLIP_VIT_BASE_PATCH32_IMAGE_QUANTIZED` 3. Check HF pages for updated models: - https://huggingface.co/software-mansion/react-native-executorch-style-transfer-candy - https://huggingface.co/software-mansion/react-native-executorch-style-transfer-mosaic - https://huggingface.co/software-mansion/react-native-executorch-style-transfer-rain-princess - https://huggingface.co/software-mansion/react-native-executorch-style-transfer-udnie - https://huggingface.co/software-mansion/react-native-executorch-efficientnet-v2-s - https://huggingface.co/software-mansion/react-native-executorch-ssdlite320-mobilenet-v3-large - https://huggingface.co/software-mansion/react-native-executorch-clip-vit-base-patch32 ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues Closes #719 ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. -->
1 parent 6dd8fb6 commit 52756c7

File tree

9 files changed

+335
-84
lines changed

9 files changed

+335
-84
lines changed

apps/computer-vision/app/classification/index.tsx

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
import Spinner from '../../components/Spinner';
22
import { getImage } from '../../utils';
3-
import { useClassification, EFFICIENTNET_V2_S } from 'react-native-executorch';
3+
import {
4+
useClassification,
5+
EFFICIENTNET_V2_S_QUANTIZED,
6+
} from 'react-native-executorch';
47
import { View, StyleSheet, Image, Text, ScrollView } from 'react-native';
58
import { BottomBar } from '../../components/BottomBar';
69
import React, { useContext, useEffect, useState } from 'react';
@@ -13,7 +16,7 @@ export default function ClassificationScreen() {
1316
);
1417
const [imageUri, setImageUri] = useState('');
1518

16-
const model = useClassification({ model: EFFICIENTNET_V2_S });
19+
const model = useClassification({ model: EFFICIENTNET_V2_S_QUANTIZED });
1720
const { setGlobalGenerating } = useContext(GeneratingContext);
1821
useEffect(() => {
1922
setGlobalGenerating(model.isGenerating);

apps/computer-vision/app/semantic_segmentation/index.tsx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ import Spinner from '../../components/Spinner';
22
import { BottomBar } from '../../components/BottomBar';
33
import { getImage } from '../../utils';
44
import {
5-
DEEPLAB_V3_RESNET50,
5+
DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED,
66
useSemanticSegmentation,
77
} from 'react-native-executorch';
88
import {
@@ -46,7 +46,7 @@ export default function SemanticSegmentationScreen() {
4646
const { setGlobalGenerating } = useContext(GeneratingContext);
4747
const { isReady, isGenerating, downloadProgress, forward } =
4848
useSemanticSegmentation({
49-
model: DEEPLAB_V3_RESNET50,
49+
model: DEEPLAB_V3_MOBILENET_V3_LARGE_QUANTIZED,
5050
});
5151
const [imageUri, setImageUri] = useState('');
5252
const [imageSize, setImageSize] = useState({ width: 0, height: 0 });

apps/computer-vision/app/style_transfer/index.tsx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@ import { BottomBar } from '../../components/BottomBar';
33
import { getImage } from '../../utils';
44
import {
55
useStyleTransfer,
6-
STYLE_TRANSFER_CANDY,
6+
STYLE_TRANSFER_CANDY_QUANTIZED,
77
} from 'react-native-executorch';
88
import { View, StyleSheet, Image } from 'react-native';
99
import React, { useContext, useEffect, useState } from 'react';
1010
import { GeneratingContext } from '../../context';
1111
import ScreenWrapper from '../../ScreenWrapper';
1212

1313
export default function StyleTransferScreen() {
14-
const model = useStyleTransfer({ model: STYLE_TRANSFER_CANDY });
14+
const model = useStyleTransfer({ model: STYLE_TRANSFER_CANDY_QUANTIZED });
1515
const { setGlobalGenerating } = useContext(GeneratingContext);
1616
useEffect(() => {
1717
setGlobalGenerating(model.isGenerating);

apps/text-embeddings/app/clip-embeddings/index.tsx

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import {
1515
useTextEmbeddings,
1616
useImageEmbeddings,
1717
CLIP_VIT_BASE_PATCH32_TEXT,
18-
CLIP_VIT_BASE_PATCH32_IMAGE,
18+
CLIP_VIT_BASE_PATCH32_IMAGE_QUANTIZED,
1919
} from 'react-native-executorch';
2020
import { launchImageLibrary } from 'react-native-image-picker';
2121
import { useIsFocused } from '@react-navigation/native';
@@ -29,7 +29,9 @@ export default function ClipEmbeddingsScreenWrapper() {
2929

3030
function ClipEmbeddingsScreen() {
3131
const textModel = useTextEmbeddings({ model: CLIP_VIT_BASE_PATCH32_TEXT });
32-
const imageModel = useImageEmbeddings({ model: CLIP_VIT_BASE_PATCH32_IMAGE });
32+
const imageModel = useImageEmbeddings({
33+
model: CLIP_VIT_BASE_PATCH32_IMAGE_QUANTIZED,
34+
});
3335

3436
const [inputSentence, setInputSentence] = useState('');
3537
const [sentencesWithEmbeddings, setSentencesWithEmbeddings] = useState<

docs/docs/02-benchmarks/inference-time.md

Lines changed: 105 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,84 @@ title: Inference Time
33
---
44

55
:::warning
6-
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.
6+
Times presented in the tables are measured as consecutive runs of the model.
7+
Initial run times may be up to 2x longer due to model loading and
8+
initialization.
79
:::
810

911
## Classification
1012

11-
| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
12-
| ----------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
13-
| EFFICIENTNET_V2_S | 64 | 68 | 217 | 205 | 198 |
13+
:::info
14+
Inference times are measured directly from native C++ code, wrapping only the
15+
model's forward pass, excluding input-dependent pre- and post-processing (e.g.
16+
image resizing, normalization) and any overhead from React Native runtime.
17+
:::
18+
19+
:::info
20+
For this model all input images, whether larger or smaller, are resized before
21+
processing. Resizing is typically fast for small images but may be noticeably
22+
slower for very large images, which can increase total time.
23+
:::
24+
25+
| Model / Device | iPhone 17 Pro [ms] | Google Pixel 10 [ms] |
26+
| :------------------------------- | :----------------: | :------------------: |
27+
| EFFICIENTNET_V2_S (XNNPACK FP32) | 70 | 100 |
28+
| EFFICIENTNET_V2_S (XNNPACK INT8) | 22 | 38 |
29+
| EFFICIENTNET_V2_S (Core ML FP32) | 12 | - |
30+
| EFFICIENTNET_V2_S (Core ML FP16) | 5 | - |
1431

1532
## Object Detection
1633

17-
| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
18-
| ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
19-
| SSDLITE_320_MOBILENET_V3_LARGE | 71 | 74 | 257 | 115 | 109 |
34+
:::info
35+
Inference times are measured directly from native C++ code, wrapping only the
36+
model's forward pass, excluding input-dependent pre- and post-processing (e.g.
37+
image resizing, normalization) and any overhead from React Native runtime.
38+
:::
39+
40+
:::info
41+
For this model all input images, whether larger or smaller, are resized before
42+
processing. Resizing is typically fast for small images but may be noticeably
43+
slower for very large images, which can increase total time.
44+
:::
45+
46+
| Model / Device | iPhone 17 Pro [ms] | Google Pixel 10 [ms] |
47+
| :-------------------------------------------- | :----------------: | :------------------: |
48+
| SSDLITE_320_MOBILENET_V3_LARGE (XNNPACK FP32) | 20 | 18 |
49+
| SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP32) | 18 | - |
50+
| SSDLITE_320_MOBILENET_V3_LARGE (Core ML FP16) | 8 | - |
2051

2152
## Style Transfer
2253

23-
| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
24-
| ---------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
25-
| STYLE_TRANSFER_CANDY | 1400 | 1485 | 4255 | 2510 | 2355 |
26-
| STYLE_TRANSFER_MOSAIC | 1400 | 1485 | 4255 | 2510 | 2355 |
27-
| STYLE_TRANSFER_UDNIE | 1400 | 1485 | 4255 | 2510 | 2355 |
28-
| STYLE_TRANSFER_RAIN_PRINCESS | 1400 | 1485 | 4255 | 2510 | 2355 |
54+
:::info
55+
Inference times are measured directly from native C++ code, wrapping only the
56+
model's forward pass, excluding input-dependent pre- and post-processing (e.g.
57+
image resizing, normalization) and any overhead from React Native runtime.
58+
:::
59+
60+
:::info
61+
For this model all input images, whether larger or smaller, are resized before
62+
processing. Resizing is typically fast for small images but may be noticeably
63+
slower for very large images, which can increase total time.
64+
:::
65+
66+
| Model / Device | iPhone 17 Pro [ms] | Google Pixel 10 [ms] |
67+
| :------------------------------------------ | :----------------: | :------------------: |
68+
| STYLE_TRANSFER_CANDY (XNNPACK FP32) | 1192 | 1025 |
69+
| STYLE_TRANSFER_CANDY (XNNPACK INT8) | 272 | 430 |
70+
| STYLE_TRANSFER_CANDY (Core ML FP32) | 100 | - |
71+
| STYLE_TRANSFER_CANDY (Core ML FP16) | 150 | - |
72+
| STYLE_TRANSFER_MOSAIC (XNNPACK FP32) | 1192 | 1025 |
73+
| STYLE_TRANSFER_MOSAIC (XNNPACK INT8) | 272 | 430 |
74+
| STYLE_TRANSFER_MOSAIC (Core ML FP32) | 100 | - |
75+
| STYLE_TRANSFER_MOSAIC (Core ML FP16) | 150 | - |
76+
| STYLE_TRANSFER_UDNIE (XNNPACK FP32) | 1192 | 1025 |
77+
| STYLE_TRANSFER_UDNIE (XNNPACK INT8) | 272 | 430 |
78+
| STYLE_TRANSFER_UDNIE (Core ML FP32) | 100 | - |
79+
| STYLE_TRANSFER_UDNIE (Core ML FP16) | 150 | - |
80+
| STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK FP32) | 1192 | 1025 |
81+
| STYLE_TRANSFER_RAIN_PRINCESS (XNNPACK INT8) | 272 | 430 |
82+
| STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP32) | 100 | - |
83+
| STYLE_TRANSFER_RAIN_PRINCESS (Core ML FP16) | 150 | - |
2984

3085
## OCR
3186

@@ -109,23 +164,51 @@ Benchmark times for text embeddings are highly dependent on the sentence length.
109164

110165
## Image Embeddings
111166

112-
| Model | iPhone 17 Pro (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
113-
| --------------------------- | :--------------------------: | :-----------------------: |
114-
| CLIP_VIT_BASE_PATCH32_IMAGE | 18 | 55 |
167+
:::info
168+
Inference times are measured directly from native C++ code, wrapping only the
169+
model's forward pass, excluding input-dependent pre- and post-processing (e.g.
170+
image resizing, normalization) and any overhead from React Native runtime.
171+
:::
115172

116173
:::info
117-
Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time.
174+
For this model all input images, whether larger or smaller, are resized before
175+
processing. Resizing is typically fast for small images but may be noticeably
176+
slower for very large images, which can increase total time.
118177
:::
119178

179+
| Model / Device | iPhone 17 Pro [ms] | Google Pixel 10 [ms] |
180+
| :----------------------------------------- | :----------------: | :------------------: |
181+
| CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK FP32) | 14 | 68 |
182+
| CLIP_VIT_BASE_PATCH32_IMAGE (XNNPACK INT8) | 11 | 31 |
183+
120184
## Semantic Segmentation
121185

122-
:::warning
123-
Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization.
186+
:::info
187+
Inference times are measured directly from native C++ code, wrapping only the
188+
model's forward pass, excluding input-dependent pre- and post-processing (e.g.
189+
image resizing, normalization) and any overhead from React Native runtime.
190+
:::
191+
192+
:::info
193+
For this model all input images, whether larger or smaller, are resized before
194+
processing. Resizing is typically fast for small images but may be noticeably
195+
slower for very large images, which can increase total time.
124196
:::
125197

126-
| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 14 Pro Max (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] |
127-
| ----------------- | ---------------------------- | -------------------------------- | --------------------------------- |
128-
| DEELABV3_RESNET50 | 1000 | 670 | 700 |
198+
| Model / Device | iPhone 17 Pro [ms] | Google Pixel 10 [ms] |
199+
| :------------------------------------------- | :----------------: | :------------------: |
200+
| DEEPLAB_V3_RESNET50 (XNNPACK FP32) | 2000 | 2200 |
201+
| DEEPLAB_V3_RESNET50 (XNNPACK INT8) | 118 | 380 |
202+
| DEEPLAB_V3_RESNET101 (XNNPACK FP32) | 2900 | 3300 |
203+
| DEEPLAB_V3_RESNET101 (XNNPACK INT8) | 174 | 660 |
204+
| DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK FP32) | 131 | 153 |
205+
| DEEPLAB_V3_MOBILENET_V3_LARGE (XNNPACK INT8) | 17 | 40 |
206+
| LRASPP_MOBILENET_V3_LARGE (XNNPACK FP32) | 13 | 36 |
207+
| LRASPP_MOBILENET_V3_LARGE (XNNPACK INT8) | 12 | 20 |
208+
| FCN_RESNET50 (XNNPACK FP32) | 1800 | 2160 |
209+
| FCN_RESNET50 (XNNPACK INT8) | 100 | 320 |
210+
| FCN_RESNET101 (XNNPACK FP32) | 2600 | 3160 |
211+
| FCN_RESNET101 (XNNPACK INT8) | 160 | 620 |
129212

130213
## Text to image
131214

0 commit comments

Comments
 (0)