react-native-executorch/docs/docs/02-benchmarks/inference-time.md at b5449534c2e48e102b1f12989924878b904b3026 · software-mansion/react-native-executorch

title	Inference Time

:::warning Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. :::

Classification

Model	iPhone 17 Pro (Core ML) [ms]	iPhone 16 Pro (Core ML) [ms]	iPhone SE 3 (Core ML) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
EFFICIENTNET_V2_S	64	68	217	205	198

Object Detection

Model	iPhone 17 Pro (XNNPACK) [ms]	iPhone 16 Pro (XNNPACK) [ms]	iPhone SE 3 (XNNPACK) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
SSDLITE_320_MOBILENET_V3_LARGE	71	74	257	115	109

Style Transfer

Model	iPhone 17 Pro (Core ML) [ms]	iPhone 16 Pro (Core ML) [ms]	iPhone SE 3 (Core ML) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
STYLE_TRANSFER_CANDY	1400	1485	4255	2510	2355
STYLE_TRANSFER_MOSAIC	1400	1485	4255	2510	2355
STYLE_TRANSFER_UDNIE	1400	1485	4255	2510	2355
STYLE_TRANSFER_RAIN_PRINCESS	1400	1485	4255	2510	2355

OCR

Notice that the recognizer models were executed between 3 and 7 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

Model	iPhone 17 Pro [ms]	iPhone 16 Pro [ms]	iPhone SE 3	Samsung Galaxy S24 [ms]	OnePlus 12 [ms]
Total Inference Time	652	600	2855	1092	1034
Detector (CRAFT) `forward_800`	220	221	1740	521	492
Recognizer (CRNN) `forward_512`	45	38	110	40	38
Recognizer (CRNN) `forward_256`	21	18	54	20	19
Recognizer (CRNN) `forward_128`	11	9	27	10	10

Vertical OCR

Notice that the recognizer models, as well as detector's forward_320 method, were executed between 4 and 21 times during a single recognition. The values below represent the averages across all runs for the benchmark image.

Model	iPhone 17 Pro [ms]	iPhone 16 Pro [ms]	iPhone SE 3	Samsung Galaxy S24 [ms]	OnePlus 12 [ms]
Total Inference Time	1104	1113	8840	2845	2640
Detector (CRAFT) `forward_1280`	501	507	4317	1405	1275
Detector (CRAFT) `forward_320`	125	121	1060	338	299
Recognizer (CRNN) `forward_512`	46	42	109	47	37
Recognizer (CRNN) `forward_64`	5	6	14	7	6

LLMs

Model	iPhone 16 Pro (XNNPACK) [tokens/s]	iPhone 13 Pro (XNNPACK) [tokens/s]	iPhone SE 3 (XNNPACK) [tokens/s]	Samsung Galaxy S24 (XNNPACK) [tokens/s]	OnePlus 12 (XNNPACK) [tokens/s]
LLAMA3_2_1B	16.1	11.4	❌	15.6	19.3
LLAMA3_2_1B_SPINQUANT	40.6	16.7	16.5	40.3	48.2
LLAMA3_2_1B_QLORA	31.8	11.4	11.2	37.3	44.4
LLAMA3_2_3B	❌	❌	❌	❌	7.1
LLAMA3_2_3B_SPINQUANT	17.2	8.2	❌	16.2	19.4
LLAMA3_2_3B_QLORA	14.5	❌	❌	14.8	18.1

❌ - Insufficient RAM.

Speech to Text

Encoding

Average time for encoding audio of given length over 10 runs. For Whisper model we only list 30 sec audio chunks since Whisper does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).

Model	iPhone 17 Pro (XNNPACK) [ms]	iPhone 16 Pro (XNNPACK) [ms]	iPhone SE 3 (XNNPACK) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)	248	254	1145	435	526

Decoding

Average time for decoding one token in sequence of approximately 100 tokens, with encoding context is obtained from audio of noted length.

Model	iPhone 17 Pro (XNNPACK) [ms]	iPhone 16 Pro (XNNPACK) [ms]	iPhone SE 3 (XNNPACK) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
Whisper-tiny (30s)	23	25	121	92	115

Text to Speech

Average time to synthesize speech from an input text of approximately 60 tokens, resulting in 2 to 5 seconds of audio depending on the input and selected voice.

Model	iPhone 17 Pro (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
Kokoro-small	2051	1548
Kokoro-medium	2124	1625

Text Embeddings

Model	iPhone 17 Pro (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
ALL_MINILM_L6_V2	7	21
ALL_MPNET_BASE_V2	24	90
MULTI_QA_MINILM_L6_COS_V1	7	19
MULTI_QA_MPNET_BASE_DOT_V1	24	88
CLIP_VIT_BASE_PATCH32_TEXT	14	39

:::info Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly. :::

Image Embeddings

Model	iPhone 17 Pro (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
CLIP_VIT_BASE_PATCH32_IMAGE	18	55

:::info Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time. :::

Semantic Segmentation

:::warning Times presented in the tables are measured as consecutive runs of the model. Initial run times may be up to 2x longer due to model loading and initialization. :::

Model	iPhone 16 Pro (Core ML) [ms]	iPhone 14 Pro Max (Core ML) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]
DEELABV3_RESNET50	1000	670	700

Text to image

Model	iPhone 17 Pro (XNNPACK) [ms]	iPhone 16 Pro (XNNPACK) [ms]	iPhone SE 3 (XNNPACK) [ms]	Samsung Galaxy S24 (XNNPACK) [ms]	OnePlus 12 (XNNPACK) [ms]
BK_SDM_TINY_VPRED_256	21184	21021	❌	18834	16617

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification

Object Detection

Style Transfer

OCR

Vertical OCR

LLMs

Speech to Text

Encoding

Decoding

Text to Speech

Text Embeddings

Image Embeddings

Semantic Segmentation

Text to image

FilesExpand file tree

inference-time.md

Latest commit

History

inference-time.md

File metadata and controls

Classification

Object Detection

Style Transfer

OCR

Vertical OCR

LLMs

Speech to Text

Encoding

Decoding

Text to Speech

Text Embeddings

Image Embeddings

Semantic Segmentation

Text to image