Skip to content

Commit d8e4ffd

Browse files
authored
[DOC] Document common computer vision patterns (pytorch#19518)
Fixes pytorch#8831 ### Summary - Add an Advanced docs page for working with computer vision models in ExecuTorch. - Cover preprocessing placement, resize/crop behavior, dtype conversion, normalization, and `NCHW` versus `NHWC` tensor layouts. - Add Android `Bitmap` and iOS `UIImage` image-to-tensor examples, plus quick guidance for classifier, segmentation, detection, and instance segmentation outputs. - Link the page from the Advanced section, Advanced pathway, and Getting Started MobileNet flow. ### Test plan - `git diff --check origin/main..HEAD` - `python -m compileall -q docs/source/conf.py docs/source/custom_directives.py` - PyTorch export smoke test for the preprocessing wrapper shown in the new page - `cd docs && MPLCONFIGDIR=/tmp/matplotlib-executorch-8831 PYTHONPATH=../pip-out/lib.linux-x86_64-cpython-312 python -m sphinx -D plot_gallery=0 -b html --keep-going source _build/html source/working-with-cv-models.md source/advanced-topics-section.md source/pathway-advanced.md source/getting-started.md` - Verified the rendered changed pages contain no local `WARNING`, `ERROR`, `CRITICAL`, `system-message`, or `problematic` artifacts cc @mergennachin @AlannaBurke @byjlw
1 parent aa06dbc commit d8e4ffd

4 files changed

Lines changed: 307 additions & 6 deletions

File tree

docs/source/advanced-topics-section.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,17 @@ Key topics:
2929
- Hardware Backend Selection & Optimization
3030
- Dynamic Shapes & Advanced Model Features
3131

32+
## Computer Vision Models
33+
34+
Patterns for image model preprocessing, tensor layout, and task-specific output decoding.
35+
36+
**→ {doc}`working-with-cv-models` — Working with computer vision models**
37+
38+
Key topics:
39+
40+
- Resize, crop, dtype conversion, and normalization placement
41+
- Android and iOS image-to-tensor conversion
42+
- Classifier, segmentation, detection, and instance segmentation outputs
3243

3344
## Kernel Library
3445

@@ -95,7 +106,7 @@ Key topics:
95106

96107
After exploring advanced topics:
97108

98-
- **{doc}`tools-sdk-section`** - Developer tools for debugging and profiling
109+
- **{doc}`tools-section`** - Developer tools for debugging and profiling
99110
- **{doc}`api-section`** - Complete API reference documentation
100111

101112
```{toctree}
@@ -105,8 +116,10 @@ After exploring advanced topics:
105116
106117
quantization-optimization
107118
using-executorch-export
119+
working-with-cv-models
108120
kernel-library-advanced
109121
backend-delegate-advanced
110122
runtime-integration-advanced
111123
compiler-ir-advanced
112124
file-formats-advanced
125+
```

docs/source/getting-started.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ After successfully generating a .pte file, it is common to use the Python runtim
7979

8080
For the MobileNet V2 model from torchvision used in this example, image inputs are expected as a normalized, float32 tensor with dimensions of (batch, channels, height, width). The output is a tensor containing class logits. See [torchvision.models.mobilenet_v2](https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v2.html) for more information on the input and output tensor format for this model.
8181

82+
For more guidance on image preprocessing, channels-first and channels-last layouts, and CV output decoding, see [Working with Computer Vision Models](working-with-cv-models.md).
83+
8284
```python
8385
import torch
8486
from executorch.runtime import Runtime

docs/source/pathway-advanced.md

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,35 @@ Understanding the Backend Dialect IR and how it differs from Edge Dialect — es
5858

5959
---
6060

61+
### Computer Vision Models
62+
63+
Computer vision apps need a precise contract for image resizing, crop behavior, tensor layout, dtype conversion, normalization, and output decoding.
64+
65+
::::{grid} 2
66+
:gutter: 2
67+
68+
:::{grid-item-card} Working with Computer Vision Models
69+
:link: working-with-cv-models
70+
:link-type: doc
71+
72+
Guidance for preprocessing placement, Android and iOS image-to-tensor conversion, and classifier or segmentation output interpretation.
73+
74+
**Difficulty:** Intermediate
75+
:::
76+
77+
:::{grid-item-card} Getting Started with ExecuTorch
78+
:link: getting-started
79+
:link-type: doc
80+
81+
End-to-end MobileNet V2 export, validation, and mobile runtime links for a first image classification workflow.
82+
83+
**Difficulty:** Beginner
84+
:::
85+
86+
::::
87+
88+
---
89+
6190
### Memory Planning and Runtime Optimization
6291

6392
Memory planning is critical for constrained devices. ExecuTorch provides ahead-of-time memory planning to eliminate runtime allocations.
@@ -470,21 +499,24 @@ If you prefer a structured progression rather than topic-based navigation, follo
470499
- {doc}`using-executorch-export`
471500
- Master advanced export options
472501
* - 3
502+
- {doc}`working-with-cv-models`
503+
- Define image preprocessing, tensor layout, and output decoding for CV apps
504+
* - 4
473505
- {doc}`quantization-optimization`
474506
- Apply production-grade quantization
475-
* - 4
507+
* - 5
476508
- {doc}`compiler-memory-planning`
477509
- Optimize memory for constrained devices
478-
* - 5
510+
* - 6
479511
- {doc}`compiler-custom-compiler-passes`
480512
- Write custom graph transformations
481-
* - 6
513+
* - 7
482514
- {doc}`backend-development`
483515
- Implement a custom backend delegate
484-
* - 7
516+
* - 8
485517
- {doc}`running-a-model-cpp-tutorial`
486518
- Master the low-level C++ runtime
487-
* - 8
519+
* - 9
488520
- {doc}`devtools-tutorial`
489521
- Profile and debug production models
490522
```
Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
(working-with-cv-models)=
2+
3+
# Working with Computer Vision Models
4+
5+
Computer vision deployments depend on the boundary between the app and the exported program being precise. Before exporting, write down the tensor contract that your app will satisfy:
6+
7+
- input shape, including whether the model expects `NCHW` (`[batch, channels, height, width]`) or `NHWC` (`[batch, height, width, channels]`)
8+
- input dtype, such as `float32` normalized image values or `uint8` image bytes
9+
- color channel order, such as RGB or BGR
10+
- resize, crop, and normalization rules
11+
- output tensors and the post-processing expected for each task
12+
13+
ExecuTorch runs the graph that you export. It does not infer image layout, resize policy, label mappings, or task-specific post-processing from the model file.
14+
15+
## Choose where preprocessing runs
16+
17+
Treat preprocessing placement as part of the backend contract. Keep platform-dependent image work, such as decoding, orientation handling, resizing, cropping, and UI-driven transforms, in the app. Put tensor-only preprocessing in the exported graph when matching the PyTorch reference exactly is more important and the operations are supported by the target backend.
18+
19+
For Core ML deployments, fixed-shape model inputs are often preferable. The Core ML backend documentation notes that true dynamic shapes use `RangeDim` and fall back to CPU or GPU instead of the Apple Neural Engine (ANE). When targeting the ANE, resize or crop images with iOS image APIs before tensor creation, or use enumerated shapes when the model needs a finite set of input sizes. See {doc}`backends/coreml/coreml-partitioner` for details.
20+
21+
This example accepts already resized and cropped `float32` `NCHW` RGB input in `[0, 1]`, normalizes it, and then calls the image classifier.
22+
23+
```python
24+
import torch
25+
from torch import nn
26+
27+
28+
class ImageClassifierWithNormalization(nn.Module):
29+
def __init__(self, model: nn.Module) -> None:
30+
super().__init__()
31+
self.model = model
32+
self.register_buffer(
33+
"mean",
34+
torch.tensor([0.485, 0.456, 0.406], dtype=torch.float32).view(1, 3, 1, 1),
35+
)
36+
self.register_buffer(
37+
"std",
38+
torch.tensor([0.229, 0.224, 0.225], dtype=torch.float32).view(1, 3, 1, 1),
39+
)
40+
41+
def forward(self, image: torch.Tensor) -> torch.Tensor:
42+
image = (image - self.mean) / self.std
43+
return self.model(image)
44+
45+
46+
wrapped_model = ImageClassifierWithNormalization(model).eval()
47+
sample_inputs = (torch.zeros(1, 3, 224, 224, dtype=torch.float32),)
48+
exported_program = torch.export.export(wrapped_model, sample_inputs)
49+
```
50+
51+
Validate app-side preprocessing against the same PyTorch preprocessing used during export. For example, if the app resizes and crops before creating the tensor, keep a small reference test that compares the packed tensor or final model output with the PyTorch path.
52+
53+
If the model expects a crop after resizing, keep that policy in exactly one place. A fixed center crop can be implemented in the wrapper for backends where those tensor operations preserve the desired delegation. Camera- or UI-dependent crops, and iOS/Core ML paths that should keep ANE-friendly static shapes, are usually better handled before packing pixels into the input tensor.
54+
55+
## Convert images to tensors in app code
56+
57+
Most mobile image APIs expose decoded pixels as interleaved rows. Most PyTorch vision models expect channels-first tensors. If preprocessing stays in the app, explicitly pack pixels into the model's expected layout.
58+
59+
### Android
60+
61+
For production Android preprocessing, handle decoding, EXIF orientation, and camera-specific transforms before packing pixels into the input tensor. The following Kotlin helper keeps the layout conversion explicit: it resizes a `Bitmap`, reads RGB pixels, applies ImageNet-style normalization, and packs the result as `NCHW` `float32` data for `Tensor.fromBlob`.
62+
63+
```kotlin
64+
import android.graphics.Bitmap
65+
import org.pytorch.executorch.Tensor
66+
67+
fun bitmapToNchwTensor(
68+
bitmap: Bitmap,
69+
size: Int,
70+
mean: FloatArray = floatArrayOf(0.485f, 0.456f, 0.406f),
71+
std: FloatArray = floatArrayOf(0.229f, 0.224f, 0.225f)
72+
): Tensor {
73+
val resized = Bitmap.createScaledBitmap(bitmap, size, size, true)
74+
val pixels = IntArray(size * size)
75+
resized.getPixels(pixels, 0, size, 0, 0, size, size)
76+
77+
val input = FloatArray(3 * size * size)
78+
for (i in pixels.indices) {
79+
val pixel = pixels[i]
80+
val r = ((pixel shr 16) and 0xff) / 255.0f
81+
val g = ((pixel shr 8) and 0xff) / 255.0f
82+
val b = (pixel and 0xff) / 255.0f
83+
84+
input[i] = (r - mean[0]) / std[0]
85+
input[size * size + i] = (g - mean[1]) / std[1]
86+
input[2 * size * size + i] = (b - mean[2]) / std[2]
87+
}
88+
89+
return Tensor.fromBlob(input, longArrayOf(1, 3, size.toLong(), size.toLong()))
90+
}
91+
```
92+
93+
If the exported model accepts `uint8` image bytes instead, use `Tensor.fromBlobUnsigned(...)` and keep dtype conversion inside the exported graph.
94+
95+
```kotlin
96+
val inputBytes = ByteArray(3 * width * height)
97+
// Pack bytes in the same layout expected by the model. For NCHW RGB,
98+
// write all red values, then green values, then blue values.
99+
val inputTensor = Tensor.fromBlobUnsigned(
100+
inputBytes,
101+
longArrayOf(1, 3, height.toLong(), width.toLong())
102+
)
103+
```
104+
105+
### iOS
106+
107+
For production iOS preprocessing, prefer platform image APIs and Accelerate, such as vImage for resizing and color conversion and vDSP for normalization, especially for camera frames or other hot paths. The following Swift helper keeps the layout conversion explicit so the tensor contract is easy to inspect: it draws a `UIImage` into a fixed-size RGB buffer, uses vDSP to normalize RGB channels, and creates a channels-first `Tensor<Float>`.
108+
109+
```swift
110+
import Accelerate
111+
import CoreGraphics
112+
import ExecuTorch
113+
import UIKit
114+
115+
func imageToNchwTensor(
116+
_ image: UIImage,
117+
size: Int,
118+
mean: [Float] = [0.485, 0.456, 0.406],
119+
std: [Float] = [0.229, 0.224, 0.225]
120+
) -> Tensor<Float>? {
121+
guard size > 0, mean.count == 3, std.count == 3,
122+
let cgImage = image.cgImage else {
123+
return nil
124+
}
125+
126+
let pixelCount = size * size
127+
var rgba = [UInt8](repeating: 0, count: pixelCount * 4)
128+
let colorSpace = CGColorSpaceCreateDeviceRGB()
129+
130+
let didDraw = rgba.withUnsafeMutableBytes { buffer -> Bool in
131+
guard let baseAddress = buffer.baseAddress,
132+
let context = CGContext(
133+
data: baseAddress,
134+
width: size,
135+
height: size,
136+
bitsPerComponent: 8,
137+
bytesPerRow: size * 4,
138+
space: colorSpace,
139+
bitmapInfo: CGImageAlphaInfo.premultipliedLast.rawValue |
140+
CGBitmapInfo.byteOrder32Big.rawValue
141+
) else {
142+
return false
143+
}
144+
context.draw(cgImage, in: CGRect(x: 0, y: 0, width: size, height: size))
145+
return true
146+
}
147+
guard didDraw else {
148+
return nil
149+
}
150+
151+
let count = vDSP_Length(pixelCount)
152+
var input = [Float](repeating: 0, count: 3 * pixelCount)
153+
rgba.withUnsafeBufferPointer { rgbaBuffer in
154+
input.withUnsafeMutableBufferPointer { inputBuffer in
155+
guard let rgbaBase = rgbaBuffer.baseAddress,
156+
let inputBase = inputBuffer.baseAddress else {
157+
return
158+
}
159+
160+
for channel in 0..<3 {
161+
let source = rgbaBase.advanced(by: channel)
162+
let destination = inputBase.advanced(by: channel * pixelCount)
163+
var scale = 1.0 / (255.0 * std[channel])
164+
var bias = -mean[channel] / std[channel]
165+
166+
vDSP_vfltu8(source, 4, destination, 1, count)
167+
vDSP_vsmsa(destination, 1, &scale, &bias, destination, 1, count)
168+
}
169+
}
170+
}
171+
172+
return Tensor<Float>(input, shape: [1, 3, size, size])
173+
}
174+
```
175+
176+
On either platform, if your model is exported for `NHWC`, keep the same decoded pixels but pack them in row-major `[height, width, channels]` order and use shape `[1, height, width, 3]`.
177+
178+
## Decode common CV outputs
179+
180+
Output tensors are model-specific. Preserve the output schema used during export and keep a small validation test that compares app-side post-processing with PyTorch post-processing.
181+
182+
For TorchVision models, check the [models and pre-trained weights documentation](https://docs.pytorch.org/vision/stable/models.html) for model-specific transforms, categories, and task conventions.
183+
184+
### Image classification
185+
186+
Image classifiers commonly return a logits tensor with shape `[1, num_classes]`. For top-1 classification, find the largest logit and map the index through the same labels file used during training or evaluation.
187+
188+
```kotlin
189+
import org.pytorch.executorch.EValue
190+
191+
val output = module.forward(EValue.from(inputTensor))[0].toTensor()
192+
val logits = output.dataAsFloatArray
193+
194+
var topIndex = 0
195+
for (i in 1 until logits.size) {
196+
if (logits[i] > logits[topIndex]) {
197+
topIndex = i
198+
}
199+
}
200+
val topScore = logits[topIndex]
201+
```
202+
203+
Use `softmax` only when the UI needs probabilities. Ranking classes by logits and by softmax probabilities gives the same order.
204+
205+
### Semantic segmentation
206+
207+
Semantic segmentation models commonly return class scores with shape `[1, classes, height, width]`. For each output pixel, choose the class channel with the largest score, then resize the mask back to the displayed image size if needed.
208+
209+
```kotlin
210+
fun argmaxMask(scores: FloatArray, classes: Int, height: Int, width: Int): IntArray {
211+
val mask = IntArray(height * width)
212+
for (y in 0 until height) {
213+
for (x in 0 until width) {
214+
val offset = y * width + x
215+
var bestClass = 0
216+
var bestScore = scores[offset]
217+
for (c in 1 until classes) {
218+
val score = scores[c * height * width + offset]
219+
if (score > bestScore) {
220+
bestScore = score
221+
bestClass = c
222+
}
223+
}
224+
mask[offset] = bestClass
225+
}
226+
}
227+
return mask
228+
}
229+
```
230+
231+
See the [DeepLabV3 Android demo](https://github.com/meta-pytorch/executorch-examples/tree/main/dl3/android/DeepLabV3Demo) for an end-to-end ExecuTorch segmentation app that exports a model, runs it on Android, and overlays the predicted mask on an image.
232+
233+
### Object detection and instance segmentation
234+
235+
Detection and instance segmentation models do not have a single universal output format. Common patterns include:
236+
237+
- boxes as `[num_detections, 4]`, usually in `xyxy` or `xywh` coordinates
238+
- labels as `[num_detections]`
239+
- scores as `[num_detections]`
240+
- masks as `[num_detections, height, width]` or `[num_detections, 1, height, width]`
241+
242+
Check whether thresholding, non-maximum suppression, box decoding, and mask resizing are already part of the exported graph. If they are not, keep those steps in the app and document the expected coordinate system. When the model runs on a resized or cropped image, map boxes and masks back to the original image coordinates before rendering overlays.
243+
244+
## Validate the model and app contract
245+
246+
Before shipping a CV model, validate these items:
247+
248+
- The app sends the same dtype, shape, layout, color order, and normalization that the exported graph expects.
249+
- The app uses the same labels, palette, score threshold, and coordinate convention as the PyTorch reference.
250+
- A known image produces matching top classes, masks, or detections in PyTorch and in the ExecuTorch app.
251+
- The preprocessing is applied exactly once. Do not normalize in both the app and the exported model.
252+
- The output code handles model-specific shapes instead of assuming all CV models return classifier logits.
253+
254+
For the basic export and runtime flow, start with {doc}`getting-started`. For mobile runtime integration, see {doc}`using-executorch-android` and {doc}`using-executorch-ios`.

0 commit comments

Comments
 (0)