While integrating LFM2-VL-1.6B into a multilingual Expo app on Android (Samsung Galaxy S26 Ultra) against react-native-executorch@0.8.3 + react-native-executorch-expo-resource-fetcher@0.8.0, a number of papercuts and one real sampler-side bug surfaced. Filing them here as a single aggregated tracking issue. Each item is independently actionable and will be addressed in its own PR.
1. LLMModule.generate() does not auto-shape multimodal mediaPath messages
Native throws "More images paths provided than '<image>' placeholders in prompt".
LLMController.sendMessage transforms user messages with a mediaPath into content: [{type:'image'}, {type:'text', text}] so the chat template emits the image placeholder. LLMController.generate does not — it only collects imagePaths and renders the template with the original string content. Callers using the lower-level generate() API must pre-shape the structured array themselves.
Proposed fix: apply the same historyForTemplate transformation inside generate() for any message with a mediaPath.
SOLVED by #1089
2. Message.content is typed string but the chat template accepts an array of content parts
Forces users of generate() (after fix for point 1 lands, or as a workaround today) to write as unknown as string casts. Type and runtime contract are out of sync.
File: src/types/llm.ts — content: string.
Proposed fix: make content a discriminated union string | Array<{type:'image'} | {type:'text', text:string}>. Likely bundles with 1 into one PR.
SOLVED by #1089
3. Image-path format requirement for multimodal generate() is undocumented and inconsistent
LLMModule.generate() with vision capability throws "Read image error: invalid argument" if the path lacks a file:// prefix.
ResourceFetcher.fetch() docstring says it returns paths without file://.
- Vision module examples in source comments show
'file:///path/to/image.jpg'.
A user threading the resource fetcher result into a vision call will hit this.
Proposed fix: normalize internally on the multimodal call (accept either form) or document the requirement loudly on LLMModule.generate/sendMessage and on every vision module.
SOLVED by #1090
4. ResourceFetcher class docstring promises methods that don't exist on the class
src/utils/ResourceFetcher.ts class-level docstring lists .getFilesTotalSize(), .listDownloadedFiles(), .listDownloadedModels(), .deleteResources() as available. They only exist on adapter implementations (e.g. ExpoResourceFetcher in the separate react-native-executorch-expo-resource-fetcher package). Pure docs fix; good warm-up PR.
Proposed fix: update the class docstring to point at the adapter packages, OR expose them as static methods that delegate to the configured adapter.
SOLVED by #1087
5. Expo init step is easy to forget
Forgetting initExecutorch({ resourceFetcher: ExpoResourceFetcher }) produces a runtime error. The error message is actually good (it directs users to the right packages), but the boilerplate itself is a footgun that every Expo consumer has to remember.
Proposed fix: provide a side-effect import like react-native-executorch/expo that auto-registers the adapter, similar to how expo-router/entry works.
Decided to not proceed with this one: #1092
6. No documented sampling presets for OCR / structured extraction
Default sampling settings on LFM2.5-VL produce noisy, inconsistent JSON output for structured extraction tasks. Settled on temperature: 0.1, topp: 0.9 empirically (see also point 8).
Proposed fix: document recommended sampling presets per use case (chat / OCR-extraction / code-completion) in the docs, or expose named presets that can be passed to configure().
This one awaits for #1094
7. Multimodal generate() does not downscale large input images
Observed on: Samsung Galaxy S26 Ultra (Android), JPEG photos straight from the device camera ( roughly 4032×3024).
Passing raw camera images directly to LLMModule.generate() with LFM2-VL-1.6B produced visibly worse OCR than after preprocessing. Adding a resize step (expo-image-manipulator → circa 1024px wide, JPEG q=0.95) measurably improved title and subtitle recognition on the same cover. Not verified whether the JPEG re-encode contributes anything beyond the resize, or whether HEIC inputs from iOS work at all.
Proposed fix: either downscale internally on the native side before vision-encoder tokenization, or document the recommended max input dimension prominently on the multimodal LLM docs page so callers know preprocessing is expected.
8. GenerationConfig has no repetition penalty / no-repeat-ngram-size knob
At low temperature (≤0.2), LFM2.5-VL on hard inputs falls into greedy traps — picks one token (observed both U+00A0 NBSP and a 24-char title-phrase) and emits it for hundreds of tokens straight. Concrete logged outputs available in the testing app session history.
GenerationConfig only exposes temperature, topp, outputTokenBatchSize, batchTimeInterval — no way to reach the underlying sampler's repetition controls from JS. Workaround in testing app: raised temperature to 0.3 + a regex-based isDegenerateResponse detector. Hacky.
Proposed fix: add repetitionPenalty?: number and/or noRepeatNgramSize?: number to GenerationConfig, plumb through to the native sampler. Both are standard in transformers/llama.cpp/ExecuTorch samplers.
This one will be resolved in #1094
Docs nudge
Small VLMs occasionally hallucinate non-Latin script (Arabic, Cyrillic) when uncertain on Latin-script inputs. Not an RNE bug — model behavior — but a "sanitize VLM output" note in the multimodal docs would have saved time.
While integrating LFM2-VL-1.6B into a multilingual Expo app on Android (Samsung Galaxy S26 Ultra) against
react-native-executorch@0.8.3+react-native-executorch-expo-resource-fetcher@0.8.0, a number of papercuts and one real sampler-side bug surfaced. Filing them here as a single aggregated tracking issue. Each item is independently actionable and will be addressed in its own PR.1.
LLMModule.generate()does not auto-shape multimodalmediaPathmessagesNative throws
"More images paths provided than '<image>' placeholders in prompt".LLMController.sendMessagetransforms user messages with amediaPathintocontent: [{type:'image'}, {type:'text', text}]so the chat template emits the image placeholder.LLMController.generatedoes not — it only collectsimagePathsand renders the template with the original stringcontent. Callers using the lower-levelgenerate()API must pre-shape the structured array themselves.Proposed fix: apply the same
historyForTemplatetransformation insidegenerate()for any message with amediaPath.SOLVED by #1089
2.
Message.contentis typedstringbut the chat template accepts an array of content partsForces users of
generate()(after fix for point 1 lands, or as a workaround today) to writeas unknown as stringcasts. Type and runtime contract are out of sync.File:
src/types/llm.ts—content: string.Proposed fix: make
contenta discriminated unionstring | Array<{type:'image'} | {type:'text', text:string}>. Likely bundles with 1 into one PR.SOLVED by #1089
3. Image-path format requirement for multimodal
generate()is undocumented and inconsistentLLMModule.generate()with vision capability throws"Read image error: invalid argument"if the path lacks afile://prefix.ResourceFetcher.fetch()docstring says it returns paths withoutfile://.'file:///path/to/image.jpg'.A user threading the resource fetcher result into a vision call will hit this.
Proposed fix: normalize internally on the multimodal call (accept either form) or document the requirement loudly on
LLMModule.generate/sendMessageand on every vision module.SOLVED by #1090
4.
ResourceFetcherclass docstring promises methods that don't exist on the classsrc/utils/ResourceFetcher.tsclass-level docstring lists.getFilesTotalSize(),.listDownloadedFiles(),.listDownloadedModels(),.deleteResources()as available. They only exist on adapter implementations (e.g.ExpoResourceFetcherin the separatereact-native-executorch-expo-resource-fetcherpackage). Pure docs fix; good warm-up PR.Proposed fix: update the class docstring to point at the adapter packages, OR expose them as static methods that delegate to the configured adapter.
SOLVED by #1087
5. Expo init step is easy to forget
Forgetting
initExecutorch({ resourceFetcher: ExpoResourceFetcher })produces a runtime error. The error message is actually good (it directs users to the right packages), but the boilerplate itself is a footgun that every Expo consumer has to remember.Proposed fix: provide a side-effect import like
react-native-executorch/expothat auto-registers the adapter, similar to howexpo-router/entryworks.Decided to not proceed with this one: #1092
6. No documented sampling presets for OCR / structured extraction
Default sampling settings on LFM2.5-VL produce noisy, inconsistent JSON output for structured extraction tasks. Settled on
temperature: 0.1, topp: 0.9empirically (see also point 8).Proposed fix: document recommended sampling presets per use case (chat / OCR-extraction / code-completion) in the docs, or expose named presets that can be passed to
configure().This one awaits for #1094
7.
Multimodalgenerate()does not downscale large input imagesObserved on: Samsung Galaxy S26 Ultra (Android), JPEG photos straight from the device camera ( roughly 4032×3024).Passing raw camera images directly toLLMModule.generate()with LFM2-VL-1.6B produced visibly worse OCR than after preprocessing. Adding a resize step (expo-image-manipulator→ circa 1024px wide, JPEG q=0.95) measurably improved title and subtitle recognition on the same cover. Not verified whether the JPEG re-encode contributes anything beyond the resize, or whether HEIC inputs from iOS work at all.Proposed fix: either downscale internally on the native side before vision-encoder tokenization, or document the recommended max input dimension prominently on the multimodal LLM docs page so callers know preprocessing is expected.8.
GenerationConfighas no repetition penalty / no-repeat-ngram-size knobAt low temperature (≤0.2), LFM2.5-VL on hard inputs falls into greedy traps — picks one token (observed both U+00A0 NBSP and a 24-char title-phrase) and emits it for hundreds of tokens straight. Concrete logged outputs available in the testing app session history.
GenerationConfigonly exposestemperature,topp,outputTokenBatchSize,batchTimeInterval— no way to reach the underlying sampler's repetition controls from JS. Workaround in testing app: raised temperature to 0.3 + a regex-basedisDegenerateResponsedetector. Hacky.Proposed fix: add
repetitionPenalty?: numberand/ornoRepeatNgramSize?: numbertoGenerationConfig, plumb through to the native sampler. Both are standard intransformers/llama.cpp/ExecuTorch samplers.This one will be resolved in #1094
Docs nudge
Small VLMs occasionally hallucinate non-Latin script (Arabic, Cyrillic) when uncertain on Latin-script inputs. Not an RNE bug — model behavior — but a "sanitize VLM output" note in the multimodal docs would have saved time.