Skip to content

Integration findings - aggregated improvements #1086

@msluszniak

Description

@msluszniak

While integrating LFM2-VL-1.6B into a multilingual Expo app on Android (Samsung Galaxy S26 Ultra) against react-native-executorch@0.8.3 + react-native-executorch-expo-resource-fetcher@0.8.0, a number of papercuts and one real sampler-side bug surfaced. Filing them here as a single aggregated tracking issue. Each item is independently actionable and will be addressed in its own PR.

1. LLMModule.generate() does not auto-shape multimodal mediaPath messages

Native throws "More images paths provided than '<image>' placeholders in prompt".

LLMController.sendMessage transforms user messages with a mediaPath into content: [{type:'image'}, {type:'text', text}] so the chat template emits the image placeholder. LLMController.generate does not — it only collects imagePaths and renders the template with the original string content. Callers using the lower-level generate() API must pre-shape the structured array themselves.

Proposed fix: apply the same historyForTemplate transformation inside generate() for any message with a mediaPath.

SOLVED by #1089

2. Message.content is typed string but the chat template accepts an array of content parts

Forces users of generate() (after fix for point 1 lands, or as a workaround today) to write as unknown as string casts. Type and runtime contract are out of sync.

File: src/types/llm.tscontent: string.

Proposed fix: make content a discriminated union string | Array<{type:'image'} | {type:'text', text:string}>. Likely bundles with 1 into one PR.

SOLVED by #1089

3. Image-path format requirement for multimodal generate() is undocumented and inconsistent

  • LLMModule.generate() with vision capability throws "Read image error: invalid argument" if the path lacks a file:// prefix.
  • ResourceFetcher.fetch() docstring says it returns paths without file://.
  • Vision module examples in source comments show 'file:///path/to/image.jpg'.

A user threading the resource fetcher result into a vision call will hit this.

Proposed fix: normalize internally on the multimodal call (accept either form) or document the requirement loudly on LLMModule.generate/sendMessage and on every vision module.

SOLVED by #1090

4. ResourceFetcher class docstring promises methods that don't exist on the class

src/utils/ResourceFetcher.ts class-level docstring lists .getFilesTotalSize(), .listDownloadedFiles(), .listDownloadedModels(), .deleteResources() as available. They only exist on adapter implementations (e.g. ExpoResourceFetcher in the separate react-native-executorch-expo-resource-fetcher package). Pure docs fix; good warm-up PR.

Proposed fix: update the class docstring to point at the adapter packages, OR expose them as static methods that delegate to the configured adapter.

SOLVED by #1087

5. Expo init step is easy to forget

Forgetting initExecutorch({ resourceFetcher: ExpoResourceFetcher }) produces a runtime error. The error message is actually good (it directs users to the right packages), but the boilerplate itself is a footgun that every Expo consumer has to remember.

Proposed fix: provide a side-effect import like react-native-executorch/expo that auto-registers the adapter, similar to how expo-router/entry works.

Decided to not proceed with this one: #1092

6. No documented sampling presets for OCR / structured extraction

Default sampling settings on LFM2.5-VL produce noisy, inconsistent JSON output for structured extraction tasks. Settled on temperature: 0.1, topp: 0.9 empirically (see also point 8).

Proposed fix: document recommended sampling presets per use case (chat / OCR-extraction / code-completion) in the docs, or expose named presets that can be passed to configure().

This one awaits for #1094

7. Multimodal generate() does not downscale large input images

Observed on: Samsung Galaxy S26 Ultra (Android), JPEG photos straight from the device camera ( roughly 4032×3024).

Passing raw camera images directly to LLMModule.generate() with LFM2-VL-1.6B produced visibly worse OCR than after preprocessing. Adding a resize step (expo-image-manipulator → circa 1024px wide, JPEG q=0.95) measurably improved title and subtitle recognition on the same cover. Not verified whether the JPEG re-encode contributes anything beyond the resize, or whether HEIC inputs from iOS work at all.

Proposed fix: either downscale internally on the native side before vision-encoder tokenization, or document the recommended max input dimension prominently on the multimodal LLM docs page so callers know preprocessing is expected.

8. GenerationConfig has no repetition penalty / no-repeat-ngram-size knob

At low temperature (≤0.2), LFM2.5-VL on hard inputs falls into greedy traps — picks one token (observed both U+00A0 NBSP and a 24-char title-phrase) and emits it for hundreds of tokens straight. Concrete logged outputs available in the testing app session history.

GenerationConfig only exposes temperature, topp, outputTokenBatchSize, batchTimeInterval — no way to reach the underlying sampler's repetition controls from JS. Workaround in testing app: raised temperature to 0.3 + a regex-based isDegenerateResponse detector. Hacky.

Proposed fix: add repetitionPenalty?: number and/or noRepeatNgramSize?: number to GenerationConfig, plumb through to the native sampler. Both are standard in transformers/llama.cpp/ExecuTorch samplers.

This one will be resolved in #1094

Docs nudge

Small VLMs occasionally hallucinate non-Latin script (Arabic, Cyrillic) when uncertain on Latin-script inputs. Not an RNE bug — model behavior — but a "sanitize VLM output" note in the multimodal docs would have saved time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPRs or issues focused on improvements in the current codebase

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions