fix/feat!: LLMs context management by mateuszlampert · Pull Request #819 · software-mansion/react-native-executorch

mateuszlampert · 2026-02-19T01:46:51Z

Description

This PR fixes few bugs related to the LLMs, caused by mixing two approaches - functional (as we pass whole messages history each time) and stateful (as we keep pos_ in the runner, representing at which position the KV cache is), which resulted in 3 bugs:

broken KV cache for reasoning models - in the runner, we counted tokens generated for the reasoning and included these in KV cache (pos_ += num_generated_tokens), but in next turns, jinja template removed these reasoning tokens from the messages history - as a result, KV-cache was incoherent
duplicated tokens in KV cache - we were passing whole messages history to the runner (functional approach), but we were also appending all tokens (prompt and generated) to the KV cache (which position is represented by pos_) - as a result tokens were "duplicated" in the KV cache and we were running out of available tokens very fast (exceeding context_window_length)
stateful TS functional API - even though our generate() method is called functional, it kept internal state in the runner (e. g. pos_)

These bugs were fixed by resetting the runner before each generation, which makes it truly functional - old messages are prefilled and the KV cache can be still used during generation phase.

Additionally, this PR adds ContextStrategy to ChatConfig interface, so now it's possible to define (or use one of already implemented) strategy for managing context (e. g. naive, message count based, sliding window) - it gives us more flexibility and user can decide what's best for their use case. From now on, SlidingWindowContextStrategy is also configured as the default one.

Introduces a breaking change?

Yes
No

These changes will not break anything until max number of messages is not modified (I removed contextWindowLength from ChatConfig and replaced it with contextStrategy)

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

Run example llm app, open executorch logs (adb logcat | grep -i "executorch" for example) and see if numbers of tokens are properly aligned and if pos_ is correct.

To test different context management strategies, change contextStrategy in llm app and modify model configuration.

Screenshots

Related issues

#776

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

Position in KV cache, number of prompt tokens and number of generated tokens for both non-reasoning and reasoning models BEFORE changes.

LLAMA 3.2 1B SPINQUANT (without reasoning)

pos_	Prompt tokens	Generated tokens
0	335	269
604=269+335	872	372
1848=604+872+372	1513	CRASH

QWEN 3.0 0.6B QUANTIZED (with reasoning)

pos_	Prompt tokens	Generated tokens
0	309	457
766=309+457	617 (<766!)	192
1575=766+617+192	925 (<1575!)	CRASH

mateuszlampert · 2026-02-19T10:02:38Z

I will update docs after the code gets approved

msluszniak · 2026-02-19T10:08:02Z

+int32_t Runner::count_text_tokens(const std::string &text) const {
+  auto encodeResult =
+      tokenizer_->encode(text, numOfAddedBoSTokens, numOfAddedEoSTokens);
+
+  if (!encodeResult.ok()) {
+    throw rnexecutorch::RnExecutorchError(
+        rnexecutorch::RnExecutorchErrorCode::TokenizerError,
+        "Encoding failed during token count check.");
+  }
+
+  return static_cast<int32_t>(encodeResult.get().size());
+}
+


I'm wondering if calling encoder specifically to get only the size of encoded text is the most efficient way to get the size. Can we compute this during encoding phase?

If I understand this correctly that's not possible for 2 reasons:

I need to calculate all tokens in the whole messages history + new message (so I don't know number of tokens for new message)

I can't really use token counts from the runner reliably, because it counts reasoning tokens as well for reasoning models (and later on, this reasoning tokens are not included in jinja template) - it could lead to some discrepancies

Alternatively, maybe we could implement these context strategies in the c++ code, but as a result we give no flexibility to the user (?)

Also:

the context management strategy is configurable, so it can be switched on/off oraz replaced with other strategy - by default, we do not use sliding window, but the strategy based on number of messages (like it was before)

I think that encoding phase should not take that long

We don't need to bother ourselves with computational complexity here, this is just tokenizer encoding which is very cheap and only performed once for the given prompt and so only impacts Time To First Token

msluszniak

Please also resolve conflicts

mateuszlampert · 2026-02-20T12:39:19Z

I changed only docs for useLLM and LLMModule - the rest of changes is related to autogenerated api reference changes

msluszniak

In a separate PR I would and TS code snippet with llm configuration.

benITo47

Good change! 👍

mkopcins · 2026-02-23T14:23:34Z

+void LLM::reset() {
+  if (!runner || !runner->is_loaded()) {
+    throw RnExecutorchError(RnExecutorchErrorCode::ModuleNotLoaded,
+                            "Can't interrupt a model that's not loaded");


Suggested change

"Can't interrupt a model that's not loaded");

"Can't reset a model that's not loaded");

mkopcins · 2026-02-23T14:33:15Z

+ *
+ * @category Utils
+ */
+export class NaiveContextStrategy implements ContextStrategy {


imo sth like NoOpContextStrategy is better, Naive suggest that there is some simple logic underneath

## Description This PR fixes few bugs related to the LLMs, caused by mixing two approaches - functional (as we pass whole messages history each time) and stateful (as we keep `pos_` in the runner, representing at which position the KV cache is), which resulted in 3 bugs: - broken KV cache for reasoning models - in the runner, we counted tokens generated for the reasoning and included these in KV cache (`pos_ += num_generated_tokens`), but in next turns, `jinja template` removed these reasoning tokens from the messages history - as a result, KV-cache was incoherent - duplicated tokens in KV cache - we were passing whole messages history to the runner (functional approach), but we were also appending all tokens (prompt and generated) to the KV cache (which position is represented by `pos_`) - as a result tokens were "duplicated" in the KV cache and we were running out of available tokens very fast (exceeding `context_window_length`) - stateful TS functional API - even though our `generate()` method is called functional, it kept internal state in the runner (e. g. `pos_`) These bugs were fixed by resetting the runner before each generation, which makes it truly functional - old messages are prefilled and the KV cache can be still used during generation phase. Additionally, this PR adds `ContextStrategy` to `ChatConfig` interface, so now it's possible to define (or use one of already implemented) strategy for managing context (e. g. naive, message count based, sliding window) - it gives us more flexibility and user can decide what's best for their use case. From now on, `SlidingWindowContextStrategy` is also configured as the default one. ### Introduces a breaking change? - [x] Yes - [ ] No These changes will not break anything until max number of messages is not modified (I removed `contextWindowLength` from `ChatConfig` and replaced it with `contextStrategy`) ### Type of change - [x] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions Run example llm app, open executorch logs (`adb logcat | grep -i "executorch"` for example) and see if numbers of tokens are properly aligned and if `pos_` is correct. To test different context management strategies, change `contextStrategy` in llm app and modify model configuration. ### Screenshots  ### Related issues #776 ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes Position in KV cache, number of prompt tokens and number of generated tokens for both non-reasoning and reasoning models BEFORE changes. LLAMA 3.2 1B SPINQUANT (without reasoning) | pos_ | Prompt tokens | Generated tokens | |------------------|---------------|------------------| | 0 | 335 | 269 | | 604=269+335 | 872 | 372 | | 1848=604+872+372 | 1513 | CRASH | QWEN 3.0 0.6B QUANTIZED (with reasoning) | pos_ | Prompt tokens | Generated tokens | |------------------|---------------|------------------| | 0 | 309 | 457 | | 766=309+457 | 617 (<766!) | 192 | | 1575=766+617+192 | 925 (<1575!) | CRASH |

mateuszlampert added 7 commits February 19, 2026 00:44

fix: reset() llm before each call to keep it functional

fb26ec7

chore: remove console.log()

1ebccfe

feat: add various strategies for handling context window

edb1306

feat: make necessary methods and classes public

3ebcb6e

Merge branch 'main' into @ml/llm-functional-api-context-management

0f12695

fix: method name

56ff74c

chore: update llms example app with strategy configuration

d2eed3c

mateuszlampert changed the title ~~@ml/llm functional api context management~~ fix/feat: LLMs context management Feb 19, 2026

mateuszlampert linked an issue Feb 19, 2026 that may be closed by this pull request

LLM message history management not working + exceeding context window ([Error: Failed to generate text, error code: 18]) #776

Closed

mateuszlampert marked this pull request as ready for review February 19, 2026 10:01

mateuszlampert requested review from chmjkb, mkopcins and msluszniak February 19, 2026 10:01

msluszniak reviewed Feb 19, 2026

View reviewed changes

Merge branch 'main' into @ml/llm-functional-api-context-management

14a6b9c

msluszniak reviewed Feb 20, 2026

View reviewed changes

Comment thread packages/react-native-executorch/common/runner/runner.cpp Outdated

mateuszlampert added 5 commits February 20, 2026 12:05

rework context strategies maContextLength

8b7ba1d

chore: update default context strategy

587f773

chore: unify types

3acba46

chore: update docs

6db88a2

chore: update config example in skill

c176e20

mateuszlampert requested a review from msluszniak February 20, 2026 12:38

msluszniak reviewed Feb 20, 2026

View reviewed changes

Comment thread packages/react-native-executorch/common/rnexecutorch/models/llm/LLM.h Outdated

msluszniak assigned mateuszlampert Feb 20, 2026

msluszniak added the bug fix PRs that are fixing bugs label Feb 20, 2026

msluszniak added the feature PRs that implement a new feature label Feb 20, 2026

msluszniak changed the title ~~fix/feat: LLMs context management~~ fix/feat!: LLMs context management Feb 20, 2026

This comment was marked as resolved.

Sign in to view

fix: unify types to work on android

dae9356

mateuszlampert requested a review from msluszniak February 20, 2026 14:54

msluszniak approved these changes Feb 23, 2026

View reviewed changes

msluszniak linked an issue Feb 23, 2026 that may be closed by this pull request

Problems with Tool Calling #783

Closed

benITo47 approved these changes Feb 23, 2026

View reviewed changes

mkopcins reviewed Feb 23, 2026

View reviewed changes

mateuszlampert added 2 commits February 23, 2026 15:28

chore: update reset() method error message

2b72f36

fix: bump default context buffer tokens

a6b2b6f

mkopcins reviewed Feb 23, 2026

View reviewed changes

chore: update strategy naming

83110c9

mateuszlampert requested a review from mkopcins February 23, 2026 15:16

mkopcins approved these changes Feb 23, 2026

View reviewed changes

mateuszlampert merged commit 4f8bdd0 into main Feb 23, 2026
5 checks passed

mateuszlampert deleted the @ml/llm-functional-api-context-management branch February 23, 2026 16:06

	"Can't interrupt a model that's not loaded");
	"Can't reset a model that's not loaded");

Conversation

mateuszlampert commented Feb 19, 2026 • edited by msluszniak Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

mateuszlampert commented Feb 19, 2026

Uh oh!

msluszniak Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mateuszlampert Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mateuszlampert Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkopcins Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mateuszlampert commented Feb 20, 2026

Uh oh!

Uh oh!

This comment was marked as resolved.

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

benITo47 left a comment

Choose a reason for hiding this comment

Uh oh!

mkopcins Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mkopcins Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mateuszlampert commented Feb 19, 2026 •

edited by msluszniak

Loading

mateuszlampert Feb 19, 2026 •

edited

Loading