Skip to content

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564

Open
veblush wants to merge 1 commit into
tensorflow:mainfrom
veblush:cm-lstm
Open

[CMSIS-NN] Fix stateful execution and batch-major striding for CMSIS-NN LSTM#3564
veblush wants to merge 1 commit into
tensorflow:mainfrom
veblush:cm-lstm

Conversation

@veblush

@veblush veblush commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Problem

The current CMSIS-NN LSTM wrapper uses arm_lstm_unidirectional_s8 and arm_lstm_unidirectional_s16. These CMSIS-NN functions are designed for stateless sequence evaluation: they explicitly wipe the cell state at t=0 and ignore any initial hidden state, returning only the sequence outputs.

This breaks TFLM's streaming/embedded ML workloads which rely on stateful LSTMs where the CellStateTensor and HiddenStateTensor persist as variable tensors across Invoke() calls.

Furthermore, CMSIS-NN's internal implementation for batch-major tensors (time_major=false with batch_size > 1) incorrectly jumps memory by time_steps, causing an out-of-bounds read on the contiguous hidden_state buffer.

Solution

  1. Fallback to explicit looping: Implemented a manual time/batch loop within CMSIS_NN_EvalInteger8x8_16Lstm and CMSIS_NN_EvalInteger16x8_16Lstm that bypasses the stateless sequence evaluator and instead iteratively calls the single-step CMSIS-NN kernels (arm_nn_lstm_step_s8 and arm_nn_lstm_step_s16).
  2. State Persistence: The fallback loop properly preserves the CellStateTensor and HiddenStateTensor across timesteps and invocations.
  3. Stride Bug Bypass: For time_major=false, the loop evaluates one batch at a time (batch_size=1 passed to the kernel), which guarantees cache-friendly contiguous memory reads and avoids CMSIS-NN's batch striding bug entirely.
  4. Future-proofing: Introduced #ifdef CMSIS_NN_STATEFUL_LSTM. Once ARM merges a fix upstream to support the optional hidden_state context pointer, this flag will seamlessly switch back to using the native CMSIS-NN sequence evaluator. (Fixed LSTM ARM-software/CMSIS-NN#219)

BUG=N/A

@veblush veblush requested a review from a team as a code owner May 21, 2026 18:24
@veblush veblush added the ci:full Triggers the comprehensive cross-platform test suite. label May 21, 2026
@veblush veblush enabled auto-merge June 16, 2026 23:29
mansnils pushed a commit to ARM-software/CMSIS-NN that referenced this pull request Jun 17, 2026
This PR fixes two critical issues in `arm_lstm_unidirectional_s8` and
`s16` that prevent state persistence in streaming models and cause
out-of-bounds reads during non-time-major inference. These issues are
closely related to in
tensorflow/tflite-micro#3564.

Problem:

- State Wiping: By default, `arm_lstm_unidirectional_*` unconditionally
sets `hidden_in` to `NULL` and memsets `cell_state` to 0. This discards
the `HiddenStateTensor` and `CellStateTensor` that TFLM relies on to
persist state across `Invoke()` calls for streaming models.
- Striding Bug: In the `time_major` = `false` block of
`arm_lstm_unidirectional_*`, CMSIS-NN attempts to jump between batches
by passing `batch_offset` = `params->time_steps` to
`arm_nn_lstm_step_*`. However, `arm_nn_lstm_step_*` forwards this
`batch_offset` to `arm_nn_vec_mat_mul_result_acc_s8_s16` for both the
`data_in` and `hidden_in` pointers. Since the `hidden_state` buffer is
contiguous (stride 1) and not strided like `data_in`, passing
`batch_offset` = `params->time_steps` causes out-of-bounds reads on the
hidden_in buffer at `timestep` t=0.

Solution:

- Adding a `hidden_state` pointer to `cmsis_nn_lstm_context`.
- Forwarding this `hidden_state` as `hidden_in` when present, skipping
the `cell_state` wiping if so.
- Explicitly iterating over the `batch_size` in the `time_major` =
`false` case when computing step sizes, which forces `batch_offset` = 1
and avoids the buggy out-of-bounds stride entirely while writing to the
final memory buffer sequentially.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:full Triggers the comprehensive cross-platform test suite.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant