Skip to content

feat: context sliding window (infinite generation beyond max_seq_len)#31

Open
cstroie wants to merge 15 commits into
RightNow-AI:mainfrom
cstroie:sliding-window
Open

feat: context sliding window (infinite generation beyond max_seq_len)#31
cstroie wants to merge 15 commits into
RightNow-AI:mainfrom
cstroie:sliding-window

Conversation

@cstroie
Copy link
Copy Markdown

@cstroie cstroie commented May 7, 2026

Summary

  • Implements KV cache sliding window so generation continues past max_seq_len without truncation
  • When pos - kv_shift >= max_seq_len, kvcache_slide() evicts half the non-prefix entries via memmove across all layers, then increments kv_shift; the physical slot is always logical_pos - kv_shift
  • Adds -w <int> flag (keep_prefix) to pin BOS / system-prompt tokens at the front of the cache so they are never evicted

Compliance with CONTRIBUTING.md

Rule Status
Zero dependencies (libc/libm/libpthread only) kvcache_slide uses only memmove
No malloc during inference ✅ operates entirely on pre-allocated KV cache buffers
Works on 256 MB RAM ✅ memory footprint unchanged — same fixed KV cache
Plain C11, no C++
snake_case / 4-space indent / type_t suffix

Verification output (x86-64, AVX2, TinyLlama 1.1B Q4_K_M)

# make clean && make native
gcc -O3 -std=c11 -D_GNU_SOURCE -Wall -Wextra -Wpedantic -march=native -o picolm ...

# greedy test
$ ./picolm model.gguf -p "The capital of France is" -n 20 -t 0
 Paris.
2. B.C. The capital of ancient Rome was Rome.
3...

# JSON mode
$ ./picolm model.gguf --json -p "Return JSON with a name" -n 50 -t 0.3
[]

# memory check
$ ./picolm model.gguf -p "Hello" -n 10 2>&1 | grep Memory
Memory: 45.17 MB runtime state (FP16 KV cache)

Hardware tested

  • x86-64 Linux, AVX2 (Intel/AMD Haswell+)

🤖 Generated with Claude Code

cstroie and others added 15 commits April 16, 2026 10:52
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
- Add x86, sse2, sse3, avx targets to platform-specific builds section
- Update SIMD feature entry to mention SSE2/SSE3/AVX tiers
- Expand x86 SIMD optimization section with per-tier description
- Update performance waterfall chart to reflect 8-wide AVX ops
- Add --mem option to usage section
- Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step
- Update FAQ SIMD mention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets:
- AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add,
  vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K
- SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround)
- SSE2: 4-wide baseline for all the above ops

Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit
(no AVX2 required); only float accumulators widen to 256-bit.
Scalar fallback preserved for all paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge simd branch cleanups into avx2:
- quant.h: unify detection into single hierarchy block (AVX2→AVX→SSE3→SSE2),
  drop redundant old guards; add hsum_avx under #ifdef PICOLM_AVX
- quant.c: trim AVX2/AVX block comments to single lines per style guide
- picolm.c: keep SIMD-tier startup print, drop removed --mem print
- model.c: remove stray blank line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SE2 intent

Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it
invisible to the AVX and SSE2 branches that use it. Move it to a
dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before
the dispatch chain so all consuming branches see it.

Add a comment to `make static` explaining the deliberate switch from
-march=native to -msse2 (portable static binary, runs on any x86-64).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat: add --mem parameter to load model into RAM instead of mmap

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* fix: fix model_load call and signed comparison warning

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: add --mem option for model loading mode selection

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: add fast mode with optimized parameters for better performance

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

* feat: enhance performance with SSE2 optimizations and update build configurations

* feat: add AVX support for optimized vector operations and enhance performance

* docs: update README for AVX support, build targets, and --mem option

- Add x86, sse2, sse3, avx targets to platform-specific builds section
- Update SIMD feature entry to mention SSE2/SSE3/AVX tiers
- Expand x86 SIMD optimization section with per-tier description
- Update performance waterfall chart to reflect 8-wide AVX ops
- Add --mem option to usage section
- Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step
- Update FAQ SIMD mention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: enhance SIMD support with AVX2 optimizations and update build configurations

* feat: add SSE2/SSE3/AVX SIMD tiers for x86 inference

Adds compile-time x86 SIMD tiers via make sse2/sse3/avx targets:
- AVX (8-wide float): rmsnorm, softmax, rope, elemwise_mul, vec_add,
  vec_dot_f32, vec_dot_q4_K, vec_dot_q6_K
- SSE3: cleaner RoPE rotation using _mm_addsub_ps (no sign-mask workaround)
- SSE2: 4-wide baseline for all the above ops

Q4_K and Q6_K dot products keep integer nibble extraction at 128-bit
(no AVX2 required); only float accumulators widen to 256-bit.
Scalar fallback preserved for all paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: move Q6K_CONV macro outside #ifdef chain; document make static SSE2 intent

Q6K_CONV was defined inside the #ifdef PICOLM_AVX2 block, making it
invisible to the AVX and SSE2 branches that use it. Move it to a
dedicated #if defined(PICOLM_AVX) || defined(PICOLM_SSE2) guard before
the dispatch chain so all consuming branches see it.

Add a comment to `make static` explaining the deliberate switch from
-march=native to -msse2 (portable static binary, runs on any x86-64).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…document Q6K_CONV idiom

- Add -mfma to make avx and make avx2; all AVX/AVX2 CPUs support FMA3
  and GCC will fuse multiply-add pairs in hot dot-product loops for free
- Fix vec_dot_f32_f32 AVX path: add 8-wide cleanup pass between the
  16-wide main loop and scalar tail so hidden sizes that are multiples
  of 8 but not 16 (2048, 4096, ...) don't leave 8 elements to scalar
- Add comment to Q6K_CONV explaining the non-obvious unpacklo/srai
  sign-extension idiom (byte→int16 widening without SSE4.1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When pos reaches max_seq_len, kvcache_slide() evicts half the
non-prefix entries (memmove across all layers) and increments
kv_shift so physical_slot = logical_pos - kv_shift stays in bounds.
The attention loop iterates over filled physical slots, and RoPE
lookup wraps via pos % max_seq_len to reuse the pre-computed table.

- Add keep_prefix to model_config_t (-w flag) to pin BOS/system prompt
- Add kv_shift to run_state_t to track evicted positions
- Remove total_steps cap in picolm.c; generation now runs past max_seq_len

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant