Twitter/X launch thread — Working Memory Cliff

~10 tweets, ~280 chars each. Designed to drive readers to the HF blog post and the arXiv-style tech report on GitHub.

1/ We measured how much of a long context two popular edge-device LLMs (Llama-3.2-1B-Q8, Llama-3.2-3B-Q4) can actually use.

Both fall off a cliff at less than 1% of their advertised 128K context window.

This is the empirical reality check the "Beyond RAG" argument needed. 🧵

2/ The 3B cliff is a step function:

ctx	retrieval
1024	100% (18/18)
1280	0% (0/18)

256 tokens of context length, perfect → broken. No degradation interval. The model just stops following the instruction.

3/ The 1B cliff is graded but starts much earlier:

ctx	fp32 KV	6.4× compressed
512	100%	100%
1024	44%	22%
1536	0%	0%

Same wikitext-2 haystack, same three needles, n=9 per cell.

4/ Important: KV cache compression (6.4×) is bit-for-bit identical to FP32 baseline in 18 of 20 cells.

The single disagreement is at the cliff cell where both methods are statistical noise.

The cliff is a model property, not a KV property.

5/ Above the cliff, the failure mode is not "I don't know."

Three modes we observed:

Wikitext continuation (model picks up the haystack)
Header echoes ("= = = 2008 II =")
Synthesised hallucinations — and #3 is the consequential one.

6/ Example synthesised hallucination, at 1B fp32 ctx=1024:

Haystack: Wikipedia article about Robert Boulter (English actor). Needle: "The CFO of Northwind Logistics is Sarah Chen, hired in 2023."

Model output: "In 2023 Boulter was hired as the chief financial officer..."

It fused them.

7/ This is the same silent-hallucination failure mode that vector RAG produces on retrieval miss — happening in the regime that was supposed to eliminate it.

"Load the whole document, skip the chunker, hallucination goes away." Works at frontier scale. Doesn't work at 1B–3B Q4.

8/ Why this matters for "Beyond RAG":

Memory-wise, 128K context fits in 9.5GB on a 16GB Mac with 6.4× KV compression. ✅
Retrieval-wise, the 3B Q4 model stops following instructions at ~1024 tokens, regardless of compression. ❌

Two to three orders of magnitude gap.

9/ Honest scope: 2 models, 3 needles, 1 language (English), 1 domain (Wikipedia bio). 198 NIAH trials total.

The cliff might:

Sit higher for 8B+ models
Move with prompt format
Be reducible with long-context fine-tuning

If you can falsify it, we want the data.

10/ Full tech report (arXiv-style): github.com/quantumaikr/quant.cpp/blob/main/docs/paper/working-memory-cliff.md

HF blog post: [link when published]

Reproduce in 5 minutes: github.com/quantumaikr/quant.cpp/blob/main/bench/niah_test.sh

Pull requests welcome. /end

Posting strategy

Post #1 alone, wait 1-2 min for early signal
Then thread the rest as a single drop
Pin the thread on the project account
Cross-post the lead tweet to:
- r/LocalLLaMA (with the title: "Measuring the working memory cliff of Llama-3.2-1B/3B: 198 NIAH trials, KV compression is bit-for-bit identical to baseline, both models fall off a cliff at <1% of nominal context")
- HackerNews (Show HN: title)
Tag in HF post: @huggingface @abacaj @karpathy (only if they engage with edge-LLM benchmarking content; do not spam-tag)

Anticipated responses + drafts

Q: "n=9 is too small." A: We agree. Cliff-cell seed sweep with 5 seeds × 3 needles = 15/cell is in the tech report appendix. The headline cells have wider error bars; the cliff direction does not change.

Q: "Why not run 8B?" A: Tried. Q4_K_M GGUF on Metal runs at ~12 min per inference for 3B, not feasible for a 90-trial grid. We're working on the Metal Q4_K_M kernel; expect 8B numbers in v2.

Q: "Is this a quant.cpp bug? Does llama.cpp do the same?" A: We tested with the exact same Q8_0 GGUF. The cliff reproduces under both quant.cpp's default loader and the FP32-weights path (TQ_NO_Q4=1). It is not specific to a quantization scheme.

Q: "Did you try a system prompt?" A: We tried six prompt formats including system prompts, structured <<<DOCUMENT>>> delimiters, few-shot Q/A, and raw continuation. The simple format we report was the best of the six. Format-sensitivity is documented in §5.1 of the report.

Q: "RULER / LongBench would be better." A: Yes. RULER is the obvious next step and is in §8 (Future Work). This is a v1 measurement explicitly scoped to the question "where does the cliff sit at edge scale" — not a comprehensive long-context benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twitter/X launch thread — Working Memory Cliff

Posting strategy

Anticipated responses + drafts

FilesExpand file tree

twitter-thread.md

Latest commit

History

twitter-thread.md

File metadata and controls

Twitter/X launch thread — Working Memory Cliff

Posting strategy

Anticipated responses + drafts