Migration from LlamaChatSession.prompt(flatString) to LlamaChat.generateResponse(ChatHistoryItem[]) for proper system/user/model role separation.
Status: ALL ISSUES RESOLVED (22 total across 4 audit passes, verified in 5th pass)
- Severity: CRITICAL (20-40% speed regression)
- What happened:
InputLookupTokenPredictorwas passed per-call tosession.prompt(). When migrating toLlamaChat.generateResponse(), which doesn't accepttokenPredictoras a per-call option, it was simply removed. - Root cause:
tokenPredictorbelongs on theLlamaContextSequence, not the generation call. - Status: ✅ FIXED — tokenPredictor now passed to
context.getSequence({ tokenPredictor })in initialize(), resetSession(), and all electron-main.js recovery paths.
- Severity: HIGH (suboptimal context truncation across turns)
- What happened:
lastEvaluation.contextShiftMetadatawas stored but never passed back to subsequentgenerateResponse()calls. - Root cause: Oversight — only
lastEvaluationContextWindow.historywas passed, not the context shift metadata. - Status: ✅ FIXED —
contextShift.lastEvaluationMetadatanow passed fromthis.lastEvaluation.contextShiftMetadata.
- Severity: MEDIUM (role confusion for small models + stale tool defs/memory/RAG)
- What happened: Tool execution results were sent as plain strings or user turns. System context (tool definitions, memory, RAG, file context) was only set on the first iteration.
- Root cause: Nudge/continuation/summary prompts were plain strings instead of structured
{systemContext, userMessage}. - Status: ✅ FIXED — Every prompt in the agentic loop now uses structured format with
buildBasePrompt()for fresh system context on every iteration: browser nudges, hallucination nudges, truncation nudges, normal tool-feedback continuations, final summary, and find-bug handler.
- Severity: MEDIUM (one-time ~500ms re-encode penalty after utility calls)
- What happened:
generate()used a temporary history on the same LlamaChat/sequence. After it ran,lastEvaluationwas nulled, forcing full re-encode on nextgenerateStream(). - Root cause: Single sequence shared between one-shot utility calls and the agentic loop.
- Status: ✅ FIXED —
generate()now creates a temporary secondary sequence viacontext.getSequence(), uses a temporaryLlamaChaton that sequence, then disposes both in afinallyblock. Falls back to main chat if the context doesn't support additional sequences, withlastEvaluation = nullonly in the fallback case.
- Severity: MEDIUM (memory + token waste in long automation sessions)
- What happened: Each agentic iteration adds user+model entries. Over 50+ iterations, chatHistory can have 100+ entries before the 80% rotation triggers.
- Root cause: No explicit trimming of chatHistory between rotations.
- Status: ✅ FIXED — New
_compactHistory()method trims old entries when chatHistory exceeds 200 entries (~100 exchanges), preserving the system message and the most recent 80%. Called before everygenerateResponse()ingenerateStream(). InvalidateslastEvaluationwhen compaction occurs so node-llama-cpp re-encodes with the trimmed history.
- Severity: LOW (pre-existing behavior, not a regression)
- What happened: After the first iteration, nudge/tool-feedback prompts were plain strings — the system message in chatHistory wasn't refreshed.
- Status: ✅ FIXED (merged with #3) — All prompts now use structured format, which calls
buildBasePrompt()fresh on every iteration to refresh tool defs, memory, RAG, and file context in the system message.
- Severity: LOW (false abortion of long generations)
- What happened: The repetition counter in
generateStream()only incremented, never decayed. After enough tokens, even normal text would accumulate a count > 5 and abort. - Status: ✅ FIXED — Counter now decays by 1 when the current window shows no repetition:
repetitionCount = Math.max(0, repetitionCount - 1).
- Severity: MEDIUM (memory leak on context rotation)
- What happened:
resetSession()could try to reusethis.chat.sequencewithout first disposing the oldLlamaChatinstance, leaving it alive and holding references. - Status: ✅ FIXED —
resetSession()now always callsthis.chat.dispose()before anything else.
- Severity: LOW (last few characters of a response silently swallowed)
- What happened: The
tagBufferused for partial<think>tag detection held trailing characters that were never emitted if the response ended mid-buffer. - Status: ✅ FIXED — After
generateResponse()returns, any remaining non-thinking tagBuffer content is flushed tofullResponseand emitted viaonToken.
- Severity: LOW (conversation context silently trimmed)
- What happened: On abort with no
fullResponse, the user message was popped from chatHistory (error handler) and never re-added, so the model wouldn't know what the user asked. - Status: ✅ FIXED — Both abort paths (with and without partial response) now always re-add the user message and add a placeholder model response
['[Generation cancelled]'].
- Severity: LOW (fragile default dependency)
- What happened: The
contextShiftoptions only includedstrategywhenlastEvaluationexisted (via the metadata pass-back). On the first call, no strategy was set, relying on node-llama-cpp's undocumented default. - Status: ✅ FIXED —
contextShift.strategy: 'eraseFirstResponseAndKeepFirstSystem'is now explicitly set on EVERY call, both with and without priorlastEvaluation.
- Severity: MEDIUM (resource waste, potential sequence limit exhaustion)
- What happened:
llmEngine.context.getSequence()was called in two places (post-generation context usage + proactive rotation check) just to readnTokens. Each call creates a NEWLlamaContextSequencethat's never disposed. - Status: ✅ FIXED — Both locations now use
llmEngine.sequence.nTokens(the existing sequence reference) instead of creating new ones.
- Severity: TRIVIAL (code hygiene)
- What happened:
_getStopSequences()was defined but never called anywhere. - Status: ✅ FIXED — Removed.
- Severity: CRITICAL (method was broken)
- What happened: During the
_getStopSequences()removal edit, the beginning of_sanitizeResponse()was accidentally eaten, leaving the file with a garbled method body. - Status: ✅ FIXED — Full method restored: JSDoc, function signature, regex cleanup, dedup loop, whitespace normalization.
- Severity: MODERATE (KV cache reuse optimization defeated)
- What happened: After
generateResponse(), we manually pushed{ type: 'model', response: [text] }to chatHistory instead of usingresult.lastEvaluation.cleanHistory— the canonical post-generation history from node-llama-cpp. - Root cause: Didn't follow node-llama-cpp's official external-chat-state pattern:
chatHistory = res.lastEvaluation.cleanHistory. - Impact: Our manually managed chatHistory could diverge from the actual KV cache state, causing
lastEvaluationContextWindowto fail overlap checks and re-encode the entire history from scratch on every call. - Status: ✅ FIXED — Now uses
result.lastEvaluation.cleanHistoryas the canonical chatHistory after successful generation. Falls back to manual push for edge cases (older API versions).
- Severity: MODERATE (reinforces the original "assistant\nassistant" bug)
- What happened: When generation was aborted with a partial response, the raw
fullResponsewas stored in chatHistory without sanitization. The streaminggarbageTokenRegexcatches special tokens like<|im_start|>, but NOT bare turn indicators (assistant,useron their own line). Those are only caught by_sanitizeResponse(), which was only applied to the returned text, not the stored text. - Root cause:
_sanitizeResponse()was called for the return value but not for the chatHistory entry. - Status: ✅ FIXED — Abort path now calls
this._sanitizeResponse(fullResponse)before pushing to chatHistory.
- Severity: MINOR (sequence slot leak, mitigated by fallback)
- What happened: In
generate(), a temporary sequence was created viacontext.getSequence(), its tokens erased in thefinallyblock, but the sequence handle itself was never disposed. Over manygenerate()calls, sequence slots could be exhausted. - Root cause: Oversight — only token erasure was implemented, not handle disposal.
- Mitigation: When slots run out,
getSequence()throws and the code falls back to the main chat. But this is sloppy. - Status: ✅ FIXED — Added
tempSequence.dispose?.()in thefinallyblock after token erasure.
- Severity: SIGNIFICANT (runtime crash on app shutdown)
- What happened: The
app.on('before-quit', async () => { playwrightBrowser.dispose(); ... })block was corrupted, withappmerged intoplaywrightBrowserto formapplaywrightBrowser.dispose();and the callback split onto the next line asp.on('before-quit', ...). - Root cause: Likely a prior editing accident (NOT introduced by this migration).
- Impact:
applaywrightBrowserandpare undefined — runtimeReferenceErroron app quit, meaning cleanup (memory disposal, model disposal) never runs. - Status: ✅ FIXED — Restored to correct
app.on('before-quit', async () => { playwrightBrowser.dispose(); ... })structure.
- Severity: CRITICAL (directly causes the original "assistant\nassistant" stuttering bug)
- What happened: The combined regex in
_sanitizeResponse()FAILED to match<|im_start|>,<|im_start|>assistant, and<|im_start|>system— the exact ChatML tokens causing the original bug. The streaminggarbageTokenRegexcaught them correctly, but_sanitizeResponse()(used on abort paths, final cleanup, AND stored in chatHistory) did NOT. - Root cause:
<|im_start|>was nested INSIDE a<|...|>wrapper group:<\|(?:...|im_start\|>(?:...)|...)\|>. The|>inim_start|>was consumed as the group's closing\|>, leaving the optional(?:system|user|assistant)?and the outer\|>unable to match. - Evidence: Proven via
node -etest — old regex returned MISSED for allim_startvariants. New regex matched all 19 token types. - Fix: Made
<\|im_start\|>(?:system|user|assistant)?a top-level alternative in the regex, not nested inside<\|(...)\|>. - Status: ✅ FIXED — Regex verified against 19 test tokens (all MATCHED) via terminal.
- Severity: MODERATE (unnecessary KV cache re-encoding overhead)
- What happened: In
generateStream(), the system message atchatHistory[0]was replaced with a new object on EVERY iteration, even when the text was identical. After adoptingcleanHistory(fix #15), this destroyed the object from node-llama-cpp's output and forced re-tokenization. - Root cause: Missing equality check — always created a new
{ type: 'system', text: fullSystemText }object. - Impact: In multi-iteration agentic loops where
systemContextdoesn't change between iterations (common case), every iteration would force node-llama-cpp to re-tokenize and potentially re-encode the entire system prefix. - Fix: Added string equality check: only replace the system message object when
this.chatHistory[0].text !== fullSystemText. - Status: ✅ FIXED
- Severity: TRIVIAL (code hygiene)
- What happened:
_getSystemPrompt()ended with;;— a double semicolon creating an empty statement. - Status: ✅ FIXED
- Severity: TRIVIAL (code hygiene)
- What happened:
let isFirstIteration = true;was declared in the agentic loop setup but never read anywhere. - Status: ✅ FIXED — Removed.
main/llmEngine.js— All engine-level fixeselectron-main.js— Structured prompts, sequence reference fixes, find-bug handlerLLAMACHAT_MIGRATION_ISSUES.md— This document
- Every
currentPrompt =assignment uses structured{systemContext, userMessage}format - No calls to
context.getSequence()outside of initialize/resetSession/recovery/generate-utility -
_compactHistory()called before everygenerateResponse() -
tagBufferflushed after generation completes -
repetitionCountdecays on non-repetitive windows -
resetSession()disposes old chat before creating new one -
contextShift.strategyexplicitly set on every call - Abort handler preserves user message in both paths
-
generate()uses separate sequence with proper cleanup -
_getStopSequences()dead code removed -
_sanitizeResponse()fully restored and verified -
cleanHistoryfromlastEvaluationused as canonical chatHistory - Abort path sanitizes partial response before storing in chatHistory
-
tempSequence.dispose()called in generate() finally block -
app.on('before-quit')corruption repaired -
_sanitizeResponse()regex matches ALL ChatML tokens (19/19 verified via node -e) - System message equality check prevents unnecessary KV re-encoding
- Double semicolon in
_getSystemPrompt()removed - Dead
isFirstIterationvariable removed from electron-main.js - 5th verification pass: all flows traced end-to-end (normal, abort, overflow, one-shot, reset, dispose) — zero new issues