Skip to content

Commit 68ed26c

Browse files
author
Brendan Gray
committed
v1.8.54: Architectural fix — remove maxTokens cap, enable native context shift mid-generation
- Replace single maxResponseTokens with generationMaxTokens (=totalCtx) + budgetResponseTokens (=25% cap for prompt assembly only) - Model generates freely; native context shift fires mid-generation to manage KV cache - Raise pre-gen compression threshold from 75% to 90% (safety net only) - Raise pre-gen compression target from 55% to 70% (preserve more context) - Root cause: three competing systems created death spiral — maxTokens cap + pre-gen compression + never-firing native shift - Frontend verified: no conflicts in ChatPanel, streaming, code block merge - Backend verified: all maxResponseTokens refs eliminated, merge order confirms caller params win - Includes prior fixes: A-F, G, H-overlap, I2, D5, D6, Fix 8/9, Fix C-Extended, nativeContextStrategy
1 parent 1fd8a8c commit 68ed26c

21 files changed

Lines changed: 1944 additions & 748 deletions

WEB_TEST_RULES.md

Lines changed: 151 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,36 @@
11
# guIDE Web Testing Environment — Complete Reference
22

33
> **To start a test session:** Tell the agent "read WEB_TEST_RULES.md" — this file contains everything needed.
4-
> Last updated: 2026-03-13
4+
> Last updated: 2026-03-23
5+
6+
---
7+
8+
## RULE ZERO OF TESTING — NO CHEERLEADING — THIS OVERRIDES EVERYTHING
9+
10+
**This is the single most important rule in this entire document. It is placed first because it is violated the most.**
11+
12+
**CHEERLEADING IS BANNED. COMPLETELY. ABSOLUTELY. NO EXCEPTIONS. NO SOFT VERSIONS.**
13+
14+
What cheerleading looks like (ALL of these are violations):
15+
- "The model is generating at 9 tok/s" (implying progress is acceptable)
16+
- "Generation is progressing" / "Output is growing" / "Code is being produced"
17+
- "Looking good so far" / "Things seem to be working" / "No issues yet"
18+
- "The model successfully generated..." / "It produced a coherent..."
19+
- Any sentence that frames the test outcome positively before defects are fully catalogued
20+
- Describing generation speed, line count growth, or output volume with implicit approval
21+
- Using exclamation marks to express satisfaction about test results
22+
- Saying "no defects found" without specifying EXACTLY which dimensions were checked
23+
- ANY positive adjective about the model's output: "good", "nice", "clean", "solid", "decent"
24+
25+
What to do instead:
26+
- Report ONLY defects and specific factual measurements (line count, context %, token count)
27+
- If no defect is found in a specific dimension, say: "No defect found in [dimension name]" — NOT "it's working" or "it passed"
28+
- If all dimensions show no defect, INCREASE TEST DIFFICULTY — you are not testing hard enough
29+
- Every observation must read like a hostile quality audit finding, not a progress update
30+
31+
**WHY THIS RULE EXISTS:** Agents repeatedly describe test progress with positive framing ("generating at 9 tok/s", "model is producing output", "looking good"), creating a false sense that things are working when they haven't been properly evaluated. This wastes the user's time and masks real defects. The agent's job during testing is to FIND PROBLEMS, not narrate progress.
32+
33+
**IF YOU CATCH YOURSELF CHEERLEADING:** Stop. Delete the sentence. Rewrite it as a factual observation or defect report. Then re-read this section before continuing.
534

635
---
736

@@ -146,13 +175,19 @@ This is the single most important rule.
146175
If a test fails, the failure IS the finding. Log it. Diagnose the root cause.
147176
Only make changes that would help ALL users with ALL prompts on ALL hardware.
148177

149-
### Rule 2 — Be a Normal User
178+
### Rule 2 — Be a Normal User (NO HAND-HOLDING)
150179
When testing, use exactly the prompts a real user would type:
151180
- Typos, run-on sentences, ambiguous phrasing
152181
- Multi-part requests ("can you do X and also Y?")
153182
- Follow-up messages that reference prior context
154183
- Edge cases: very short messages, very long messages, code pastes
155184
- NO hand-holding prompts designed to succeed ("please call the read_file tool and...")
185+
- **NEVER instruct the model on HOW to generate its output** — do NOT say "make sure to close HTML tags",
186+
"end with proper closing tags", "include opening and closing brackets", or ANY instruction that
187+
coaches the model on output format. A real user says "build me a website for a pizza shop" — they
188+
do NOT say "build me a website and make sure you close all the HTML tags at the end." If the model
189+
can't produce complete output without being coached, THAT IS A DEFECT TO LOG — not a prompt to fix.
190+
- The test prompt should describe WHAT the user wants, never HOW the model should structure its response
156191

157192
### Rule 3 — Score All Three Dimensions
158193
Every test response must be evaluated on all three:
@@ -192,10 +227,14 @@ nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits
192227
If free VRAM < 2800MB, do NOT run inference tests. Results on VRAM-constrained
193228
hardware are hardware-degraded and cannot be trusted for diagnosing pipeline issues.
194229

195-
### Rule 8 — No Cheerleading
230+
### Rule 8 — No Cheerleading (see also RULE ZERO OF TESTING at top of this document)
196231
Test results are reported as defects found. If nothing broke, state exactly what
197232
was checked and that no defect was found in those dimensions. Never say "great results",
198-
"working well", "looking good", or any positive framing. Report facts only.
233+
"working well", "looking good", "progressing", "generating nicely", or any positive framing.
234+
Report facts only. No positive adjectives. No narrating progress as though it's achievement.
235+
Describing generation speed or line count growth with implied approval is cheerleading.
236+
Saying "no issues yet" is cheerleading. The ONLY acceptable framing is defect reports and
237+
raw numeric measurements without editorial commentary.
199238

200239
---
201240

@@ -649,6 +688,114 @@ ACCOUNTABILITY CHECK
649688
```
650689
If ANY answer is wrong, fix it before proceeding.
651690

691+
---
692+
693+
## 15. SEAMLESS CONTINUATION TEST PROTOCOL — MANDATORY
694+
695+
> Added 2026-03-23. This section governs ALL seamless continuation testing.
696+
> When the user says "test seamless continuation", follow this section EXACTLY.
697+
698+
### What We Are Testing
699+
Seamless continuation = when the model hits maxResponseTokens mid-generation, the pipeline
700+
automatically triggers a new generation that picks up EXACTLY where the model left off.
701+
The intended behavior:
702+
703+
1. Model starts generating a large code file (thousands of lines)
704+
2. Model hits maxResponseTokens → generation stops
705+
3. Pipeline detects incomplete output → automatically triggers continuation
706+
4. New generation picks up in the SAME code block, from the EXACT line where it stopped
707+
5. Process repeats until the file is complete
708+
6. Final output is ONE coherent file with no gaps, no duplicates, no broken blocks
709+
710+
**ANYTHING other than this is a failure.** Specifically:
711+
- Code block splitting into multiple blocks at continuation boundary = FAILURE
712+
- Duplicate lines at the continuation seam = FAILURE
713+
- Model restarting from the beginning of the file = FAILURE
714+
- Model losing track of what function/class it was writing = FAILURE
715+
- Naked code appearing outside code blocks after finalization = FAILURE
716+
- Code block not closing properly = FAILURE
717+
- Model stopping before completing the file = FAILURE
718+
719+
### How to Run the Test
720+
1. Clear logs
721+
2. Check VRAM (>2800MB required)
722+
3. Open a new conversation in the browser
723+
4. Select a local model (Qwen3.5-4B or larger)
724+
5. Send a **10-paragraph prompt** describing a realistic website/application with:
725+
- Specific business type (car dealership, restaurant, gym, etc. — rotate every test)
726+
- Detailed requirements: specific JavaScript functions (12+), CSS animations, WebGL effects,
727+
responsive design, form validation, API integration, localStorage, etc.
728+
- Do NOT specify line counts — let the model decide how long the output needs to be
729+
- The prompt must be detailed enough that the model NATURALLY produces 1000+ lines
730+
6. Take a screenshot IMMEDIATELY after sending
731+
7. Monitor every 5 seconds with screenshots (see monitoring rules below)
732+
733+
### Monitoring Rules During Seamless Continuation
734+
- **Screenshot every 5 seconds** — no exceptions. Every 5 seconds.
735+
- **After each screenshot, analyze it** — describe what you see. Is the code block still open?
736+
Has the line count changed? Is there a "Reasoning..." indicator? Has continuation triggered?
737+
- **Check backend logs every 5 seconds** — look for:
738+
- `seamless continuation` / `maxTokens reached` / `continuation triggered`
739+
- `context rotation` / `compaction`
740+
- `Natural stop with unclosed code fence`
741+
- stopReason values
742+
- **The INSTANT seamless continuation triggers**: Watch the next screenshot with extreme
743+
scrutiny. The model must continue in the SAME code block from the EXACT line where it stopped.
744+
If ANYTHING else happens (new block, duplicate lines, restart, naked code), that is an
745+
IMMEDIATE FAILURE. Stop monitoring and investigate.
746+
- **Do NOT wait until the end** to check for failures — check at every continuation boundary
747+
- **If the model is still generating after 10 minutes**, continue monitoring. Do not stop early.
748+
749+
### What Constitutes Immediate Failure
750+
Any of these observed during generation = STOP and investigate immediately:
751+
1. Code splits into multiple code blocks at continuation boundary
752+
2. Continuation starts a new ``` fence instead of continuing in the existing one
753+
3. Text appears BETWEEN code blocks that should be contiguous
754+
4. Line count drops (content regression)
755+
5. Model restarts the file from scratch
756+
6. Generation freezes for more than 3 minutes with no new tokens
757+
7. After finalization: code that was in one block during streaming appears scattered
758+
759+
### Reporting Format for Seamless Continuation Tests
760+
```
761+
SEAMLESS CONTINUATION TEST REPORT
762+
==================================
763+
PROMPT: [exact prompt sent — or summary if >500 chars]
764+
MODEL: [name, size, context allocated]
765+
TOTAL GENERATION TIME: [minutes:seconds]
766+
CONTINUATION COUNT: [how many times seamless continuation triggered]
767+
ROTATION COUNT: [how many context rotations triggered]
768+
FINAL OUTPUT: [line count, whether file is complete, whether code block is intact]
769+
770+
CONTINUATION BOUNDARIES:
771+
- Continuation 1 at [time]: [what happened — did code stay in same block?]
772+
- Continuation 2 at [time]: [what happened]
773+
...
774+
775+
DEFECTS FOUND:
776+
- [specific defect with screenshot evidence and log evidence]
777+
- [or "None"]
778+
779+
FINALIZATION CHECK:
780+
- Code block intact after response finalized? [yes/no]
781+
- Naked code outside blocks? [yes/no]
782+
- Duplicate sections? [yes/no]
783+
784+
LOG EVIDENCE:
785+
[paste relevant log lines showing continuation triggers, stop reasons, etc.]
786+
```
787+
788+
### ABSOLUTE BANS During Seamless Continuation Testing
789+
- No cheerleading. No "looking good." No "progressing well." No exclamation marks.
790+
- No saying "good" about anything. Not the code. Not the continuation. Not the progress.
791+
- No positive framing of any kind. Report defects and facts ONLY.
792+
- If zero defects are found, state exactly what was checked and that no defect was found.
793+
Do NOT celebrate this. Increase test difficulty instead.
794+
- Do NOT end the test early. Let the model finish completely.
795+
- Do NOT modify source code during the test. Pipeline is frozen during observation.
796+
797+
---
798+
652799
### Rules That Apply at ALL Times During Testing
653800
- **NO CHEERLEADING — PRIME RULE** — never say "looking good", "great progress", "improvement", "strong results", or any positive framing whatsoever. Report defects and facts ONLY. If you find zero defects, increase test difficulty — you're not testing hard enough.
654801
- **Screenshots every 5 seconds** — not 10, not 15, not 30. Every 5 seconds during active generation.

main/agenticChat.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ function register(ctx) {
143143
MAX_AGENTIC_ITERATIONS,
144144
});
145145
} catch (error) {
146+
console.error('[AgenticChat] Pipeline error:', error.stack || error.message || error);
146147
return { success: false, error: error.message };
147148
}
148149
});

main/llmEngine.js

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ const { EventEmitter } = require('events');
88
const { getModelProfile, getModelSamplingParams, getEffectiveContextSize, getSizeTier } = require('./modelProfiles');
99
const { detectFamily, detectParamSize } = require('./modelDetection');
1010
const { sanitizeResponse } = require('./sanitize');
11+
const { buildContextShiftOptions } = require('./pipeline/nativeContextStrategy');
1112

1213
// ─── Constants ───
1314
const STALL_TIMEOUT_GPU_MS = 90_000;
@@ -26,6 +27,11 @@ const MAX_PARALLEL_FUNCTION_CALLS = 4;
2627
const CONTEXT_ABSOLUTE_CEILING = 131_072;
2728
const VRAM_PADDING_FLOOR_MB = 800;
2829

30+
// ─── Testing Override ───
31+
// Set TEST_MAX_CONTEXT=6000 (or any number) to force small context for faster rotation testing
32+
const TEST_MAX_CONTEXT = process.env.TEST_MAX_CONTEXT ? parseInt(process.env.TEST_MAX_CONTEXT, 10) : null;
33+
if (TEST_MAX_CONTEXT) console.log(`[LLM] TEST_MAX_CONTEXT override active: ${TEST_MAX_CONTEXT} tokens`);
34+
2935
let _genCounter = 0;
3036

3137
class LLMEngine extends EventEmitter {
@@ -396,8 +402,13 @@ class LLMEngine extends EventEmitter {
396402
let nativeTrainCtx = 0;
397403
try { nativeTrainCtx = loadedModel.trainContextSize || 0; } catch (_) {}
398404
// Respect the model's actual train context ceiling
399-
const clampedCtx = nativeTrainCtx > 0 ? Math.min(targetCtx, nativeTrainCtx) : targetCtx;
400-
console.log(`[LLM DIAG] Context creation: mode=${mode}, targetCtx=${targetCtx}, clampedCtx=${clampedCtx}, trainCtx=${nativeTrainCtx}, modelSizeGB=${gpuConfig.modelSizeGB.toFixed(2)}`);
405+
let clampedCtx = nativeTrainCtx > 0 ? Math.min(targetCtx, nativeTrainCtx) : targetCtx;
406+
// TEST_MAX_CONTEXT override for faster rotation testing
407+
if (TEST_MAX_CONTEXT && clampedCtx > TEST_MAX_CONTEXT) {
408+
console.log(`[LLM] TEST_MAX_CONTEXT: clamping context from ${clampedCtx} to ${TEST_MAX_CONTEXT}`);
409+
clampedCtx = TEST_MAX_CONTEXT;
410+
}
411+
console.log(`[LLM DIAG] Context creation: mode=${mode}, targetCtx=${targetCtx}, clampedCtx=${clampedCtx}, trainCtx=${nativeTrainCtx}, modelSizeGB=${gpuConfig.modelSizeGB.toFixed(2)}${TEST_MAX_CONTEXT ? `, testOverride=${TEST_MAX_CONTEXT}` : ''}`);
401412

402413
const ctxRequest = {
403414
contextSize: clampedCtx,
@@ -770,11 +781,13 @@ class LLMEngine extends EventEmitter {
770781
tagBuffer += ch;
771782

772783
if (tagBuffer === '<think>' || tagBuffer.endsWith('<think>')) {
784+
if (!insideThinkBlock) console.log(`[LLM] Think block OPENED (${fullResponse.length} output chars so far)`);
773785
insideThinkBlock = true;
774786
tagBuffer = '';
775787
continue;
776788
}
777789
if (tagBuffer === '</think>' || tagBuffer.endsWith('</think>')) {
790+
if (insideThinkBlock) console.log(`[LLM] Think block CLOSED (${thinkingTokenCount} think chars total)`);
778791
insideThinkBlock = false;
779792
tagBuffer = '';
780793
continue;
@@ -869,6 +882,14 @@ class LLMEngine extends EventEmitter {
869882
}
870883
const tokensUsed = this.sequence?.nextTokenIndex || 0;
871884
console.log(`[LLM] Post-gen: stopReason=${finalStopReason}, responseChars=${fullResponse.length}, tokensUsed=${tokensUsed}, maxTokens=${merged.maxTokens}, llamaStopReason=${result?.metadata?.stopReason || 'unknown'}`);
885+
// Content logging: show first/last 200 chars so we can diagnose what the model produced
886+
if (fullResponse.length > 0) {
887+
const head = fullResponse.slice(0, 200).replace(/\n/g, '\\n');
888+
const tail = fullResponse.slice(-200).replace(/\n/g, '\\n');
889+
console.log(`[LLM] Content HEAD: ${head}`);
890+
if (fullResponse.length > 400) console.log(`[LLM] Content TAIL: ${tail}`);
891+
console.log(`[LLM] Think tokens this gen: ${thinkingTokenCount}`);
892+
}
872893
return {
873894
text: sanitized,
874895
rawText: fullResponse,
@@ -928,6 +949,13 @@ class LLMEngine extends EventEmitter {
928949
else if (thoughtBudget === 0) budgets.thoughtTokens = 0;
929950
else budgets.thoughtTokens = thoughtBudget;
930951

952+
// Use native context shift strategy with custom compression logic
953+
// Solution A: Let node-llama-cpp handle WHEN to shift, we define WHAT happens
954+
const contextShiftOpts = buildContextShiftOptions(this);
955+
if (useKvCache) {
956+
contextShiftOpts.lastEvaluationMetadata = this.lastEvaluation?.contextShiftMetadata;
957+
}
958+
931959
return this.chat.generateResponse(this.chatHistory, {
932960
maxTokens: params.maxTokens || this.defaultParams.maxTokens,
933961
temperature: params.temperature,
@@ -944,9 +972,7 @@ class LLMEngine extends EventEmitter {
944972
history: this.lastEvaluation?.contextWindow,
945973
minimumOverlapPercentageToPreventContextShift: 0.5,
946974
} : undefined,
947-
contextShift: useKvCache ? {
948-
lastEvaluationMetadata: this.lastEvaluation?.contextShiftMetadata,
949-
} : undefined,
975+
contextShift: contextShiftOpts,
950976
budgets,
951977
signal: this.abortController?.signal,
952978
tokenPredictor: this.tokenPredictor,
@@ -1182,6 +1208,13 @@ class LLMEngine extends EventEmitter {
11821208
else if (thoughtBudget === 0) budgets.thoughtTokens = 0;
11831209
else budgets.thoughtTokens = thoughtBudget;
11841210

1211+
// Use native context shift strategy with custom compression logic
1212+
// Solution A: Let node-llama-cpp handle WHEN to shift, we define WHAT happens
1213+
const contextShiftOpts = buildContextShiftOptions(this);
1214+
if (useKvCache) {
1215+
contextShiftOpts.lastEvaluationMetadata = this.lastEvaluation?.contextShiftMetadata;
1216+
}
1217+
11851218
const result = await this.chat.generateResponse(this.chatHistory, {
11861219
functions,
11871220
maxParallelFunctionCalls: MAX_PARALLEL_FUNCTION_CALLS,
@@ -1200,9 +1233,7 @@ class LLMEngine extends EventEmitter {
12001233
history: this.lastEvaluation?.contextWindow,
12011234
minimumOverlapPercentageToPreventContextShift: 0.5,
12021235
} : undefined,
1203-
contextShift: useKvCache ? {
1204-
lastEvaluationMetadata: this.lastEvaluation?.contextShiftMetadata,
1205-
} : undefined,
1236+
contextShift: contextShiftOpts,
12061237
budgets,
12071238
signal: this.abortController?.signal,
12081239
tokenPredictor: this.tokenPredictor,

0 commit comments

Comments
 (0)