|
1 | 1 | # guIDE Web Testing Environment — Complete Reference |
2 | 2 |
|
3 | 3 | > **To start a test session:** Tell the agent "read WEB_TEST_RULES.md" — this file contains everything needed. |
4 | | -> Last updated: 2026-03-13 |
| 4 | +> Last updated: 2026-03-23 |
| 5 | +
|
| 6 | +--- |
| 7 | + |
| 8 | +## RULE ZERO OF TESTING — NO CHEERLEADING — THIS OVERRIDES EVERYTHING |
| 9 | + |
| 10 | +**This is the single most important rule in this entire document. It is placed first because it is violated the most.** |
| 11 | + |
| 12 | +**CHEERLEADING IS BANNED. COMPLETELY. ABSOLUTELY. NO EXCEPTIONS. NO SOFT VERSIONS.** |
| 13 | + |
| 14 | +What cheerleading looks like (ALL of these are violations): |
| 15 | +- "The model is generating at 9 tok/s" (implying progress is acceptable) |
| 16 | +- "Generation is progressing" / "Output is growing" / "Code is being produced" |
| 17 | +- "Looking good so far" / "Things seem to be working" / "No issues yet" |
| 18 | +- "The model successfully generated..." / "It produced a coherent..." |
| 19 | +- Any sentence that frames the test outcome positively before defects are fully catalogued |
| 20 | +- Describing generation speed, line count growth, or output volume with implicit approval |
| 21 | +- Using exclamation marks to express satisfaction about test results |
| 22 | +- Saying "no defects found" without specifying EXACTLY which dimensions were checked |
| 23 | +- ANY positive adjective about the model's output: "good", "nice", "clean", "solid", "decent" |
| 24 | + |
| 25 | +What to do instead: |
| 26 | +- Report ONLY defects and specific factual measurements (line count, context %, token count) |
| 27 | +- If no defect is found in a specific dimension, say: "No defect found in [dimension name]" — NOT "it's working" or "it passed" |
| 28 | +- If all dimensions show no defect, INCREASE TEST DIFFICULTY — you are not testing hard enough |
| 29 | +- Every observation must read like a hostile quality audit finding, not a progress update |
| 30 | + |
| 31 | +**WHY THIS RULE EXISTS:** Agents repeatedly describe test progress with positive framing ("generating at 9 tok/s", "model is producing output", "looking good"), creating a false sense that things are working when they haven't been properly evaluated. This wastes the user's time and masks real defects. The agent's job during testing is to FIND PROBLEMS, not narrate progress. |
| 32 | + |
| 33 | +**IF YOU CATCH YOURSELF CHEERLEADING:** Stop. Delete the sentence. Rewrite it as a factual observation or defect report. Then re-read this section before continuing. |
5 | 34 |
|
6 | 35 | --- |
7 | 36 |
|
@@ -146,13 +175,19 @@ This is the single most important rule. |
146 | 175 | If a test fails, the failure IS the finding. Log it. Diagnose the root cause. |
147 | 176 | Only make changes that would help ALL users with ALL prompts on ALL hardware. |
148 | 177 |
|
149 | | -### Rule 2 — Be a Normal User |
| 178 | +### Rule 2 — Be a Normal User (NO HAND-HOLDING) |
150 | 179 | When testing, use exactly the prompts a real user would type: |
151 | 180 | - Typos, run-on sentences, ambiguous phrasing |
152 | 181 | - Multi-part requests ("can you do X and also Y?") |
153 | 182 | - Follow-up messages that reference prior context |
154 | 183 | - Edge cases: very short messages, very long messages, code pastes |
155 | 184 | - NO hand-holding prompts designed to succeed ("please call the read_file tool and...") |
| 185 | +- **NEVER instruct the model on HOW to generate its output** — do NOT say "make sure to close HTML tags", |
| 186 | + "end with proper closing tags", "include opening and closing brackets", or ANY instruction that |
| 187 | + coaches the model on output format. A real user says "build me a website for a pizza shop" — they |
| 188 | + do NOT say "build me a website and make sure you close all the HTML tags at the end." If the model |
| 189 | + can't produce complete output without being coached, THAT IS A DEFECT TO LOG — not a prompt to fix. |
| 190 | +- The test prompt should describe WHAT the user wants, never HOW the model should structure its response |
156 | 191 |
|
157 | 192 | ### Rule 3 — Score All Three Dimensions |
158 | 193 | Every test response must be evaluated on all three: |
@@ -192,10 +227,14 @@ nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits |
192 | 227 | If free VRAM < 2800MB, do NOT run inference tests. Results on VRAM-constrained |
193 | 228 | hardware are hardware-degraded and cannot be trusted for diagnosing pipeline issues. |
194 | 229 |
|
195 | | -### Rule 8 — No Cheerleading |
| 230 | +### Rule 8 — No Cheerleading (see also RULE ZERO OF TESTING at top of this document) |
196 | 231 | Test results are reported as defects found. If nothing broke, state exactly what |
197 | 232 | was checked and that no defect was found in those dimensions. Never say "great results", |
198 | | -"working well", "looking good", or any positive framing. Report facts only. |
| 233 | +"working well", "looking good", "progressing", "generating nicely", or any positive framing. |
| 234 | +Report facts only. No positive adjectives. No narrating progress as though it's achievement. |
| 235 | +Describing generation speed or line count growth with implied approval is cheerleading. |
| 236 | +Saying "no issues yet" is cheerleading. The ONLY acceptable framing is defect reports and |
| 237 | +raw numeric measurements without editorial commentary. |
199 | 238 |
|
200 | 239 | --- |
201 | 240 |
|
@@ -649,6 +688,114 @@ ACCOUNTABILITY CHECK |
649 | 688 | ``` |
650 | 689 | If ANY answer is wrong, fix it before proceeding. |
651 | 690 |
|
| 691 | +--- |
| 692 | + |
| 693 | +## 15. SEAMLESS CONTINUATION TEST PROTOCOL — MANDATORY |
| 694 | + |
| 695 | +> Added 2026-03-23. This section governs ALL seamless continuation testing. |
| 696 | +> When the user says "test seamless continuation", follow this section EXACTLY. |
| 697 | +
|
| 698 | +### What We Are Testing |
| 699 | +Seamless continuation = when the model hits maxResponseTokens mid-generation, the pipeline |
| 700 | +automatically triggers a new generation that picks up EXACTLY where the model left off. |
| 701 | +The intended behavior: |
| 702 | + |
| 703 | +1. Model starts generating a large code file (thousands of lines) |
| 704 | +2. Model hits maxResponseTokens → generation stops |
| 705 | +3. Pipeline detects incomplete output → automatically triggers continuation |
| 706 | +4. New generation picks up in the SAME code block, from the EXACT line where it stopped |
| 707 | +5. Process repeats until the file is complete |
| 708 | +6. Final output is ONE coherent file with no gaps, no duplicates, no broken blocks |
| 709 | + |
| 710 | +**ANYTHING other than this is a failure.** Specifically: |
| 711 | +- Code block splitting into multiple blocks at continuation boundary = FAILURE |
| 712 | +- Duplicate lines at the continuation seam = FAILURE |
| 713 | +- Model restarting from the beginning of the file = FAILURE |
| 714 | +- Model losing track of what function/class it was writing = FAILURE |
| 715 | +- Naked code appearing outside code blocks after finalization = FAILURE |
| 716 | +- Code block not closing properly = FAILURE |
| 717 | +- Model stopping before completing the file = FAILURE |
| 718 | + |
| 719 | +### How to Run the Test |
| 720 | +1. Clear logs |
| 721 | +2. Check VRAM (>2800MB required) |
| 722 | +3. Open a new conversation in the browser |
| 723 | +4. Select a local model (Qwen3.5-4B or larger) |
| 724 | +5. Send a **10-paragraph prompt** describing a realistic website/application with: |
| 725 | + - Specific business type (car dealership, restaurant, gym, etc. — rotate every test) |
| 726 | + - Detailed requirements: specific JavaScript functions (12+), CSS animations, WebGL effects, |
| 727 | + responsive design, form validation, API integration, localStorage, etc. |
| 728 | + - Do NOT specify line counts — let the model decide how long the output needs to be |
| 729 | + - The prompt must be detailed enough that the model NATURALLY produces 1000+ lines |
| 730 | +6. Take a screenshot IMMEDIATELY after sending |
| 731 | +7. Monitor every 5 seconds with screenshots (see monitoring rules below) |
| 732 | + |
| 733 | +### Monitoring Rules During Seamless Continuation |
| 734 | +- **Screenshot every 5 seconds** — no exceptions. Every 5 seconds. |
| 735 | +- **After each screenshot, analyze it** — describe what you see. Is the code block still open? |
| 736 | + Has the line count changed? Is there a "Reasoning..." indicator? Has continuation triggered? |
| 737 | +- **Check backend logs every 5 seconds** — look for: |
| 738 | + - `seamless continuation` / `maxTokens reached` / `continuation triggered` |
| 739 | + - `context rotation` / `compaction` |
| 740 | + - `Natural stop with unclosed code fence` |
| 741 | + - stopReason values |
| 742 | +- **The INSTANT seamless continuation triggers**: Watch the next screenshot with extreme |
| 743 | + scrutiny. The model must continue in the SAME code block from the EXACT line where it stopped. |
| 744 | + If ANYTHING else happens (new block, duplicate lines, restart, naked code), that is an |
| 745 | + IMMEDIATE FAILURE. Stop monitoring and investigate. |
| 746 | +- **Do NOT wait until the end** to check for failures — check at every continuation boundary |
| 747 | +- **If the model is still generating after 10 minutes**, continue monitoring. Do not stop early. |
| 748 | + |
| 749 | +### What Constitutes Immediate Failure |
| 750 | +Any of these observed during generation = STOP and investigate immediately: |
| 751 | +1. Code splits into multiple code blocks at continuation boundary |
| 752 | +2. Continuation starts a new ``` fence instead of continuing in the existing one |
| 753 | +3. Text appears BETWEEN code blocks that should be contiguous |
| 754 | +4. Line count drops (content regression) |
| 755 | +5. Model restarts the file from scratch |
| 756 | +6. Generation freezes for more than 3 minutes with no new tokens |
| 757 | +7. After finalization: code that was in one block during streaming appears scattered |
| 758 | + |
| 759 | +### Reporting Format for Seamless Continuation Tests |
| 760 | +``` |
| 761 | +SEAMLESS CONTINUATION TEST REPORT |
| 762 | +================================== |
| 763 | +PROMPT: [exact prompt sent — or summary if >500 chars] |
| 764 | +MODEL: [name, size, context allocated] |
| 765 | +TOTAL GENERATION TIME: [minutes:seconds] |
| 766 | +CONTINUATION COUNT: [how many times seamless continuation triggered] |
| 767 | +ROTATION COUNT: [how many context rotations triggered] |
| 768 | +FINAL OUTPUT: [line count, whether file is complete, whether code block is intact] |
| 769 | +
|
| 770 | +CONTINUATION BOUNDARIES: |
| 771 | + - Continuation 1 at [time]: [what happened — did code stay in same block?] |
| 772 | + - Continuation 2 at [time]: [what happened] |
| 773 | + ... |
| 774 | +
|
| 775 | +DEFECTS FOUND: |
| 776 | + - [specific defect with screenshot evidence and log evidence] |
| 777 | + - [or "None"] |
| 778 | +
|
| 779 | +FINALIZATION CHECK: |
| 780 | + - Code block intact after response finalized? [yes/no] |
| 781 | + - Naked code outside blocks? [yes/no] |
| 782 | + - Duplicate sections? [yes/no] |
| 783 | +
|
| 784 | +LOG EVIDENCE: |
| 785 | + [paste relevant log lines showing continuation triggers, stop reasons, etc.] |
| 786 | +``` |
| 787 | + |
| 788 | +### ABSOLUTE BANS During Seamless Continuation Testing |
| 789 | +- No cheerleading. No "looking good." No "progressing well." No exclamation marks. |
| 790 | +- No saying "good" about anything. Not the code. Not the continuation. Not the progress. |
| 791 | +- No positive framing of any kind. Report defects and facts ONLY. |
| 792 | +- If zero defects are found, state exactly what was checked and that no defect was found. |
| 793 | + Do NOT celebrate this. Increase test difficulty instead. |
| 794 | +- Do NOT end the test early. Let the model finish completely. |
| 795 | +- Do NOT modify source code during the test. Pipeline is frozen during observation. |
| 796 | + |
| 797 | +--- |
| 798 | + |
652 | 799 | ### Rules That Apply at ALL Times During Testing |
653 | 800 | - **NO CHEERLEADING — PRIME RULE** — never say "looking good", "great progress", "improvement", "strong results", or any positive framing whatsoever. Report defects and facts ONLY. If you find zero defects, increase test difficulty — you're not testing hard enough. |
654 | 801 | - **Screenshots every 5 seconds** — not 10, not 15, not 30. Every 5 seconds during active generation. |
|
0 commit comments