|
| 1 | +# Environment Variables |
| 2 | + |
| 3 | +Reference for `TQ_*` runtime env vars. Grouped by purpose. Everything |
| 4 | +here is opt-in; defaults are the tested production path. |
| 5 | + |
| 6 | +## Performance / resource controls |
| 7 | + |
| 8 | +| Var | Default | Purpose | |
| 9 | +|---|---|---| |
| 10 | +| `TQ_NO_METAL` | off | Skip Metal (Apple GPU) path; force CPU-only | |
| 11 | +| `TQ_NO_MLOCK` | off | Don't `mlock` the mmap'd weights; lets OS page out cold experts on small machines | |
| 12 | +| `TQ_NO_Q4` | off | Skip load-time FP32→internal-Q4 recompression; use on-the-fly GGUF dequant. Quality tradeoff — see `state.md` R5 | |
| 13 | +| `TQ_NO_BATCH_PREFILL` | off | Force per-token prefill (disables batched matrix prefill path) | |
| 14 | +| `TQ_NO_MOE_BATCH` | off | Opt-out of batched MoE dispatch (default-on). Restores per-token MoE forward | |
| 15 | +| `TQ_NO_MOE_BATCH_DYNAMIC` | off | Opt-out of FCFS dynamic dispatch (default-on). Wave-mode expert dispatch instead | |
| 16 | +| `TQ_MOE_BATCH_CHUNK` | 8 | Tokens per batched MoE call (1-20 sensible range); larger = more speedup, worse numerical stability above ~20 | |
| 17 | +| `TQ_MOE_BATCH_SELFTEST` | off | Route N=1 MoE through batch(N=1) kernel — proves equivalence vs per-token path | |
| 18 | +| `TQ_PHI3_SPLIT` | 0 | Phi-3 fused QKV/FFN split to separate Q4 weights. **Off by default** — degrades chat quality per feedback/perf_commits_need_chat_test | |
| 19 | +| `TQ_MOE_FAST_EXP` | off | Use Schraudolph fast-exp in MoE SwiGLU (vs exact expf default). ~2% per-call error; may re-introduce long-gen drift | |
| 20 | + |
| 21 | +## Quality / correctness |
| 22 | + |
| 23 | +| Var | Default | Purpose | |
| 24 | +|---|---|---| |
| 25 | +| `TQ_NO_AUTO_SERIAL` | off | Opt-out of Qwen3.6 auto single-thread mode. Multi-thread is non-deterministic at T=0 — default forces `-j 1` on qwen35moe+DeltaNet hybrid. Cost: ~2-3× slower decode | |
| 26 | +| `TQ_FORCE_QK_NORM` | off | Force QK-norm on Qwen hybrid (normally disabled for that arch) | |
| 27 | +| `TQ_ROPE_PAIRS` | off | Force LLaMA-style interleaved RoPE pairs (overrides NEOX auto-detect) | |
| 28 | +| `TQ_NO_PLE` | off | Disable Gemma-4 per-layer-embedding path | |
| 29 | + |
| 30 | +## Debugging — general |
| 31 | + |
| 32 | +| Var | Default | Purpose | |
| 33 | +|---|---|---| |
| 34 | +| `TQ_DEBUG` | off | Prints per-layer output norms, attention range, tokenized prompt, etc. | |
| 35 | +| `TQ_DEBUG_PREFILL` | off | Per-layer `final x sum` / `sumabs` during prefill (layers 0-3) | |
| 36 | +| `TQ_DEBUG_WQ` | off | L0 pre-norm RMS at first token | |
| 37 | + |
| 38 | +## Debugging — refparity framework |
| 39 | + |
| 40 | +The `tools/refparity/` framework uses these to produce comparable dumps |
| 41 | +against HF FP32 reference. Do not enable in production — each dump |
| 42 | +is a fsync'd file. |
| 43 | + |
| 44 | +| Var | Value | Purpose | |
| 45 | +|---|---|---| |
| 46 | +| `TQ_DUMP_HIDDEN` | `/path/to/dir` | Dump `emb.bin`, `h0.bin`…`hN.bin`, `post_norm.bin`, `logits.bin` (one raw FP32 file per slot) | |
| 47 | +| `TQ_DUMP_POS` | `0` (default) or `N` or `all` | Which token position to dump. `all` is expensive (28 × seq_len files) | |
| 48 | +| `TQ_DUMP_INTERMEDIATE` | off | Also dump per-layer sub-stage: `h{l}_in/postattn/preffn/ffnout` — bisects attention vs FFN divergences | |
| 49 | + |
| 50 | +## Debugging — DeltaNet (Qwen3.5/3.6) |
| 51 | + |
| 52 | +Added in the 2026-04-21 DeltaNet investigation. Probe or ablate the |
| 53 | +recurrent state to localize drift. |
| 54 | + |
| 55 | +| Var | Value | Purpose | |
| 56 | +|---|---|---| |
| 57 | +| `TQ_DELTA_PROBE` | `call1,call2,...` | Print per-layer `delta_state` L2 norm at listed layer-0 call counts. E.g. `TQ_DELTA_PROBE=50,100,115,120` | |
| 58 | +| `TQ_DELTA_RESET_EVERY` | `N` | Zero `delta_state` + `conv_state` every N-th layer-0 call. Diagnostic only (destroys useful context) | |
| 59 | +| `TQ_DELTA_RESET_LAYER` | `N` or unset | Combined with `RESET_EVERY`, clears only that layer's slice. `-1` or unset = all layers | |
| 60 | + |
| 61 | +## Examples |
| 62 | + |
| 63 | +**Reproduce BPE UTF-8 regression suite**: |
| 64 | +```bash |
| 65 | +bash scripts/test_models.sh # runs test_tokenizer.sh at tail |
| 66 | +``` |
| 67 | + |
| 68 | +**Reference-parity diff on one model**: |
| 69 | +```bash |
| 70 | +export PYTHONPATH=tools/pillar1/venv/lib/python3.12/site-packages |
| 71 | +python tools/refparity/hf_reference.py --model Qwen/Qwen3-0.6B --prompt "Hello" --out /tmp/ref.npz |
| 72 | +TQ_DUMP_HIDDEN=/tmp/eng TQ_NO_METAL=1 TQ_NO_MLOCK=1 TQ_NO_BATCH_PREFILL=1 TQ_NO_AUTO_SERIAL=1 \ |
| 73 | + ./build/quant models/Qwen3-0.6B-Q4_K_M.gguf -p "Hello" -n 1 -T 0 |
| 74 | +python tools/refparity/diff_layers.py /tmp/ref.npz /tmp/eng |
| 75 | +``` |
| 76 | + |
| 77 | +**Probe Qwen3.6 DeltaNet state at drift boundary**: |
| 78 | +```bash |
| 79 | +TQ_DELTA_PROBE=50,100,115,118,120 \ |
| 80 | + ./build/quant models/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf \ |
| 81 | + -p "Once upon a time in a faraway land" -n 125 -T 0 2>&1 | grep delta-probe |
| 82 | +``` |
| 83 | + |
| 84 | +**35B best-quality user config**: |
| 85 | +```bash |
| 86 | +./build/quant models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \ |
| 87 | + -p "<your prompt>" -n 200 -T 0 --rep-penalty 1.3 |
| 88 | +``` |
| 89 | + |
| 90 | +## Notes |
| 91 | + |
| 92 | +- Most `TQ_NO_*` envs exist because the default path has a correctness |
| 93 | + or quality tradeoff someone wanted to A/B. Flipping them usually |
| 94 | + trades speed for determinism or vice versa. Read `state.md` and |
| 95 | + `bench/results/` for the measured impact before relying on any. |
| 96 | +- New envs land with `state.md` entries documenting *why* they exist. |
| 97 | + Don't add undocumented envs. |
0 commit comments