saas-home
diff --git a/‎.gemini/models/qwen-3.6-35b-a3b.md‎
Lines changed: 73 additions & 0 deletions b/‎.gemini/models/qwen-3.6-35b-a3b.md‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎.gemini/skills/llm-stack-optimizer/SKILL.md‎
Lines changed: 97 additions & 0 deletions b/‎.gemini/skills/llm-stack-optimizer/SKILL.md‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎.gemini/skills/llm-stack-optimizer/references/template.conf‎
Lines changed: 45 additions & 0 deletions b/‎.gemini/skills/llm-stack-optimizer/references/template.conf‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎.gemini/skills/llm-stack-optimizer/references/template.json‎
Lines changed: 35 additions & 0 deletions b/‎.gemini/skills/llm-stack-optimizer/references/template.json‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎GEMINI.md‎
Lines changed: 55 additions & 0 deletions b/‎GEMINI.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎scripts-local/baselines.json‎
Lines changed: 17 additions & 0 deletions b/‎scripts-local/baselines.json‎
Lines changed: 17 additions & 0 deletions
@@ -0,0 +1,73 @@
+# Technical Optimization Profile: Qwen 3.6 35B-A3B-APEX-I-Balanced
+
+## 1. Documentation & Primary Sources
+This profile is synthesized from the following official technical specifications:
+*   **APEX Implementation:** [mudler/Qwen3.6-35B-A3B-APEX-GGUF](https://huggingface.co/mudler/Qwen3.6-35B-A3B-APEX-GGUF/raw/main/README.md)
+*   **Model Weights/Specs:** [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B/raw/main/README.md)
+*   **APEX Methodology:** [mudler/apex-quant](https://github.com/mudler/apex-quant/blob/main/README.md)
+
+---
+
+## 2. Model Architecture & Deployment Rationale
+
+### **Model Overview**
+*   **Base:** Qwen 3.6 35B-A3B (256 Experts).
+*   **Specialization:** Optimized for "Thinking Mode" reasoning, agentic coding (terminal/repo-level), and Model Context Protocol (MCP) tool-calling.
+*   **Context Limit:** 262,144 (Native). 
+*   **Deployment Goal:** Stable 128K context window on 16GB VRAM (RTX 4070 Ti Super) with peak generation speed.
+
+### **Hardware Calibration (16GB RTX 4070 Ti Super + 7950X3D)**
+*   **CCD Focus:** `CPU_AFFINITY="0-7,24-31"` (CCD0 V-Cache) ensures logic-heavy MoE branch routing stays on the high-cache CCD.
+*   **Expert Offloading:** `N_CPU_MOE=192` (64 experts on GPU, 192 on CPU).
+    *   *Finding:* Values < 192 are silently overridden by `llama-server` to fit in 16GB VRAM.
+*   **Memory Locks:** `MLOCK=true` is mandatory to eliminate page-fault latency spikes when the router hits CPU-offloaded experts.
+*   **KV Cache:** `CTX_SIZE=131072` (128K) with `q4_0` K/V types fits within the ~15GB safety limit.
+
+---
+
+## 3. APEX (Adaptive Precision for EXpert Models) Strategy
+
+### **Quantization Details (Extracted from mudler/apex-quant)**
+*   **Layer-wise Precision Gradient:** Not all layers are equal. Sensitivity is non-linear across the 40 blocks.
+*   **Edge Layers:** First and last 5 layers are most sensitive; kept at higher precision (e.g., Q6_K).
+*   **Shared Experts:** Must be at least **Q8_0** to maintain routing stability and prevent "logic drift."
+*   **Imatrix (APEX-I):** Uses a diverse calibration dataset (chat, code, reasoning, tool-calling) to trade negligible perplexity for significant gains in real-world accuracy.
+*   **Tiering:** APEX-I Balanced (24GB GGUF) is the recommended tier for high-accuracy local deployment on 16GB-24GB cards.
+
+---
+
+## 4. llama.cpp Server Configuration & Findings
+
+### **Thinking Mode Parameters (Official Qwen 3.6 Specs)**
+*   **Thinking mode** is enabled by default to generate reasoning traces within `<think>` blocks.
+*   **Reasoning Format:** `deepseek` (Default). Maps `<think>` blocks to `reasoning_content` in API responses.
+*   **Agentic Continuity:** `preserve_thinking: true` and `enable_thinking: true` must be enabled via `--chat-template-kwargs`.
+    *   *Source Detail:* `preserve_thinking` allows the model to retain reasoning context from historical messages, which is critical for decision consistency in multi-turn agentic workflows.
+*   **Reasoning Budget:** `REASONING_BUDGET=-1` (Infinite) to prevent truncating complex reasoning traces.
+
+### **Sampling (Official Qwen 3.6 "General Thinking" Specs)**
+| Parameter | Value | Rationale |
+| :--- | :--- | :--- |
+| `TEMP` | 1.0 | Standard randomness for MoE diversity. |
+| `TOP_P` | 0.95 | Nucleus sampling. |
+| `TOP_K` | 20 | Tight top-K for reasoning precision. |
+| `MIN_P` | 0.0 | Official recommendation for general thinking tasks. |
+| `REPEAT_PENALTY` | 1.0 | Official spec recommendation (no penalty). |
+| `PRESENCE_PENALTY` | 1.5 | High penalty to force creative topic expansion. |
+| `SAMPLERS` | `dry;top_k;top_p;min_p;temperature` | Optimized chain order. |
+
+---
+
+## 5. Benchmark Baselines (128K Context)
+
+| Slot Configuration | Generation Speed (avg) | Aggregate Throughput |
+| :--- | :--- | :--- |
+| **1-Slot (Peak)** | **55.9 tok/s** | 55.9 tok/s |
+| **3-Slot (Concurrent)** | **26.7 tok/s** | **75.4 tok/s** |
+
+---
+
+## 6. Maintenance & Sync Notes
+*   **Update Script:** Always use `./scripts-local/rebuild-llama.sh scripts-local/qwen-3.6-35b-a3b.conf`.
+*   **UI Sync:** The script automatically generates the OpenWeb-UI JSON profile with the mandatory `preserve_thinking` flags.
+*   **Baseline File:** Baselines are stored in `scripts-local/baselines.json`.
@@ -0,0 +1,97 @@
+---
+name: llm-stack-optimizer
+description: Generate optimized llama-server configs and OpenWeb-UI JSON profiles by researching model specs, benchmarks, and cloning from established high-quality templates.
+---
+
+# LLM Stack Optimizer
+
+This skill performs deep research and analysis to generate a perfectly tuned configuration for any new LLM on your hardware, using established high-quality templates as a base.
+
+## Input
+- **Base Model Name**: (e.g., `google_gemma-4-26B-A4B-it-Q6_K.gguf`)
+
+## Workflow
+
+### 1. Preparation Phase (Upstream Sync)
+- **Sync Check**: Ask the user if they want to synchronize the local repository with upstream before proceeding.
+- **Action**: If yes, execute `./scripts-local/sync-fork.sh`. 
+- **Build Requirement**: If a sync was performed, the deployment step **MUST** include the `--build` flag in `rebuild-llama.sh` to update the server binary.
+
+### 2. Intelligence Phase (Deep Research & Architecture)
+- **Search Objective**: Perform a comprehensive Google search for the model's official release notes, HuggingFace model card, and benchmark reports (PPL, MMLU, HumanEval).
+- **Architecture Analysis**: Identify if the model is Dense or MoE (Mixture of Experts), number of layers, and attention mechanisms (GQA/MQA).
+- **Research Summary**: Provide a detailed technical report including:
+    - **Model Details**: Architecture, parameter count, expert count (if MoE).
+    - **Performance**: Key benchmark scores (HumanEval, MMLU).
+    - **Sources**: List the URLs/citations used for the data.
+    - **Llama-Server Specs**: Proposed `llama-server` flags based on the architecture (e.g., `--reasoning`, `--chat-template`).
+- **User Input**: **Ask the user if they have additional resources** (local documents, URLs, or notes) to refine the config.
+- **Confirmation**: **Ask for explicit user confirmation** to proceed to creating the `.conf` and `.json` files.
+
+### 3. Physical Phase (Hardware Constraints)
+- **VRAM Verification**: Use `python3 scripts-local/vram-linter.py [TEMP_CONFIG]` to validate usage for the 16GB RTX 4070 Ti Super.
+- **Safety Threshold**: Target < 15.5GB total usage. If the linter fails, increase `N_CPU_MOE` or decrease `CTX_SIZE`.
+- **MoE Expert Offloading**: 
+    - **Stability baseline (Gemma-4/128-expert)**: 112-120 experts on CPU (out of 128).
+    - **Stability baseline (Qwen 3.6/256-expert)**: 240-244 experts on CPU (out of 256).
+- **Context Scaling (DeltaNet)**: For Hybrid DeltaNet models (Qwen 3.6), `CTX_SIZE=262144` is achievable on 16GB VRAM due to linear-attention efficiency (75% of layers).
+- **Affinity**: Always use `CPU_AFFINITY="0-15"` and `THREADS=16` for the 7950X3D.
+
+### 4. Generation Phase (Templated Cloning)
+- **llama-server Config**: 
+    - **Base**: Clone from `scripts-local/gemma-4-26b-a4b-prism-pro-dq.gguf.conf`.
+    - **Mandatory Flags**: `--cache-reuse 256`, `--prio 2`, `--context-shift`, `--kv-unified`.
+    - **Qwen 3.6 / Agentic Specifics**:
+        - Include `--chat-template-kwargs '{"preserve_thinking":true}'` for agentic preservation.
+        - **Pristine History**: Set `REASONING_BUDGET_MESSAGE=""` (empty) to prevent non-native filler from disrupting recursive logic.
+        - **Watchdog**: Set a global `REASONING_BUDGET=2048` to prevent infinite thinking loops.
+- **OpenWeb-UI Profiles**: 
+    - **Automation**: Do NOT manually create JSON profiles. Execute `python3 scripts-local/generate-ui-profiles.py` after saving the `.conf` file to generate all 8 standard personas (FIM, Coder, Pro, etc.).
+    - **Standards**: The script automatically enforces the **10/10 UI Standard** (H3 headers, emoji anchoring, float types).
+    - **Recursive Logic**: Ensure high-logic personas (Pro/Coder/Research) instruct the model to "recursively validate against previous reasoning traces."
+    - **Tool Calling**: For Qwen 3.6 Coder models, ensure the persona uses the XML-based `qwen3_coder` format (detected automatically by modern llama-server).
+
+### 5. Review & Confirmation Phase (Pre-Deployment)
+- **Technical Summary**: Present a final summary of the proposed stack:
+    - **Model**: [Model Name & Quant]
+    - **VRAM**: [Estimated Usage] MB / 15.5 GB
+    - **Compute**: [N_CPU_MOE] experts on CPU, [N_GPU_LAYERS] layers on GPU.
+    - **Context**: [CTX_SIZE] (Total) / [PARALLEL] (Slots).
+    - **Reasoning**: [Budget/Format settings].
+- **Build Notification**: State if a full rebuild is triggered (mandatory if sync was performed).
+- **Mandatory Stop**: **Ask for explicit user confirmation** before proceeding to deployment.
+
+### 6. Integration Phase (Deployment & Health Check)
+- **Deployment**: Execute `./scripts-local/rebuild-llama.sh scripts-local/[CONFIG_NAME].conf` (include `--build` if sync was performed).
+- **Initialization Delay**: **Wait 10 seconds** to allow the model to load into VRAM and the KV cache to initialize.
+- **Health Check**: Perform a health check via `curl -s http://localhost:8080/health`. 
+- **Verification**: Ensure the response returns `{"status": "ok"}`. If the service is not healthy, check `systemctl status llama-server` and report the error.
+
+### 7. Deployment & Validation Phase (Benchmarking & Baseline)
+- **Single-Slot Benchmark**: Run `python3 scripts-local/bench-llama.py 1 -p 1` to measure peak single-stream generation speed.
+- **Parallel-Slot Benchmark**: Run `python3 scripts-local/bench-llama.py 1 -p [PARALLEL]` (where `PARALLEL` matches the value in the `.conf` file) to measure aggregate throughput under load.
+- **Establish Baseline**: Once performance is optimized and stable across both tests, run `./scripts-local/save-baseline.sh scripts-local/[CONFIG_NAME].conf` to set the **Golden Baseline**.
+
+## Hard-Won Lessons (Reference)
+
+### The 10/10 UI/UX Standard (Mandatory)
+- **Header Scaling**: Use `###` (Header 3) as the maximum header size. Never use `##` or `#` (prevents oversized text on mobile).
+- **F-Scan Anchoring**: Always place emojis at the **absolute start** of the header (e.g., `### 🏗️ Engineering`) to enable vertical marginal scanning.
+- **Breathing Room**: Every system prompt must include the instruction: **"Use double-newlines and keep paragraphs under 3 lines."**
+- **Surgical Capabilities**: Only enable metadata `capabilities` (Vision, Search, Code Interpreter) that are directly relevant to the persona. Prune others to keep the UI clean.
+- **ID Hygiene**: Use short, lowercase, hyphenated IDs (e.g., `mythos-coder`) instead of long filenames or quant-heavy strings.
+- **Technical Rigor**: For Math/Logic profiles, explicitly define symbols in the system prompt (e.g., "Use $\therefore$ for 'therefore' and $\implies$ for 'implies'") to force formal derivation logic.
+- **The "No-Filler" Mandate**: For tool-like profiles (Extractor, FIM), use the phrase: **"Return ONLY the data. No conversational filler or introductory text."**
+- **Type Standardization**: All numerical parameters (e.g., `repeat_penalty: 1.0`) must be explicitly defined as **floats**, never integers, for parser robustness.
+
+### Optimization Baseline
+- **MoE Ratio for Q6_K**: For **26B-A4B-Q6_K**, `N_CPU_MOE=14` + `CTX_SIZE=65536` (32K per slot for 2 slots) is the optimized spot for speed and stability on 16GB VRAM.
+- **OOM Prevention**: If the server fails with `status=11/SEGV`, increase `N_CPU_MOE` by 2 or decrease `CTX_SIZE` by 16,384.
+- **Sampler Stability**: Use `TEMP=0.7`, `MIN_P=0.02`, and `REPEAT_PENALTY=1.08` for Gemma-4 MoE models.
+- **DRY Sampler**: Always include DRY (`multiplier=0.8`, `base=1.75`) to prevent loops.
+- **KV Cache**: Use `CACHE_TYPE_K="q4_0"` and `CACHE_TYPE_V="q4_0"` by default.
+- **Autocomplete/FIM Profiles**: For high-speed "Turbo" Autocomplete/FIM profiles:
+    - **Omit Parameters**: Completely omit `thinking_budget_tokens` and `reasoning`.
+    - **Disable Thinking**: Set `"chat_template_kwargs": { "enable_thinking": false }`.
+    - **Thin Sampler Chain**: Use `"samplers": "top_k;top_p"` to minimize latency.
+- **Reasoning Budget Message**: When setting `REASONING_BUDGET_MESSAGE` in `.conf`, avoid using control tokens like `<channel|>` as they can prematurely interrupt the model's output stream. Use a clean string like `" [Logic Finalized] "`.
@@ -0,0 +1,45 @@
+# Configuration for {{model_name}}
+# Optimized for: AMD 7950X3D | RTX 4070 Ti Super (16GB) | CachyOS
+
+# --- PATHS ---
+MODEL_PATH="{{model_path}}"
+MMPRJ_PATH="{{mmproj_path}}"
+SERVICE_NAME="llama-server.service"
+
+# --- HARDWARE ---
+CPU_AFFINITY="0-15"
+THREADS=16
+N_GPU_LAYERS=999
+N_CPU_MOE={{n_cpu_moe}}
+
+# --- MEMORY ---
+CACHE_TYPE_K="{{cache_type_k}}"
+CACHE_TYPE_V="{{cache_type_v}}"
+CTX_SIZE={{ctx_size}}
+PARALLEL={{parallel}}
+CACHE_RAM=32768
+BATCH_SIZE=2048
+UBATCH_SIZE=2048
+
+# --- SAMPLING ---
+TEMP={{temperature}}
+MIN_P={{min_p}}
+XTC_PROBABILITY={{xtc_prob}}
+XTC_THRESHOLD={{xtc_thresh}}
+TOP_P=0.95
+TOP_K=50
+REPEAT_PENALTY=1.05
+REPEAT_LAST_N=64
+
+# --- DRY SAMPLER ---
+DRY_MULTIPLIER={{dry_mult}}
+DRY_BASE=1.75
+DRY_ALLOWED_LENGTH=2
+DRY_PENALTY_LAST_N=4096
+
+# --- INFRASTRUCTURE ---
+REASONING_BUDGET=-1
+REASONING_BUDGET_MESSAGE=" [Thinking budget reached, concluding logic...] "
+HOST="0.0.0.0"
+PORT=8080
+SLEEP_IDLE_SECONDS=300
@@ -0,0 +1,35 @@
+[
+    {
+        "id": "{{model_id}}-fast",
+        "base_model_id": "{{model_filename}}",
+        "name": "⚡ Fast | {{display_name}}",
+        "params": {
+            "system": "You are a high-speed assistant. Perform a rapid factual check using your budget before answering. Use topic-relevant emojis.",
+            "temperature": {{temp_analytical}},
+            "min_p": 0.02,
+            "thinking_budget_tokens": {{budget_fast}}
+        }
+    },
+    {
+        "id": "{{model_id}}-coder",
+        "base_model_id": "{{model_filename}}",
+        "name": "🚀 Coder | {{display_name}}",
+        "params": {
+            "system": "You are a Lead Software Engineer. MANDATORY: Perform a line-by-line mental dry-run of code logic. Use --- dividers.",
+            "temperature": {{temp_coder}},
+            "min_p": 0.02,
+            "thinking_budget_tokens": {{budget_coder}}
+        }
+    },
+    {
+        "id": "{{model_id}}-pro",
+        "base_model_id": "{{model_filename}}",
+        "name": "🧠 Pro | {{display_name}}",
+        "params": {
+            "system": "You are an Elite Analytical Agent. REQUIRED: Explore 3+ distinct paths. Use semantic visual anchors.",
+            "temperature": {{temp_analytical}},
+            "min_p": 0.01,
+            "thinking_budget_tokens": {{budget_pro}}
+        }
+    }
+]
@@ -0,0 +1,55 @@
+# Workspace Core Mandates & Engineering Standards
+
+This file contains foundational mandates, architectural patterns, and hard-won lessons specific to this `llama.cpp` and OpenWeb-UI deployment repository. These rules take absolute precedence over general defaults.
+
+## 1. System Architecture & Deployment
+
+- **Deployment Script**: Always use `./scripts-local/rebuild-llama.sh [CONFIG_FILE]` to deploy or update the server. Never launch `llama-server` manually without the wrapper script.
+  - **VRAM Pre-flight**: Automatically checks if your config fits in the 16GB RTX 4070 Ti Super before starting. Aborts if >15.5GB.
+  - **Manual UI Sync**: OpenWeb-UI JSON profiles are no longer regenerated automatically. Use the `--generate-ui` flag in `rebuild-llama.sh` to trigger updates.
+- **Active Configurations**:
+  - `qwen-3.6-35b-a3b.conf`: APEX-I-Balanced (Port 8080 | CCD0 focus).
+  - `gemma-4-26b-a4b.conf`: PRISM-PRO-DQ (Port 8081 | CCD1 focus).
+- **Upstream Sync**: Use `./scripts-local/sync-fork.sh` to synchronize the local fork with the upstream repository and rebase active branches.
+
+## 2. Hardware Calibration (16GB VRAM | AMD 7950X3D)
+
+When tuning `.conf` files for MoE models, respect these physical limits and findings:
+- **CPU Affinity Strategy**:
+  - **Mixed Affinity (Preferred for Burst)**: Use cross-CCD strings like `0-7,24-31` (Qwen) and `16-23,8-15` (Mythos) to maximize single-model burst performance (up to 71 tok/s for 35B).
+  - **Strict Isolation (Preferred for Multitasking)**: Use strict ranges like `0-7,16-23` (CCD0) and `8-15,24-31` (CCD1) to reduce prompt evaluation stutter by ~24% during simultaneous load.
+- **CUDA Optimization**:
+  - **Peer Copy**: Always build with `-DGGML_CUDA_NO_PEER_COPY=OFF`. This is critical for MoE models when experts are split between VRAM and System RAM (DDR5), significantly reducing expert-switching latency.
+  - **Binary Compression**: For CUDA 12.8+, use `-DGGML_CUDA_COMPRESSION_MODE=speed` to optimize link times and kernel performance.
+  - **Architectures**: Set `CMAKE_CUDA_ARCHITECTURES="89"` for the RTX 4070 Ti Super.
+- **VRAM Safeguards**:
+  - Max combined Context + Weights must not exceed 15.5GB.
+  - Use `CACHE_TYPE_K="q4_0"` and `CACHE_TYPE_V="q4_0"` for maximum context (128K+) on 16GB cards.
+  - For Qwen 35B, use `N_CPU_MOE=32` to keep most experts on GPU while fitting in VRAM.
+
+## 3. The 10/10 UI/UX Standard for OpenWeb-UI Profiles
+
+When generating or editing JSON profiles, you MUST strictly adhere to these standards:
+
+1. **Header Scaling**: Use `###` (Header 3) as the maximum header size. Never use `##` or `#`.
+2. **F-Scan Anchoring**: Always place emojis at the **absolute start** of the header (e.g., `### 🏗️ Engineering`) to enable vertical marginal scanning.
+3. **Mandatory Breathing Room**: Every system prompt must include: **"Use double-newlines and keep paragraphs under 3 lines."**
+4. **The FIM Mission**: Instant/FIM profiles must include: `"Predict the next most likely tokens. No conversational filler."`
+5. **Type Standardization**: All numerical parameters (budgets, penalties, samplers) must be explicitly defined as **floats** (e.g., `1024.0`, `1.05`) for OpenWeb-UI parser robustness.
+6. **ID Hygiene**: Use short, lowercase, hyphenated IDs starting with model type (e.g., `qw3-35b-fast`, `gemma4-26b-pro`).
+7. **The "No-Leak" Mandate**: For reasoning personas, use the phrase: **"Deliver the final solution only; strictly exclude internal reasoning traces from this response block."**
+
+## 4. Reasoning & Sampler Mechanics (MoE / APEX Optimization)
+
+- **APEX-I-Balanced Sampling (Qwen 3.6)**:
+  - **Temperature**: Maintain `0.40 - 0.55`. Lowering below `0.25` causes MoE "Expert Stagnation."
+  - **Min-P**: Use `0.05 - 0.1` to filter quantization noise while preserving the APEX imatrix diversity.
+  - **Repetition**: Use `repeat_penalty: 1.05` with `repeat_last_n: 256.0` to prevent MoE "End-of-Thought" stutter.
+- **Reasoning Budget Enforcement**:
+  - **Format**: Always use `REASONING_FORMAT="auto"` for non-DeepSeek models to ensure native tag detection.
+  - **Stop-Anchor**: Use `REASONING_BUDGET_MESSAGE=" [Logic Finalized] "` in `.conf` files to provide a graceful termination sequence.
+  - **Jinja Hygiene**: Set `preserve_thinking: false` in chat template kwargs (or omit it) to allow the C++ engine to natively manage and hide thinking tokens. This is required for `thinking_budget_tokens` to be respected.
+- **Standard Sampler Chains**:
+  - **Creative/Pro**: `"samplers": "dry;top_p;temperature"`
+  - **Logic/Coder**: `"samplers": "dry;top_k;min_p"`
+  - **DRY Sampler**: Always include DRY (`multiplier=0.8`, `base=1.75`) to prevent repetitive MoE loops.
@@ -0,0 +1,17 @@
+{
+    "gemma-4-26B-A4B-it-APEX-I-Quality.gguf.conf": {
+        "avg_gen": 29.2,
+        "agg_thr": 78.7,
+        "timestamp": "2026-04-16T22:09:56+00:00"
+    },
+    "gemma-4-26b-a4b-prism-pro-dq.gguf.conf": {
+        "avg_gen": 33.3,
+        "agg_thr": 93.2,
+        "timestamp": "2026-04-16T20:57:05+00:00"
+    },
+    "Qwen3.6-35B-A3B-APEX-I-Quality.gguf.conf": {
+        "avg_gen": 24.2,
+        "agg_thr": 72.5,
+        "timestamp": "2026-04-17T14:56:24+00:00"
+    }
+}