Skip to content

Commit 901bd96

Browse files
committed
chore: reset to master while preserving .gemini, scripts-local, and GEMINI.md
1 parent cc7200b commit 901bd96

26 files changed

Lines changed: 3343 additions & 0 deletions

.gemini/models/qwen-3.6-35b-a3b.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Technical Optimization Profile: Qwen 3.6 35B-A3B-APEX-I-Balanced
2+
3+
## 1. Documentation & Primary Sources
4+
This profile is synthesized from the following official technical specifications:
5+
* **APEX Implementation:** [mudler/Qwen3.6-35B-A3B-APEX-GGUF](https://huggingface.co/mudler/Qwen3.6-35B-A3B-APEX-GGUF/raw/main/README.md)
6+
* **Model Weights/Specs:** [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B/raw/main/README.md)
7+
* **APEX Methodology:** [mudler/apex-quant](https://github.com/mudler/apex-quant/blob/main/README.md)
8+
9+
---
10+
11+
## 2. Model Architecture & Deployment Rationale
12+
13+
### **Model Overview**
14+
* **Base:** Qwen 3.6 35B-A3B (256 Experts).
15+
* **Specialization:** Optimized for "Thinking Mode" reasoning, agentic coding (terminal/repo-level), and Model Context Protocol (MCP) tool-calling.
16+
* **Context Limit:** 262,144 (Native).
17+
* **Deployment Goal:** Stable 128K context window on 16GB VRAM (RTX 4070 Ti Super) with peak generation speed.
18+
19+
### **Hardware Calibration (16GB RTX 4070 Ti Super + 7950X3D)**
20+
* **CCD Focus:** `CPU_AFFINITY="0-7,24-31"` (CCD0 V-Cache) ensures logic-heavy MoE branch routing stays on the high-cache CCD.
21+
* **Expert Offloading:** `N_CPU_MOE=192` (64 experts on GPU, 192 on CPU).
22+
* *Finding:* Values < 192 are silently overridden by `llama-server` to fit in 16GB VRAM.
23+
* **Memory Locks:** `MLOCK=true` is mandatory to eliminate page-fault latency spikes when the router hits CPU-offloaded experts.
24+
* **KV Cache:** `CTX_SIZE=131072` (128K) with `q4_0` K/V types fits within the ~15GB safety limit.
25+
26+
---
27+
28+
## 3. APEX (Adaptive Precision for EXpert Models) Strategy
29+
30+
### **Quantization Details (Extracted from mudler/apex-quant)**
31+
* **Layer-wise Precision Gradient:** Not all layers are equal. Sensitivity is non-linear across the 40 blocks.
32+
* **Edge Layers:** First and last 5 layers are most sensitive; kept at higher precision (e.g., Q6_K).
33+
* **Shared Experts:** Must be at least **Q8_0** to maintain routing stability and prevent "logic drift."
34+
* **Imatrix (APEX-I):** Uses a diverse calibration dataset (chat, code, reasoning, tool-calling) to trade negligible perplexity for significant gains in real-world accuracy.
35+
* **Tiering:** APEX-I Balanced (24GB GGUF) is the recommended tier for high-accuracy local deployment on 16GB-24GB cards.
36+
37+
---
38+
39+
## 4. llama.cpp Server Configuration & Findings
40+
41+
### **Thinking Mode Parameters (Official Qwen 3.6 Specs)**
42+
* **Thinking mode** is enabled by default to generate reasoning traces within `<think>` blocks.
43+
* **Reasoning Format:** `deepseek` (Default). Maps `<think>` blocks to `reasoning_content` in API responses.
44+
* **Agentic Continuity:** `preserve_thinking: true` and `enable_thinking: true` must be enabled via `--chat-template-kwargs`.
45+
* *Source Detail:* `preserve_thinking` allows the model to retain reasoning context from historical messages, which is critical for decision consistency in multi-turn agentic workflows.
46+
* **Reasoning Budget:** `REASONING_BUDGET=-1` (Infinite) to prevent truncating complex reasoning traces.
47+
48+
### **Sampling (Official Qwen 3.6 "General Thinking" Specs)**
49+
| Parameter | Value | Rationale |
50+
| :--- | :--- | :--- |
51+
| `TEMP` | 1.0 | Standard randomness for MoE diversity. |
52+
| `TOP_P` | 0.95 | Nucleus sampling. |
53+
| `TOP_K` | 20 | Tight top-K for reasoning precision. |
54+
| `MIN_P` | 0.0 | Official recommendation for general thinking tasks. |
55+
| `REPEAT_PENALTY` | 1.0 | Official spec recommendation (no penalty). |
56+
| `PRESENCE_PENALTY` | 1.5 | High penalty to force creative topic expansion. |
57+
| `SAMPLERS` | `dry;top_k;top_p;min_p;temperature` | Optimized chain order. |
58+
59+
---
60+
61+
## 5. Benchmark Baselines (128K Context)
62+
63+
| Slot Configuration | Generation Speed (avg) | Aggregate Throughput |
64+
| :--- | :--- | :--- |
65+
| **1-Slot (Peak)** | **55.9 tok/s** | 55.9 tok/s |
66+
| **3-Slot (Concurrent)** | **26.7 tok/s** | **75.4 tok/s** |
67+
68+
---
69+
70+
## 6. Maintenance & Sync Notes
71+
* **Update Script:** Always use `./scripts-local/rebuild-llama.sh scripts-local/qwen-3.6-35b-a3b.conf`.
72+
* **UI Sync:** The script automatically generates the OpenWeb-UI JSON profile with the mandatory `preserve_thinking` flags.
73+
* **Baseline File:** Baselines are stored in `scripts-local/baselines.json`.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
name: llm-stack-optimizer
3+
description: Generate optimized llama-server configs and OpenWeb-UI JSON profiles by researching model specs, benchmarks, and cloning from established high-quality templates.
4+
---
5+
6+
# LLM Stack Optimizer
7+
8+
This skill performs deep research and analysis to generate a perfectly tuned configuration for any new LLM on your hardware, using established high-quality templates as a base.
9+
10+
## Input
11+
- **Base Model Name**: (e.g., `google_gemma-4-26B-A4B-it-Q6_K.gguf`)
12+
13+
## Workflow
14+
15+
### 1. Preparation Phase (Upstream Sync)
16+
- **Sync Check**: Ask the user if they want to synchronize the local repository with upstream before proceeding.
17+
- **Action**: If yes, execute `./scripts-local/sync-fork.sh`.
18+
- **Build Requirement**: If a sync was performed, the deployment step **MUST** include the `--build` flag in `rebuild-llama.sh` to update the server binary.
19+
20+
### 2. Intelligence Phase (Deep Research & Architecture)
21+
- **Search Objective**: Perform a comprehensive Google search for the model's official release notes, HuggingFace model card, and benchmark reports (PPL, MMLU, HumanEval).
22+
- **Architecture Analysis**: Identify if the model is Dense or MoE (Mixture of Experts), number of layers, and attention mechanisms (GQA/MQA).
23+
- **Research Summary**: Provide a detailed technical report including:
24+
- **Model Details**: Architecture, parameter count, expert count (if MoE).
25+
- **Performance**: Key benchmark scores (HumanEval, MMLU).
26+
- **Sources**: List the URLs/citations used for the data.
27+
- **Llama-Server Specs**: Proposed `llama-server` flags based on the architecture (e.g., `--reasoning`, `--chat-template`).
28+
- **User Input**: **Ask the user if they have additional resources** (local documents, URLs, or notes) to refine the config.
29+
- **Confirmation**: **Ask for explicit user confirmation** to proceed to creating the `.conf` and `.json` files.
30+
31+
### 3. Physical Phase (Hardware Constraints)
32+
- **VRAM Verification**: Use `python3 scripts-local/vram-linter.py [TEMP_CONFIG]` to validate usage for the 16GB RTX 4070 Ti Super.
33+
- **Safety Threshold**: Target < 15.5GB total usage. If the linter fails, increase `N_CPU_MOE` or decrease `CTX_SIZE`.
34+
- **MoE Expert Offloading**:
35+
- **Stability baseline (Gemma-4/128-expert)**: 112-120 experts on CPU (out of 128).
36+
- **Stability baseline (Qwen 3.6/256-expert)**: 240-244 experts on CPU (out of 256).
37+
- **Context Scaling (DeltaNet)**: For Hybrid DeltaNet models (Qwen 3.6), `CTX_SIZE=262144` is achievable on 16GB VRAM due to linear-attention efficiency (75% of layers).
38+
- **Affinity**: Always use `CPU_AFFINITY="0-15"` and `THREADS=16` for the 7950X3D.
39+
40+
### 4. Generation Phase (Templated Cloning)
41+
- **llama-server Config**:
42+
- **Base**: Clone from `scripts-local/gemma-4-26b-a4b-prism-pro-dq.gguf.conf`.
43+
- **Mandatory Flags**: `--cache-reuse 256`, `--prio 2`, `--context-shift`, `--kv-unified`.
44+
- **Qwen 3.6 / Agentic Specifics**:
45+
- Include `--chat-template-kwargs '{"preserve_thinking":true}'` for agentic preservation.
46+
- **Pristine History**: Set `REASONING_BUDGET_MESSAGE=""` (empty) to prevent non-native filler from disrupting recursive logic.
47+
- **Watchdog**: Set a global `REASONING_BUDGET=2048` to prevent infinite thinking loops.
48+
- **OpenWeb-UI Profiles**:
49+
- **Automation**: Do NOT manually create JSON profiles. Execute `python3 scripts-local/generate-ui-profiles.py` after saving the `.conf` file to generate all 8 standard personas (FIM, Coder, Pro, etc.).
50+
- **Standards**: The script automatically enforces the **10/10 UI Standard** (H3 headers, emoji anchoring, float types).
51+
- **Recursive Logic**: Ensure high-logic personas (Pro/Coder/Research) instruct the model to "recursively validate against previous reasoning traces."
52+
- **Tool Calling**: For Qwen 3.6 Coder models, ensure the persona uses the XML-based `qwen3_coder` format (detected automatically by modern llama-server).
53+
54+
### 5. Review & Confirmation Phase (Pre-Deployment)
55+
- **Technical Summary**: Present a final summary of the proposed stack:
56+
- **Model**: [Model Name & Quant]
57+
- **VRAM**: [Estimated Usage] MB / 15.5 GB
58+
- **Compute**: [N_CPU_MOE] experts on CPU, [N_GPU_LAYERS] layers on GPU.
59+
- **Context**: [CTX_SIZE] (Total) / [PARALLEL] (Slots).
60+
- **Reasoning**: [Budget/Format settings].
61+
- **Build Notification**: State if a full rebuild is triggered (mandatory if sync was performed).
62+
- **Mandatory Stop**: **Ask for explicit user confirmation** before proceeding to deployment.
63+
64+
### 6. Integration Phase (Deployment & Health Check)
65+
- **Deployment**: Execute `./scripts-local/rebuild-llama.sh scripts-local/[CONFIG_NAME].conf` (include `--build` if sync was performed).
66+
- **Initialization Delay**: **Wait 10 seconds** to allow the model to load into VRAM and the KV cache to initialize.
67+
- **Health Check**: Perform a health check via `curl -s http://localhost:8080/health`.
68+
- **Verification**: Ensure the response returns `{"status": "ok"}`. If the service is not healthy, check `systemctl status llama-server` and report the error.
69+
70+
### 7. Deployment & Validation Phase (Benchmarking & Baseline)
71+
- **Single-Slot Benchmark**: Run `python3 scripts-local/bench-llama.py 1 -p 1` to measure peak single-stream generation speed.
72+
- **Parallel-Slot Benchmark**: Run `python3 scripts-local/bench-llama.py 1 -p [PARALLEL]` (where `PARALLEL` matches the value in the `.conf` file) to measure aggregate throughput under load.
73+
- **Establish Baseline**: Once performance is optimized and stable across both tests, run `./scripts-local/save-baseline.sh scripts-local/[CONFIG_NAME].conf` to set the **Golden Baseline**.
74+
75+
## Hard-Won Lessons (Reference)
76+
77+
### The 10/10 UI/UX Standard (Mandatory)
78+
- **Header Scaling**: Use `###` (Header 3) as the maximum header size. Never use `##` or `#` (prevents oversized text on mobile).
79+
- **F-Scan Anchoring**: Always place emojis at the **absolute start** of the header (e.g., `### 🏗️ Engineering`) to enable vertical marginal scanning.
80+
- **Breathing Room**: Every system prompt must include the instruction: **"Use double-newlines and keep paragraphs under 3 lines."**
81+
- **Surgical Capabilities**: Only enable metadata `capabilities` (Vision, Search, Code Interpreter) that are directly relevant to the persona. Prune others to keep the UI clean.
82+
- **ID Hygiene**: Use short, lowercase, hyphenated IDs (e.g., `mythos-coder`) instead of long filenames or quant-heavy strings.
83+
- **Technical Rigor**: For Math/Logic profiles, explicitly define symbols in the system prompt (e.g., "Use $\therefore$ for 'therefore' and $\implies$ for 'implies'") to force formal derivation logic.
84+
- **The "No-Filler" Mandate**: For tool-like profiles (Extractor, FIM), use the phrase: **"Return ONLY the data. No conversational filler or introductory text."**
85+
- **Type Standardization**: All numerical parameters (e.g., `repeat_penalty: 1.0`) must be explicitly defined as **floats**, never integers, for parser robustness.
86+
87+
### Optimization Baseline
88+
- **MoE Ratio for Q6_K**: For **26B-A4B-Q6_K**, `N_CPU_MOE=14` + `CTX_SIZE=65536` (32K per slot for 2 slots) is the optimized spot for speed and stability on 16GB VRAM.
89+
- **OOM Prevention**: If the server fails with `status=11/SEGV`, increase `N_CPU_MOE` by 2 or decrease `CTX_SIZE` by 16,384.
90+
- **Sampler Stability**: Use `TEMP=0.7`, `MIN_P=0.02`, and `REPEAT_PENALTY=1.08` for Gemma-4 MoE models.
91+
- **DRY Sampler**: Always include DRY (`multiplier=0.8`, `base=1.75`) to prevent loops.
92+
- **KV Cache**: Use `CACHE_TYPE_K="q4_0"` and `CACHE_TYPE_V="q4_0"` by default.
93+
- **Autocomplete/FIM Profiles**: For high-speed "Turbo" Autocomplete/FIM profiles:
94+
- **Omit Parameters**: Completely omit `thinking_budget_tokens` and `reasoning`.
95+
- **Disable Thinking**: Set `"chat_template_kwargs": { "enable_thinking": false }`.
96+
- **Thin Sampler Chain**: Use `"samplers": "top_k;top_p"` to minimize latency.
97+
- **Reasoning Budget Message**: When setting `REASONING_BUDGET_MESSAGE` in `.conf`, avoid using control tokens like `<channel|>` as they can prematurely interrupt the model's output stream. Use a clean string like `" [Logic Finalized] "`.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Configuration for {{model_name}}
2+
# Optimized for: AMD 7950X3D | RTX 4070 Ti Super (16GB) | CachyOS
3+
4+
# --- PATHS ---
5+
MODEL_PATH="{{model_path}}"
6+
MMPRJ_PATH="{{mmproj_path}}"
7+
SERVICE_NAME="llama-server.service"
8+
9+
# --- HARDWARE ---
10+
CPU_AFFINITY="0-15"
11+
THREADS=16
12+
N_GPU_LAYERS=999
13+
N_CPU_MOE={{n_cpu_moe}}
14+
15+
# --- MEMORY ---
16+
CACHE_TYPE_K="{{cache_type_k}}"
17+
CACHE_TYPE_V="{{cache_type_v}}"
18+
CTX_SIZE={{ctx_size}}
19+
PARALLEL={{parallel}}
20+
CACHE_RAM=32768
21+
BATCH_SIZE=2048
22+
UBATCH_SIZE=2048
23+
24+
# --- SAMPLING ---
25+
TEMP={{temperature}}
26+
MIN_P={{min_p}}
27+
XTC_PROBABILITY={{xtc_prob}}
28+
XTC_THRESHOLD={{xtc_thresh}}
29+
TOP_P=0.95
30+
TOP_K=50
31+
REPEAT_PENALTY=1.05
32+
REPEAT_LAST_N=64
33+
34+
# --- DRY SAMPLER ---
35+
DRY_MULTIPLIER={{dry_mult}}
36+
DRY_BASE=1.75
37+
DRY_ALLOWED_LENGTH=2
38+
DRY_PENALTY_LAST_N=4096
39+
40+
# --- INFRASTRUCTURE ---
41+
REASONING_BUDGET=-1
42+
REASONING_BUDGET_MESSAGE=" [Thinking budget reached, concluding logic...] "
43+
HOST="0.0.0.0"
44+
PORT=8080
45+
SLEEP_IDLE_SECONDS=300
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
[
2+
{
3+
"id": "{{model_id}}-fast",
4+
"base_model_id": "{{model_filename}}",
5+
"name": "⚡ Fast | {{display_name}}",
6+
"params": {
7+
"system": "You are a high-speed assistant. Perform a rapid factual check using your budget before answering. Use topic-relevant emojis.",
8+
"temperature": {{temp_analytical}},
9+
"min_p": 0.02,
10+
"thinking_budget_tokens": {{budget_fast}}
11+
}
12+
},
13+
{
14+
"id": "{{model_id}}-coder",
15+
"base_model_id": "{{model_filename}}",
16+
"name": "🚀 Coder | {{display_name}}",
17+
"params": {
18+
"system": "You are a Lead Software Engineer. MANDATORY: Perform a line-by-line mental dry-run of code logic. Use --- dividers.",
19+
"temperature": {{temp_coder}},
20+
"min_p": 0.02,
21+
"thinking_budget_tokens": {{budget_coder}}
22+
}
23+
},
24+
{
25+
"id": "{{model_id}}-pro",
26+
"base_model_id": "{{model_filename}}",
27+
"name": "🧠 Pro | {{display_name}}",
28+
"params": {
29+
"system": "You are an Elite Analytical Agent. REQUIRED: Explore 3+ distinct paths. Use semantic visual anchors.",
30+
"temperature": {{temp_analytical}},
31+
"min_p": 0.01,
32+
"thinking_budget_tokens": {{budget_pro}}
33+
}
34+
}
35+
]

GEMINI.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Workspace Core Mandates & Engineering Standards
2+
3+
This file contains foundational mandates, architectural patterns, and hard-won lessons specific to this `llama.cpp` and OpenWeb-UI deployment repository. These rules take absolute precedence over general defaults.
4+
5+
## 1. System Architecture & Deployment
6+
7+
- **Deployment Script**: Always use `./scripts-local/rebuild-llama.sh [CONFIG_FILE]` to deploy or update the server. Never launch `llama-server` manually without the wrapper script.
8+
- **VRAM Pre-flight**: Automatically checks if your config fits in the 16GB RTX 4070 Ti Super before starting. Aborts if >15.5GB.
9+
- **Manual UI Sync**: OpenWeb-UI JSON profiles are no longer regenerated automatically. Use the `--generate-ui` flag in `rebuild-llama.sh` to trigger updates.
10+
- **Active Configurations**:
11+
- `qwen-3.6-35b-a3b.conf`: APEX-I-Balanced (Port 8080 | CCD0 focus).
12+
- `gemma-4-26b-a4b.conf`: PRISM-PRO-DQ (Port 8081 | CCD1 focus).
13+
- **Upstream Sync**: Use `./scripts-local/sync-fork.sh` to synchronize the local fork with the upstream repository and rebase active branches.
14+
15+
## 2. Hardware Calibration (16GB VRAM | AMD 7950X3D)
16+
17+
When tuning `.conf` files for MoE models, respect these physical limits and findings:
18+
- **CPU Affinity Strategy**:
19+
- **Mixed Affinity (Preferred for Burst)**: Use cross-CCD strings like `0-7,24-31` (Qwen) and `16-23,8-15` (Mythos) to maximize single-model burst performance (up to 71 tok/s for 35B).
20+
- **Strict Isolation (Preferred for Multitasking)**: Use strict ranges like `0-7,16-23` (CCD0) and `8-15,24-31` (CCD1) to reduce prompt evaluation stutter by ~24% during simultaneous load.
21+
- **CUDA Optimization**:
22+
- **Peer Copy**: Always build with `-DGGML_CUDA_NO_PEER_COPY=OFF`. This is critical for MoE models when experts are split between VRAM and System RAM (DDR5), significantly reducing expert-switching latency.
23+
- **Binary Compression**: For CUDA 12.8+, use `-DGGML_CUDA_COMPRESSION_MODE=speed` to optimize link times and kernel performance.
24+
- **Architectures**: Set `CMAKE_CUDA_ARCHITECTURES="89"` for the RTX 4070 Ti Super.
25+
- **VRAM Safeguards**:
26+
- Max combined Context + Weights must not exceed 15.5GB.
27+
- Use `CACHE_TYPE_K="q4_0"` and `CACHE_TYPE_V="q4_0"` for maximum context (128K+) on 16GB cards.
28+
- For Qwen 35B, use `N_CPU_MOE=32` to keep most experts on GPU while fitting in VRAM.
29+
30+
## 3. The 10/10 UI/UX Standard for OpenWeb-UI Profiles
31+
32+
When generating or editing JSON profiles, you MUST strictly adhere to these standards:
33+
34+
1. **Header Scaling**: Use `###` (Header 3) as the maximum header size. Never use `##` or `#`.
35+
2. **F-Scan Anchoring**: Always place emojis at the **absolute start** of the header (e.g., `### 🏗️ Engineering`) to enable vertical marginal scanning.
36+
3. **Mandatory Breathing Room**: Every system prompt must include: **"Use double-newlines and keep paragraphs under 3 lines."**
37+
4. **The FIM Mission**: Instant/FIM profiles must include: `"Predict the next most likely tokens. No conversational filler."`
38+
5. **Type Standardization**: All numerical parameters (budgets, penalties, samplers) must be explicitly defined as **floats** (e.g., `1024.0`, `1.05`) for OpenWeb-UI parser robustness.
39+
6. **ID Hygiene**: Use short, lowercase, hyphenated IDs starting with model type (e.g., `qw3-35b-fast`, `gemma4-26b-pro`).
40+
7. **The "No-Leak" Mandate**: For reasoning personas, use the phrase: **"Deliver the final solution only; strictly exclude internal reasoning traces from this response block."**
41+
42+
## 4. Reasoning & Sampler Mechanics (MoE / APEX Optimization)
43+
44+
- **APEX-I-Balanced Sampling (Qwen 3.6)**:
45+
- **Temperature**: Maintain `0.40 - 0.55`. Lowering below `0.25` causes MoE "Expert Stagnation."
46+
- **Min-P**: Use `0.05 - 0.1` to filter quantization noise while preserving the APEX imatrix diversity.
47+
- **Repetition**: Use `repeat_penalty: 1.05` with `repeat_last_n: 256.0` to prevent MoE "End-of-Thought" stutter.
48+
- **Reasoning Budget Enforcement**:
49+
- **Format**: Always use `REASONING_FORMAT="auto"` for non-DeepSeek models to ensure native tag detection.
50+
- **Stop-Anchor**: Use `REASONING_BUDGET_MESSAGE=" [Logic Finalized] "` in `.conf` files to provide a graceful termination sequence.
51+
- **Jinja Hygiene**: Set `preserve_thinking: false` in chat template kwargs (or omit it) to allow the C++ engine to natively manage and hide thinking tokens. This is required for `thinking_budget_tokens` to be respected.
52+
- **Standard Sampler Chains**:
53+
- **Creative/Pro**: `"samplers": "dry;top_p;temperature"`
54+
- **Logic/Coder**: `"samplers": "dry;top_k;min_p"`
55+
- **DRY Sampler**: Always include DRY (`multiplier=0.8`, `base=1.75`) to prevent repetitive MoE loops.

scripts-local/baselines.json

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"gemma-4-26B-A4B-it-APEX-I-Quality.gguf.conf": {
3+
"avg_gen": 29.2,
4+
"agg_thr": 78.7,
5+
"timestamp": "2026-04-16T22:09:56+00:00"
6+
},
7+
"gemma-4-26b-a4b-prism-pro-dq.gguf.conf": {
8+
"avg_gen": 33.3,
9+
"agg_thr": 93.2,
10+
"timestamp": "2026-04-16T20:57:05+00:00"
11+
},
12+
"Qwen3.6-35B-A3B-APEX-I-Quality.gguf.conf": {
13+
"avg_gen": 24.2,
14+
"agg_thr": 72.5,
15+
"timestamp": "2026-04-17T14:56:24+00:00"
16+
}
17+
}

0 commit comments

Comments
 (0)