SharpAI
diff --git a/‎README.md‎
Lines changed: 35 additions & 7 deletions b/‎README.md‎
Lines changed: 35 additions & 7 deletions
diff --git a/‎mlx-swift-lm‎ b/‎mlx-swift-lm‎
diff --git a/‎tests/benchmark.sh‎
Lines changed: 0 additions & 103 deletions b/‎tests/benchmark.sh‎
Lines changed: 0 additions & 103 deletions
diff --git a/‎tests/run_4models_100k.py‎
Lines changed: 136 additions & 0 deletions b/‎tests/run_4models_100k.py‎
Lines changed: 136 additions & 0 deletions
@@ -49,16 +49,42 @@ Reference implementations: [`turboquant-mlx`](https://github.com/sharpner/turboq
 
 ## 💻 Tested Hardware & Benchmarks
 
-To reliably run massive 122B parameter MoE models over SSD streaming, `SwiftLM` was designed and benchmarked natively on the following hardware:
+SwiftLM is designed to leverage Apple Silicon Unified Memory limits to their absolute maximum. All first-party benchmarks are generated on an **Apple M5 Pro (64 GB Unified Memory)**.
 
-- **Machine**: MacBook Pro, Apple M5 Pro
-- **Memory**: 64 GB Unified Memory
-- **Model**: Qwen3.5-122B-A10B-4bit
-- **SSD**: Internal Apple NVMe (Zero-Copy Streaming)
+Since we have limited access to the full spectrum of Apple hardware, **we welcome Pull Requests with benchmark results** from other devices (especially base M1/M2/M3 chips and Mac Studios). 
 
-> **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\name\` instead of `"name"`—which systematically breaks OpenAI-compatible tool calling.
+To run the extreme context benchmark suite on your device, execute:
+```bash
+bash tests/run_extreme_context.sh <model-id>
+```
 
----
+### Extreme Context Performance (100K Tokens)
+Tested on M5 Pro (64GB) handling a monolithic **100,000 token** system prompt with **TurboKV Acceleration** enabled.
+
+| Model | Configuration | Time To First Token (TTFT) | Peak GPU Memory (w/ TurboKV) |
+|---|---|---|---|
+| `gemma-4-e4b-it-8bit` | Dense (4B) | 60.92s | 11.83 GB |
+| `gemma-4-26b-a4b-it-4bit` | MoE (26B) | 66.99s | 16.86 GB |
+| `gemma-4-31b-it-4bit` | MoE (31B) | 533.37s | 29.23 GB |
+
+### Throughput & Inference Memory Profile
+Tested by rendering exactly 20 tokens under standard conversational evaluation (`--prefill-size 512`) to capture precise Token Generation (TPS) and Apple Metal memory footprint limits:
+
+| Model | Time To First Token (s) | Generation Speed (tok/s) | Peak GPU Memory (GB) |
+|---|---|---|---|
+| `gemma-4-e2b-it-4bit` | 0.08s | 116.27 tok/s | 1.37 GB |
+| `gemma-4-e4b-it-8bit` | 0.33s | 48.21 tok/s | 7.64 GB |
+| `gemma-4-26b-a4b-it-4bit` | 0.14s | 85.49 tok/s | 13.46 GB |
+| `gemma-4-31b-it-4bit` | 0.55s | 14.82 tok/s | 16.83 GB |
+
+To run the automated suite on your machine for these models, execute:
+```bash
+python3 tests/run_4models_benchmark.py
+```
+
+> **🧠 How it works:** SwiftLM implements **Chunked Prefill** (controlled via `--prefill-size`, defaulting to 512). This is functionally equivalent to `llama.cpp`'s `--batch-size` parameter and mirrors the [`mlx-lm` Python library](https://github.com/ml-explore/mlx/tree/main/mlx_lm)'s reference implementation approach to preventing $O(N^2)$ Unified Memory over-allocation during massive sequence parsing.
+
+> **⚠️ Quantization Disclaimer**: While heavier quantization shrinks the required memory footprint, **4-bit quantization** remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like `\name\` instead of `"name"`—which systematically breaks OpenAI-compatible tool calling.
 
 ---
 
@@ -173,6 +199,7 @@ curl http://localhost:5413/v1/chat/completions \
 | `--port` | `5413` | Port to listen on |
 | `--host` | `127.0.0.1` | Host to bind |
 | `--max-tokens` | `2048` | Max tokens limit per generation |
+| `--prefill-size`| `512`  | Prompt prefill chunk size (micro-batching for long contexts) |
 | `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
 | `--stream-experts` | `false` | Enable experimental SSD streaming for MoE model expert matrices |
 
@@ -187,6 +214,7 @@ curl http://localhost:5413/v1/chat/completions \
 
 Built entirely on the hard work of the Apple MLX community.
 - [mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift
+- [mlx-lm](https://github.com/ml-explore/mlx/tree/main/mlx_lm) — Python reference implementation for MLX Language Models (inspiration for prompt chunking architecture)
 - [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server
 - [flash-moe](https://github.com/danveloper/flash-moe) — Reference for SSD Expert Streaming
 
 
@@ -0,0 +1,136 @@
+import subprocess
+import time
+import urllib.request
+import json
+import os
+
+models = [
+    "mlx-community/gemma-4-e4b-it-8bit",
+    "mlx-community/gemma-4-26b-a4b-it-4bit",
+    "mlx-community/gemma-4-31b-it-4bit"
+]
+
+port = 5440
+# 1 word ~ 1.3 tokens. "test " repeated 80,000 times gives ~100k tokens.
+prompt_text = "Please write a story about a little bird. " + ("test " * 80000)
+results = []
+
+print("==========================================================")
+print(f" 🚀 EXTREME CONTEXT ISOLATION (100K TOKENS) MATRIX")
+print("==========================================================\n")
+
+for idx, model in enumerate(models):
+    print(f"\n======================================")
+    print(f"[{idx+1}/3] Benchmarking: {model}")
+    print(f"======================================")
+    
+    # Enable turbo-kv for extreme context
+    server_cmd = [
+        ".build/release/SwiftLM", 
+        "--model", model,
+        "--port", str(port),
+        "--turbo-kv"
+    ]
+    
+    with open("benchmark_100k_server.log", "w") as log_file:
+        server_proc = subprocess.Popen(server_cmd, stdout=log_file, stderr=subprocess.STDOUT)
+    
+    # Wait for server to load the model
+    loaded = False
+    for i in range(1200):
+        try:
+            req = urllib.request.Request(f"http://127.0.0.1:{port}/health", method="GET")
+            with urllib.request.urlopen(req) as response:
+                if response.status == 200:
+                    loaded = True
+                    break
+        except:
+            time.sleep(1.0)
+            
+    if not loaded:
+        print(f"Error: Server failed to start for {model}")
+        server_proc.terminate()
+        server_proc.wait()
+        continue
+        
+    print(f"Model {model} loaded. Submitting 100K context...")
+    time.sleep(2) # Stabilize memory
+    
+    # Fire off evaluation
+    payload = {
+        "model": model,
+        "stream": True,
+        "max_tokens": 10,
+        "messages": [{"role": "user", "content": prompt_text}]
+    }
+    
+    req = urllib.request.Request(
+        f"http://127.0.0.1:{port}/v1/chat/completions",
+        data=json.dumps(payload).encode('utf-8'),
+        headers={'Content-Type': 'application/json'},
+        method="POST"
+    )
+    
+    start_time = time.time()
+    ttft = None
+    tokens = 0
+    
+    try:
+        # This blocks until the first byte arrives (prefill duration)
+        with urllib.request.urlopen(req, timeout=1200) as response:
+            for line in response:
+                line = line.decode('utf-8').strip()
+                if not line or line == "data: [DONE]":
+                    continue
+                if line.startswith("data: "):
+                    data_str = line[6:]
+                    try:
+                        data = json.loads(data_str)
+                        if data.get('choices') and data['choices'][0]['delta'].get('content'):
+                            if ttft is None:
+                                ttft = time.time() - start_time
+                                print(f"  -> First Token received in {ttft:.2f}s!")
+                            tokens += 1
+                    except json.JSONDecodeError:
+                        continue
+    except Exception as e:
+        print(f"Generation failed: {e}")
+        
+    duration = time.time() - start_time
+    if ttft is None:
+        ttft = duration
+        
+    # Get peak memory
+    peak_gb = 0
+    try:
+        req = urllib.request.Request(f"http://127.0.0.1:{port}/health", method="GET")
+        with urllib.request.urlopen(req) as response:
+            data = json.loads(response.read().decode('utf-8'))
+            peak_mb = data.get("memory", {}).get("peak_mb", 0)
+            peak_gb = peak_mb / 1024.0
+    except Exception as e:
+        print(f"Failed to fetch memory: {e}")
+        
+    # Teardown
+    server_proc.terminate()
+    try:
+        server_proc.wait(timeout=5)
+    except subprocess.TimeoutExpired:
+        server_proc.kill()
+        
+    tps = tokens / (duration - ttft) if (duration - ttft) > 0 and tokens > 1 else 0
+    print(f"--- Results for {model} ---")
+    print(f"TTFT: {ttft:.2f}s | TPS: {tps:.2f} tok/s | Peak RAM: {peak_gb:.2f} GB | Tokens: {tokens}")
+    
+    results.append({
+        "Model": model.split("/")[-1],
+        "TTFT (s)": round(ttft, 2),
+        "TPS": round(tps, 2),
+        "Peak Mem (GB)": round(peak_gb, 2)
+    })
+
+print("\n\n=== FINAL 100K CONTEXT MARKDOWN TABLE ===")
+print("| Model | 100K Time To First Token | Generation Speed | Peak GPU Memory (w/ TurboKV) |")
+print("|---|---|---|---|")
+for r in results:
+    print(f"| `{r['Model']}` | {r['TTFT (s)']}s | {r['TPS']} tok/s | {r['Peak Mem (GB)']} GB |")