Llama ccp WinUI3 Control Panel is a native Windows app for running and managing local GGUF models through bundled llama.cpp-based backends. It is designed as a local model/runtime control surface rather than a full chat application: pick a model, load it into memory, tune runtime settings, expose a local OpenAI-compatible endpoint, monitor resources, and keep the backend resident until you unload it.
The app targets users who want a lightweight local LLM server with more direct control than a black-box runner. It can run standard llama.cpp-compatible GGUF text models and DiffusionGemma GGUF models through separate bundled backends.
- Runs local GGUF models through
llama-server.exe. - Runs DiffusionGemma GGUF models through a bundled DiffusionGemma worker wrapper.
- Keeps one selected model loaded until the user unloads it.
- Exposes a local OpenAI-compatible endpoint for third-party tools.
- Lets users switch models from the UI.
- Downloads public Hugging Face GGUF files.
- Provides an HF Library browser with most-popular/newest sorting and hardware fit estimates.
- Shows CPU, RAM, process memory, uptime, GPU usage, and VRAM usage where available.
- Redirects backend logs into an in-app runtime console.
- Provides a small Chat Test page for checking the currently loaded model.
- Supports app startup registration and close-to-system-tray behavior.
- Select a custom app models folder.
- Optionally point the UI at the local Ollama model store for discovery.
- Scan for
.gguffiles. - Hide models from the app library without deleting files.
- Reveal selected models in Explorer.
- Detect model architecture and likely capabilities.
- Detect vision/projector companion files such as
mmprojwhere available.
- Search public Hugging Face repositories.
- Sort compatible results by newest or most popular.
- List up to 100 compatible repositories.
- Show available GGUF files in a selected repo.
- Display file size, quantization, backend type, capabilities, and local fit estimates.
- Color-code fit:
- Green: good fit.
- Yellow: likely usable or unknown.
- Orange: tight fit or partial GPU offload likely.
- Red: too large or auxiliary/not standalone.
- Download public GGUF files into the selected app models folder.
- Resume interrupted downloads where the server supports byte ranges.
- Load, unload, and reload the selected model.
- Run backends without opening a separate console window.
- Configure host and port.
- Copy the local endpoint.
- Detect port conflicts.
- Keep the backend resident until explicitly unloaded.
- Stop backend processes on app exit.
- Use a Windows process lifetime guard so child backends are closed if the app exits unexpectedly.
- Context size.
- CPU threads.
- GPU layers.
- Batch size.
- UBatch size.
- Parallel slots.
- Flash attention.
- mmap/mlock toggles.
- Diffusion steps for DiffusionGemma.
- Custom additional backend arguments.
- Per-model runtime settings.
- Hardware-aware optimization button.
- CPU usage.
- RAM usage.
- Backend process memory.
- Backend uptime.
- Current model.
- Current host and port.
- GPU usage through
nvidia-smiwhen available. - VRAM usage through
nvidia-smiwhen available. - Clear fallback text when GPU metrics are unavailable.
- Live stdout/stderr capture.
- Timestamped log lines.
- Warning/error highlighting.
- Copy logs.
- Save logs.
- Clear logs.
The app is not intended to be a full chat client, but it includes a basic test page so users can verify the currently loaded endpoint. It supports text prompts and image attachment attempts for compatible vision models.
Used for standard GGUF text, embedding, tool-capable, reasoning, and vision-capable models that are supported by the bundled llama.cpp build.
Used for DiffusionGemma GGUF models. The app keeps the worker process resident and wraps it with local /v1/chat/completions and /v1/completions endpoints.
When a model is loaded, the app exposes a local endpoint such as:
http://127.0.0.1:8080
Useful paths:
/health
/v1/completions
/v1/chat/completions
This allows other local tools to connect as long as they support OpenAI-style local server URLs.
This repository and release package do not include model files.
Users can:
- Download public GGUF models from the HF Library page.
- Paste a public Hugging Face repository URL.
- Paste a direct public
.gguffile URL. - Place
.gguffiles manually in the configured models folder.
Private, gated, or token-protected Hugging Face downloads are not currently supported.
For first tests, use smaller quantized GGUF files such as Q4 or Q5 variants. They load faster and are more likely to fit on consumer GPUs.
Examples of useful search terms in the HF Library:
gemma gguf
qwen3 gguf
embedding gguf
diffusiongemma
- CPU-only mode can work but will usually be slower.
- NVIDIA GPUs are detected through
nvidia-smi. - CUDA-enabled bundled backends are included in the ready-to-run package.
- GPU/VRAM monitoring depends on available drivers and
nvidia-smi. - Large BF16/F16 models may require more VRAM than consumer GPUs provide.
- Quantized Q4/Q5 models are usually better for local interactive use.
-
Download the release zip.
-
Extract it to a normal user-writable folder, for example:
C:\Tools\Llama ccp WinUI3 Control Panel -
Run:
Llama ccp WinUI3 Control Panel.exe -
Open Settings if you want to enable:
- Run on startup.
- Close to system tray.
-
Open Models or HF Library to add GGUF models.
-
Select a model and click Load model on the Dashboard or Runtime page.
Requirements:
- Windows 10/11.
- .NET SDK compatible with the target framework in the project file.
- Windows App SDK dependencies restored through NuGet.
Build:
dotnet restore
dotnet buildThe build copies a runnable unpackaged app into:
local-launch\
The public source repository does not store bundled backend binaries or model files. Official release downloads include the ready-to-run app package with the bundled backend files already included.
The release package should be created from the runnable output, not from bin, obj, downloaded models, or local settings.
The app stores user settings under the current Windows user profile in:
%LOCALAPPDATA%\Llama ccp WinUI3 Control Panel
Those settings are created on the user's device at runtime. The release package does not include local settings, personal paths, downloaded models, or machine-specific data.
Do not include:
Models/bin/obj/local-launch-next/- smoke test logs
- local settings from
%LOCALAPPDATA% - downloaded model files
Include:
- the root launcher
.exe - app DLLs/runtime files
- bundled backend binaries
- required backend DLLs
Assets/newlogo.*- screenshots for GitHub documentation
README.mdLICENSE- third-party notices, when distributing backend binaries
- Private/gated Hugging Face models are not supported.
- GPU metrics are best-effort and NVIDIA-focused.
- Vision models may require matching projector files.
- The Chat Test page is intentionally basic.
- DiffusionGemma support is a local wrapper around a resident worker process, not a full general diffusion/image generation system.
- This app does not currently package image-generation backends such as Stable Diffusion, SDXL, or Flux.
This is an MVP-level local runtime manager. It is functional, but still evolving around backend tuning, model compatibility detection, and release packaging.





