@@ -10,12 +10,21 @@ Run a local LLM server with a REST API, manage GGUF models, and use the `go-llam
1010## Features
1111
1212- ** Command Line Interface** : Interactive chat and completion tooling
13- - ** HTTP API Server** : REST endpoints for chat, completion, embeddings, and model management (not yet OpenAI or Anthropic compatible)
13+ - ** HTTP API Server** : REST endpoints for chat, completion, embeddings, and model management
1414- ** Model Management** : Pull, cache, load, unload, and delete GGUF models
1515- ** Streaming** : Incremental token streaming for chat and completion
1616- ** GPU Support** : CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
1717- ** Docker Support** : Pre-built images for CPU, CUDA, and Vulkan targets
1818
19+ Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:
20+
21+ - Multi-modal support (images, audio, PDF's, etc)
22+ - Reasoning/Thinking support
23+ - OpenAI or Anthropic compatible API
24+ - Tool calling
25+ - Grammar (JSON format output)
26+ - Text-to-Speech (Audio output)
27+
1928## Quick Start
2029
2130Start the server with Docker:
@@ -33,18 +42,18 @@ Then use the CLI to interact with the server:
3342export GOLLAMA_ADDR=" localhost:8083"
3443
3544# Pull a model (Hugging Face URL or hf:// scheme)
36- go-llama pull https ://huggingface.co/unsloth/phi-4- GGUF/blob/main/phi-4-q4_k_m .gguf
45+ go-llama pull hf ://bartowski/Llama-3.2-1B-Instruct- GGUF/Llama-3.2-1B-Instruct-Q4_K_M .gguf
3746
3847# List models
3948go-llama models
4049
4150# Load a model into memory
42- go-llama load phi-4-q4_k_m .gguf
51+ go-llama load Llama-3.2-1B-Instruct-Q4_K_M .gguf
4352
4453# Chat (interactive)
45- go-llama chat phi-4-q4_k_m .gguf " You are a helpful assistant"
54+ go-llama chat Llama-3.2-1B-Instruct-Q4_K_M .gguf " You are a helpful assistant"
4655# Completion
47- go-llama complete phi-4-q4_k_m .gguf " Explain KV cache in two sentences"
56+ go-llama complete Llama-3.2-1B-Instruct-Q4_K_M .gguf " Explain KV cache in two sentences"
4857~~~
4958
5059## Model Support
@@ -54,23 +63,23 @@ go-llama complete phi-4-q4_k_m.gguf "Explain KV cache in two sentences"
5463- ` https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf `
5564- ` hf://<org>/<repo>/<file>.gguf `
5665
57- The default model cache directory is the system cache folder (e.g., ` ~/.cache/ go-llama` on Linux, ` ~/Library/Caches/go-llama ` on macOS ) and can be overridden with ` GOLLAMA_DIR ` or ` --models ` .
66+ The default model cache directory is ` ${XDG_CACHE_HOME}/ go-llama` (or system temp ) and can be overridden with ` GOLLAMA_DIR ` .
5867
5968## Docker Deployment
6069
6170Docker containers are published for Linux AMD64 and ARM64. Variants include:
6271
63- - ** CPU** : ` ghcr.io/mutablelogic/go-llama `
72+ - ** CPU and Vulkan ** : ` ghcr.io/mutablelogic/go-llama `
6473- ** CUDA** : ` ghcr.io/mutablelogic/go-llama-cuda `
65- - ** Vulkan** : ` ghcr.io/mutablelogic/go-llama-vulkan `
6674
6775Use the ` run ` command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.
6876
6977## CLI Usage Examples
7078
79+ Client-only commands:
80+
7181| Command | Description | Example |
7282| ---------| -------------| ---------|
73- | ` gpuinfo ` | Show GPU information | ` go-llama gpuinfo ` |
7483| ` models ` | List available models | ` go-llama models ` |
7584| ` model ` | Get model details | ` go-llama model phi-4-q4_k_m.gguf ` |
7685| ` pull ` | Download a model | ` go-llama pull hf://org/repo/model.gguf ` |
@@ -82,9 +91,13 @@ Use the `run` command inside the container to start the server. For GPU usage, e
8291| ` embed ` | Generate embeddings | ` go-llama embed phi-4-q4_k_m.gguf "text" ` |
8392| ` tokenize ` | Convert text to tokens | ` go-llama tokenize phi-4-q4_k_m.gguf "text" ` |
8493| ` detokenize ` | Convert tokens to text | ` go-llama detokenize phi-4-q4_k_m.gguf 1 2 3 ` |
85- | ` run ` | Run the HTTP server | ` go-llama run --http.addr localhost:8083 ` |
8694
87- Use ` go-llama --help ` or ` go-llama <command> --help ` for full options.
95+ Use ` go-llama --help ` or ` go-llama <command> --help ` for full options. Server commands:
96+
97+ | Command | Description | Example |
98+ | ---------| -------------| ---------|
99+ | ` gpuinfo ` | Show GPU information | ` go-llama gpuinfo ` |
100+ | ` run ` | Run the HTTP server | ` go-llama run --http.addr localhost:8083 ` |
88101
89102## Development
90103
0 commit comments