Skip to content

Commit bbffeb6

Browse files
authored
Updated README (#13)
* Updated README * Reanmed
1 parent 27d0b9b commit bbffeb6

1 file changed

Lines changed: 24 additions & 11 deletions

File tree

README.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,21 @@ Run a local LLM server with a REST API, manage GGUF models, and use the `go-llam
1010
## Features
1111

1212
- **Command Line Interface**: Interactive chat and completion tooling
13-
- **HTTP API Server**: REST endpoints for chat, completion, embeddings, and model management (not yet OpenAI or Anthropic compatible)
13+
- **HTTP API Server**: REST endpoints for chat, completion, embeddings, and model management
1414
- **Model Management**: Pull, cache, load, unload, and delete GGUF models
1515
- **Streaming**: Incremental token streaming for chat and completion
1616
- **GPU Support**: CUDA, Vulkan, and Metal (macOS) acceleration via llama.cpp
1717
- **Docker Support**: Pre-built images for CPU, CUDA, and Vulkan targets
1818

19+
Some work still to do on the chat endpoint. The following are not yet included, but will eventually be supported:
20+
21+
- Multi-modal support (images, audio, PDF's, etc)
22+
- Reasoning/Thinking support
23+
- OpenAI or Anthropic compatible API
24+
- Tool calling
25+
- Grammar (JSON format output)
26+
- Text-to-Speech (Audio output)
27+
1928
## Quick Start
2029

2130
Start the server with Docker:
@@ -33,18 +42,18 @@ Then use the CLI to interact with the server:
3342
export GOLLAMA_ADDR="localhost:8083"
3443

3544
# Pull a model (Hugging Face URL or hf:// scheme)
36-
go-llama pull https://huggingface.co/unsloth/phi-4-GGUF/blob/main/phi-4-q4_k_m.gguf
45+
go-llama pull hf://bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q4_K_M.gguf
3746

3847
# List models
3948
go-llama models
4049

4150
# Load a model into memory
42-
go-llama load phi-4-q4_k_m.gguf
51+
go-llama load Llama-3.2-1B-Instruct-Q4_K_M.gguf
4352

4453
# Chat (interactive)
45-
go-llama chat phi-4-q4_k_m.gguf "You are a helpful assistant"
54+
go-llama chat Llama-3.2-1B-Instruct-Q4_K_M.gguf "You are a helpful assistant"
4655
# Completion
47-
go-llama complete phi-4-q4_k_m.gguf "Explain KV cache in two sentences"
56+
go-llama complete Llama-3.2-1B-Instruct-Q4_K_M.gguf "Explain KV cache in two sentences"
4857
~~~
4958

5059
## Model Support
@@ -54,23 +63,23 @@ go-llama complete phi-4-q4_k_m.gguf "Explain KV cache in two sentences"
5463
- `https://huggingface.co/<org>/<repo>/blob/<branch>/<file>.gguf`
5564
- `hf://<org>/<repo>/<file>.gguf`
5665

57-
The default model cache directory is the system cache folder (e.g., `~/.cache/go-llama` on Linux, `~/Library/Caches/go-llama` on macOS) and can be overridden with `GOLLAMA_DIR` or `--models`.
66+
The default model cache directory is `${XDG_CACHE_HOME}/go-llama` (or system temp) and can be overridden with `GOLLAMA_DIR`.
5867

5968
## Docker Deployment
6069

6170
Docker containers are published for Linux AMD64 and ARM64. Variants include:
6271

63-
- **CPU**: `ghcr.io/mutablelogic/go-llama`
72+
- **CPU and Vulkan**: `ghcr.io/mutablelogic/go-llama`
6473
- **CUDA**: `ghcr.io/mutablelogic/go-llama-cuda`
65-
- **Vulkan**: `ghcr.io/mutablelogic/go-llama-vulkan`
6674

6775
Use the `run` command inside the container to start the server. For GPU usage, ensure the host has the appropriate drivers and runtime.
6876

6977
## CLI Usage Examples
7078

79+
Client-only commands:
80+
7181
| Command | Description | Example |
7282
|---------|-------------|---------|
73-
| `gpuinfo` | Show GPU information | `go-llama gpuinfo` |
7483
| `models` | List available models | `go-llama models` |
7584
| `model` | Get model details | `go-llama model phi-4-q4_k_m.gguf` |
7685
| `pull` | Download a model | `go-llama pull hf://org/repo/model.gguf` |
@@ -82,9 +91,13 @@ Use the `run` command inside the container to start the server. For GPU usage, e
8291
| `embed` | Generate embeddings | `go-llama embed phi-4-q4_k_m.gguf "text"` |
8392
| `tokenize` | Convert text to tokens | `go-llama tokenize phi-4-q4_k_m.gguf "text"` |
8493
| `detokenize` | Convert tokens to text | `go-llama detokenize phi-4-q4_k_m.gguf 1 2 3` |
85-
| `run` | Run the HTTP server | `go-llama run --http.addr localhost:8083` |
8694

87-
Use `go-llama --help` or `go-llama <command> --help` for full options.
95+
Use `go-llama --help` or `go-llama <command> --help` for full options. Server commands:
96+
97+
| Command | Description | Example |
98+
|---------|-------------|---------|
99+
| `gpuinfo` | Show GPU information | `go-llama gpuinfo` |
100+
| `run` | Run the HTTP server | `go-llama run --http.addr localhost:8083` |
88101

89102
## Development
90103

0 commit comments

Comments
 (0)