Skip to content

Commit fddac55

Browse files
gHashTagona-agent
andcommitted
Add supported models documentation
Co-authored-by: Ona <no-reply@ona.com>
1 parent 3602d37 commit fddac55

1 file changed

Lines changed: 96 additions & 0 deletions

File tree

docs/SUPPORTED_MODELS.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# TRINITY Supported Models
2+
3+
## Quick Start
4+
5+
```bash
6+
# General chat (fast, small)
7+
./bin/vibee chat --model models/smollm-135m-instruct-q8_0.gguf --prompt "Hello"
8+
9+
# Coding (requires more RAM)
10+
./bin/vibee chat --model models/qwen2.5-coder-0.5b-instruct-q8_0.gguf --prompt "Write fibonacci in Python"
11+
```
12+
13+
## Supported Models
14+
15+
### General Chat
16+
17+
| Model | Size | RAM | Speed | Quality |
18+
|-------|------|-----|-------|---------|
19+
| SmolLM-135M-Instruct | 139 MB | 1 GB | 12 tok/s | ★★☆☆☆ |
20+
| TinyLlama-1.1B | 1.1 GB | 4 GB | 4 tok/s | ★★★☆☆ |
21+
22+
### Coding
23+
24+
| Model | Size | RAM | Speed | Quality |
25+
|-------|------|-----|-------|---------|
26+
| Qwen2.5-Coder-0.5B | 600 MB | 2 GB | 6 tok/s | ★★★☆☆ |
27+
| Qwen2.5-Coder-1.5B | 1.8 GB | 6 GB | 3 tok/s | ★★★★☆ |
28+
29+
## Download Models
30+
31+
```bash
32+
cd models/
33+
34+
# SmolLM-135M (recommended for demos)
35+
curl -L -o smollm-135m-instruct-q8_0.gguf \
36+
"https://huggingface.co/HuggingFaceTB/smollm-135M-instruct-v0.2-Q8_0-GGUF/resolve/main/smollm-135m-instruct-add-basics-q8_0.gguf"
37+
38+
# Qwen2.5-Coder-0.5B (coding, small)
39+
curl -L -o qwen2.5-coder-0.5b-instruct-q8_0.gguf \
40+
"https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF/resolve/main/qwen2.5-coder-0.5b-instruct-q8_0.gguf"
41+
42+
# Qwen2.5-Coder-1.5B (coding, better quality)
43+
curl -L -o qwen2.5-coder-1.5b-instruct-q8_0.gguf \
44+
"https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct-GGUF/resolve/main/qwen2.5-coder-1.5b-instruct-q8_0.gguf"
45+
```
46+
47+
## Quantization Support
48+
49+
| Format | Supported | Notes |
50+
|--------|-----------|-------|
51+
| Q8_0 | ✅ Yes | Best quality, larger size |
52+
| Q4_0 | ✅ Yes | Good balance |
53+
| Q4_K_M | ⚠️ Partial | K-quants, experimental |
54+
| F32 | ✅ Yes | Full precision |
55+
| F16 | ✅ Yes | Half precision |
56+
57+
## Chat Templates
58+
59+
TRINITY auto-detects the model and uses appropriate chat template:
60+
61+
| Model | Template |
62+
|-------|----------|
63+
| Qwen* | ChatML (`<\|im_start\|>`) |
64+
| SmolLM* | ChatML |
65+
| TinyLlama* | TinyLlama format |
66+
| Llama-2* | Llama2 format |
67+
68+
## Fly.io Deployment
69+
70+
Current deployment uses SmolLM-135M (fits in 2GB RAM):
71+
72+
```bash
73+
# API endpoint
74+
curl https://trinity-llm.fly.dev/v1/chat/completions \
75+
-H "Content-Type: application/json" \
76+
-d '{"messages":[{"role":"user","content":"Hello"}]}'
77+
```
78+
79+
For coding models, deploy with more RAM:
80+
81+
```toml
82+
# fly.toml
83+
[[vm]]
84+
size = "shared-cpu-4x"
85+
memory = "4gb"
86+
```
87+
88+
## Ternary Mode
89+
90+
Enable 16x memory savings (experimental):
91+
92+
```bash
93+
./bin/vibee chat --model models/smollm-135m-instruct-q8_0.gguf --ternary
94+
```
95+
96+
Note: Quality may degrade for non-BitNet models.

0 commit comments

Comments
 (0)