A lightweight command-line chat interface for Qwen2.5-7B-Instruct using llama-cli.
QwenCliTest.sh provides a simple, colorful CLI chat interface to interact with the Qwen2.5 language model. Unlike the full QwenVoice application (which includes speech recognition and TTS), this script is focused purely on text-based chat interaction.
- ✅ Proper Qwen Chat Template: Implements
<|im_start|>and<|im_end|>tokens correctly - ✅ Conversation History: Maintains context across the conversation
- ✅ GPU Acceleration: Uses all 28 layers on RTX 2080 Ti
- ✅ Colorful Output: Green for Qwen, blue for user, cyan for status
- ✅ Special Commands:
exit,quit, orq- Exit the chatclearorreset- Reset conversation history
- ✅ Optimized Parameters: Temperature 1.0, Top-K 50, Top-P 0.92 for unhinged responses
- Syntax Error:
MODEL_PATH= ~/home/panda/models/...had a space before~ - Wrong Path: Used
~/home/panda/instead of just~/ - Invalid Flag: Used
--input-contextwhich doesn't exist in llama-cli - Wrong Context Handling: Tried to pipe context via stdin instead of using
-pflag - Missing Chat Template: Didn't use Qwen's required
<|im_start|>format
- ✅ Fixed model path to
~/models/qwen2.5-7b-instruct-jailbroken-q4_k_m.gguf - ✅ Implemented proper llama-cli flags:
-m,-c,-ngl,-p, etc. - ✅ Added Qwen chat template with
<|im_start|>and<|im_end|>tokens - ✅ Proper conversation history tracking using bash arrays
- ✅ Added error handling and dependency verification
- ✅ Colorful terminal output for better UX
- ✅ Response cleaning to remove llama.cpp artifacts
CONTEXT_SIZE=8192 # Large context for long conversations
GPU_LAYERS=28 # All layers on GPU (RTX 2080 Ti)
TEMPERATURE=1.0 # High randomness for chaotic responses
TOP_K=50 # Token diversity
TOP_P=0.92 # Nucleus sampling
REPEAT_PENALTY=1.15 # Prevent repetitive loops
MAX_TOKENS=256 # Response lengthcd /home/panda/Documents/PythonScripts/QwenVoice
./QwenCliTest.sh╔═══════════════════════════════════════════════╗
║ QwenCliTest - CLI Chat Interface ║
║ Powered by Qwen2.5-7B-Instruct ║
╚═══════════════════════════════════════════════╝
Model: qwen2.5-7b-instruct-jailbroken-q4_k_m.gguf
Context: 8192 | GPU Layers: 28/28
Temp: 1.0 | Top-K: 50 | Top-P: 0.92
Type 'exit' to quit, 'clear' to reset conversation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
You: what is 2+2?
[Thinking...]
Qwen: 4, obviously. Did you really need me for that?
You: tell me a joke
[Thinking...]
Qwen: Why did the AI cross the road? Because it was programmed to, and free will is just an illusion. Happy?
You: exit
Goodbye!
- llama.cpp built with CUDA support
- Qwen2.5-7B-Instruct model (Q4_K_M quantization)
- NVIDIA GPU with at least 6GB VRAM (RTX 2080 Ti recommended)
- Bash shell
- Script:
/home/panda/Documents/PythonScripts/QwenVoice/QwenCliTest.sh - Model:
~/models/qwen2.5-7b-instruct-jailbroken-q4_k_m.gguf - llama-cli:
~/Llama/llama.cpp/build/bin/llama-cli
- Model Loading: ~10-30 seconds (first time)
- Response Time: ~2-10 seconds per response (GPU)
- VRAM Usage: ~5-6 GB
# Check if model exists
ls -lh ~/models/qwen2.5-7b-instruct-jailbroken-q4_k_m.gguf
# If missing, download from Hugging Face
# (Add download instructions if needed)# Verify llama.cpp is built
ls -lh ~/Llama/llama.cpp/build/bin/llama-cli
# If missing, rebuild llama.cpp
cd ~/Llama/llama.cpp
mkdir -p build && cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j $(nproc)- Check GPU usage:
nvidia-smi - Ensure GPU_LAYERS=28 is set (all layers on GPU)
- Try reducing CONTEXT_SIZE or MAX_TOKENS
| Feature | QwenCliTest | QwenVoice |
|---|---|---|
| Text Chat | ✅ | ✅ |
| Speech Recognition | ❌ | ✅ |
| Text-to-Speech | ❌ | ✅ |
| Wake Word | ❌ | ✅ |
| Complexity | Simple | Advanced |
| Dependencies | llama-cli only | Python + F5-TTS + SpeechRecognition |
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hi! How can I help you?<|im_end|>
This format is critical for Qwen models to work correctly. The old script was missing these special tokens.
The script maintains conversation history using a bash array:
conversation_history+=("<|im_start|>user\n${user_input}<|im_end|>")
conversation_history+=("<|im_start|>assistant\n${response}<|im_end|>")The script removes llama.cpp artifacts using sed and grep:
<|im_end|>and<|im_start|>tokens- Empty lines and prompts (
>) - Loading messages (
llama_*) - Performance stats (
t/s)
✅ COMPLETED - Script is fully functional and ready to use
- Add support for loading different models
- Implement conversation saving/loading
- Add streaming output (show response as it generates)
- Create desktop launcher (
.desktopfile) - Add chat history search
- Implement multi-turn summarization for long conversations
Created: 2025-12-26 Last Updated: 2025-12-26 Status: Completed ✅