This example runs Qwen3.5-4B on Jetson Orin with llama.cpp and exposes an OpenAI-compatible API server.
It uses:
- a prebuilt Docker image archive imported locally on first run
- the
unsloth/Qwen3.5-4B-GGUFmodel inQ4_K_Mformat
Supported JetPack/L4T targets:
- JetPack 6.1 -> L4T 36.3.0
- JetPack 6.2 -> L4T 36.4.0
- JetPack 6.2.1 -> L4T 36.4.3 / 36.4.4
Test status:
- validated on JetPack 6.2
- expected to work on JetPack 6.1 to 6.2.1
- NVIDIA Jetson Orin device
- Docker installed and available
aria2installed
PyPI:
pip install jetson-examplesGitHub:
git clone https://github.com/Seeed-Projects/jetson-examples
cd jetson-examples
pip install .Start the demo:
reComputer run qwen3.5-4bThe first run downloads the image archive and model, then starts the server on:
http://127.0.0.1:8080
Check the model list:
curl http://127.0.0.1:8080/v1/modelsChat via OpenAI-compatible API:
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 512
}'Python example:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="none")
response = client.chat.completions.create(
model="qwen",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)QWEN35_PORT: host port, default8080QWEN35_CTX_SIZE: context length, default8192QWEN35_GPU_LAYERS: override automatic GPU layer selectionQWEN35_MODELS_DIR: model cache directory, default$HOME/models
Stop and remove the container:
reComputer clean qwen3.5-4bThe downloaded image and model cache are kept for faster startup next time.