This project is a demo app providing a quick-start example using Llama Stack, and two UIs, one built using chainlit and one with streamlit.
We describe two quick start approaches to running this application:
- Docker
- "Native" execution
If you use Linux on a machine with an AMD64-compatible CPU, using the Docker option is the fastest way to try out the application. On other machines or if you use an alternative to Docker, such as Podman, you should use the "native" execution option.
For convenient, quick invocation, you can run the app using Docker.
Warning
At this time, we only recommend this approach if you are working on a Linux system with an AMD64-compatible processor. Also, if you are a Podman user, this quick start method uses docker compose and podman compose doesn't appear to be sufficiently compatible to use as a replacement.
-
Install docker. You will also need docker compose.
-
Copy the environment file and customize it, as needed:
cp .env.example .env
-
Start all services:
docker compose up --detach
-
Wait for all services to be healthy (this can take a minute...) and then access the applications in your browser. These are the default ports if you did not customize them in the
.envfile:- Chainlit Chat Interface: localhost:9090
- Llama Stack Playground: localhost:8501
This approach involve more steps, but it works on a broader set of platforms and CPU architectures. We use uv to manage Python dependencies and run the applications. If you prefer not to use uv, manage the dependencies with pip or another alternative and remove the uv run prefixes shown.
Note
You may notice that some different port numbers are used in what follows compared to what you'll see in docker-compose.yml and the Dockerfile.* used above, because some of those ports that are effectively hidden inside containers can collide with common services running on host operating systems, like MacOS.
- Install
uv - Run
uv syncto install python dependencies. (This will create a.venvfolder.)
Start the ollama server:
uv run ollama serveOpen a new terminal window and pull down the llama3.2:1b model we will use, then verify the list of models contains it:
uv run ollama pull llama3.2:1b
uv run ollama listThe ollama list command should contain the llama:3.2:1b model.
Build and run the server
ENABLE_OLLAMA=ollama \
OLLAMA_INFERENCE_MODEL=llama3.2:1b \
LLAMA_STACK_PORT=5001 \
uv run --with llama-stack llama stack build \
--template starter --image-type venv --runWarning
Note that OLLAMA_INFERENCE_MODEL=llama3.2:1b doesn't have ollama/ before the model name. This is the identifier ollama expects for the model. In contrast, commands you'll see below use ollama/llama3.2:1b, which is the identifier Llama Stack uses.
It can take a moment to come up. It is ready when you see a message like this:
INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit)
Open a new terminal window and check that you can get the list of models Llama Stack knows about:
curl -f http://localhost:5001/v1/modelsIf this works successfully, you'll get some JSON back with the list of models. If you have jq installed, piping the output through jq . yields this result (and probably other models listed, too):
{
"data": [
{
"identifier": "ollama/llama3.2:1b",
"provider_resource_id": "llama3.2:1b",
"provider_id": "ollama",
"type": "model",
"metadata": {},
"model_type": "llm"
}
]
}
If it appears you connected successfully to the Llama Stack server, but some sort of error was returned, look at the second terminal window where you are running the Llama Stack server and see what errors are reported. For a successful response, you would see something like this:
INFO: ::1:52582 - "GET /v1/models HTTP/1.1" 200 OK
20:35:58.868 [START] /v1/models
20:35:58.871 [END] /v1/models [StatusCode.OK] (2.70ms)
Now you can run one of two, or perhaps both GUI environments.
First, a GUI app built with Streamlit, which is called Llama Stack Playground in the docker quick start discussed above, because this example uses a UI that comes with the llama_stack distribution. This is why we use a glob in the next command to locate the file inside the .venv directory:
LLAMA_STACK_ENDPOINT=http://localhost:5001 \
uv run --with streamlit,fireworks streamlit run \
.venv/lib/python3.*/site-packages/llama_stack/distribution/ui/app.py \
--server.port 8501 --server.address localhostIt should pop up a browser window with the GUI at URL http://localhost:8500. Select the ollama/llama3.2:1b model in the drop down menu. If you don't, you will most likely trigger an exception when you submit a query and the first model in the list is used!
Note
If you plan to use this GUI regularly, consider installing the streamlit and fireworks with uv:
uv add streamlit fireworksThen you can remove --with streamlit,fireworks from the previous command.
A second GUI environment is a chat app built with Chainlit.
INFERENCE_MODEL=ollama/llama3.2:1b \
LLAMA_STACK_ENDPOINT=http://localhost:5001 \
uv run --with chainlit,fireworks chainlit run demo_01_app.py --host localhost --port 8000Note
The model environment variable is specified with the ollama/ prefix. If you don't specify a model the demo_01_app will grab the first LLM returned by the llama stack server, which likely won't work, causing a 500 error to be returned to the browser.
The command should pop up a browser window with the GUI at URL http://localhost:8000, where you can enter prompts.
You can also uv add chainlit fireworks, etc., if you prefer, as discussed for the first GUI.
If you used the Docker-based quick start, keep the containers running for what follows. If you used the "native" quick start execution, you'll need to keep the ollama and llama-stack services running.
Just as we did for the "native" quick start execution above, we'll use uv. Install uv if you haven't done this already, then run uv sync to install the python dependencies.
If you don't want to use uv, then install the dependencies in pyproject.toml another way and omit the uv run command prefixes used next.
The llama-stack-client CLI is an alternative to the GUI apps. Try a few commands to verify connectivity to the services. First, knowing how to get help is useful. (We won't show the output for the next several commands, but obviously you shouldn't get errors!)
uv run llama-stack-client --helpFor details about the sub-commands, e.g., for models:
uv run llama-stack-client models --help
uv run llama-stack-client models list --helpTip
If you use a different llama stack server endpoint than the default http://localhost:5001, which we are using, then pass the --endpoint http://server:port option after llama-stack-client and before the sub-commands, models in this example.
Let's try models list:
uv run llama-stack-client models listThe content should be the same as the curl command used previously (curl -f http://localhost:5001/v1/models), except a nicely-formatted table is printed instead of JSON.
Try an inference call with the client CLI's inference chat-completion sub command:
uv run llama-stack-client \
inference chat-completion \
--model-id 'ollama/llama3.2:1b' \
--message "write a haiku for meta's llama models"Note how the model id is specified, with the ollama/ prefix. If you omit it, you'll get an error that the model couldn't be found.
You should get output similar to this:
INFO:httpx:HTTP Request: POST http://localhost:5001/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
OpenAIChatCompletion(
id='chatcmpl-36b421f0-dbf0-4e14-bd60-dbe947f5f0cb',
choices=[
OpenAIChatCompletionChoice(
finish_reason='stop',
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role='assistant',
content='Galloping digital\nReflections of the self stare back\nVirtual mystics',
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None
),
logprobs=None
)
],
created=1753474963,
model='llama3.2:1b',
object='chat.completion',
service_tier=None,
system_fingerprint='fp_ollama',
usage={
'completion_tokens': 17,
'prompt_tokens': 34,
'total_tokens': 51,
'completion_tokens_details': None,
'prompt_tokens_details': None
}
)
If you want to try an interactive session, replace the --message "..." arguments with --session. The prompt will be >>>. To exit the loop, use control-d or control-c.
Finally, run the demo client:
INFERENCE_MODEL=ollama/llama3.2:1b \
LLAMA_STACK_ENDPOINT=http://localhost:5001 \
uv run demo_01_client.py- RAG (Retrieval Augmented Generation): The Chainlit app includes document ingestion and RAG capabilities
- Multiple UIs: Choose between the official Llama Stack Playground or the custom Chainlit interface
- Dockerized Setup: All services run in containers with proper health checks and dependencies
- Auto Model Pulling: Ollama automatically pulls the specified model on startup
- Tool calling with small models is inconsistent. Sometimes it works sometimes it doesn't. You need to use a bigger model for more consistent results.
- The Chainlit app automatically ingests documents on startup, which may take some time.
- All services use environment variables for configuration - customize via
.envfile.
The project consists of four main services:
- Ollama: Provides local LLM inference
- Llama Stack: API server that interfaces with Ollama
- Llama Stack Playground: Official web UI for testing
- Chainlit App: Custom chat interface with RAG capabilities
In the docker execution option described above, all services are orchestrated via Docker Compose with proper health checks and startup dependencies. The "native execution" alternative shows how to run each service individually and commands to run as health checks.
Here are some improvements we are considering. What do you think?
- Improve demo UI
- Add RAG steps (like AllyCat)
- Add AI Alliance branding (like AllyCat)
- Explore other UI frameworks (e.g. open-webui)
- Merge llama-stack-playground with llama-stack container
- Document Llama Stack issues
- undeclared dependencies for client: fire, requests
- ollama distribution embedding model name mismatch:
all-MiniLM-L6-v2vsall-minilm:latest