Skip to content

Latest commit

 

History

History
511 lines (354 loc) · 24.4 KB

File metadata and controls

511 lines (354 loc) · 24.4 KB

Open WebUI with OpenVINO Model Server {#ovms_demos_integration_with_open_webui}

Description

Open WebUI is a very popular component that provides a user interface for generative models. It supports use cases related to text generation, RAG, image generation, and many more. It also supports integration with remote execution services compatible with standard APIs like OpenAI for chat completion and image generation.

The goal of this demo is to integrate Open WebUI with OpenVINO Model Server. It includes instructions for deploying the server with a set of models and configuring Open WebUI to delegate generation to the serving endpoints.


Setup

Prerequisites

This demo deploys OpenVINO Model Server on Linux with Docker containers or Windows with a binary package. OpenWebUI is installed via Python pip.

Requirements:

  • Host with x86_64 architecture
  • Linux or Windows
  • Docker Engine installed in case of Linux OS.
  • Python 3.11 with pip
  • HuggingFace account to download models

There are other options to fulfill the prerequisites like OpenVINO Model Server deployment on baremetal Linux or Windows and Open WebUI installation with Docker. The steps in this demo can be reused across different options, and the reference for each step cover both deployments.

This demo can be followed without changes on Panther Lake host with 64GB RAM and VRAM allocation to GPU extended using Intel Graphics Software. That way all the mentioned models can be loaded simultaneously. It's also possible to use llama-swap integration to reload the models automatically. On hosts with less VRAM available, use a subset of the models, apply other models or configure different target device like CPU or NPU. Check this list of preconfigured OpenVINO models.

Step 1: Pull model and start the OVMS server

::::{tab-set} :::{tab-item} Windows :sync: Windows

mkdir models
ovms.exe --pull --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path models --tool_parser gptoss --reasoning_parser gptoss --task text_generation --target_device GPU
ovms.exe --add_to_config --config_path  models\config.json --model_path OpenVINO\gpt-oss-20b-int4-ov --model_name ovms-model
ovms.exe --rest_port 8000 --config_path models\config.json --allowed_media_domains raw.githubusercontent.com

::: :::{tab-item} Linux (using Docker) :sync: Linux

mkdir models
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --pull --source_model OpenVINO/gpt-oss-20b-int4-ov --model_repository_path /models --task text_generation --tool_parser gptoss --reasoning_parser gptoss --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly --add_to_config --config_path  /models/config.json --model_path OpenVINO/gpt-oss-20b-int4-ov --model_name ovms-model
docker run -d -u $(id -u):$(id -g) -v $PWD/models:/models -p 8000:8000 --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --rest_port 8000 --config_path /models/config.json --allowed_media_domains raw.githubusercontent.com

::: ::::

Here is the basic call to check if it works:

curl http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"ovms-model\",\"messages\":[{\"role\":\"system\",\"content\":\"You are a helpful assistant.\"},{\"role\":\"user\",\"content\":\"Say this is a test\"}]}"

Step 2: Install and start OpenWebUI

Install Open WebUI:

pip install --no-cache-dir open-webui --extra-index-url "https://download.pytorch.org/whl/cpu"

Running Open WebUI:

open-webui serve

Go to http://localhost:8080 and create admin account to get started

get started with Open WebUI

Important Note: While using NPU device for acceleration it is recommended to disable Follow-Up Auto-Generation in Settings > Interface menu. It will improve response time and avoid queuing requests.

References

https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching.html

https://docs.openwebui.com


Chat

Step 1: Connections Setting

  1. Go to Admin PanelSettingsConnections (http://localhost:8080/admin/settings/connections)
  2. Click +Add Connection under OpenAI API
    • URL: http://localhost:8000/v3
    • Model IDs: put ovms-model and click + to add the model, or leave empty to include all models
  3. Click Save

connection setting

Step 2: Start Chatting

Click New Chat and select the model to start chatting

chat demo

(optional) Step 3: Set request parameters

There are multiple configurable parameters in OVMS, all of them for /v3/chat/completions endpoint are accessible in chat api documentation.

To configure them in OpenWebUI with an example of turning off reasoning:

  1. Go to Admin Panel -> Settings -> Models (http://localhost:8080/admin/settings/models)
  2. Click on desired model, unfold Advanced Params.
  3. Click + Add Custom Parameter.
  4. Change parameter name to chat_template_kwargs and content to {"reasoning_effort": "low"}.

parameter set

Reference

https://docs.openwebui.com/getting-started/quick-start/starting-with-openai-compatible


RAG

Step 1: Model Preparation

In addition to text generation, endpoints for embedding and reranking in Retrieval Augmented Generation can also be deployed with OpenVINO Model Server. In this demo, the embedding model is OpenVINO/Qwen3-Embedding-0.6B-fp16-ov and the the reranking model is OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov. Run the export script to download and quantize the models:

::::{tab-set} :::{tab-item} Windows :sync: Windows

ovms.exe --pull --source_model OpenVINO/Qwen3-Embedding-0.6B-fp16-ov --model_repository_path models --task embeddings --target_device GPU
ovms.exe --add_to_config --config_path models\config.json --model_path OpenVINO\Qwen3-Embedding-0.6B-fp16-ov --model_name OpenVINO/Qwen3-Embedding-0.6B-fp16-ov
ovms.exe --pull --source_model OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov --model_repository_path models --task rerank --target_device GPU
ovms.exe --add_to_config --config_path models\config.json --model_path OpenVINO\Qwen3-Reranker-0.6B-seq-cls-fp16-ov --model_name OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov

::: :::{tab-item} Linux (using Docker) :sync: Linux

docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --pull --source_model OpenVINO/Qwen3-Embedding-0.6B-fp16-ov --model_repository_path models --task embeddings --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly --add_to_config --config_path /models/config.json  --model_path OpenVINO/Qwen3-Embedding-0.6B-fp16-ov --model_name OpenVINO/Qwen3-Embedding-0.6B-fp16-ov
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --pull --source_model OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov --model_repository_path models --task rerank --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly --add_to_config --config_path /models/config.json  --model_path OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov --model_name OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov

::: ::::

Keep the model server running or restart it. Here are the basic calls to check if they work:

curl http://localhost:8000/v3/embeddings -H "Content-Type: application/json" -d "{\"model\":\"OpenVINO/Qwen3-Embedding-0.6B-fp16-ov\",\"input\":\"hello world\"}"
curl http://localhost:8000/v3/rerank -H "Content-Type: application/json" -d "{\"model\":\"OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov\",\"query\":\"welcome\",\"documents\":[\"good morning\",\"farewell\"]}"

Step 2: Documents Setting

  1. Go to Admin PanelSettingsDocuments (http://localhost:8080/admin/settings/documents)
  2. Select OpenAI for Embedding Model Engine
    • URL: http://localhost:8000/v3
    • Set Engine type to OpenAI
    • Embedding Model: OpenVINO/Qwen3-Embedding-0.6B-fp16-ov
    • Put anything in API key
  3. Enable Hybrid Search
  4. Select External for Reranking Engine
    • URL: http://localhost:8000/v3/rerank
    • Set Engine type to External
    • Reranking Model: OpenVINO/Qwen3-Reranker-0.6B-seq-cls-fp16-ov
  5. Click Save

embedding and retrieval setting

Step 3: Knowledge Base

  1. Prepare the Documentation

    The documentation used in this demo is https://github.com/open-webui/docs/archive/refs/heads/main.zip. Download and extract it to get the folder.

  2. Go to WorkspaceKnowledge+ New Knowledge (http://localhost:8080/workspace/knowledge/create)

  3. Name and describe the knowledge base

  4. Click Create Knowledge

  5. Click +Add ContentUpload directory, then select the extracted folder. This will upload all files with suitable extensions.

create a knowledge base

Step 4: Chat with RAG

  1. Click New Chat. Enter # symbol
  2. Select documents that appear above the chat box for retrieval. Document icons will appear above Send a Message.

select documents

  1. Enter a query and send

chat with RAG demo

Step 5: RAG-enabled Model

  1. Go to WorkspaceModels+ New Model (http://localhost:8080/workspace/models/create)
  2. Configure the Model:
    • Name the model
    • Select a base model from the list
    • Click Select Knowledge and select a knowledge base for retrieval
  3. Click Save & Create

create and configure the RAG-enabled model

  1. Click the created model and start chatting

RAG-enabled model demo

Reference

https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching_rag.html

https://docs.openwebui.com/tutorials/tips/rag-tutorial


Image Generation

Step 1: Model Preparation

The image generation model used in this demo is OpenVINO/FLUX.1-schnell-int4-ov. Run the ovms with --pull parameter to download and quantize the model:

::::{tab-set} :::{tab-item} Windows :sync: Windows

ovms.exe --pull --source_model OpenVINO/FLUX.1-schnell-int4-ov --model_repository_path models --model_name OpenVINO/FLUX.1-schnell-int4-ov --task image_generation --default_num_inference_steps 3 --target_device GPU
ovms.exe --add_to_config --config_path models\config.json --model_path OpenVINO\FLUX.1-schnell-int4-ov --model_name OpenVINO/FLUX.1-schnell-int4-ov

::: :::{tab-item} Linux (using Docker) :sync: Linux

docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --pull --source_model OpenVINO/FLUX.1-schnell-int4-ov --model_repository_path models --model_name OpenVINO/FLUX.1-schnell-int4-ov --task image_generation --default_num_inference_steps 3 --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly  --add_to_config --config_path /models/config.json  --model_path OpenVINO/FLUX.1-schnell-int4-ov --model_name OpenVINO/FLUX.1-schnell-int4-ov

::: ::::

Keep the model server running or restart it. Here is the basic call to check if it works:

curl http://localhost:8000/v3/images/generations -H "Content-Type: application/json" -d "{\"model\":\"OpenVINO/FLUX.1-schnell-int4-ov\",\"prompt\":\"anime\",\"num_inference_steps\":1,\"size\":\"256x256\",\"response_format\":\"b64_json\"}"

Step 2: Image Generation Setting

Note: The instructions below were tested with Open WebUI v0.8.x. If you are using an older version (pre-v0.7.0), the settings UI and image generation methods may differ.

  1. Go to Admin PanelSettingsImages (http://localhost:8080/admin/settings/images)

admin panel

  1. Set the Image Generation Engine to Open AI
  2. Configure the OpenAI API connection:
    • URL: http://localhost:8000/v3
    • Put anything in API key
  3. Enable Image Generation (Experimental)
    • Set Default Model: eg OpenVINO/FLUX.1-schnell-int4-ov
    • Set Image Size. Must be in WxH format, example: 256x256
  4. Click Save

image generation setting

Step 3: Generate Image

  1. In the chat window, expand the Integrations menu
  2. Toggle the Image switch to on

generate prompt

  1. Enter a prompt describing the image you want and send

result image

Alternative methods (Open WebUI v0.7.0+):

Restore "Generate Image" Button — The built-in button on assistant messages was removed in v0.7.0. You can restore it by importing a community action: click Get to import, then enable it in Admin PanelFunctions. Assistant messages will then show a Generate Image icon in the action bar.

Reference

https://docs.openvino.ai/2026/model-server/ovms_demos_image_generation.html https://docs.openwebui.com/features/chat-conversations/image-generation-and-editing/openai https://docs.openwebui.com/features/chat-conversations/image-generation-and-editing/usage/


VLM

Step 1: Model Preparation

The vision language model used in this demo is Junrui2021/Qwen3-VL-8B-Instruct-int4. Run the ovms with --pull parameter to download and quantize the model:

::::{tab-set} :::{tab-item} Windows :sync: Windows

ovms.exe --pull --source_model Junrui2021/Qwen3-VL-8B-Instruct-int4 --model_repository_path models --model_name ovms-model-vl --task text_generation --pipeline_type VLM_CB --target_device GPU
ovms.exe --add_to_config --config_path models\config.json --model_path Junrui2021\Qwen3-VL-8B-Instruct-int4 --model_name ovms-model-vl

::: :::{tab-item} Linux (using Docker) :sync: Linux

docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) openvino/model_server:weekly --pull --source_model Junrui2021/Qwen3-VL-8B-Instruct-int4 --model_repository_path /models --model_name ovms-model-vl --task text_generation --pipeline_type VLM_CB --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly --add_to_config --config_path /models/config.json  --model_path Junrui2021/Qwen3-VL-8B-Instruct-int4 --model_name ovms-model-vl

::: ::::

Keep the model server running or restart it. Here is the basic call to check if it works:

curl http://localhost:8000/v3/chat/completions  -H "Content-Type: application/json" -d "{ \"model\": \"ovms-model-vl\", \"messages\":[{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"what is in the picture?\"},{\"type\": \"image_url\", \"image_url\": {\"url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}}]}], \"max_completion_tokens\": 100}"

Step 2: Chat with VLM

  1. Start a New Chat and choose ovms-model-vl model
  2. Click +More to upload images, by capturing the screen or uploading files. The image used in this demo is http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg.

upload images 3. Enter a query and send

chat with VLM demo

Reference

https://docs.openvino.ai/2026/model-server/ovms_demos_continuous_batching_vlm.html


AI agent with Tools

Step 1: Start Tool Server

Start a OpenAPI tool server available in the openapi-servers repo. The server used in this demo is https://github.com/open-webui/openapi-servers/tree/main/servers/weather. Run it locally at http://localhost:9000:

pip install mcpo
pip install mcp_weather_server
mcpo --port 9000 -- python -m mcp_weather_server

Step 2: Tools Setting

  1. Go to Admin PanelSettingsIntegrations
  2. Click +Manage Tool Servers
    • URL: http://localhost:9000
    • Name the tool
  3. Click Save

tools setting

Step 3: Chat with AI Agent

  1. Click IntegrationsTools and toggle on the tool

activate the tool

  1. Enter a query and send

chat with AI Agent demo

Reference

https://docs.openwebui.com/features/extensibility/plugin/tools/openapi-servers/open-webui

Using Web Search

Step 1: Configure Web Search

  1. Go to Admin PanelSettingsWeb Search
  2. Enable Web Search
  3. Choose Web Search Engine
  4. Add API Key
  5. Click Save

web search configuration

Step 2: Enable Web Search in model

  1. Go to Admin PanelSettingsModels
  2. Choose desired model
  3. Enable Web Search capability
  4. In Default Features enable Web Search or toggle it in the chat
  5. In Advanced Parameters set Function Calling to Native

function calling native

web search model configuration

Step 3: Use Web Search in the chat

  1. Open new Chat
  2. Enable Web Search, if it's not displayed as blue icon below.
  3. Send the prompt

web search usage

Reference

https://docs.openwebui.com/features/chat-conversations/web-search/agentic-search/

Adding Context to the prompt

In Open WebUI, users can add additional context to their chats using the Memory feature. This allows models to access shared information across all conversations.

To configure it:

  1. Go to SettingsPersonalization
  2. Enable Memory
  3. Click Manage
  4. Click Add Memory
  5. Enter the information

add memory

It's possible to have multiple manageable memory records.

multiple memory records

Then workspace model should be created:

  1. Go to WorkspaceModels
  2. Choose model or create it.
  3. In Buildin Tools section enable Memory
  4. In Advanced Parameters set Function Calling to Native

function calling native

model memory config

It's now available in all chats:

memory usage

Note: There is no way to make searching memory default on the beginning of the conversation in Open Web UI. User should tell model to use it to make it work.

Reference

https://docs.openwebui.com/features/chat-conversations/memory/

Code Interpreter

It's available to use Code Interpreter feature in Open Web UI.

  1. Go to Admin PanelSettingsModels
  2. Choose desired model
  3. Enable Code Interpreter capability
  4. In Default Features enable Code Interpreter or toggle it in the chat
  5. In Advanced Parameters set Function Calling to Native

function calling native

  1. Go to Admin PanelSettingsCode Execution
  2. Enable Code Interpreter and Code Execution

Then it's ready to use. In new chat it's possible to toggle Code Interpreter and write a prompt.

code execution

Audio

Note: To ensure audio features work correctly, download FFmpeg and add its executable directory to your system's PATH environment variable.

Step 1: Models Preparation

Start by downloading export_model.py script and run it to download and quantize the model for speech generation:

curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
python export_model.py text2speech --source_model microsoft/speecht5_tts --weight-format fp32 --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --vocoder microsoft/speecht5_hifigan

Next, download and add to config model for transcription:

::::{tab-set} :::{tab-item} Windows :sync: Windows

ovms.exe --pull --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path models --task speech2text --target_device GPU
ovms.exe --add_to_config --config_path  models\config.json --model_path OpenVINO\whisper-base-fp16-ov --model_name OpenVINO/whisper-base-fp16-ov

::: :::{tab-item} Linux (using Docker) :sync: Linux

docker run --rm -u $(id -u):$(id -g) --device /dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -v $PWD/models:/models openvino/model_server:weekly --pull --source_model OpenVINO/whisper-base-fp16-ov --model_repository_path /models --task speech2text --target_device GPU
docker run --rm -u $(id -u):$(id -g) -v $PWD/models:/models openvino/model_server:weekly --add_to_config --config_path /models/config.json --model_path OpenVINO/whisper-base-fp16-ov --model_name OpenVINO/whisper-base-fp16-ov

::: ::::

Step 2: Audio Settings

  1. Go to Admin PanelSettingsAudio
  2. Select OpenAI for both engines
    • URL: http://localhost:8000/v3
    • Set Engine type to OpenAI
    • STT Model: OpenVINO/whisper-base-fp16-ov
    • TTS Model: microsoft/speecht5_tts
    • Put anything in API key
  3. Click Save

audio settings

Step 3: Chat with AI Agent

  1. Click Voice mode icon.
  2. Start talking.

voice mode

Reference

https://docs.openwebui.com/features/#%EF%B8%8F-audio-voice--accessibility