This session focuses on building production-ready chat applications using Chainlit and Microsoft Foundry Local. You'll learn to create modern web interfaces for AI conversations, implement streaming responses, and deploy robust chat applications with proper error handling and user experience design.
What You'll Build:
- Chainlit Chat App: Modern web UI with streaming responses
- WebGPU Demo: Browser-based inference for privacy-first applications
- Open WebUI Integration: Professional chat interface with Foundry Local
- Production Patterns: Error handling, monitoring, and deployment strategies
- Build production-ready chat applications with Chainlit
- Implement streaming responses for enhanced user experience
- Master Foundry Local SDK integration patterns
- Apply proper error handling and graceful degradation
- Deploy and configure chat applications for different environments
- Understand modern web UI patterns for conversational AI
- Foundry Local: Installed and running (Installation Guide)
- Python: 3.10 or later with virtual environment capability
- Model: At least one model loaded (
foundry model run phi-4-mini) - Browser: Modern web browser with WebGPU support (Chrome/Edge)
- Docker: For Open WebUI integration (optional)
User Browser ←→ Chainlit UI ←→ Python Backend ←→ Foundry Local ←→ AI Model
↓ ↓ ↓ ↓ ↓
Web UI Event Handlers OpenAI Client HTTP API Local GPU
Foundry Local SDK Patterns:
FoundryLocalManager(alias): Automatic service managementmanager.endpointandmanager.api_key: Connection detailsmanager.get_model_info(alias).id: Model identification
Chainlit Framework:
@cl.on_chat_start: Initialize chat sessions@cl.on_message: Handle incoming user messagescl.Message().stream_token(): Real-time streaming- Automatic UI generation and WebSocket management
| Aspect | Local (Foundry) | Cloud (Azure OpenAI) |
|---|---|---|
| Latency | 🚀 50-200ms (no network) | ⏱️ 200-2000ms (network dependent) |
| Privacy | 🔒 Data never leaves device | |
| Cost | 💰 Free after hardware | 💸 Pay per token |
| Offline | ✅ Works without internet | ❌ Requires internet |
| Model Size | ✅ Access to largest models | |
| Scaling | ✅ Unlimited scaling |
Local-First with Fallback:
async def hybrid_completion(prompt: str, complexity_threshold: int = 100):
if len(prompt.split()) < complexity_threshold:
return await local_completion(prompt) # Fast, private
else:
return await cloud_completion(prompt) # Complex reasoningTask-Based Routing:
async def smart_routing(prompt: str, task_type: str):
routing_rules = {
"code_generation": "local", # Privacy-sensitive
"creative_writing": "cloud", # Benefits from larger models
"data_analysis": "local", # Fast iteration needed
"research": "cloud" # Requires broad knowledge
}
if routing_rules.get(task_type) == "local":
return await foundry_completion(prompt)
else:
return await azure_completion(prompt)# Navigate to Module08 directory
cd Module08
# Start your preferred model
foundry model run phi-4-mini
# Run the Chainlit application (avoiding port conflicts)
chainlit run samples\04\app.py -w --port 8080The application automatically opens at http://localhost:8080 with a modern chat interface.
The Sample 04 application demonstrates production-ready patterns:
Automatic Service Discovery:
import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager
# Global variables for client and model
client = None
model_name = None
async def initialize_client():
global client, model_name
alias = os.environ.get("MODEL", "phi-4-mini")
try:
# Use FoundryLocalManager for proper service management
manager = FoundryLocalManager(alias)
model_info = manager.get_model_info(alias)
client = OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key or "not-required"
)
model_name = model_info.id if model_info else alias
return True
except Exception as e:
# Fallback to manual configuration
base_url = os.environ.get("BASE_URL", "http://localhost:51211")
client = OpenAI(base_url=f"{base_url}/v1", api_key="not-required")
model_name = alias
return TrueStreaming Chat Handler:
@cl.on_message
async def main(message: cl.Message):
# Create streaming response
msg = cl.Message(content="")
await msg.send()
stream = client.chat.completions.create(
model=model_name,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": message.content}
],
stream=True
)
# Stream tokens in real-time
for chunk in stream:
if chunk.choices[0].delta.content:
await msg.stream_token(chunk.choices[0].delta.content)
await msg.update()Environment Variables:
| Variable | Description | Default | Example |
|---|---|---|---|
MODEL |
Model alias to use | phi-4-mini |
qwen2.5-7b |
BASE_URL |
Foundry Local endpoint | Auto-detected | http://localhost:51211 |
API_KEY |
API key (optional for local) | "" |
your-api-key |
Advanced Usage:
# Use different model
set MODEL=qwen2.5-7b
chainlit run samples\04\app.py -w --port 8080
# Use different ports (avoid 51211 which is used by Foundry Local)
chainlit run samples\04\app.py -w --port 3000
chainlit run samples\04\app.py -w --port 5000The Sample 04 includes a comprehensive Jupyter notebook (chainlit_app.ipynb) that provides:
- 📚 Educational Content: Step-by-step learning materials
- 🔬 Interactive Exploration: Run and experiment with code cells
- 📊 Visual Demonstrations: Charts, diagrams, and output visualization
- 🛠️ Development Tools: Testing and debugging capabilities
# Ensure you're in the Module08 directory
cd Module08
# Activate your virtual environment
.venv\Scripts\activate
# Install Jupyter and dependencies
pip install jupyter notebook jupyterlab ipykernel
pip install -r requirements.txt
# Register the kernel for VS Code
python -m ipykernel install --user --name=foundry-local --display-name="Foundry Local"Using VS Code:
- Open VS Code in the Module08 directory
- Create a new file with
.ipynbextension - Select the "Foundry Local" kernel when prompted
- Start adding cells with your content
Using Jupyter Lab:
# Start Jupyter Lab
jupyter lab
# Navigate to samples/04/ and create new notebook
# Choose Python 3 kernel# Cell 1: Imports and Setup
import os
import sys
import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager
print("✅ Libraries imported successfully")# Cell 2: Configuration and Client Setup
class FoundryClientManager:
def __init__(self, model_name="phi-4-mini"):
self.model_name = model_name
self.client = None
def initialize_client(self):
# Client initialization logic
pass
# Initialize and test
client_manager = FoundryClientManager()
result = client_manager.initialize_client()
print(f"Client initialized: {result}")# Test different configuration methods
configurations = [
{"method": "foundry_sdk", "model": "phi-4-mini"},
{"method": "manual", "base_url": "http://localhost:51211", "model": "qwen2.5-7b"},
]
for config in configurations:
print(f"\n🧪 Testing {config['method']} configuration...")
# Implementation here
result = test_configuration(config)
print(f"Result: {'✅ Success' if result['status'] == 'ok' else '❌ Failed'}")import asyncio
async def simulate_streaming_response(text, delay=0.1):
"""Simulate how streaming works in Chainlit."""
print("🌊 Simulating streaming response...")
for char in text:
print(char, end='', flush=True)
await asyncio.sleep(delay)
print("\n✅ Streaming complete!")
# Test the simulation
sample_text = "This is how streaming responses work in Chainlit applications!"
await simulate_streaming_response(sample_text)WebGPU enables running AI models directly in the browser for maximum privacy and zero-install experiences. This sample demonstrates ONNX Runtime Web with WebGPU execution.
Browser Requirements:
- Chrome/Edge 113+ with WebGPU enabled
- Check:
chrome://gpu→ confirm "WebGPU" status - Programmatic check:
if (!('gpu' in navigator)) { /* no WebGPU */ }
Create directory: samples/04/webgpu-demo/
index.html:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>WebGPU + ONNX Runtime Demo</title>
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.webgpu.min.js"></script>
<style>
body { font-family: system-ui, sans-serif; margin: 2rem; }
pre { background: #f5f5f5; padding: 1rem; overflow: auto; }
.status { padding: 1rem; background: #e3f2fd; border-radius: 4px; }
</style>
</head>
<body>
<h1>🚀 WebGPU + Foundry Local Integration</h1>
<div id="status" class="status">Initializing...</div>
<pre id="output"></pre>
<script type="module" src="./main.js"></script>
</body>
</html>main.js:
const statusEl = document.getElementById('status');
const outputEl = document.getElementById('output');
function log(msg) {
outputEl.textContent += `${msg}\n`;
console.log(msg);
}
(async () => {
try {
if (!('gpu' in navigator)) {
statusEl.textContent = '❌ WebGPU not available';
return;
}
statusEl.textContent = '🔍 WebGPU detected. Loading model...';
// Use a small ONNX model for demo
const modelUrl = 'https://huggingface.co/onnx/models/resolve/main/vision/classification/mnist-12/mnist-12.onnx';
const session = await ort.InferenceSession.create(modelUrl, {
executionProviders: ['webgpu']
});
log('✅ ONNX Runtime session created with WebGPU');
log(`📊 Input names: ${session.inputNames.join(', ')}`);
log(`📊 Output names: ${session.outputNames.join(', ')}`);
// Create dummy input (MNIST expects 1x1x28x28)
const inputData = new Float32Array(1 * 1 * 28 * 28).fill(0.1);
const input = new ort.Tensor('float32', inputData, [1, 1, 28, 28]);
const feeds = {};
feeds[session.inputNames[0]] = input;
const results = await session.run(feeds);
const output = results[session.outputNames[0]];
// Find prediction (argmax)
let maxIdx = 0;
for (let i = 1; i < output.data.length; i++) {
if (output.data[i] > output.data[maxIdx]) maxIdx = i;
}
statusEl.textContent = '✅ WebGPU inference complete!';
log(`🎯 Predicted class: ${maxIdx}`);
log(`📈 Confidence scores: [${Array.from(output.data).map(x => x.toFixed(3)).join(', ')}]`);
} catch (error) {
statusEl.textContent = `❌ Error: ${error.message}`;
log(`Error: ${error.message}`);
console.error(error);
}
})();# Create demo directory
mkdir samples\04\webgpu-demo
cd samples\04\webgpu-demo
# Save HTML and JS files, then serve
python -m http.server 5173
# Open browser to http://localhost:5173Open WebUI provides a professional ChatGPT-like interface that connects to Foundry Local's OpenAI-compatible API.
# Verify Foundry Local is running
foundry service status
# Start a model
foundry model run phi-4-mini
# Confirm API endpoint is accessible
curl http://localhost:51211/v1/models# Pull Open WebUI image
docker pull ghcr.io/open-webui/open-webui:main
# Run with Foundry Local connection
docker run -d --name open-webui -p 3000:8080 ^
-e OPENAI_API_BASE_URL=http://host.docker.internal:51211/v1 ^
-e OPENAI_API_KEY=foundry-local-key ^
-v open-webui-data:/app/backend/data ^
ghcr.io/open-webui/open-webui:mainNote: host.docker.internal allows Docker containers to access the host machine on Windows.
- Open Browser: Navigate to
http://localhost:3000 - Initial Setup: Create admin account
- Model Configuration:
- Settings → Models → OpenAI API
- Base URL:
http://host.docker.internal:51211/v1 - API Key:
foundry-local-key(any value works)
- Test Connection: Models should appear in dropdown
Common Issues:
-
Connection Refused:
# Check Foundry Local status foundry service ps netstat -ano | findstr :51211
-
Models Not Appearing:
- Verify model is loaded:
foundry model list - Check API response:
curl http://localhost:51211/v1/models - Restart Open WebUI container
- Verify model is loaded:
Development Setup:
# Development with auto-reload and debugging
chainlit run samples\04\app.py -w --port 8080 --debugProduction Deployment:
# Production mode with optimizations
chainlit run samples\04\app.py --host 0.0.0.0 --port 8080 --no-cachePort 51211 Conflict Prevention:
# Check what's using Foundry Local port
netstat -ano | findstr :51211
# Use different port for Chainlit
chainlit run samples\04\app.py -w --port 8080Health Check Implementation:
@cl.on_chat_start
async def health_check():
try:
# Test model availability
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": "test"}],
max_tokens=1
)
return {"status": "healthy", "model": model_name}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}Session 4 covered building production-ready Chainlit applications for conversational AI. You learned about:
- ✅ Chainlit Framework: Modern UI and streaming support for chat applications
- ✅ Foundry Local Integration: SDK usage and configuration patterns
- ✅ WebGPU Inference: Browser-based AI for maximum privacy
- ✅ Open WebUI Setup: Professional chat interface deployment
- ✅ Production Patterns: Error handling, monitoring, and scaling
The Sample 04 application demonstrates best practices for building robust chat interfaces that leverage local AI models through Microsoft Foundry Local while providing excellent user experiences.
- Sample 04: Chainlit Application: Complete application with documentation
- Chainlit Educational Notebook: Interactive learning materials
- Foundry Local Documentation: Complete platform documentation
- Chainlit Documentation: Official framework documentation
- Open WebUI Integration Guide: Official tutorial