Session 4: Building Production Chat Applications with Chainlit

Overview

This session focuses on building production-ready chat applications using Chainlit and Microsoft Foundry Local. You'll learn to create modern web interfaces for AI conversations, implement streaming responses, and deploy robust chat applications with proper error handling and user experience design.

What You'll Build:

Chainlit Chat App: Modern web UI with streaming responses
WebGPU Demo: Browser-based inference for privacy-first applications
Open WebUI Integration: Professional chat interface with Foundry Local
Production Patterns: Error handling, monitoring, and deployment strategies

Learning Objectives

Build production-ready chat applications with Chainlit
Implement streaming responses for enhanced user experience
Master Foundry Local SDK integration patterns
Apply proper error handling and graceful degradation
Deploy and configure chat applications for different environments
Understand modern web UI patterns for conversational AI

Prerequisites

Foundry Local: Installed and running (Installation Guide)
Python: 3.10 or later with virtual environment capability
Model: At least one model loaded (foundry model run phi-4-mini)
Browser: Modern web browser with WebGPU support (Chrome/Edge)
Docker: For Open WebUI integration (optional)

Part 1: Understanding Modern Chat Applications

Architecture Overview

User Browser ←→ Chainlit UI ←→ Python Backend ←→ Foundry Local ←→ AI Model
      ↓              ↓              ↓              ↓            ↓
   Web UI      Event Handlers   OpenAI Client   HTTP API    Local GPU

Key Technologies

Foundry Local SDK Patterns:

FoundryLocalManager(alias): Automatic service management
manager.endpoint and manager.api_key: Connection details
manager.get_model_info(alias).id: Model identification

Chainlit Framework:

@cl.on_chat_start: Initialize chat sessions
@cl.on_message: Handle incoming user messages
cl.Message().stream_token(): Real-time streaming
Automatic UI generation and WebSocket management

Part 2: Local vs Cloud Decision Matrix

Performance Characteristics

Aspect	Local (Foundry)	Cloud (Azure OpenAI)
Latency	🚀 50-200ms (no network)	⏱️ 200-2000ms (network dependent)
Privacy	🔒 Data never leaves device	⚠️ Data sent to cloud
Cost	💰 Free after hardware	💸 Pay per token
Offline	✅ Works without internet	❌ Requires internet
Model Size	⚠️ Limited by hardware	✅ Access to largest models
Scaling	⚠️ Hardware dependent	✅ Unlimited scaling

Hybrid Strategy Patterns

Local-First with Fallback:

async def hybrid_completion(prompt: str, complexity_threshold: int = 100):
    if len(prompt.split()) < complexity_threshold:
        return await local_completion(prompt)  # Fast, private
    else:
        return await cloud_completion(prompt)   # Complex reasoning

Task-Based Routing:

async def smart_routing(prompt: str, task_type: str):
    routing_rules = {
        "code_generation": "local",     # Privacy-sensitive
        "creative_writing": "cloud",    # Benefits from larger models
        "data_analysis": "local",       # Fast iteration needed
        "research": "cloud"             # Requires broad knowledge
    }
    
    if routing_rules.get(task_type) == "local":
        return await foundry_completion(prompt)
    else:
        return await azure_completion(prompt)

Part 3: Sample 04 - Chainlit Chat Application

Quick Start

# Navigate to Module08 directory  
cd Module08

# Start your preferred model
foundry model run phi-4-mini

# Run the Chainlit application (avoiding port conflicts)
chainlit run samples\04\app.py -w --port 8080

The application automatically opens at http://localhost:8080 with a modern chat interface.

Core Implementation

The Sample 04 application demonstrates production-ready patterns:

Automatic Service Discovery:

import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager

# Global variables for client and model
client = None
model_name = None

async def initialize_client():
    global client, model_name
    alias = os.environ.get("MODEL", "phi-4-mini")
    
    try:
        # Use FoundryLocalManager for proper service management
        manager = FoundryLocalManager(alias)
        model_info = manager.get_model_info(alias)
        
        client = OpenAI(
            base_url=manager.endpoint,
            api_key=manager.api_key or "not-required"
        )
        model_name = model_info.id if model_info else alias
        return True
    except Exception as e:
        # Fallback to manual configuration
        base_url = os.environ.get("BASE_URL", "http://localhost:51211")
        client = OpenAI(base_url=f"{base_url}/v1", api_key="not-required")
        model_name = alias
        return True

Streaming Chat Handler:

@cl.on_message
async def main(message: cl.Message):
    # Create streaming response
    msg = cl.Message(content="")
    await msg.send()
    
    stream = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": message.content}
        ],
        stream=True
    )
    
    # Stream tokens in real-time
    for chunk in stream:
        if chunk.choices[0].delta.content:
            await msg.stream_token(chunk.choices[0].delta.content)
    
    await msg.update()

Configuration Options

Environment Variables:

Variable	Description	Default	Example
`MODEL`	Model alias to use	`phi-4-mini`	`qwen2.5-7b`
`BASE_URL`	Foundry Local endpoint	Auto-detected	`http://localhost:51211`
`API_KEY`	API key (optional for local)	`""`	`your-api-key`

Advanced Usage:

# Use different model
set MODEL=qwen2.5-7b
chainlit run samples\04\app.py -w --port 8080

# Use different ports (avoid 51211 which is used by Foundry Local)
chainlit run samples\04\app.py -w --port 3000
chainlit run samples\04\app.py -w --port 5000

Part 4: Creating and Using Jupyter Notebooks

Overview of Notebook Support

The Sample 04 includes a comprehensive Jupyter notebook (chainlit_app.ipynb) that provides:

📚 Educational Content: Step-by-step learning materials
🔬 Interactive Exploration: Run and experiment with code cells
📊 Visual Demonstrations: Charts, diagrams, and output visualization
🛠️ Development Tools: Testing and debugging capabilities

Creating Your Own Notebooks

Step 1: Set Up Jupyter Environment

# Ensure you're in the Module08 directory
cd Module08

# Activate your virtual environment
.venv\Scripts\activate

# Install Jupyter and dependencies
pip install jupyter notebook jupyterlab ipykernel
pip install -r requirements.txt

# Register the kernel for VS Code
python -m ipykernel install --user --name=foundry-local --display-name="Foundry Local"

Step 2: Create a New Notebook

Using VS Code:

Open VS Code in the Module08 directory
Create a new file with .ipynb extension
Select the "Foundry Local" kernel when prompted
Start adding cells with your content

Using Jupyter Lab:

# Start Jupyter Lab
jupyter lab

# Navigate to samples/04/ and create new notebook
# Choose Python 3 kernel

Notebook Structure Best Practices

Cell Organization

# Cell 1: Imports and Setup
import os
import sys
import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager

print("✅ Libraries imported successfully")

# Cell 2: Configuration and Client Setup
class FoundryClientManager:
    def __init__(self, model_name="phi-4-mini"):
        self.model_name = model_name
        self.client = None
        
    def initialize_client(self):
        # Client initialization logic
        pass

# Initialize and test
client_manager = FoundryClientManager()
result = client_manager.initialize_client()
print(f"Client initialized: {result}")

Interactive Examples and Exercises

Exercise 1: Client Configuration Testing

# Test different configuration methods
configurations = [
    {"method": "foundry_sdk", "model": "phi-4-mini"},
    {"method": "manual", "base_url": "http://localhost:51211", "model": "qwen2.5-7b"},
]

for config in configurations:
    print(f"\n🧪 Testing {config['method']} configuration...")
    # Implementation here
    result = test_configuration(config)
    print(f"Result: {'✅ Success' if result['status'] == 'ok' else '❌ Failed'}")

Exercise 2: Streaming Response Simulation

import asyncio

async def simulate_streaming_response(text, delay=0.1):
    """Simulate how streaming works in Chainlit."""
    print("🌊 Simulating streaming response...")
    
    for char in text:
        print(char, end='', flush=True)
        await asyncio.sleep(delay)
    
    print("\n✅ Streaming complete!")

# Test the simulation
sample_text = "This is how streaming responses work in Chainlit applications!"
await simulate_streaming_response(sample_text)

Part 5: WebGPU Browser Inference Demo

Overview

WebGPU enables running AI models directly in the browser for maximum privacy and zero-install experiences. This sample demonstrates ONNX Runtime Web with WebGPU execution.

Step 1: Check WebGPU Support

Browser Requirements:

Chrome/Edge 113+ with WebGPU enabled
Check: chrome://gpu → confirm "WebGPU" status
Programmatic check: if (!('gpu' in navigator)) { /* no WebGPU */ }

Step 2: Create WebGPU Demo

Create directory: samples/04/webgpu-demo/

index.html:

<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>WebGPU + ONNX Runtime Demo</title>
    <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.webgpu.min.js"></script>
    <style>
        body { font-family: system-ui, sans-serif; margin: 2rem; }
        pre { background: #f5f5f5; padding: 1rem; overflow: auto; }
        .status { padding: 1rem; background: #e3f2fd; border-radius: 4px; }
    </style>
</head>
<body>
    <h1>🚀 WebGPU + Foundry Local Integration</h1>
    <div id="status" class="status">Initializing...</div>
    <pre id="output"></pre>
    <script type="module" src="./main.js"></script>
</body>
</html>

main.js:

const statusEl = document.getElementById('status');
const outputEl = document.getElementById('output');

function log(msg) {
    outputEl.textContent += `${msg}\n`;
    console.log(msg);
}

(async () => {
    try {
        if (!('gpu' in navigator)) {
            statusEl.textContent = '❌ WebGPU not available';
            return;
        }
        
        statusEl.textContent = '🔍 WebGPU detected. Loading model...';
        
        // Use a small ONNX model for demo
        const modelUrl = 'https://huggingface.co/onnx/models/resolve/main/vision/classification/mnist-12/mnist-12.onnx';
        
        const session = await ort.InferenceSession.create(modelUrl, {
            executionProviders: ['webgpu']
        });
        
        log('✅ ONNX Runtime session created with WebGPU');
        log(`📊 Input names: ${session.inputNames.join(', ')}`);
        log(`📊 Output names: ${session.outputNames.join(', ')}`);
        
        // Create dummy input (MNIST expects 1x1x28x28)
        const inputData = new Float32Array(1 * 1 * 28 * 28).fill(0.1);
        const input = new ort.Tensor('float32', inputData, [1, 1, 28, 28]);
        
        const feeds = {};
        feeds[session.inputNames[0]] = input;
        
        const results = await session.run(feeds);
        const output = results[session.outputNames[0]];
        
        // Find prediction (argmax)
        let maxIdx = 0;
        for (let i = 1; i < output.data.length; i++) {
            if (output.data[i] > output.data[maxIdx]) maxIdx = i;
        }
        
        statusEl.textContent = '✅ WebGPU inference complete!';
        log(`🎯 Predicted class: ${maxIdx}`);
        log(`📈 Confidence scores: [${Array.from(output.data).map(x => x.toFixed(3)).join(', ')}]`);
        
    } catch (error) {
        statusEl.textContent = `❌ Error: ${error.message}`;
        log(`Error: ${error.message}`);
        console.error(error);
    }
})();

Step 3: Run the Demo

# Create demo directory
mkdir samples\04\webgpu-demo
cd samples\04\webgpu-demo

# Save HTML and JS files, then serve
python -m http.server 5173

# Open browser to http://localhost:5173

Part 6: Open WebUI Integration

Overview

Open WebUI provides a professional ChatGPT-like interface that connects to Foundry Local's OpenAI-compatible API.

Step 1: Prerequisites

# Verify Foundry Local is running
foundry service status

# Start a model
foundry model run phi-4-mini

# Confirm API endpoint is accessible
curl http://localhost:51211/v1/models

Step 2: Docker Setup (Recommended)

# Pull Open WebUI image
docker pull ghcr.io/open-webui/open-webui:main

# Run with Foundry Local connection
docker run -d --name open-webui -p 3000:8080 ^
  -e OPENAI_API_BASE_URL=http://host.docker.internal:51211/v1 ^
  -e OPENAI_API_KEY=foundry-local-key ^
  -v open-webui-data:/app/backend/data ^
  ghcr.io/open-webui/open-webui:main

Note: host.docker.internal allows Docker containers to access the host machine on Windows.

Step 3: Configuration

Open Browser: Navigate to http://localhost:3000
Initial Setup: Create admin account
Model Configuration:
- Settings → Models → OpenAI API
- Base URL: http://host.docker.internal:51211/v1
- API Key: foundry-local-key (any value works)
Test Connection: Models should appear in dropdown

Troubleshooting

Common Issues:

Connection Refused:

# Check Foundry Local status
foundry service ps
netstat -ano | findstr :51211

Models Not Appearing:
- Verify model is loaded: foundry model list
- Check API response: curl http://localhost:51211/v1/models
- Restart Open WebUI container

Part 7: Production Deployment Considerations

Environment Configuration

Development Setup:

# Development with auto-reload and debugging
chainlit run samples\04\app.py -w --port 8080 --debug

Production Deployment:

# Production mode with optimizations
chainlit run samples\04\app.py --host 0.0.0.0 --port 8080 --no-cache

Common Port Issues and Solutions

Port 51211 Conflict Prevention:

# Check what's using Foundry Local port
netstat -ano | findstr :51211

# Use different port for Chainlit
chainlit run samples\04\app.py -w --port 8080

Performance Monitoring

Health Check Implementation:

@cl.on_chat_start
async def health_check():
    try:
        # Test model availability
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": "test"}],
            max_tokens=1
        )
        return {"status": "healthy", "model": model_name}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Summary

Session 4 covered building production-ready Chainlit applications for conversational AI. You learned about:

✅ Chainlit Framework: Modern UI and streaming support for chat applications
✅ Foundry Local Integration: SDK usage and configuration patterns
✅ WebGPU Inference: Browser-based AI for maximum privacy
✅ Open WebUI Setup: Professional chat interface deployment
✅ Production Patterns: Error handling, monitoring, and scaling

The Sample 04 application demonstrates best practices for building robust chat interfaces that leverage local AI models through Microsoft Foundry Local while providing excellent user experiences.

References

Sample 04: Chainlit Application: Complete application with documentation
Chainlit Educational Notebook: Interactive learning materials
Foundry Local Documentation: Complete platform documentation
Chainlit Documentation: Official framework documentation
Open WebUI Integration Guide: Official tutorial

FilesExpand file tree

04.CuttingEdgeModels.md

Latest commit

History

04.CuttingEdgeModels.md

File metadata and controls

Session 4: Building Production Chat Applications with Chainlit

Overview

Learning Objectives

Prerequisites

Part 1: Understanding Modern Chat Applications

Architecture Overview

Key Technologies

Part 2: Local vs Cloud Decision Matrix

Performance Characteristics

Hybrid Strategy Patterns

Part 3: Sample 04 - Chainlit Chat Application

Quick Start

Core Implementation

Configuration Options

Part 4: Creating and Using Jupyter Notebooks

Overview of Notebook Support

Creating Your Own Notebooks

Step 1: Set Up Jupyter Environment

Step 2: Create a New Notebook

Notebook Structure Best Practices

Cell Organization

Interactive Examples and Exercises

Exercise 1: Client Configuration Testing

Exercise 2: Streaming Response Simulation

Part 5: WebGPU Browser Inference Demo

Overview

Step 1: Check WebGPU Support

Step 2: Create WebGPU Demo

Step 3: Run the Demo

Part 6: Open WebUI Integration

Overview

Step 1: Prerequisites

Step 2: Docker Setup (Recommended)

Step 3: Configuration

Troubleshooting

Part 7: Production Deployment Considerations

Environment Configuration

Common Port Issues and Solutions

Performance Monitoring

Summary

References