Skip to content

Latest commit

 

History

History
545 lines (415 loc) · 16.5 KB

File metadata and controls

545 lines (415 loc) · 16.5 KB

Session 4: Building Production Chat Applications with Chainlit

Overview

This session focuses on building production-ready chat applications using Chainlit and Microsoft Foundry Local. You'll learn to create modern web interfaces for AI conversations, implement streaming responses, and deploy robust chat applications with proper error handling and user experience design.

What You'll Build:

  • Chainlit Chat App: Modern web UI with streaming responses
  • WebGPU Demo: Browser-based inference for privacy-first applications
  • Open WebUI Integration: Professional chat interface with Foundry Local
  • Production Patterns: Error handling, monitoring, and deployment strategies

Learning Objectives

  • Build production-ready chat applications with Chainlit
  • Implement streaming responses for enhanced user experience
  • Master Foundry Local SDK integration patterns
  • Apply proper error handling and graceful degradation
  • Deploy and configure chat applications for different environments
  • Understand modern web UI patterns for conversational AI

Prerequisites

  • Foundry Local: Installed and running (Installation Guide)
  • Python: 3.10 or later with virtual environment capability
  • Model: At least one model loaded (foundry model run phi-4-mini)
  • Browser: Modern web browser with WebGPU support (Chrome/Edge)
  • Docker: For Open WebUI integration (optional)

Part 1: Understanding Modern Chat Applications

Architecture Overview

User Browser ←→ Chainlit UI ←→ Python Backend ←→ Foundry Local ←→ AI Model
      ↓              ↓              ↓              ↓            ↓
   Web UI      Event Handlers   OpenAI Client   HTTP API    Local GPU

Key Technologies

Foundry Local SDK Patterns:

  • FoundryLocalManager(alias): Automatic service management
  • manager.endpoint and manager.api_key: Connection details
  • manager.get_model_info(alias).id: Model identification

Chainlit Framework:

  • @cl.on_chat_start: Initialize chat sessions
  • @cl.on_message: Handle incoming user messages
  • cl.Message().stream_token(): Real-time streaming
  • Automatic UI generation and WebSocket management

Part 2: Local vs Cloud Decision Matrix

Performance Characteristics

Aspect Local (Foundry) Cloud (Azure OpenAI)
Latency 🚀 50-200ms (no network) ⏱️ 200-2000ms (network dependent)
Privacy 🔒 Data never leaves device ⚠️ Data sent to cloud
Cost 💰 Free after hardware 💸 Pay per token
Offline ✅ Works without internet ❌ Requires internet
Model Size ⚠️ Limited by hardware ✅ Access to largest models
Scaling ⚠️ Hardware dependent ✅ Unlimited scaling

Hybrid Strategy Patterns

Local-First with Fallback:

async def hybrid_completion(prompt: str, complexity_threshold: int = 100):
    if len(prompt.split()) < complexity_threshold:
        return await local_completion(prompt)  # Fast, private
    else:
        return await cloud_completion(prompt)   # Complex reasoning

Task-Based Routing:

async def smart_routing(prompt: str, task_type: str):
    routing_rules = {
        "code_generation": "local",     # Privacy-sensitive
        "creative_writing": "cloud",    # Benefits from larger models
        "data_analysis": "local",       # Fast iteration needed
        "research": "cloud"             # Requires broad knowledge
    }
    
    if routing_rules.get(task_type) == "local":
        return await foundry_completion(prompt)
    else:
        return await azure_completion(prompt)

Part 3: Sample 04 - Chainlit Chat Application

Quick Start

# Navigate to Module08 directory  
cd Module08

# Start your preferred model
foundry model run phi-4-mini

# Run the Chainlit application (avoiding port conflicts)
chainlit run samples\04\app.py -w --port 8080

The application automatically opens at http://localhost:8080 with a modern chat interface.

Core Implementation

The Sample 04 application demonstrates production-ready patterns:

Automatic Service Discovery:

import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager

# Global variables for client and model
client = None
model_name = None

async def initialize_client():
    global client, model_name
    alias = os.environ.get("MODEL", "phi-4-mini")
    
    try:
        # Use FoundryLocalManager for proper service management
        manager = FoundryLocalManager(alias)
        model_info = manager.get_model_info(alias)
        
        client = OpenAI(
            base_url=manager.endpoint,
            api_key=manager.api_key or "not-required"
        )
        model_name = model_info.id if model_info else alias
        return True
    except Exception as e:
        # Fallback to manual configuration
        base_url = os.environ.get("BASE_URL", "http://localhost:51211")
        client = OpenAI(base_url=f"{base_url}/v1", api_key="not-required")
        model_name = alias
        return True

Streaming Chat Handler:

@cl.on_message
async def main(message: cl.Message):
    # Create streaming response
    msg = cl.Message(content="")
    await msg.send()
    
    stream = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": message.content}
        ],
        stream=True
    )
    
    # Stream tokens in real-time
    for chunk in stream:
        if chunk.choices[0].delta.content:
            await msg.stream_token(chunk.choices[0].delta.content)
    
    await msg.update()

Configuration Options

Environment Variables:

Variable Description Default Example
MODEL Model alias to use phi-4-mini qwen2.5-7b
BASE_URL Foundry Local endpoint Auto-detected http://localhost:51211
API_KEY API key (optional for local) "" your-api-key

Advanced Usage:

# Use different model
set MODEL=qwen2.5-7b
chainlit run samples\04\app.py -w --port 8080

# Use different ports (avoid 51211 which is used by Foundry Local)
chainlit run samples\04\app.py -w --port 3000
chainlit run samples\04\app.py -w --port 5000

Part 4: Creating and Using Jupyter Notebooks

Overview of Notebook Support

The Sample 04 includes a comprehensive Jupyter notebook (chainlit_app.ipynb) that provides:

  • 📚 Educational Content: Step-by-step learning materials
  • 🔬 Interactive Exploration: Run and experiment with code cells
  • 📊 Visual Demonstrations: Charts, diagrams, and output visualization
  • 🛠️ Development Tools: Testing and debugging capabilities

Creating Your Own Notebooks

Step 1: Set Up Jupyter Environment

# Ensure you're in the Module08 directory
cd Module08

# Activate your virtual environment
.venv\Scripts\activate

# Install Jupyter and dependencies
pip install jupyter notebook jupyterlab ipykernel
pip install -r requirements.txt

# Register the kernel for VS Code
python -m ipykernel install --user --name=foundry-local --display-name="Foundry Local"

Step 2: Create a New Notebook

Using VS Code:

  1. Open VS Code in the Module08 directory
  2. Create a new file with .ipynb extension
  3. Select the "Foundry Local" kernel when prompted
  4. Start adding cells with your content

Using Jupyter Lab:

# Start Jupyter Lab
jupyter lab

# Navigate to samples/04/ and create new notebook
# Choose Python 3 kernel

Notebook Structure Best Practices

Cell Organization

# Cell 1: Imports and Setup
import os
import sys
import chainlit as cl
from openai import OpenAI
from foundry_local import FoundryLocalManager

print("✅ Libraries imported successfully")
# Cell 2: Configuration and Client Setup
class FoundryClientManager:
    def __init__(self, model_name="phi-4-mini"):
        self.model_name = model_name
        self.client = None
        
    def initialize_client(self):
        # Client initialization logic
        pass

# Initialize and test
client_manager = FoundryClientManager()
result = client_manager.initialize_client()
print(f"Client initialized: {result}")

Interactive Examples and Exercises

Exercise 1: Client Configuration Testing

# Test different configuration methods
configurations = [
    {"method": "foundry_sdk", "model": "phi-4-mini"},
    {"method": "manual", "base_url": "http://localhost:51211", "model": "qwen2.5-7b"},
]

for config in configurations:
    print(f"\n🧪 Testing {config['method']} configuration...")
    # Implementation here
    result = test_configuration(config)
    print(f"Result: {'✅ Success' if result['status'] == 'ok' else '❌ Failed'}")

Exercise 2: Streaming Response Simulation

import asyncio

async def simulate_streaming_response(text, delay=0.1):
    """Simulate how streaming works in Chainlit."""
    print("🌊 Simulating streaming response...")
    
    for char in text:
        print(char, end='', flush=True)
        await asyncio.sleep(delay)
    
    print("\n✅ Streaming complete!")

# Test the simulation
sample_text = "This is how streaming responses work in Chainlit applications!"
await simulate_streaming_response(sample_text)

Part 5: WebGPU Browser Inference Demo

Overview

WebGPU enables running AI models directly in the browser for maximum privacy and zero-install experiences. This sample demonstrates ONNX Runtime Web with WebGPU execution.

Step 1: Check WebGPU Support

Browser Requirements:

  • Chrome/Edge 113+ with WebGPU enabled
  • Check: chrome://gpu → confirm "WebGPU" status
  • Programmatic check: if (!('gpu' in navigator)) { /* no WebGPU */ }

Step 2: Create WebGPU Demo

Create directory: samples/04/webgpu-demo/

index.html:

<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>WebGPU + ONNX Runtime Demo</title>
    <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.webgpu.min.js"></script>
    <style>
        body { font-family: system-ui, sans-serif; margin: 2rem; }
        pre { background: #f5f5f5; padding: 1rem; overflow: auto; }
        .status { padding: 1rem; background: #e3f2fd; border-radius: 4px; }
    </style>
</head>
<body>
    <h1>🚀 WebGPU + Foundry Local Integration</h1>
    <div id="status" class="status">Initializing...</div>
    <pre id="output"></pre>
    <script type="module" src="./main.js"></script>
</body>
</html>

main.js:

const statusEl = document.getElementById('status');
const outputEl = document.getElementById('output');

function log(msg) {
    outputEl.textContent += `${msg}\n`;
    console.log(msg);
}

(async () => {
    try {
        if (!('gpu' in navigator)) {
            statusEl.textContent = '❌ WebGPU not available';
            return;
        }
        
        statusEl.textContent = '🔍 WebGPU detected. Loading model...';
        
        // Use a small ONNX model for demo
        const modelUrl = 'https://huggingface.co/onnx/models/resolve/main/vision/classification/mnist-12/mnist-12.onnx';
        
        const session = await ort.InferenceSession.create(modelUrl, {
            executionProviders: ['webgpu']
        });
        
        log('✅ ONNX Runtime session created with WebGPU');
        log(`📊 Input names: ${session.inputNames.join(', ')}`);
        log(`📊 Output names: ${session.outputNames.join(', ')}`);
        
        // Create dummy input (MNIST expects 1x1x28x28)
        const inputData = new Float32Array(1 * 1 * 28 * 28).fill(0.1);
        const input = new ort.Tensor('float32', inputData, [1, 1, 28, 28]);
        
        const feeds = {};
        feeds[session.inputNames[0]] = input;
        
        const results = await session.run(feeds);
        const output = results[session.outputNames[0]];
        
        // Find prediction (argmax)
        let maxIdx = 0;
        for (let i = 1; i < output.data.length; i++) {
            if (output.data[i] > output.data[maxIdx]) maxIdx = i;
        }
        
        statusEl.textContent = '✅ WebGPU inference complete!';
        log(`🎯 Predicted class: ${maxIdx}`);
        log(`📈 Confidence scores: [${Array.from(output.data).map(x => x.toFixed(3)).join(', ')}]`);
        
    } catch (error) {
        statusEl.textContent = `❌ Error: ${error.message}`;
        log(`Error: ${error.message}`);
        console.error(error);
    }
})();

Step 3: Run the Demo

# Create demo directory
mkdir samples\04\webgpu-demo
cd samples\04\webgpu-demo

# Save HTML and JS files, then serve
python -m http.server 5173

# Open browser to http://localhost:5173

Part 6: Open WebUI Integration

Overview

Open WebUI provides a professional ChatGPT-like interface that connects to Foundry Local's OpenAI-compatible API.

Step 1: Prerequisites

# Verify Foundry Local is running
foundry service status

# Start a model
foundry model run phi-4-mini

# Confirm API endpoint is accessible
curl http://localhost:51211/v1/models

Step 2: Docker Setup (Recommended)

# Pull Open WebUI image
docker pull ghcr.io/open-webui/open-webui:main

# Run with Foundry Local connection
docker run -d --name open-webui -p 3000:8080 ^
  -e OPENAI_API_BASE_URL=http://host.docker.internal:51211/v1 ^
  -e OPENAI_API_KEY=foundry-local-key ^
  -v open-webui-data:/app/backend/data ^
  ghcr.io/open-webui/open-webui:main

Note: host.docker.internal allows Docker containers to access the host machine on Windows.

Step 3: Configuration

  1. Open Browser: Navigate to http://localhost:3000
  2. Initial Setup: Create admin account
  3. Model Configuration:
    • Settings → Models → OpenAI API
    • Base URL: http://host.docker.internal:51211/v1
    • API Key: foundry-local-key (any value works)
  4. Test Connection: Models should appear in dropdown

Troubleshooting

Common Issues:

  1. Connection Refused:

    # Check Foundry Local status
    foundry service ps
    netstat -ano | findstr :51211
  2. Models Not Appearing:

    • Verify model is loaded: foundry model list
    • Check API response: curl http://localhost:51211/v1/models
    • Restart Open WebUI container

Part 7: Production Deployment Considerations

Environment Configuration

Development Setup:

# Development with auto-reload and debugging
chainlit run samples\04\app.py -w --port 8080 --debug

Production Deployment:

# Production mode with optimizations
chainlit run samples\04\app.py --host 0.0.0.0 --port 8080 --no-cache

Common Port Issues and Solutions

Port 51211 Conflict Prevention:

# Check what's using Foundry Local port
netstat -ano | findstr :51211

# Use different port for Chainlit
chainlit run samples\04\app.py -w --port 8080

Performance Monitoring

Health Check Implementation:

@cl.on_chat_start
async def health_check():
    try:
        # Test model availability
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": "test"}],
            max_tokens=1
        )
        return {"status": "healthy", "model": model_name}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Summary

Session 4 covered building production-ready Chainlit applications for conversational AI. You learned about:

  • Chainlit Framework: Modern UI and streaming support for chat applications
  • Foundry Local Integration: SDK usage and configuration patterns
  • WebGPU Inference: Browser-based AI for maximum privacy
  • Open WebUI Setup: Professional chat interface deployment
  • Production Patterns: Error handling, monitoring, and scaling

The Sample 04 application demonstrates best practices for building robust chat interfaces that leverage local AI models through Microsoft Foundry Local while providing excellent user experiences.

References