Skip to content

Latest commit

 

History

History
448 lines (353 loc) · 11.4 KB

File metadata and controls

448 lines (353 loc) · 11.4 KB
layout default
title Chapter 5: LLM Integration & Configuration
parent RAGFlow Tutorial
nav_order 5

Chapter 5: LLM Integration & Configuration

Welcome to Chapter 5: LLM Integration & Configuration. In this part of RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Connect RAGFlow with various Large Language Models for intelligent question answering.

🎯 Overview

This chapter covers how to integrate different Large Language Models (LLMs) with RAGFlow to power your RAG applications. You'll learn to configure various LLM providers and optimize their performance for document-based question answering.

🤖 Supported LLM Providers

RAGFlow supports a wide range of LLM providers for different use cases and deployment scenarios:

Cloud Providers

  • OpenAI - GPT-4, GPT-3.5-turbo
  • Anthropic - Claude 3, Claude 2
  • Google - Gemini 1.5, Gemini 1.0
  • Azure OpenAI - Enterprise-grade deployments
  • AWS Bedrock - Amazon's LLM service

Local & Self-Hosted

  • Ollama - Local model inference
  • LM Studio - Local model management
  • Hugging Face - Direct model integration
  • vLLM - High-throughput inference
  • LocalAI - Unified local AI API

Specialized Providers

  • Together AI - Optimized inference
  • Replicate - Model marketplace
  • Fireworks AI - Fast inference
  • DeepInfra - Cost-effective models

🔧 Configuration Steps

Step 1: Access LLM Settings

  1. Log into RAGFlow web interface
  2. Navigate to System Settings > Model Providers
  3. Click Add Provider to configure a new LLM

Step 2: Configure API Keys

# Set environment variables for different providers
export OPENAI_API_KEY="your-openai-key"
export ANTHROPIC_API_KEY="your-anthropic-key"
export GOOGLE_API_KEY="your-google-key"

Step 3: Provider-Specific Setup

OpenAI Configuration

{
  "provider": "OpenAI",
  "model": "gpt-4o",
  "api_key": "sk-...",
  "temperature": 0.1,
  "max_tokens": 2000,
  "top_p": 0.9
}

Anthropic Configuration

{
  "provider": "Anthropic",
  "model": "claude-3-5-sonnet-20241022",
  "api_key": "sk-ant-...",
  "temperature": 0.1,
  "max_tokens": 4000,
  "system_prompt": "You are a helpful assistant that answers questions based on provided context."
}

Local Ollama Setup

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull llama3.1:8b

# Start Ollama service
ollama serve

Then configure in RAGFlow:

{
  "provider": "Ollama",
  "model": "llama3.1:8b",
  "base_url": "http://localhost:11434",
  "temperature": 0.1,
  "num_ctx": 4096
}

🎛️ Advanced LLM Configuration

Temperature & Sampling

{
  "temperature": 0.1,        // Lower = more deterministic
  "top_p": 0.9,              // Nucleus sampling
  "top_k": 40,               // Top-k sampling
  "repetition_penalty": 1.1, // Reduce repetition
  "max_tokens": 2000         // Response length limit
}

Context Window Management

{
  "max_context_length": 8192,    // Maximum context tokens
  "overlap_size": 200,           // Chunk overlap for retrieval
  "compression_ratio": 0.7,      // Context compression
  "hierarchical_retrieval": true // Multi-level retrieval
}

🔄 Model Switching & Fallbacks

Primary-Secondary Model Setup

{
  "models": [
    {
      "name": "gpt-4o",
      "provider": "OpenAI",
      "priority": 1,
      "fallback": false
    },
    {
      "name": "claude-3-5-sonnet",
      "provider": "Anthropic",
      "priority": 2,
      "fallback": true
    },
    {
      "name": "llama3.1:8b",
      "provider": "Ollama",
      "priority": 3,
      "fallback": true
    }
  ]
}

Load Balancing Configuration

{
  "load_balancing": {
    "enabled": true,
    "strategy": "round_robin",
    "health_check_interval": 30,
    "timeout": 10
  }
}

📊 Performance Optimization

Caching Strategies

{
  "caching": {
    "enabled": true,
    "ttl": 3600,              // Cache TTL in seconds
    "max_cache_size": "1GB",  // Maximum cache size
    "compression": true       // Enable response compression
  }
}

Batch Processing

{
  "batch_processing": {
    "enabled": true,
    "max_batch_size": 10,
    "timeout": 30,
    "concurrency_limit": 5
  }
}

🎯 Use Case Optimization

Different Configurations for Different Tasks

Document Q&A

{
  "task": "document_qa",
  "model": "gpt-4o",
  "temperature": 0.1,
  "max_tokens": 1000,
  "system_prompt": "Answer questions based solely on the provided document context."
}

Creative Writing

{
  "task": "creative_writing",
  "model": "claude-3-5-sonnet",
  "temperature": 0.8,
  "max_tokens": 2000,
  "system_prompt": "Generate creative content while staying relevant to the document context."
}

Code Generation

{
  "task": "code_generation",
  "model": "gpt-4o",
  "temperature": 0.2,
  "max_tokens": 1500,
  "system_prompt": "Generate code based on the documentation and requirements provided."
}

🔍 Monitoring & Analytics

Response Quality Metrics

{
  "monitoring": {
    "response_time_tracking": true,
    "token_usage_monitoring": true,
    "quality_scoring": true,
    "error_rate_tracking": true
  }
}

Custom Metrics Dashboard

{
  "dashboard": {
    "real_time_metrics": true,
    "historical_trends": true,
    "model_comparison": true,
    "cost_analysis": true
  }
}

🛠️ Troubleshooting Common Issues

API Rate Limits

# Monitor rate limits
curl -X GET "http://localhost:80/api/rate-limits" \
  -H "Authorization: Bearer YOUR_TOKEN"

# Implement exponential backoff
# RAGFlow handles this automatically with retry logic

Model Compatibility Issues

# Check model compatibility
curl -X POST "http://localhost:80/api/models/check-compatibility" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4o", "task": "document_qa"}'

Context Window Overflow

{
  "error_handling": {
    "context_overflow_strategy": "truncate",
    "chunk_reduction_ratio": 0.8,
    "fallback_model": "gpt-3.5-turbo"
  }
}

🔐 Security Best Practices

API Key Management

# Use environment variables
export RAGFLOW_ENCRYPTION_KEY="your-encryption-key"

# Rotate keys regularly
curl -X POST "http://localhost:80/api/keys/rotate" \
  -H "Authorization: Bearer ADMIN_TOKEN"

Model Access Control

{
  "access_control": {
    "user_roles": ["admin", "editor", "viewer"],
    "model_permissions": {
      "gpt-4": ["admin", "editor"],
      "claude-3": ["admin", "editor", "viewer"],
      "local-models": ["admin", "editor", "viewer"]
    }
  }
}

🚀 Production Deployment Considerations

High Availability Setup

# docker-compose.prod.yml
version: '3.8'
services:
  ragflow:
    image: infiniflow/ragflow:latest
    environment:
      - LLM_PROVIDER_BACKUP=true
      - LOAD_BALANCER_ENABLED=true
      - CACHE_LAYER=redis
    depends_on:
      - redis
      - postgres

  redis:
    image: redis:7-alpine

  postgres:
    image: postgres:15

Scaling Strategies

{
  "scaling": {
    "auto_scaling": true,
    "min_instances": 2,
    "max_instances": 10,
    "cpu_threshold": 70,
    "memory_threshold": 80
  }
}

🎯 Best Practices

Model Selection Guidelines

Use Case Recommended Models Rationale
Document Q&A GPT-4, Claude 3 High accuracy, good context understanding
Creative Tasks Claude 3, GPT-4 Better at generating natural, creative responses
Code Generation GPT-4, Claude 3 Strong code understanding and generation
Cost-Effective GPT-3.5, Claude Instant Good balance of cost and performance
Local/Offline Llama 3, Mistral Privacy-focused, no API costs

Performance Optimization Tips

  1. Use Appropriate Model Sizes: Larger models for complex tasks, smaller for simple queries
  2. Implement Caching: Cache frequent queries and responses
  3. Monitor Usage: Track token consumption and costs
  4. Load Balancing: Distribute requests across multiple model instances
  5. Fallback Strategies: Have backup models for reliability

📈 Next Steps

Now that you have configured LLMs for RAGFlow, you're ready to:


Ready to build intelligent chatbots? Continue to Chapter 6: Chatbot Development! 🚀

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, temperature, provider so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 5: LLM Integration & Configuration as an operating subsystem inside RAGFlow Tutorial: Complete Guide to Open-Source RAG Engine, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around max_tokens, your, claude as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 5: LLM Integration & Configuration usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for model.
  2. Input normalization: shape incoming data so temperature receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through provider.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

Suggested trace strategy:

  • search upstream code for model and temperature to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections