Skip to content

Latest commit

 

History

History
706 lines (515 loc) · 24.7 KB

File metadata and controls

706 lines (515 loc) · 24.7 KB
layout default
title Chapter 7: Advanced Techniques
parent LLaMA Factory Tutorial
nav_order 7

Chapter 7: Advanced Techniques

Welcome to the cutting edge! This chapter explores advanced techniques for pushing the boundaries of LLaMA fine-tuning, from continual learning to multi-modal models and beyond.

Advanced Techniques Overview

flowchart TD
    ADV[Advanced Techniques] --> CL[Continual Learning\nIncremental fine-tuning]
    ADV --> MULTI[Multi-GPU Training\nDeepSpeed / FSDP]
    ADV --> RLHF[RLHF Pipeline\nReward Modeling + PPO]
    ADV --> VLM[Vision-Language\nQwen2-VL, LLaVA support]
    ADV --> MERGE_TECH[Model Merging\nTIES, DARE, SLERP]

    MULTI --> DS[DeepSpeed ZeRO-3]
    MULTI --> FSDP[FSDP + QLoRA]
Loading

Continual Learning and Model Updates

Incremental Fine-tuning

class ContinualLearner:
    """Handles incremental learning without catastrophic forgetting"""

    def __init__(self, base_model, tokenizer):
        self.model = base_model
        self.tokenizer = tokenizer
        self.task_vectors = {}  # Store task-specific adaptations

    def learn_new_task(self, task_data: List[Dict], task_name: str):
        """Learn a new task while preserving previous knowledge"""

        # Create task-specific adapter
        adapter_config = LoraConfig(
            r=16,
            lora_alpha=32,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
            lora_dropout=0.05,
        )

        # Add adapter for this task
        task_adapter_name = f"task_{task_name}"
        self.model.add_adapter(task_adapter_name, adapter_config)

        # Train only the new adapter
        training_args = TrainingArguments(
            output_dir=f"./results/{task_name}",
            num_train_epochs=3,
            per_device_train_batch_size=4,
            learning_rate=2e-5,
            save_strategy="epoch",
            evaluation_strategy="epoch",
        )

        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=task_data,
            tokenizer=self.tokenizer,
        )

        # Set active adapter for training
        self.model.set_adapter(task_adapter_name)
        trainer.train()

        # Store task vector
        self.task_vectors[task_name] = self.model.state_dict()

    def switch_task(self, task_name: str):
        """Switch to a different learned task"""
        if task_name in self.task_vectors:
            self.model.load_state_dict(self.task_vectors[task_name])
            self.model.set_adapter(f"task_{task_name}")

    def merge_knowledge(self, task_names: List[str]):
        """Merge knowledge from multiple tasks"""
        # Implement knowledge distillation or model merging
        merged_state = {}

        for key in self.task_vectors[task_names[0]].keys():
            if any('lora' in key for key in task_names):
                # Average LoRA weights across tasks
                merged_state[key] = sum(
                    self.task_vectors[task][key] for task in task_names
                ) / len(task_names)
            else:
                # Keep base weights unchanged
                merged_state[key] = self.task_vectors[task_names[0]][key]

        return merged_state

# Usage
continual_learner = ContinualLearner(model, tokenizer)

# Learn multiple tasks incrementally
continual_learner.learn_new_task(math_data, "math")
continual_learner.learn_new_task(code_data, "coding")
continual_learner.learn_new_task(writing_data, "creative_writing")

# Switch between tasks
continual_learner.switch_task("math")
# Now model is optimized for math tasks

continual_learner.switch_task("coding")
# Now model is optimized for coding tasks

Multi-Modal Fine-tuning

Vision-Language Models

from transformers import CLIPModel, CLIPProcessor
from torch.nn import functional as F

class MultiModalLLaMA:
    """LLaMA model extended with vision capabilities"""

    def __init__(self, text_model, vision_model_name="openai/clip-vit-base-patch32"):
        self.text_model = text_model
        self.vision_model = CLIPModel.from_pretrained(vision_model_name)
        self.vision_processor = CLIPProcessor.from_pretrained(vision_model_name)

        # Add projection layer to align vision and text embeddings
        self.vision_projection = torch.nn.Linear(
            self.vision_model.config.vision_config.hidden_size,
            self.text_model.config.hidden_size
        )

    def encode_image(self, image):
        """Encode image into text embedding space"""
        inputs = self.vision_processor(images=image, return_tensors="pt")

        with torch.no_grad():
            vision_outputs = self.vision_model.get_image_features(**inputs)

        # Project to text embedding space
        projected = self.vision_projection(vision_outputs)
        return projected

    def generate_with_image(self, image, prompt: str, max_length: int = 100):
        """Generate text conditioned on image"""

        # Encode image
        image_embedding = self.encode_image(image)

        # Tokenize text prompt
        text_inputs = self.tokenizer(prompt, return_tensors="pt")

        # Combine image and text embeddings
        # (This is a simplified approach - real implementation would be more complex)
        combined_embedding = torch.cat([image_embedding, text_inputs['input_ids']], dim=1)

        # Generate
        outputs = self.text_model.generate(
            inputs_embeds=combined_embedding,
            max_length=max_length,
            num_return_sequences=1
        )

        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Usage
multi_modal_model = MultiModalLLaMA(text_model)

# Generate caption for image
from PIL import Image
image = Image.open("path/to/image.jpg")

caption = multi_modal_model.generate_with_image(
    image,
    "Describe this image in detail:"
)
print(caption)

Federated Learning

Privacy-Preserving Training

class FederatedLearner:
    """Implements federated learning for distributed training"""

    def __init__(self, global_model, num_clients: int):
        self.global_model = global_model
        self.num_clients = num_clients
        self.client_models = []
        self.client_weights = []

    def create_client_models(self):
        """Create local models for each client"""

        for i in range(self.num_clients):
            # Create client-specific model (could be on different devices)
            client_model = copy.deepcopy(self.global_model)

            # Add differential privacy noise
            self.add_differential_privacy(client_model)

            self.client_models.append(client_model)
            self.client_weights.append(1.0)  # Equal weighting initially

    def train_clients_locally(self, client_datasets: List[Dataset]):
        """Train each client on their local data"""

        local_updates = []

        for i, (client_model, client_data) in enumerate(zip(self.client_models, client_datasets)):
            print(f"Training client {i+1}/{self.num_clients}")

            # Local training
            local_update = self.train_single_client(client_model, client_data)
            local_updates.append(local_update)

        return local_updates

    def aggregate_updates(self, local_updates: List[Dict]):
        """Aggregate local updates using FedAvg"""

        # Initialize aggregated update
        aggregated_update = {}

        for key in local_updates[0].keys():
            # Weighted average of updates
            aggregated_update[key] = sum(
                update[key] * weight
                for update, weight in zip(local_updates, self.client_weights)
            ) / sum(self.client_weights)

        return aggregated_update

    def update_global_model(self, aggregated_update: Dict):
        """Update global model with aggregated updates"""

        with torch.no_grad():
            for name, param in self.global_model.named_parameters():
                if name in aggregated_update:
                    param.copy_(aggregated_update[name])

    def add_differential_privacy(self, model, noise_scale: float = 0.01):
        """Add differential privacy noise to model parameters"""

        for param in model.parameters():
            noise = torch.randn_like(param) * noise_scale
            param.data.add_(noise)

    def train_single_client(self, model, dataset, local_epochs: int = 5):
        """Train a single client model"""

        optimizer = AdamW(model.parameters(), lr=2e-5)
        model.train()

        for epoch in range(local_epochs):
            for batch in DataLoader(dataset, batch_size=4, shuffle=True):
                outputs = model(**batch)
                loss = outputs.loss

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        # Return model update (difference from initial state)
        return {name: param.clone() for name, param in model.named_parameters()}

# Usage
federated_learner = FederatedLearner(global_model, num_clients=10)
federated_learner.create_client_models()

# Simulate federated training round
local_updates = federated_learner.train_clients_locally(client_datasets)
aggregated_update = federated_learner.aggregate_updates(local_updates)
federated_learner.update_global_model(aggregated_update)

Knowledge Distillation

Teacher-Student Training

class KnowledgeDistiller:
    """Implements knowledge distillation from teacher to student models"""

    def __init__(self, teacher_model, student_model, tokenizer):
        self.teacher_model = teacher_model
        self.student_model = student_model
        self.tokenizer = tokenizer
        self.temperature = 2.0
        self.alpha = 0.5  # Weight for distillation loss

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Compute distillation loss"""

        # Soft targets from teacher
        teacher_probs = F.softmax(teacher_logits / self.temperature, dim=-1)

        # Student predictions
        student_log_probs = F.log_softmax(student_logits / self.temperature, dim=-1)

        # Distillation loss
        distillation_loss = F.kl_div(
            student_log_probs,
            teacher_probs,
            reduction='batchmean'
        ) * (self.temperature ** 2)

        # Ground truth loss
        ce_loss = F.cross_entropy(student_logits, labels)

        # Combined loss
        total_loss = self.alpha * distillation_loss + (1 - self.alpha) * ce_loss

        return total_loss

    def distill_knowledge(self, train_dataset, num_epochs: int = 10):
        """Perform knowledge distillation"""

        optimizer = AdamW(self.student_model.parameters(), lr=1e-5)
        data_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

        self.teacher_model.eval()
        self.student_model.train()

        for epoch in range(num_epochs):
            total_loss = 0

            for batch in data_loader:
                inputs = {k: v.to(self.student_model.device) for k, v in batch.items()}

                # Teacher forward pass
                with torch.no_grad():
                    teacher_outputs = self.teacher_model(**inputs)
                    teacher_logits = teacher_outputs.logits

                # Student forward pass
                student_outputs = self.student_model(**inputs)
                student_logits = student_outputs.logits

                # Compute distillation loss
                loss = self.distillation_loss(
                    student_logits,
                    teacher_logits,
                    inputs['labels']
                )

                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            print(f"Epoch {epoch+1}: Average loss = {total_loss / len(data_loader):.4f}")

    def evaluate_distilled_model(self, eval_dataset):
        """Evaluate the distilled student model"""

        self.student_model.eval()
        evaluator = ModelEvaluator(self.student_model, self.tokenizer)

        metrics = evaluator.evaluate_generation_quality(eval_dataset, num_samples=100)

        return metrics

# Usage
# Assume we have a large teacher model and want to distill to smaller student
distiller = KnowledgeDistiller(teacher_model, student_model, tokenizer)
distiller.distill_knowledge(train_dataset, num_epochs=20)

# Evaluate compression results
metrics = distiller.evaluate_distilled_model(eval_dataset)
print(f"Distilled model performance: {metrics}")

Advanced Prompting Techniques

Chain-of-Thought Fine-tuning

def create_cot_training_data(base_dataset: List[Dict]) -> List[Dict]:
    """Create chain-of-thought training data"""

    cot_data = []

    for item in base_dataset:
        question = item['instruction']
        answer = item['output']

        # Generate step-by-step reasoning
        reasoning_steps = generate_reasoning_steps(question, answer)

        # Create CoT format
        cot_instruction = f"{question}\n\nLet's think step by step:"
        cot_output = ""

        for i, step in enumerate(reasoning_steps):
            cot_output += f"Step {i+1}: {step}\n"

        cot_output += f"\nFinal Answer: {answer}"

        cot_data.append({
            'instruction': cot_instruction,
            'input': '',
            'output': cot_output
        })

    return cot_data

def generate_reasoning_steps(question: str, answer: str) -> List[str]:
    """Generate intermediate reasoning steps"""

    # Use GPT-4 to generate reasoning steps
    prompt = f"""
    For the question: "{question}"
    With the answer: "{answer}"

    Generate 3-5 intermediate reasoning steps that would lead to this answer.
    Make them logical and educational.

    Return as a numbered list.
    """

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    steps_text = response.choices[0].message.content

    # Parse steps
    steps = []
    for line in steps_text.split('\n'):
        if line.strip() and (line[0].isdigit() or line.startswith('-')):
            step = line.split('.', 1)[-1].strip() if line[0].isdigit() else line[1:].strip()
            steps.append(step)

    return steps

# Usage
base_data = load_dataset("your-dataset")['train']
cot_data = create_cot_training_data(base_data[:1000])  # Sample for demonstration

# Fine-tune with CoT data
cot_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=cot_data,
    tokenizer=tokenizer
)

cot_trainer.train()

Meta-Learning and Few-Shot Adaptation

Model-Agnostic Meta-Learning (MAML)

class MAMLTrainer:
    """Implements MAML for few-shot adaptation"""

    def __init__(self, model, inner_lr: float = 0.01, meta_lr: float = 0.001):
        self.model = model
        self.inner_lr = inner_lr
        self.meta_lr = meta_lr
        self.meta_optimizer = AdamW(model.parameters(), lr=meta_lr)

    def adapt_to_task(self, task_data: List[Dict], num_steps: int = 5):
        """Adapt model to a new task using few-shot data"""

        # Create temporary copy for inner loop
        adapted_model = copy.deepcopy(self.model)
        inner_optimizer = SGD(adapted_model.parameters(), lr=self.inner_lr)

        # Inner loop: adapt to task
        for step in range(num_steps):
            batch = self.sample_batch(task_data)

            # Forward pass
            outputs = adapted_model(**batch)
            loss = outputs.loss

            # Inner update
            inner_optimizer.zero_grad()
            loss.backward()
            inner_optimizer.step()

        return adapted_model

    def meta_update(self, task_datasets: List[List[Dict]]):
        """Perform meta-learning update across multiple tasks"""

        meta_loss = 0

        for task_data in task_datasets:
            # Adapt to task
            adapted_model = self.adapt_to_task(task_data)

            # Evaluate on query set (held-out examples)
            query_batch = self.sample_batch(task_data, query=True)
            query_outputs = adapted_model(**query_batch)
            query_loss = query_outputs.loss

            meta_loss += query_loss

        # Meta update
        self.meta_optimizer.zero_grad()
        meta_loss.backward()
        self.meta_optimizer.step()

    def sample_batch(self, data: List[Dict], batch_size: int = 4, query: bool = False):
        """Sample a batch from task data"""
        # Implementation would sample support/query sets for meta-learning
        return data[:batch_size]  # Simplified

# Usage
maml_trainer = MAMLTrainer(model)

# Meta-training across multiple tasks
task_datasets = [math_tasks, coding_tasks, writing_tasks]
maml_trainer.meta_update(task_datasets)

# Few-shot adaptation to new task
new_task_data = load_new_task_data()
adapted_model = maml_trainer.adapt_to_task(new_task_data)

Advanced Evaluation Metrics

Human Preference Alignment

class PreferenceEvaluator:
    """Evaluate models based on human preferences"""

    def __init__(self, model_a, model_b, tokenizer):
        self.model_a = model_a
        self.model_b = model_b
        self.tokenizer = tokenizer

    def collect_preferences(self, prompts: List[str], num_comparisons: int = 100):
        """Collect human preferences between model outputs"""

        preferences = []

        for i in range(min(num_comparisons, len(prompts))):
            prompt = prompts[i]

            # Generate responses from both models
            response_a = self.generate_response(self.model_a, prompt)
            response_b = self.generate_response(self.model_b, prompt)

            # In practice, you would collect human judgments
            # For demo, we'll simulate preferences
            preference = self.simulate_human_preference(response_a, response_b)

            preferences.append({
                'prompt': prompt,
                'response_a': response_a,
                'response_b': response_b,
                'preferred': preference
            })

        return preferences

    def calculate_win_rate(self, preferences: List[Dict]) -> Dict[str, float]:
        """Calculate win rate for each model"""

        wins_a = sum(1 for p in preferences if p['preferred'] == 'A')
        wins_b = sum(1 for p in preferences if p['preferred'] == 'B')

        total = len(preferences)

        return {
            'model_a_win_rate': wins_a / total,
            'model_b_win_rate': wins_b / total
        }

    def simulate_human_preference(self, response_a: str, response_b: str) -> str:
        """Simulate human preference (in practice, collect real human feedback)"""

        # Simple heuristic: prefer longer, more detailed responses
        len_a = len(response_a.split())
        len_b = len(response_b.split())

        if len_a > len_b:
            return 'A'
        elif len_b > len_a:
            return 'B'
        else:
            return 'A' if hash(response_a) % 2 == 0 else 'B'

# Usage
preference_evaluator = PreferenceEvaluator(model_a, model_b, tokenizer)

prompts = [
    "Explain quantum computing",
    "Write a Python function to sort a list",
    "Describe the water cycle"
]

preferences = preference_evaluator.collect_preferences(prompts)
win_rates = preference_evaluator.calculate_win_rate(preferences)

print(f"Model A win rate: {win_rates['model_a_win_rate']:.3f}")
print(f"Model B win rate: {win_rates['model_b_win_rate']:.3f}")

What We've Accomplished

Outstanding! 🚀 You've mastered advanced LLaMA Factory techniques:

  1. Continual learning - Incremental learning without forgetting
  2. Multi-modal models - Vision-language integration
  3. Federated learning - Privacy-preserving distributed training
  4. Knowledge distillation - Model compression and efficiency
  5. Advanced prompting - Chain-of-thought and reasoning
  6. Meta-learning - Few-shot adaptation capabilities
  7. Preference learning - Human-aligned model evaluation

Next Steps

Ready for the grand finale? In Chapter 8: Production Case Studies, we'll explore real-world production deployments and lessons learned!


Practice what you've learned:

  1. Implement continual learning for your model
  2. Try knowledge distillation to create smaller, faster models
  3. Experiment with federated learning for privacy-preserving training
  4. Add chain-of-thought reasoning to your fine-tuning
  5. Implement meta-learning for few-shot adaptation

What's the most advanced technique you'll implement? 🧠

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for self, model, tokenizer so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 7: Advanced Techniques as an operating subsystem inside LLaMA-Factory Tutorial: Unified Framework for LLM Training and Fine-tuning, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around loss, task, List as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 7: Advanced Techniques usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for self.
  2. Input normalization: shape incoming data so model receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through tokenizer.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

  • src/llamafactory/train/dpo/trainer.py The DPO (Direct Preference Optimization) trainer used for RLHF-style alignment training. Shows how preference pairs are processed and the DPO loss is computed against a reference model.

  • src/llamafactory/train/ppo/trainer.py PPO (Proximal Policy Optimization) trainer for reinforcement learning from human feedback. Implements reward model scoring, KL-divergence penalty, and policy gradient updates.

  • src/llamafactory/train/rm/trainer.py Reward model trainer used to learn human preferences from ranked response pairs. Required as the first stage of a full RLHF pipeline before PPO fine-tuning.

  • src/llamafactory/model/model_utils/visual.py Vision-language model integration utilities. Handles multi-modal input encoding, image patch embedding projection, and alignment with the language model's embedding space for VLM fine-tuning (e.g., Qwen2-VL, LLaVA).

  • examples/deepspeed/ DeepSpeed ZeRO configuration examples (ZeRO-2, ZeRO-3, offload variants) used for multi-GPU and multi-node distributed training. These YAML files are passed via --deepspeed to enable sharded optimizer states, gradients, and parameters.

  • src/llamafactory/hparams/finetuning_args.py Defines FinetuningArguments dataclass including fields for stage (sft/dpo/ppo/rm/kto), lora_rank, use_dora, pissa_init, rope_scaling, and model-merging method (merge_method: ties, dare, slerp). This is the authoritative source for all advanced training options.

Suggested trace strategy:

  • Follow src/llamafactory/train/dpo/trainer.pycompute_preference_loss() to understand how RLHF alignment differs from SFT
  • Check src/llamafactory/hparams/finetuning_args.py for the full list of supported merge_method values and VLM-specific flags
  • Review DeepSpeed ZeRO-3 config in examples/deepspeed/ds_z3_config.json before running multi-GPU jobs

Chapter Connections