Name	Name	Last commit message	Last commit date
parent directory ..
figs	figs
01-pairwise-pref-generation.pdf	01-pairwise-pref-generation.pdf
02-pairwise-reward-model-training.pdf	02-pairwise-reward-model-training.pdf
03-ppo-training-with-reward-model.pdf	03-ppo-training-with-reward-model.pdf
README.md	README.md

Overall Approach

Main Use Case: Using the RAG Pipeline

This project implements a Retrieval-Augmented Generation (RAG) pipeline for answering domain-specific questions using a large language model (LLM). It combines a precomputed FAISS vectorstore of embeddings with a Generator Model (i.e., the LLM) to provide accurate and context-aware responses.

FAISS vectorstore generation

Loading the vectorstore and model in the RAG pipeline

Setting up the prompt template and QA chain using LangChain

When a user submits a query, the pipeline first retrieves the most relevant context from the vectorstore using semantic similarity.

Retrieval of context using semantic similarity using LangChain's RetrievalQA

This context is then inserted into a structured instruction-style prompt, which is fed to the LLM to generate an answer. The system parses the output into context, question, and answer components and logs each interaction in a JSON dataset for future reference.

This approach allows efficient querying over large datasets.

Query using a gold dataset, and save results

RAG LLM Optimization

Phase 1: Pairwaise Pairs Dataset Generation for Training Reward Model

This phase generates pairwise preference data to train a reward model for instruction-following or code-related tasks.

For each query in the dataset, the process first retrieves relevant context from a precomputed FAISS vectorstore.

A structured prompt is constructed combining the retrieved context and the query, which is then passed to a generative language model (Mistral-7B-Instruct) to produce multiple candidate answers.

Generation of multiple candidates

Each candidate is scored using a heuristic ranking system that combines: (1) similarity to the retrieved context (“grounding score”), (2) coverage of task-relevant keywords, and (3) a mild length penalty to discourage overly verbose answers.

Candidate scoring function

The top-scoring candidate is marked as “chosen” and the lowest-scoring candidate as “rejected,” forming a pair. These prompt–chosen–rejected triples are saved in a JSONL file (pairwise_prefs.jsonl) and provide training data for reward models that can later guide preference-aligned generation.

This approach ensures that the reward model learns to prefer outputs that are both contextually grounded and relevant to the task.

See more details of the implementation of this phase...

Phase 2: Training a Pairwise Reward Model

In this stage, a Reward Model (RM) is trained using the preference pairs generated earlier.

The dataset consists of triplets: a prompt, a “chosen” answer (preferred), and a “rejected” answer (less preferred).

A pretrained base encoder (e.g., bert-base-uncased) is fine-tuned to assign a scalar reward score to each answer.

Model and data selection

Training uses a pairwise loss function of the form -log σ(r_chosen − r_rejected), which encourages the model to give higher scores to preferred answers compared to rejected ones. This setup aligns the model’s scoring function with human-like or heuristic preferences.

Training loop

The process includes splitting data into training and validation sets, optimizing with AdamW, and monitoring both loss and validation accuracy. After each epoch, checkpoints are saved, and a log file tracks progress.

Validation

The trained reward model becomes a crucial evaluator for reinforcement learning or direct preference optimization steps that follow.

See more details of the implementation of this phase...

Phase 3: Fine-tuning the RAG Generator Model (LLM) with PPO and Feedback from Reward Model

In this phase, the policy model (Mistral-7B-Instruct), the main RAG model we want to optimize, is fine-tuned using Proximal Policy Optimization (PPO) with guidance from the reward model trained on synthetic preference data.

In addition, during training, a frozen reference model is maintained to constrain policy updates and prevent instability.

Instantiation of the Policy Model (The RAG Model to be optimized)

Instantiation of the Reference Model (To be used as a frozen copy of the above model)

Instantiation of the Reward Model

Instead of relying on direct human annotations, the system uses heuristic-based rankings (context grounding, keyword coverage, and length penalty) to generate “chosen vs. rejected” pairs. These pairs allow the reward model to provide scalar rewards for policy outputs.

The pipeline samples prompts, generates candidate responses from the policy, scores them with the reward model, and updates the policy to maximize expected reward while staying close to the reference.

In this approach, we utilize Reinforcement Learning with AI Feedback (RLAIF), where synthetic preferences stand in for human judgments, enabling scalable alignment without manual labeling.

See more details of the implementation of this phase...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Overall Approach

Main Use Case: Using the RAG Pipeline

RAG LLM Optimization

Phase 1: Pairwaise Pairs Dataset Generation for Training Reward Model

Phase 2: Training a Pairwise Reward Model

Phase 3: Fine-tuning the RAG Generator Model (LLM) with PPO and Feedback from Reward Model

FilesExpand file tree

docs

Directory actions

More options

Directory actions

More options

Latest commit

History

docs

Folders and files

parent directory

README.md

Overall Approach

Main Use Case: Using the RAG Pipeline

RAG LLM Optimization

Phase 1: Pairwaise Pairs Dataset Generation for Training Reward Model

Phase 2: Training a Pairwise Reward Model

Phase 3: Fine-tuning the RAG Generator Model (LLM) with PPO and Feedback from Reward Model