Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions 2_open_source_models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# **Open-Source Model Experiments**

This directory contains four standalone experiments exploring
**local, open-source language models** for Retrieval-Augmented Generation
(RAG), model evaluation, recursive editing, and sustainability tracking
(energy & CO₂ emissions).
Each subfolder includes its own notebook, documentation, outputs, and
model-specific setup.

---

## Directory Structure

```text
2_open_source_models/
├── distilled_models/
│ └── rag_and_distilled_model/
├── quantized_models/
│ └── mistral7b/
└── slm/
├── google_gemma/
└── qwen/
```

Each subfolder contains a self-contained model with its own README,
notebook(s), generated outputs, and energy/emissions logs where applicable.

---

## Project Summaries

Below is a concise description of each model project to understand
the purpose of the overall folder at a glance.

---

### **1. Distilled Models – RAG + Instruction-Tuned Distilled LMs**

**Folder:** `distilled_models/rag_and_distilled_model/`
**Notebook:** `Apollo11_rag&distilled.ipynb`

This project uses a lightweight **LaMini-Flan-T5-248M** distilled model
combined with a **MiniLM** embedding model to run a fully local
Retrieval-Augmented Generation pipeline on the Apollo 11 dataset.
It demonstrates:

* Local embeddings and ChromaDB vector storage
* RAG-based question answering
* Evaluation across several prompt types
* Emissions tracking and generated output logs

Ideal for showing how **compact distilled models** can handle
RAG efficiently on CPU or modest GPU hardware.

---

### **2. Quantized Models – Mistral 7B RAG Pipeline**

**Folder:** `quantized_models/mistral7b/`

This project evaluates a **quantized Mistral-7B (GGUF)** model running
fully locally via `llama-cpp-python`.
It focuses on:

* Retrieval-Augmented Generation using LlamaIndex
* Local inference using a 4-bit quantized LLM
* Document processing, embedding (BGE-small), and top-k retrieval
* Practical observations on feasibility and performance on a laptop

A strong example of how quantization enables
**large-model capability at small-device cost**.

---

### **3. Small Language Model (SLM): Google Gemma 2-2B**

**Folder:** `slm/google_gemma/`

This experiment implements a structured RAG workflow with Google’s lightweight
**Gemma 2-2B** model and a fixed Apollo 11 source text.
Key features include:

* Standardized 21-prompt evaluation set
* RAG pipeline with chunked retrieval
* Draft to Critic to Refiner multi-step generation
* Real-time emissions logging with CodeCarbon
* Fully reproducible testing and reporting

This project demonstrates how even very small open-weight models can
perform multi-step reasoning when paired with thoughtful prompting and revision
cycles.

---

### **4. Small Language Model (SLM): Qwen 2.5B + Recursive Editing**

**Folder:** `slm/qwen/`

This notebook experiments with **Qwen 2.5B**, integrating:

* RAG retrieval
* A recursive editing loop (Draft to Critic to Refine)
* Context retrieval through Hugging Face embeddings
* Energy + CO₂ logging for each query

Outputs are saved in markdown form with all iterations and emissions data.

---

## Purpose of This Collection

This folder exists to:

* Compare how different **model sizes**, **architectures**, and
**inference strategies** behave on the **same tasks**.
* Demonstrate **fully local RAG pipelines** using only open-source components.
* Document **energy and carbon trade-offs** in local LLM usage.
* Provide reproducible examples that can be extended or rerun with other models.

Each subfolder is designed as a standalone experiment, but together they
form a cohesive study of open-source LLM efficiency and performance.

---

## Notes

* All code is intended to run locally.
* Each folder includes its own notebook and README with instructions.
* Energy/emissions reporting is included where relevant (via CodeCarbon).
* Datasets and prompts are standardized across projects for fairness and comparability.
77 changes: 77 additions & 0 deletions 3_experiment/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# AI Model Comparison Experiment

## Evaluating Open-Source vs. Commercial Language Models

This folder contains the materials for our experiment comparing open-source and
commercial AI models through human evaluation. Participants were asked to read
pairs of AI-generated texts and judge their quality without knowing which model
produced which text.

---

## What This Experiment Is

We created a survey where each question includes two texts—**Text A** and
**Text B**—generated by different AI models. One text always comes from an
**open-source model**, and the other from a **commercial model**. Participants:

* Choose which text they prefer
* Guess which model type generated each text
* Rate both texts (accuracy, clarity, relevance, faithfulness)

All evaluations are blind to remove brand bias.

---

## Why We Did This

Open-source AI models are advancing quickly, and we wanted to understand
whether they are perceived as competitive alternatives to commercial systems.
While benchmarks can measure performance numerically, they don’t reflect how
humans actually experience AI-generated writing.

This experiment aims to answer questions like:

* Do people notice a consistent quality difference?
* Can users accurately identify commercial vs. open-source output?
* Are open-source models “good enough” for real-world tasks?

Understanding these perceptions is important for evaluating the viability of
sustainable, accessible, and transparent AI systems.

---

## Why We Chose This Method

We used a **paired, blind comparison** because it provides a clean way to
assess text quality without model reputation influencing the results.
Participants judge writing on its own merits, which helps us collect more
reliable data.

We included multiple task types: summarization, paraphrasing, reasoning, and
creative writing, because each one tests a different aspect of model behavior.

This variety gives us a broader picture of model strengths and weaknesses.

---

## Why This Approach Works Well

This survey-based structure is simple and easy for participants to
understand. It mirrors how people naturally interact with AI systems: reading
text and forming opinions about quality. By keeping the evaluation blind, we
minimize bias and generate more meaningful insights into real user perception.

The method also helps determine whether open-source models, especially optimized
ones, can realistically serve as alternatives to commercial systems
in practical use.

---

## Contents of This Folder

```text
3_experiment/
├── survey_form.md # The form text used in the study
└── README.md # Explanation of the experiment (this file)
```
Loading
Loading