|
| 1 | +# 🌱 ELO2 – GREEN AI |
| 2 | + |
| 3 | +***Comparing Commercial and Open-Source Language Models for*** |
| 4 | +***Sustainable AI*** |
| 5 | + |
| 6 | +This repository presents the **ELO2 – GREEN AI Project**, developed |
| 7 | +within the **MIT Emerging Talent – AI & ML Program (2025)**. The work |
| 8 | +investigates the technical performance, sustainability traits, and |
| 9 | +human-perceived quality of **open-source language models** |
| 10 | +compared to commercial systems. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 🔍 Project Overview |
| 15 | + |
| 16 | +### Research Question |
| 17 | + |
| 18 | +**To what extent can open-source LLMs provide competitive output quality |
| 19 | +while operating at significantly lower environmental cost?** |
| 20 | + |
| 21 | + |
| 22 | + |
| 23 | +### Motivation |
| 24 | + |
| 25 | +Large commercial LLMs deliver strong performance but demand substantial |
| 26 | +compute and energy. This project examines whether **small, accessible, |
| 27 | +and environmentally efficient open-source models**—especially when |
| 28 | +enhanced with retrieval and refinement pipelines—can offer practical |
| 29 | +alternatives for everyday tasks. |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## 🧪 Methods |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | +### 1. Model Families |
| 38 | + |
| 39 | +The study evaluates several open-source model groups: |
| 40 | + |
| 41 | +- **Quantized Model:** Mistral-7B (GGUF) |
| 42 | +- **Distilled Model:** LaMini-Flan-T5-248M |
| 43 | +- **Small Models:** Qwen, Gemma |
| 44 | +- **Enhanced Pipelines (applied to all model families):** |
| 45 | + - **RAG (Retrieval-Augmented Generation)** |
| 46 | + - **Recursive Editing** |
| 47 | + - includes AI-based critique and iterative refinement |
| 48 | + |
| 49 | +These configurations serve as the optimized open-source setups used in |
| 50 | +the comparison against commercial models. |
| 51 | + |
| 52 | +### 2. Tasks & Dataset |
| 53 | + |
| 54 | +Evaluation tasks include: |
| 55 | + |
| 56 | +- summarization |
| 57 | +- factual reasoning |
| 58 | +- paraphrasing |
| 59 | +- short creative writing |
| 60 | +- instruction following |
| 61 | +- question answering |
| 62 | + |
| 63 | +A targeted excerpt from the **Apollo-11 mission transcripts** served as |
| 64 | +the central reference text for all evaluation tasks. All prompts were constructed |
| 65 | +directly from this shared material. Using a single, consistent source ensured |
| 66 | +that every model was tested under identical informational conditions, allowing |
| 67 | +clear and fair comparison of output quality and relevance. |
| 68 | + |
| 69 | +### 3. RAG Pipeline |
| 70 | + |
| 71 | +Retrieval-Augmented Generation (RAG) was applied to multiple model |
| 72 | +families. The pipeline includes: |
| 73 | + |
| 74 | +- document indexing |
| 75 | +- dense similarity retrieval |
| 76 | +- context injection through prompt augmentation |
| 77 | +- answer synthesis using guidance prompts |
| 78 | + |
| 79 | +RAG improved factual grounding in nearly all models. |
| 80 | + |
| 81 | +### 4. Recursive Editing Framework |
| 82 | + |
| 83 | +A lightweight iterative refinement procedure was implemented: |
| 84 | + |
| 85 | +1. **Draft Generation:** |
| 86 | + The primary model produces an initial output. |
| 87 | + |
| 88 | +2. **AI-Based Critique:** |
| 89 | + A secondary SLM evaluates clarity, accuracy, faithfulness and relevance. |
| 90 | + |
| 91 | +3. **Refinement Step:** |
| 92 | + A revision prompt integrates critique and generates an improved text. |
| 93 | + |
| 94 | +4. **Stopping Condition:** |
| 95 | + The cycle ends after a fixed number of iterations or when critique |
| 96 | + stabilizes. |
| 97 | + |
| 98 | +This approach allowed weaker SLMs to yield higher-quality results |
| 99 | +without relying on large models. |
| 100 | + |
| 101 | +### 5. Environmental Measurement |
| 102 | + |
| 103 | +Environmental footprint data was captured with **CodeCarbon**, recording: |
| 104 | + |
| 105 | +- CPU/GPU energy usage |
| 106 | +- Carbon emissions |
| 107 | +- PUE-adjusted overhead |
| 108 | + |
| 109 | +These measurements enabled comparison with published metrics for |
| 110 | +commercial LLMs. |
| 111 | + |
| 112 | +### 6. Human Evaluation (Single-Blind) |
| 113 | + |
| 114 | +A structured Google Form experiment collected: |
| 115 | + |
| 116 | +- **source identification** (commercial vs. open-source) |
| 117 | +- **quality ratings** on accuracy, faithfulness, relevance, and clarity |
| 118 | + (1–5 scale) |
| 119 | + |
| 120 | +Outputs were randomized and anonymized to avoid bias. This provided a |
| 121 | +perception-based counterpart to technical evaluation. |
| 122 | + |
| 123 | +### 7. Analysing the Results |
| 124 | + |
| 125 | +.... |
| 126 | + |
| 127 | +### 8. Publishing an Article |
| 128 | + |
| 129 | +.... |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +## 📊 Key Findings |
| 134 | + |
| 135 | +- FINDING1..... |
| 136 | +- FINDING2..... |
| 137 | +- FINDING3..... |
| 138 | +- FINDING4..... |
| 139 | + |
| 140 | +--- |
| 141 | + |
| 142 | +## 🔮 Future Work |
| 143 | + |
| 144 | +- Evaluate additional open-source model families across diverse tasks |
| 145 | +- Test optimized pipelines in specialized domains (medical, legal, technical writing) |
| 146 | +- Track carbon footprint across full lifecycle (training to deployment) |
| 147 | +- Conduct ablation studies isolating RAG vs. recursive editing contributions |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## 📢 Communication Strategy |
| 152 | + |
| 153 | +The research findings will be shared through formats designed for different |
| 154 | +audiences and purposes: |
| 155 | + |
| 156 | +### For Researchers |
| 157 | + |
| 158 | +A comprehensive research article will document the complete experimental design, |
| 159 | +statistical analysis, and implications. |
| 160 | + |
| 161 | +🔗 **[View Aticle](link1)** |
| 162 | + |
| 163 | +### For Practitioners & Educators |
| 164 | + |
| 165 | +An executive presentation provides a visual overview of the research question, |
| 166 | +methodology, and key findings without requiring deep technical background. |
| 167 | + |
| 168 | +🔗 **[View Presentation](link2)** |
| 169 | + |
| 170 | +### For the Community |
| 171 | + |
| 172 | +A public evaluation study invites participation in assessing AI-generated texts. |
| 173 | +This crowdsourced data forms a critical component of the research. |
| 174 | + |
| 175 | +🔗 **[Participate in Study](link3)** |
| 176 | + |
| 177 | +### For Reproducibility |
| 178 | + |
| 179 | +All materials (dataset, prompts, model outputs, evaluation scripts, and carbon |
| 180 | +tracking logs) are publicly available in this repository. |
| 181 | + |
| 182 | +🔗 **[Browse Repository](https://github.com/banuozyilmaz2-jpg/ELO2-GREEN-AI)** |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## 👥 Contributors |
| 187 | + |
| 188 | +- [Amro Mohamed](https://github.com/Elshikh-Amro) |
| 189 | +- [Aseel Omer](https://github.com/AseelOmer) |
| 190 | +- [Banu Ozyilmaz](https://github.com/doctorbanu) |
| 191 | +- [Caesar Ghazi](https://github.com/CaesarGhazi) |
| 192 | +- [Reem Osama](https://github.com/reunicorn1) |
| 193 | +- [Safia Gibril Nouman](https://github.com/Safi222) |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +## 🙏 Acknowledgments |
| 198 | + |
| 199 | +Special thanks to the **MIT Emerging Talent Program** for their guidance and |
| 200 | +feedback throughout the project. |
0 commit comments