Merge pull request #29 from MIT-Emerging-Talent/meeting_minutes

AseelOmer · web-flow · commit e6ecc5c83863 · 2025-11-22T13:57:47.000+02:00
Milestone 4: add milestone 3 meeting minutes to repo
diff --git a/meeting_minutes/README.md b/meeting_minutes/README.md
@@ -1,4 +1,5 @@
 <!-- markdownlint-disable MD013 -->
+<!-- Disabled MD013 (Line length) for better readability -->
 
 # 🗓️ Meeting Minutes – Environmental Impact of AI Models
 
@@ -21,10 +22,30 @@ By the end of Milestone 1, the project established its scope, research framework
 
 ## ⚙️ Milestone 2 – Tool Setup & Experiment Planning
 
-**Timeline:** October 15 – Ongoing
+**Timeline:** October 15 – November 6, 2025
 
 With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon**, **CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
 
 The team also plans to configure testing environments for small open-source models (e.g., **Mistral**, **LLaMA-2**) using **Hugging Face Transformers**, **PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
 
-This milestone sets the foundation for **Milestone 3**, where real model experiments and energy tracking will begin.
+By the end of Milestone 2, the team completed the technical setup, finalized the measurement pipeline, and validated that all tracking tools operate consistently across model types—ensuring a smooth transition into Milestone 3, where full experiments will be executed.
+
+## 📊 Milestone 3 – Model Benchmarking & Data Collection
+
+**Timeline:** November 7 – November 18, 2025
+
+Milestone 3 marks the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team runs benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This includes tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
+
+During this phase, the team also validates accuracy results on selected reasoning and summarization tasks, investigates irregular outputs, and updates evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** are recorded to support later analysis.
+
+By the end of Milestone 3, the project has produced a complete experimental dataset covering sustainability metrics and accuracy scores for all evaluated models, providing a strong foundation for **Milestone 4**, which focuses on human evaluation and qualitative assessment.
+
+## 🧪 Milestone 4 – Human Evaluation & Survey Analysis
+
+**Timeline:** November 19 – ongoing
+
+Milestone 4 centers on incorporating **human judgment** into the benchmarking process. The team prepares a Google Form survey designed to compare model outputs side-by-side. Participants evaluate **clarity, coherence, informativeness, factuality,** and **overall preference**.
+
+Once responses are collected, the team analyzes the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences with automated accuracy metrics from earlier milestones. This helps identify where quantitative and qualitative assessments align or diverge.
+
+By the end of Milestone 4, the project integrates the human evaluation results into the broader dataset, enabling a more nuanced understanding of model performance and preparing the groundwork for **Milestone 5**.
diff --git a/meeting_minutes/milestone3_meetings.md b/meeting_minutes/milestone3_meetings.md
@@ -0,0 +1,155 @@
+<!-- markdownlint-disable MD024 -->
+<!-- Disabled MD024 (Multiple headings with the same content) rule
+because repeated headings (Summary, Action Items) are intentionally used
+across multiple sections for structural clarity.
+-->
+# Milestone 3 Meeting Minutes
+
+## Meeting 14
+
+**Date:** November 7, 2025 (Friday, 3:00 PM EST)  
+**Attendees:** Amro, Aseel, Caesar, Safia, Banu
+
+### Summary
+
+- **Amro** presented the latest progress on his work with **Mistral 7B** integrated
+  with **RAG**. The model showed improved accuracy and more contextually aligned
+  responses. Although latency increased (64–>152), it was considered acceptable
+  given the model’s size and the team’s limited compute.
+- **Caesar** tested **CodeCarbon** and the **Emissions Tracker** and reported
+  inconsistent results. Amro suggested trying an **offline version of
+  CodeCarbon**.
+- **Safia** continued experiments with **SLM (TinyLlama) + RAG** and will begin
+  testing the **unified test prompts** next.
+- Based on **Evan’s feedback**, the team decided to **evaluate outputs rather
+  than models**. **Python evaluation libraries** produced weak results, so the
+  team will use a **hybrid evaluation**: AI-based plus human-based.
+- The previously discussed **Google Form** idea received positive feedback from
+  Evan and will serve as a **human-based evaluation tool**.
+- The **initial Google Form structure** was outlined:
+  - Intro section with a short project description.
+  - Following sections will show **model-generated texts** (open-source and
+    commercial) in **random order**.
+  - Participants will **guess the source** and rate clarity, relevance, and
+    accuracy on a **1–5 scale**.
+  - Per Evan’s suggestion, the **target audience** should be **diverse and
+    multicultural**, so the form will be shared in the **cohort group**.
+
+### Action Items
+
+- The team aims to **meet again tomorrow**, depending on availability.
+- The **recursive model approach** will be **re-evaluated**.
+- **README files** will be prepared for all selected models in the repository.
+- Work on the **Google Form** will begin, targeting a **November 15**
+  publication date.
+
+### Future Work
+
+- Once published, the form will remain open for **two weeks** for response
+  collection and analysis.
+- In the **first week of December (final week of ELO2)**, the data will be
+  analyzed manually and used to write the **research article**.
+
+---
+
+## Meeting 15
+
+**Date:** November 9, 2025 (Sunday, 2:30 PM EST)  
+**Attendees:** Amro, Aseel, Banu, Caesar, Reem, Safia
+
+### Summary
+
+- Due to scheduling conflicts, the meeting planned for Saturday was held today.
+- A brief recap of Meeting 14 was provided.
+- The team reviewed the general work plan and discussed the three models under
+  testing.
+- Caesar reported that the CodeCarbon issue was resolved using Amro’s prior
+  suggestion. His distilled model initially failed on creative tasks.
+- Amro recommended refining prompts. After adjusting guidance, temperature, and
+  other parameters live during the meeting, Caesar’s model improved and produced
+  two correct creative outputs out of three.
+- Amro shared that Mistral 7B also improved in creative tasks after adopting the
+  new guidance prompts.
+- Reem presented her proposal to modify the recursive reasoning method into a
+  **recursive editing** approach. The process involves:
+  - One model generating a draft.
+  - A feedback provider model reviewing it.
+  - A refinement cycle combining draft and feedback.
+  - Iteration until quality is high or a limit is reached.
+  - Evaluation criteria vary by task.
+- The current model lineup:
+  - **Caesar:** LaMini (distilled open-source) + RAG
+  - **Amro:** Mistral 7B + RAG
+  - **Safia:** TinyLlama (SLM) + RAG
+- The team agreed to apply the recursive editing framework to Safia’s model as
+  the **fourth setup**.
+- Human evaluation plans were discussed, focusing on participant-based accuracy
+  and quality assessment. Final testing will be completed this week.
+
+### Action Items
+
+- Reem, Aseel, and Banu will implement **recursive editing** on the small model
+  (TinyLlama + RAG + Recursive Editing).
+- Caesar will apply recursive editing to the distilled model.
+- Amro will test the method on Mistral.
+- All members will prepare outputs and documentation before **Friday,
+  November 14**.
+- Friday’s meeting will review results and prepare the **Google Form**, which
+  will be finalized on **Saturday, November 15**.
+
+---
+
+## Meeting 16
+
+**Date:** November 14, 2025 (Thursday, 2:30 PM EST)  
+**Attendees:** Amro, Aseel, Banu, Caesar, Reem, Safia
+
+### Summary
+
+- After asynchronous discussions, the team determined that **TinyLlama is not
+  suitable** for recursive editing. The team switched to **Microsoft’s Phi-3**,
+  but it was removed from Hugging Face. New SLMs were selected:
+  **Reem → Microsoft Qwen**, **Aseel → Google Gemma**.
+- **Caesar** reported **token limitation issues** during recursive editing.
+- **Reem** shared updates on recursive cycle experiments and issues related to
+  **context window tuning** and **token allocation**.
+- **Amro** proposed using **specialized small models** for focused tasks
+  (critique and refinement) instead of relying on large models. Retrieval and
+  initial generation could be done by the user, while a small model refines the
+  text.
+- Technical discussions included context window tuning, max-token adjustments,
+  and restoring capacity by splitting loops across notebook cells.
+- The team reaffirmed that **GPUs outperform CPUs** significantly.
+- The team began discussing the **Google Form methodology**, deciding to use
+  **vague model descriptions** to avoid bias.
+
+### Action Items
+
+- Team will **meet again tomorrow at 2:30 PM EST** to finalize remaining work.
+- **Safia and Banu** will develop the **Google Form** after the meeting and send
+  it to Evan by Monday.
+- **Reem and Aseel** will continue working on their models.
+- All members must finalize **implementations and documentation**.
+
+---
+
+## Meeting 17
+
+**Date:** November 15, 2025 (Wednesday, 2:30 PM EST)  
+**Attendees:** Amro, Aseel, Caesar, Reem, Banu
+
+### Summary
+
+- **Aseel** confirmed she pushed her model and related work earlier today.
+- **Reem** reported that she is finalizing outputs for upload.
+- **Amro** improved his model documentation for clarity and organization.
+- The team discussed the **structure of the Google Form**, focusing on whether
+  to include only **open-source model outputs** or also **commercial outputs**.
+  The group will seek **Evan’s guidance** before finalizing the structure.
+
+### Action Items
+
+- **Create the first Google Form draft tomorrow** and prepare to present it on
+  Monday.
+- **Publish the final form on Monday** after incorporating Evan’s feedback.
+- **Finalize all model documentation** and begin reorganizing the repository.