MIT-Emerging-Talent
diff --git a/‎0_domain_study/model_evaluation_metrics.md‎
Lines changed: 77 additions & 0 deletions b/‎0_domain_study/model_evaluation_metrics.md‎
Lines changed: 77 additions & 0 deletions
diff --git a/‎meeting_minutes/README.md‎
Lines changed: 30 additions & 0 deletions b/‎meeting_minutes/README.md‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎meeting_minutes/milestone1_meetings.md‎
Lines changed: 180 additions & 0 deletions b/‎meeting_minutes/milestone1_meetings.md‎
Lines changed: 180 additions & 0 deletions
@@ -0,0 +1,77 @@
+# Model Testing Metrics
+
+|Skill Type (Task)|What It Tests|Example Dataset|Metric to Measure Accuracy|
+|-----------------|-------------|---------------|--------------------------|
+|Reasoning / Logic|Mathematical reasoning|GSM8K|(correct answers / total)|
+|Commonsense QA|Everyday reasoning and knowledge|PIQA, BoolQ|Accuracy|
+|Summarization|Condensing information|CNN/DailyMail, XSum|ROUGE-L, BERTScore|
+|Code Generation|Logical structure|HumanEval-lite, MBPP|Pass@k|
+
+## Datasets
+
+### GSM8K(Grade School Math 8K)
+
+It is a dataset of 8.5K high quality linguistically diverse
+grade school math word problems. The dataset was created to support the task
+of question answering on basic mathematical problems that require multi-step
+reasoning.
+
+### BoolQ
+
+It is a question answering dataset for yes/no questions containing
+15942 examples. These questions are naturally occurring ---they are
+generated in unprompted and unconstrained settings.
+
+### PIQA
+
+This dataset introduces the task of physical commonsense reasoning and a
+corresponding benchmark dataset Physical Interaction: Question Answering
+or PIQA.
+
+### Extreme Summarization (XSum) Dataset
+
+There are three features:
+document: Input news article.
+summary: One sentence summary of the article.
+id: BBC ID of the article.
+
+### The CNN / DailyMail Dataset
+
+It is an English-language dataset containing just over
+300k unique news articles as written by journalists at CNN and the Daily Mail.
+he current version supports both extractive and abstractive summarization.
+
+### The HumanEval Dataset
+
+released by OpenAI includes 164 programming problems
+with a function sig- nature, docstring, body, and several unit tests.
+They were handwritten to ensure not to be included in the training
+set of code generation models.
+
+### MBPP
+
+The benchmark consists of around 1,000 crowd-sourced Python programming
+problems, designed to be solvable by entry level programmers, covering
+programming fundamentals, standard library functionality, and so on.
+Each problem consists of a task description,code solution and 3 automated
+test cases.
+
+## Metrics
+
+### Pass@1
+
+The percentage of problems for which the model’s first generated solution
+passes all tests.
+
+### BERTScore
+
+It measures how similar two pieces of text are in meaning, not just in word
+overlap. It uses BERT embeddings (or similar transformer embeddings) to
+compare the semantic content of the generated text and the reference text.
+
+### ROUGE, or Recall-Oriented Understudy for Gisting Evaluation
+
+It is a set of metrics and a software package used for evaluating automatic
+summarization and machine translation software in natural language processing.
+The metrics compare an automatically produced summary or translation against
+a reference or a set of references (human-produced) summary or translation.
@@ -0,0 +1,30 @@
+<!-- markdownlint-disable MD013 -->
+
+# 🗓️ Meeting Minutes – Environmental Impact of AI Models
+
+This directory documents the weekly progress and decision-making process for the research project on **the environmental and performance trade-offs between large proprietary and small open-source AI models**.
+
+Each meeting entry outlines team discussions, feedback, experimental progress, and assigned tasks across project milestones.
+
+## 🧭 Milestone 1 – Scoping & Research Question Refinement
+
+**Timeline:** September 27 – October 14, 2025
+
+The first milestone focused on refining the research direction and defining a clear, measurable problem within **Green AI**. After exploring various AI-related topics, the team finalized the project title — **“Green AI Benchmarking of Foundation Models”** — and the research question:
+
+> Can open-source LLMs match the accuracy of commercial models while reducing environmental impact?
+>
+
+Key progress included reviewing literature on energy, carbon, and water use in AI systems, selecting benchmark tasks (**reasoning** and **summarization**), and identifying evaluation metrics for **accuracy** and **environmental footprint**. The team also chose comparison models (**GPT-4** and **Mistral-7B**), created shared documentation, and distributed responsibilities among members.
+
+By the end of Milestone 1, the project established its scope, research framework, and collaborative infrastructure, setting the stage for **Milestone 2**, focused on tool setup and metric calibration.
+
+## ⚙️ Milestone 2 – Tool Setup & Experiment Planning
+
+**Timeline:** October 15 – Ongoing
+
+With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon**, **CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
+
+The team also plans to configure testing environments for small open-source models (e.g., **Mistral**, **LLaMA-2**) using **Hugging Face Transformers**, **PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
+
+This milestone sets the foundation for **Milestone 3**, where real model experiments and energy tracking will begin.
@@ -0,0 +1,180 @@
+<!-- markdownlint-disable MD024 -->
+<!-- Disabled MD024 (Multiple headings with the same content) rule
+because repeated headings (Summary, Action Items) are
+intentionally used across multiple sections for structural clarity. -->
+# Milestone 1 Meeting Minutes
+
+## Meeting 1  
+
+**Date:** September 27, 2025 (Saturday, 10:00 AM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu  
+
+### Summary  
+
+- Group members met and introduced themselves.  
+- Project topic suggestions were presented:  
+  - *AI Jobs vs Real Jobs* (continuation of CDSP)  
+  - *Reddit Mental Health Text Analysis*  
+  - *Machine Learning for Climate–Environmental Data*  
+
+### Action Items  
+
+- Conduct a **domain search** on the proposed topics.  
+- Bring **alternative project ideas** to the next meeting.  
+- Create a [**Google Doc**](https://docs.google.com/document/d/1dk0j0GUoDWqBHmLArcS2xoW5ct5nOjdlCeX3P-yhhOw/edit?tab=t.0)
+  to facilitate asynchronous collaboration.  
+
+---
+
+## Meeting 2  
+
+**Date:** September 29, 2025 (Monday, 12:00 PM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu  
+
+### Summary  
+
+- Members presented new project ideas and ELO2 process plans:  
+  - *Mental Health of University Students in Sudan*  
+  - *Probabilistic Dental Triage System with Synthetic Data Generation for
+    Resource-Limited Settings*  
+  - *Project: Green AI Benchmarking of Foundation Models*  
+  - *Green AI — Energy & Water Efficiency in Machine Learning*  
+- Previously proposed topics were dropped due to various constraints.  
+- The new ideas were discussed, but no final consensus was reached.  
+
+### Action Items  
+
+- All members will research the newly proposed topics before the next meeting.  
+- The group will reach a **final decision** on the project topic at the next
+  session.  
+
+---
+
+## Meeting 3  
+
+**Date:** September 30, 2025 (Tuesday, 1:30 PM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu  
+
+### Summary  
+
+- The topics discussed in the previous meeting were revisited.  
+- After evaluating the group’s collective knowledge, experience, and skills,
+  the team decided that **“Project: Green AI Benchmarking of Foundation
+  Models”** was the most suitable topic for the ELO2 project.  
+
+### Action Items  
+
+- Conduct **domain research** on the selected project topic.  
+
+---
+
+## Meeting 4  
+
+**Date:** October 5, 2025 (Sunday, 12:00 PM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia  
+
+### Summary  
+
+- Safia officially joined the project team.  
+- Amro presented a [**two-month (ELO2 deadline) project plan**](https://docs.google.com/document/d/19OCqflqeRLHzdPs9URrRWPzIdh3g1uw9TgX7-d_SXp8/edit?tab=t.0#heading=h.qd58vuomlp42).
+- The team discussed **how to kick off the project**, including **milestones,
+  constraints, and deliverables**.  
+- During domain research, Reem found a [**recently published study**](https://mitemergingtalent.slack.com/files/U082U854W8Y/F09JUBJQ9C2/2505.09598v4.pdf)
+  with striking methodological similarities to the group’s topic and shared it
+  with us.  
+
+### Action Items  
+
+- Seek **Evan’s feedback** on how to proceed with the project in light of the
+  new findings.  
+
+---
+
+## Meeting 5  
+
+**Date:** October 7, 2025 (Tuesday, 11:00 AM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia  
+
+### Summary  
+
+- Based on Evan’s feedback, the group decided to **extend the topic-finalization
+  phase** by approximately two weeks and focused to adjust the project subject.
+- Members proposed ways to **refine and make the project more original**, such
+  as:  
+  - Comparing *Big AI vs Small AI* models  
+  - Evaluating *Accuracy vs Eco-Friendliness*  
+
+### Action Items  
+
+- Conduct **in-depth research** to refine and strengthen the project’s
+  originality.  
+- Review the **sources cited in the research paper** previously shared by Reem.
+
+---
+
+## Meeting 6  
+
+**Date:** October 9, 2025 (Thursday, 10:30 AM EST)  
+**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia  
+
+### Summary  
+
+- The group held a **brainstorming session** to further develop and differentiate
+  the project topic.  
+- Amro drafted a [**preliminary project plan**](https://mitemergingtalent.slack.com/files/U082U854W8Y/F09KJCKUEUB/approach_.pdf)
+  based on the discussion.  
+
+### Action Items  
+
+- Agreed to hold another meeting the following day to finalize the details.  
+- A GitHub repository will be created for the project.  
+
+---
+
+## Meeting 7  
+
+**Date:** October 10, 2025 (Friday, 12:00 PM EST)  
+**Attendees:** Amro, Reem, Caesar, Banu  
+
+### Summary  
+
+- A **new and original research question** was finalized:  
+  *“To what extent can open-source LLMs achieve comparable accuracy to
+  corporate (commercial) models while significantly reducing environmental
+  footprint?”*  
+- A new [**Google Doc**](https://docs.google.com/document/d/1BAoWHe8D3c_QAEFugS1CNEUqU5jugBwg1dFJE6-LVQo/edit?tab=t.0)
+  was created to share useful resources and references for the project.  
+
+### Action Items  
+
+- All members to gain **basic knowledge about RAG and distilled models**.  
+- **Banu and Aseel:** Select which models to use.  
+- **Caesar and Safia:** Define how to measure **accuracy metrics**.  
+- **Amro and Reem:** Define how to measure **environmental cost metrics**.  
+
+---
+
+## Meeting 8  
+
+**Date:** October 14, 2025 (Tuesday, 1:30 PM EST)  
+**Attendees:** Amro, Aseel, Caesar, Banu, Safia  
+
+### Summary  
+
+- Members presented progress on their assigned tasks from the previous meeting.
+- **Aseel & Banu:** Selected *GPT-4* (commercial) and *Mistral-7B* (open-source)
+    models. Evaluation will focus on *reasoning* and *summarization* using *MMLU*
+    and *Math* datasets. Detailed documentation can be found in the
+    [Model Evaluation Report](https://docs.google.com/document/d/1oOYIdLDumoZyYqgsQuBXDlXr1yZfo1sJNNanmIEGD8I/edit?tab=t.0).
+- **Caesar & Safia:** Suggested using the *LightEval* library with a customized
+    dataset. Caesar demonstrated how to split the *GSM8K* dataset into a
+    500-example subset. Detailed documentation can be found in the
+    [Accuracy Notes](https://docs.google.com/document/d/19L4vX-67O-fNNSmY9S8QaHUULZwgzwmKVoJGdzSsUWo/edit?tab=t.0).
+- **Amro & Reem:** Presented environmental metrics and detailed evaluation
+    methods for environmental factors.  
+
+### Action Items  
+
+- Review all presented work by **October 16th**.  
+- Meet again on **October 16th** to **discuss task allocation** for the second
+  milestone.