Skip to content

Commit ef1fa8f

Browse files
committed
Merge branch 'main' of https://github.com/MIT-Emerging-Talent/ELO2_GREEN_AI into Commercial-models
2 parents 895dd29 + 45639cc commit ef1fa8f

12 files changed

Lines changed: 1804 additions & 0 deletions

File tree

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Model Testing Metrics
2+
3+
|Skill Type (Task)|What It Tests|Example Dataset|Metric to Measure Accuracy|
4+
|-----------------|-------------|---------------|--------------------------|
5+
|Reasoning / Logic|Mathematical reasoning|GSM8K|(correct answers / total)|
6+
|Commonsense QA|Everyday reasoning and knowledge|PIQA, BoolQ|Accuracy|
7+
|Summarization|Condensing information|CNN/DailyMail, XSum|ROUGE-L, BERTScore|
8+
|Code Generation|Logical structure|HumanEval-lite, MBPP|Pass@k|
9+
10+
## Datasets
11+
12+
### GSM8K(Grade School Math 8K)
13+
14+
It is a dataset of 8.5K high quality linguistically diverse
15+
grade school math word problems. The dataset was created to support the task
16+
of question answering on basic mathematical problems that require multi-step
17+
reasoning.
18+
19+
### BoolQ
20+
21+
It is a question answering dataset for yes/no questions containing
22+
15942 examples. These questions are naturally occurring ---they are
23+
generated in unprompted and unconstrained settings.
24+
25+
### PIQA
26+
27+
This dataset introduces the task of physical commonsense reasoning and a
28+
corresponding benchmark dataset Physical Interaction: Question Answering
29+
or PIQA.
30+
31+
### Extreme Summarization (XSum) Dataset
32+
33+
There are three features:
34+
document: Input news article.
35+
summary: One sentence summary of the article.
36+
id: BBC ID of the article.
37+
38+
### The CNN / DailyMail Dataset
39+
40+
It is an English-language dataset containing just over
41+
300k unique news articles as written by journalists at CNN and the Daily Mail.
42+
he current version supports both extractive and abstractive summarization.
43+
44+
### The HumanEval Dataset
45+
46+
released by OpenAI includes 164 programming problems
47+
with a function sig- nature, docstring, body, and several unit tests.
48+
They were handwritten to ensure not to be included in the training
49+
set of code generation models.
50+
51+
### MBPP
52+
53+
The benchmark consists of around 1,000 crowd-sourced Python programming
54+
problems, designed to be solvable by entry level programmers, covering
55+
programming fundamentals, standard library functionality, and so on.
56+
Each problem consists of a task description,code solution and 3 automated
57+
test cases.
58+
59+
## Metrics
60+
61+
### Pass@1
62+
63+
The percentage of problems for which the model’s first generated solution
64+
passes all tests.
65+
66+
### BERTScore
67+
68+
It measures how similar two pieces of text are in meaning, not just in word
69+
overlap. It uses BERT embeddings (or similar transformer embeddings) to
70+
compare the semantic content of the generated text and the reference text.
71+
72+
### ROUGE, or Recall-Oriented Understudy for Gisting Evaluation
73+
74+
It is a set of metrics and a software package used for evaluating automatic
75+
summarization and machine translation software in natural language processing.
76+
The metrics compare an automatically produced summary or translation against
77+
a reference or a set of references (human-produced) summary or translation.

meeting_minutes/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- markdownlint-disable MD013 -->
2+
3+
# 🗓️ Meeting Minutes – Environmental Impact of AI Models
4+
5+
This directory documents the weekly progress and decision-making process for the research project on **the environmental and performance trade-offs between large proprietary and small open-source AI models**.
6+
7+
Each meeting entry outlines team discussions, feedback, experimental progress, and assigned tasks across project milestones.
8+
9+
## 🧭 Milestone 1 – Scoping & Research Question Refinement
10+
11+
**Timeline:** September 27 – October 14, 2025
12+
13+
The first milestone focused on refining the research direction and defining a clear, measurable problem within **Green AI**. After exploring various AI-related topics, the team finalized the project title — **“Green AI Benchmarking of Foundation Models”** — and the research question:
14+
15+
> Can open-source LLMs match the accuracy of commercial models while reducing environmental impact?
16+
>
17+
18+
Key progress included reviewing literature on energy, carbon, and water use in AI systems, selecting benchmark tasks (**reasoning** and **summarization**), and identifying evaluation metrics for **accuracy** and **environmental footprint**. The team also chose comparison models (**GPT-4** and **Mistral-7B**), created shared documentation, and distributed responsibilities among members.
19+
20+
By the end of Milestone 1, the project established its scope, research framework, and collaborative infrastructure, setting the stage for **Milestone 2**, focused on tool setup and metric calibration.
21+
22+
## ⚙️ Milestone 2 – Tool Setup & Experiment Planning
23+
24+
**Timeline:** October 15 – Ongoing
25+
26+
With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon****CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
27+
28+
The team also plans to configure testing environments for small open-source models (e.g., **Mistral****LLaMA-2**) using **Hugging Face Transformers****PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
29+
30+
This milestone sets the foundation for **Milestone 3**, where real model experiments and energy tracking will begin.
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
<!-- markdownlint-disable MD024 -->
2+
<!-- Disabled MD024 (Multiple headings with the same content) rule
3+
because repeated headings (Summary, Action Items) are
4+
intentionally used across multiple sections for structural clarity. -->
5+
# Milestone 1 Meeting Minutes
6+
7+
## Meeting 1
8+
9+
**Date:** September 27, 2025 (Saturday, 10:00 AM EST)
10+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu
11+
12+
### Summary
13+
14+
- Group members met and introduced themselves.
15+
- Project topic suggestions were presented:
16+
- *AI Jobs vs Real Jobs* (continuation of CDSP)
17+
- *Reddit Mental Health Text Analysis*
18+
- *Machine Learning for Climate–Environmental Data*
19+
20+
### Action Items
21+
22+
- Conduct a **domain search** on the proposed topics.
23+
- Bring **alternative project ideas** to the next meeting.
24+
- Create a [**Google Doc**](https://docs.google.com/document/d/1dk0j0GUoDWqBHmLArcS2xoW5ct5nOjdlCeX3P-yhhOw/edit?tab=t.0)
25+
to facilitate asynchronous collaboration.
26+
27+
---
28+
29+
## Meeting 2
30+
31+
**Date:** September 29, 2025 (Monday, 12:00 PM EST)
32+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu
33+
34+
### Summary
35+
36+
- Members presented new project ideas and ELO2 process plans:
37+
- *Mental Health of University Students in Sudan*
38+
- *Probabilistic Dental Triage System with Synthetic Data Generation for
39+
Resource-Limited Settings*
40+
- *Project: Green AI Benchmarking of Foundation Models*
41+
- *Green AI — Energy & Water Efficiency in Machine Learning*
42+
- Previously proposed topics were dropped due to various constraints.
43+
- The new ideas were discussed, but no final consensus was reached.
44+
45+
### Action Items
46+
47+
- All members will research the newly proposed topics before the next meeting.
48+
- The group will reach a **final decision** on the project topic at the next
49+
session.
50+
51+
---
52+
53+
## Meeting 3
54+
55+
**Date:** September 30, 2025 (Tuesday, 1:30 PM EST)
56+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu
57+
58+
### Summary
59+
60+
- The topics discussed in the previous meeting were revisited.
61+
- After evaluating the group’s collective knowledge, experience, and skills,
62+
the team decided that **“Project: Green AI Benchmarking of Foundation
63+
Models”** was the most suitable topic for the ELO2 project.
64+
65+
### Action Items
66+
67+
- Conduct **domain research** on the selected project topic.
68+
69+
---
70+
71+
## Meeting 4
72+
73+
**Date:** October 5, 2025 (Sunday, 12:00 PM EST)
74+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia
75+
76+
### Summary
77+
78+
- Safia officially joined the project team.
79+
- Amro presented a [**two-month (ELO2 deadline) project plan**](https://docs.google.com/document/d/19OCqflqeRLHzdPs9URrRWPzIdh3g1uw9TgX7-d_SXp8/edit?tab=t.0#heading=h.qd58vuomlp42).
80+
- The team discussed **how to kick off the project**, including **milestones,
81+
constraints, and deliverables**.
82+
- During domain research, Reem found a [**recently published study**](https://mitemergingtalent.slack.com/files/U082U854W8Y/F09JUBJQ9C2/2505.09598v4.pdf)
83+
with striking methodological similarities to the group’s topic and shared it
84+
with us.
85+
86+
### Action Items
87+
88+
- Seek **Evan’s feedback** on how to proceed with the project in light of the
89+
new findings.
90+
91+
---
92+
93+
## Meeting 5
94+
95+
**Date:** October 7, 2025 (Tuesday, 11:00 AM EST)
96+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia
97+
98+
### Summary
99+
100+
- Based on Evan’s feedback, the group decided to **extend the topic-finalization
101+
phase** by approximately two weeks and focused to adjust the project subject.
102+
- Members proposed ways to **refine and make the project more original**, such
103+
as:
104+
- Comparing *Big AI vs Small AI* models
105+
- Evaluating *Accuracy vs Eco-Friendliness*
106+
107+
### Action Items
108+
109+
- Conduct **in-depth research** to refine and strengthen the project’s
110+
originality.
111+
- Review the **sources cited in the research paper** previously shared by Reem.
112+
113+
---
114+
115+
## Meeting 6
116+
117+
**Date:** October 9, 2025 (Thursday, 10:30 AM EST)
118+
**Attendees:** Amro, Aseel, Reem, Caesar, Banu, Safia
119+
120+
### Summary
121+
122+
- The group held a **brainstorming session** to further develop and differentiate
123+
the project topic.
124+
- Amro drafted a [**preliminary project plan**](https://mitemergingtalent.slack.com/files/U082U854W8Y/F09KJCKUEUB/approach_.pdf)
125+
based on the discussion.
126+
127+
### Action Items
128+
129+
- Agreed to hold another meeting the following day to finalize the details.
130+
- A GitHub repository will be created for the project.
131+
132+
---
133+
134+
## Meeting 7
135+
136+
**Date:** October 10, 2025 (Friday, 12:00 PM EST)
137+
**Attendees:** Amro, Reem, Caesar, Banu
138+
139+
### Summary
140+
141+
- A **new and original research question** was finalized:
142+
*“To what extent can open-source LLMs achieve comparable accuracy to
143+
corporate (commercial) models while significantly reducing environmental
144+
footprint?”*
145+
- A new [**Google Doc**](https://docs.google.com/document/d/1BAoWHe8D3c_QAEFugS1CNEUqU5jugBwg1dFJE6-LVQo/edit?tab=t.0)
146+
was created to share useful resources and references for the project.
147+
148+
### Action Items
149+
150+
- All members to gain **basic knowledge about RAG and distilled models**.
151+
- **Banu and Aseel:** Select which models to use.
152+
- **Caesar and Safia:** Define how to measure **accuracy metrics**.
153+
- **Amro and Reem:** Define how to measure **environmental cost metrics**.
154+
155+
---
156+
157+
## Meeting 8
158+
159+
**Date:** October 14, 2025 (Tuesday, 1:30 PM EST)
160+
**Attendees:** Amro, Aseel, Caesar, Banu, Safia
161+
162+
### Summary
163+
164+
- Members presented progress on their assigned tasks from the previous meeting.
165+
- **Aseel & Banu:** Selected *GPT-4* (commercial) and *Mistral-7B* (open-source)
166+
models. Evaluation will focus on *reasoning* and *summarization* using *MMLU*
167+
and *Math* datasets. Detailed documentation can be found in the
168+
[Model Evaluation Report](https://docs.google.com/document/d/1oOYIdLDumoZyYqgsQuBXDlXr1yZfo1sJNNanmIEGD8I/edit?tab=t.0).
169+
- **Caesar & Safia:** Suggested using the *LightEval* library with a customized
170+
dataset. Caesar demonstrated how to split the *GSM8K* dataset into a
171+
500-example subset. Detailed documentation can be found in the
172+
[Accuracy Notes](https://docs.google.com/document/d/19L4vX-67O-fNNSmY9S8QaHUULZwgzwmKVoJGdzSsUWo/edit?tab=t.0).
173+
- **Amro & Reem:** Presented environmental metrics and detailed evaluation
174+
methods for environmental factors.
175+
176+
### Action Items
177+
178+
- Review all presented work by **October 16th**.
179+
- Meet again on **October 16th** to **discuss task allocation** for the second
180+
milestone.

0 commit comments

Comments
 (0)