Skip to content

Commit e6ecc5c

Browse files
authored
Merge pull request #29 from MIT-Emerging-Talent/meeting_minutes
Milestone 4: add milestone 3 meeting minutes to repo
2 parents ca89768 + 1ba740e commit e6ecc5c

2 files changed

Lines changed: 178 additions & 2 deletions

File tree

meeting_minutes/README.md

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
<!-- markdownlint-disable MD013 -->
2+
<!-- Disabled MD013 (Line length) for better readability -->
23

34
# 🗓️ Meeting Minutes – Environmental Impact of AI Models
45

@@ -21,10 +22,30 @@ By the end of Milestone 1, the project established its scope, research framework
2122

2223
## ⚙️ Milestone 2 – Tool Setup & Experiment Planning
2324

24-
**Timeline:** October 15 – Ongoing
25+
**Timeline:** October 15 – November 6, 2025
2526

2627
With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon****CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
2728

2829
The team also plans to configure testing environments for small open-source models (e.g., **Mistral****LLaMA-2**) using **Hugging Face Transformers****PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
2930

30-
This milestone sets the foundation for **Milestone 3**, where real model experiments and energy tracking will begin.
31+
By the end of Milestone 2, the team completed the technical setup, finalized the measurement pipeline, and validated that all tracking tools operate consistently across model types—ensuring a smooth transition into Milestone 3, where full experiments will be executed.
32+
33+
## 📊 Milestone 3 – Model Benchmarking & Data Collection
34+
35+
**Timeline:** November 7 – November 18, 2025
36+
37+
Milestone 3 marks the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team runs benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This includes tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
38+
39+
During this phase, the team also validates accuracy results on selected reasoning and summarization tasks, investigates irregular outputs, and updates evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** are recorded to support later analysis.
40+
41+
By the end of Milestone 3, the project has produced a complete experimental dataset covering sustainability metrics and accuracy scores for all evaluated models, providing a strong foundation for **Milestone 4**, which focuses on human evaluation and qualitative assessment.
42+
43+
## 🧪 Milestone 4 – Human Evaluation & Survey Analysis
44+
45+
**Timeline:** November 19 – ongoing
46+
47+
Milestone 4 centers on incorporating **human judgment** into the benchmarking process. The team prepares a Google Form survey designed to compare model outputs side-by-side. Participants evaluate **clarity, coherence, informativeness, factuality,** and **overall preference**.
48+
49+
Once responses are collected, the team analyzes the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences with automated accuracy metrics from earlier milestones. This helps identify where quantitative and qualitative assessments align or diverge.
50+
51+
By the end of Milestone 4, the project integrates the human evaluation results into the broader dataset, enabling a more nuanced understanding of model performance and preparing the groundwork for **Milestone 5**.
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
<!-- markdownlint-disable MD024 -->
2+
<!-- Disabled MD024 (Multiple headings with the same content) rule
3+
because repeated headings (Summary, Action Items) are intentionally used
4+
across multiple sections for structural clarity.
5+
-->
6+
# Milestone 3 Meeting Minutes
7+
8+
## Meeting 14
9+
10+
**Date:** November 7, 2025 (Friday, 3:00 PM EST)
11+
**Attendees:** Amro, Aseel, Caesar, Safia, Banu
12+
13+
### Summary
14+
15+
- **Amro** presented the latest progress on his work with **Mistral 7B** integrated
16+
with **RAG**. The model showed improved accuracy and more contextually aligned
17+
responses. Although latency increased (64–>152), it was considered acceptable
18+
given the model’s size and the team’s limited compute.
19+
- **Caesar** tested **CodeCarbon** and the **Emissions Tracker** and reported
20+
inconsistent results. Amro suggested trying an **offline version of
21+
CodeCarbon**.
22+
- **Safia** continued experiments with **SLM (TinyLlama) + RAG** and will begin
23+
testing the **unified test prompts** next.
24+
- Based on **Evan’s feedback**, the team decided to **evaluate outputs rather
25+
than models**. **Python evaluation libraries** produced weak results, so the
26+
team will use a **hybrid evaluation**: AI-based plus human-based.
27+
- The previously discussed **Google Form** idea received positive feedback from
28+
Evan and will serve as a **human-based evaluation tool**.
29+
- The **initial Google Form structure** was outlined:
30+
- Intro section with a short project description.
31+
- Following sections will show **model-generated texts** (open-source and
32+
commercial) in **random order**.
33+
- Participants will **guess the source** and rate clarity, relevance, and
34+
accuracy on a **1–5 scale**.
35+
- Per Evan’s suggestion, the **target audience** should be **diverse and
36+
multicultural**, so the form will be shared in the **cohort group**.
37+
38+
### Action Items
39+
40+
- The team aims to **meet again tomorrow**, depending on availability.
41+
- The **recursive model approach** will be **re-evaluated**.
42+
- **README files** will be prepared for all selected models in the repository.
43+
- Work on the **Google Form** will begin, targeting a **November 15**
44+
publication date.
45+
46+
### Future Work
47+
48+
- Once published, the form will remain open for **two weeks** for response
49+
collection and analysis.
50+
- In the **first week of December (final week of ELO2)**, the data will be
51+
analyzed manually and used to write the **research article**.
52+
53+
---
54+
55+
## Meeting 15
56+
57+
**Date:** November 9, 2025 (Sunday, 2:30 PM EST)
58+
**Attendees:** Amro, Aseel, Banu, Caesar, Reem, Safia
59+
60+
### Summary
61+
62+
- Due to scheduling conflicts, the meeting planned for Saturday was held today.
63+
- A brief recap of Meeting 14 was provided.
64+
- The team reviewed the general work plan and discussed the three models under
65+
testing.
66+
- Caesar reported that the CodeCarbon issue was resolved using Amro’s prior
67+
suggestion. His distilled model initially failed on creative tasks.
68+
- Amro recommended refining prompts. After adjusting guidance, temperature, and
69+
other parameters live during the meeting, Caesar’s model improved and produced
70+
two correct creative outputs out of three.
71+
- Amro shared that Mistral 7B also improved in creative tasks after adopting the
72+
new guidance prompts.
73+
- Reem presented her proposal to modify the recursive reasoning method into a
74+
**recursive editing** approach. The process involves:
75+
- One model generating a draft.
76+
- A feedback provider model reviewing it.
77+
- A refinement cycle combining draft and feedback.
78+
- Iteration until quality is high or a limit is reached.
79+
- Evaluation criteria vary by task.
80+
- The current model lineup:
81+
- **Caesar:** LaMini (distilled open-source) + RAG
82+
- **Amro:** Mistral 7B + RAG
83+
- **Safia:** TinyLlama (SLM) + RAG
84+
- The team agreed to apply the recursive editing framework to Safia’s model as
85+
the **fourth setup**.
86+
- Human evaluation plans were discussed, focusing on participant-based accuracy
87+
and quality assessment. Final testing will be completed this week.
88+
89+
### Action Items
90+
91+
- Reem, Aseel, and Banu will implement **recursive editing** on the small model
92+
(TinyLlama + RAG + Recursive Editing).
93+
- Caesar will apply recursive editing to the distilled model.
94+
- Amro will test the method on Mistral.
95+
- All members will prepare outputs and documentation before **Friday,
96+
November 14**.
97+
- Friday’s meeting will review results and prepare the **Google Form**, which
98+
will be finalized on **Saturday, November 15**.
99+
100+
---
101+
102+
## Meeting 16
103+
104+
**Date:** November 14, 2025 (Thursday, 2:30 PM EST)
105+
**Attendees:** Amro, Aseel, Banu, Caesar, Reem, Safia
106+
107+
### Summary
108+
109+
- After asynchronous discussions, the team determined that **TinyLlama is not
110+
suitable** for recursive editing. The team switched to **Microsoft’s Phi-3**,
111+
but it was removed from Hugging Face. New SLMs were selected:
112+
**Reem → Microsoft Qwen**, **Aseel → Google Gemma**.
113+
- **Caesar** reported **token limitation issues** during recursive editing.
114+
- **Reem** shared updates on recursive cycle experiments and issues related to
115+
**context window tuning** and **token allocation**.
116+
- **Amro** proposed using **specialized small models** for focused tasks
117+
(critique and refinement) instead of relying on large models. Retrieval and
118+
initial generation could be done by the user, while a small model refines the
119+
text.
120+
- Technical discussions included context window tuning, max-token adjustments,
121+
and restoring capacity by splitting loops across notebook cells.
122+
- The team reaffirmed that **GPUs outperform CPUs** significantly.
123+
- The team began discussing the **Google Form methodology**, deciding to use
124+
**vague model descriptions** to avoid bias.
125+
126+
### Action Items
127+
128+
- Team will **meet again tomorrow at 2:30 PM EST** to finalize remaining work.
129+
- **Safia and Banu** will develop the **Google Form** after the meeting and send
130+
it to Evan by Monday.
131+
- **Reem and Aseel** will continue working on their models.
132+
- All members must finalize **implementations and documentation**.
133+
134+
---
135+
136+
## Meeting 17
137+
138+
**Date:** November 15, 2025 (Wednesday, 2:30 PM EST)
139+
**Attendees:** Amro, Aseel, Caesar, Reem, Banu
140+
141+
### Summary
142+
143+
- **Aseel** confirmed she pushed her model and related work earlier today.
144+
- **Reem** reported that she is finalizing outputs for upload.
145+
- **Amro** improved his model documentation for clarity and organization.
146+
- The team discussed the **structure of the Google Form**, focusing on whether
147+
to include only **open-source model outputs** or also **commercial outputs**.
148+
The group will seek **Evan’s guidance** before finalizing the structure.
149+
150+
### Action Items
151+
152+
- **Create the first Google Form draft tomorrow** and prepare to present it on
153+
Monday.
154+
- **Publish the final form on Monday** after incorporating Evan’s feedback.
155+
- **Finalize all model documentation** and begin reorganizing the repository.

0 commit comments

Comments
 (0)