Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions meeting_minutes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,28 +24,38 @@ By the end of Milestone 1, the project established its scope, research framework

**Timeline:** October 15 – November 6, 2025

With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon**, **CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
With the research framework and scope finalized in Milestone 1, **Milestone 2** focused on preparing the experimental environment and defining how sustainability metrics were be measured. This phase involved setting up tools such as **CodeCarbon**, **CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.

The team also plans to configure testing environments for small open-source models (e.g., **Mistral**, **LLaMA-2**) using **Hugging Face Transformers**, **PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
The team also planned to configure testing environments for small open-source models (e.g., **Mistral**, **LLaMA-2**) using **Hugging Face Transformers**, **PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable was the **experimental design document**, which was outlining the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.

By the end of Milestone 2, the team completed the technical setup, finalized the measurement pipeline, and validated that all tracking tools operate consistently across model types—ensuring a smooth transition into Milestone 3, where full experiments will be executed.

## 📊 Milestone 3 – Model Benchmarking & Data Collection

**Timeline:** November 7 – November 18, 2025

Milestone 3 marks the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team runs benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This includes tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
Milestone 3 marked the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team ran benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This included tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.

During this phase, the team also validates accuracy results on selected reasoning and summarization tasks, investigates irregular outputs, and updates evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** are recorded to support later analysis.
During this phase, the team also validated accuracy results on selected reasoning and summarization tasks, investigated irregular outputs, and updated evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** were recorded to support later analysis.

By the end of Milestone 3, the project has produced a complete experimental dataset covering sustainability metrics and accuracy scores for all evaluated models, providing a strong foundation for **Milestone 4**, which focuses on human evaluation and qualitative assessment.

## 🧪 Milestone 4 – Human Evaluation & Survey Analysis

**Timeline:** November 19 – ongoing
**Timeline:** November 19 – December 3, 2025

Milestone 4 centers on incorporating **human judgment** into the benchmarking process. The team prepares a Google Form survey designed to compare model outputs side-by-side. Participants evaluate **clarity, coherence, informativeness, factuality,** and **overall preference**.
Milestone 4 centered on incorporating **human judgment** into the benchmarking process and concluded successfully. The team prepared and published a Google Form survey to compare model outputs side-by-side, and participants evaluated **clarity, coherence, informativeness, factuality,** and **overall preference**.

Once responses are collected, the team analyzes the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences with automated accuracy metrics from earlier milestones. This helps identify where quantitative and qualitative assessments align or diverge.
To improve participation and focus, the survey scope was refined to eight questions across four categories—**Reasoning, Summarization, Creative Writing,** and **Paraphrasing**—and the **Retrieval/RAG** category was excluded due to its emphasis on factual lookup rather than generative quality.

By the end of Milestone 4, the project integrates the human evaluation results into the broader dataset, enabling a more nuanced understanding of model performance and preparing the groundwork for **Milestone 5**.
Once responses were collected, the team analyzed the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences. Initial insights, including distributional patterns and respondent demographics, were reviewed via Google Forms visualizations, and notable alignments and divergences between human judgments and quantitative metrics were documented to guide interpretation in the final analysis.

By the end of Milestone 4, the project integrated the human evaluation results into the broader dataset, consolidated the confirmed question set and model pairings, and prepared materials for downstream reporting. This provided a more nuanced understanding of model performance, completing the human evaluation phase and setting up the transition into **Milestone 5**.

## 📣 Milestone 5 – Communication of Results & Final Presentation

**Timeline:** December 4 – ongoing

Milestone 5 focuses on packaging and communicating the project’s findings while completing the final presentation and releasing the full set of artifacts. The team is synthesizing human evaluation results to produce a coherent analysis narrative, drafting and editing the presentation and article for publication, and finalizing an infographic and visual summary that will be embedded in both the article and the presentation.

In parallel, the repository is being cleaned and organized to publish the code, data, and analysis notebooks with clear usage notes and data access instructions. Everything will be finalized on December 7.
180 changes: 180 additions & 0 deletions meeting_minutes/milestone4_meetings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
<!-- markdownlint-disable MD024 -->
<!-- Disabled MD024 (Multiple headings with the same content) rule
because repeated headings (Summary, Action Items) are
intentionally used across multiple sections for structural clarity.
-->
# Milestone 4 Meeting Minutes

## Meeting 18

**Date:** November 19, 2025 (Wednesday, 1:00 PM EST)
**Attendees:** Amro, Aseel, Banu, Caesar, Reem, Safia

### Summary

- The research question was refined with Evan's help:
- During the group conversation with Evan on Slack, team initially drafted a
precise question focusing on whether optimized open-source models (e.g., via
recursive editing, distillation) could become environmentally and
functionally viable alternatives to commercial models.
- However, Evan advised that the ELO2 project should remain **open-ended**,
shifting toward a broader guiding question:
**“How can we achieve similar results to large private models on smaller
devices and with less power consumption?”**
- As a result, the final deliverable will be a **comprehensive portfolio** of
experiments, benchmarks, comparisons, and promising directions—rather than a
single definitive answer.
- Based on Evan’s feedback, the upcoming **Google Form** will include both
**commercial** and **open-source** model responses.
- Initially, the plan was to pair small open-source SLMs with commercial models
of similar sizes, but this was not feasible due to limited access. The team
instead decided to **use accessible commercial LLMs (ChatGPT, Claude,
Gemini).**
- **All questions** across all categories will be used in the **Google form**.
- Finalized model assignments:
- **Aseel** → ChatGPT
- **Caesar** → Claude Haiku 4.5
- **Amro** → Gemini Pro 3
- **Banu** → Gemini Fast (Flash 2.5)
- **Reem** → Gemini Flash 2.5 Lite (via API/HuggingFace)

### Action Plan

- Each member will generate **responses for all question prompts** using their
assigned model.
- All responses must be uploaded to the
[shared document](<https://docs.google.com/document/d/1CBYpsLvkeE5aLKp1-6vPaiiz>
DXVf80o2uH42XL5gXtw/edit?tab=t.0) **by tomorrow**.
- In tomorrow’s meeting, the team will review all open-source and commercial
model outputs and **select the final answers** to include in the Google Form.

---

## Meeting 19

**Date:** November 20, 2025 (Thursday, 2:30 PM EST)
**Attendees:** Amro, Aseel, Caesar, Reem, Safia

### Summary

- The team revisited the original plan for the evaluation form. Initially,
**all 21 questions** across all task categories were intended to be included.
- However, because a 21-question survey would be too long for participants, the
group agreed to **select only two questions per category** to keep the form
manageable.
- During this selection process, the team decided to **exclude the
Retrieval/RAG category** entirely, since its questions require factual lookup
(dates, names, quantities), which does not align well with the survey’s goal
of evaluating reasoning or generation quality.
- As a result, the form will include **four categories**—Reasoning,
Summarization, Creative Writing, Paraphrasing—with **two questions per
category**, for a total of **8 questions**.
_(Selected Q&A's can be found
[here.](<https://docs.google.com/document/d/1CBYpsLvkeE5aLKp1-6vPaiizDXVf80o2uH>
42XL5gXtw/edit?tab=t.ugqqnecewdh7))_
- The group reaffirmed that **each task category will be represented by one
model pair**, ensuring all models contribute to the study.
- The team decided to **pair each open-source model with the closest commercial
model** for comparative evaluation.
- The final task–model pairings were confirmed as:
- **Reasoning:** Gemma ↔ Claude Haiku 4.5
- **Summarization:** LaMini ↔ Gemini Flash
- **Creative Writing:** Mistral ↔ Gemini Pro 3
- **Paraphrasing:** Qwen ↔ ChatGPT

### Action Plan

- Add the selected model responses to the **Google Form** initially created by
Banu and finalize the form.
- Submit the form to **Evan** for feedback, then incorporate any revisions.
- **Publish** the finalized form to the cohort group and **collect responses
until November 30**.

---

## Meeting 20

**Date:** November 25, 2025 (Tuesday, 2:30 PM EST)
**Attendees:** Amro, Caesar, Reem, Banu

### Summary

- To discuss what is left while the form is still running, the team reviewed the
remaining deliverables for both **ELO2/Graduation requirements** and **the
project itself**.

1. For **ELO2 and graduation**, the deliverables were revisited and confirmed
as:
- Repository
- Presentation
- 1000-word final testimonial _(individual)_
- 1000-word ELO2 retrospective _(individual)_
- Exit Survey _(individual)_

2. For **Green AI project’s final outputs**, the deliverables were identified
as:
- Repository
- Article
- Presentation
- Form analysis
- The team also discussed the expected article format and structure.
- The article should narrate the project process by explaining motivations and
the overall journey, with roughly **5–10%** on initial ideas, most of the
content focusing on the work done, and a concluding section with findings
and potential future directions.
- Reem volunteered to create an infographic or visual summary that can be used
for both the article and the presentation.
- It was also noted that **on November 28 (Friday)**, support may be requested
from Evan to help boost form participation through an announcement.
- Since the form was published later than anticipated, the team decided to
**close it on December 2nd instead of November 30th.**

### Action Plan

- **Caesar and Reem** to begin drafting the article for team review.
- **Amro** to work on repository updates and refine the main README draft
prepared by Banu earlier.
- **Reem** to create an infographic using a **visualization tool** to summarize
project results for the article, presentation, and Medium or similar platforms
_(after the form closes)_.
- The survey form will be **closed on December 2nd**, after which data analysis
will begin.
- **Banu, Safia, and Aseel** to work on the presentation, building on the
initial draft previously prepared by Banu.
- An announcement request **may be sent to Evan on November 28 (Friday)** to
encourage more form responses.

---

## Meeting 21

**Date:** December 2, 2025 (Tuesday, 12:30 PM EST)
**Attendees:** Caesar, Reem, Aseel

### Summary

- The team discussed the status of the survey form and agreed to close it by the
end of the day. After reviewing its current performance, they noted that most
initial insights and demographics could already be observed through the Google
Forms visualization tools, which provided a general overview of respondent
characteristics and early trends.
- The group revisited the remaining requirements and clarified what still needs
to be completed for the final deliverables. A key part of the discussion
focused on how the results will be structured and presented in both the
article and the visual summary. This included considering how to best
translate the survey findings into a clear narrative and an accompanying
infographic or visualization.
- The team confirmed that an additional meeting would be held on December 3rd to
examine the survey results in more depth. During that session, they will
identify any notable or unexpected findings that may require special emphasis
or separate formats within the final outputs.

### Action Plan

- Close the survey form by end of day on December 2.
- Begin outlining how survey results will be integrated into the article and
visual materials.
- Continue exploring the data using Google Forms visualizations to prepare for
deeper analysis.
- Meet again on December 3rd to review detailed results and determine standout
findings or sections that require special formatting.
43 changes: 43 additions & 0 deletions meeting_minutes/milestone5_meetings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<!-- markdownlint-disable MD024-->
<!-- Disabled MD024 (Multiple headings with the same content) rule
because repeated headings (Summary, Action Items) are
intentionally used across multiple sections for structural clarity.
-->
# Milestone 5 Meeting Minutes

## Meeting 22

**Date:** December 4, 2025 (Thursday, 1:00 PM EST)
**Attendees:** Amro, Aseel, Banu, Caesar, Safia

### Summary

- The team reviewed the overall project status, noting that only a few days remained
and both the **presentation** and the **article** were close to completion.
- The group discussed the structure of the final presentation, agreeing on a
**category-based layout** (reasoning, summarization, creative generation, paraphrasing)
supported by charts, visuals, and short explanations.
- Aseel led the design direction, and the team confirmed a **road-themed visual style**
for consistency across slides.
- The group highlighted key research findings:
- Open-source models performed surprisingly well in **reasoning** and
**creative generation** tasks.
- Commercial LLMs still hold advantages for **highly specialized or complex domains**.
- Promising future directions include experimenting with larger parameter counts,
fine-tuning methods, and evaluating performance under different context window
sizes.

### Action Plan

- **Banu** will compile and document all experiment findings (questions, answers,
charts) and upload the file to the GitHub repo, to the **'findings' folder**.
- **Caesar** will continue to work on the **article and 'experiment' folder** of
the repo.
- **Amro** will finalize the **main README**.
- **Aseel** will continue leading the presentation design and **finalize the slides**.
- **Reem** will complete the **visual summaries** for the article and presentation.
- **Safia** will work on the **conclusion slide.**
- The team will meet again on **December 7** to review all components and
finalize the project.

---
Loading