Skip to content

Commit 008d989

Browse files
committed
docs(meetings): edit meeting_minutes README for milestone 4 and 5
1 parent 26a86af commit 008d989

2 files changed

Lines changed: 19 additions & 9 deletions

File tree

meeting_minutes/README.md

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,28 +24,38 @@ By the end of Milestone 1, the project established its scope, research framework
2424

2525
**Timeline:** October 15 – November 6, 2025
2626

27-
With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon****CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
27+
With the research framework and scope finalized in Milestone 1, **Milestone 2** focused on preparing the experimental environment and defining how sustainability metrics were be measured. This phase involved setting up tools such as **CodeCarbon****CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
2828

29-
The team also plans to configure testing environments for small open-source models (e.g., **Mistral****LLaMA-2**) using **Hugging Face Transformers****PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
29+
The team also planned to configure testing environments for small open-source models (e.g., **Mistral****LLaMA-2**) using **Hugging Face Transformers****PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable was the **experimental design document**, which was outlining the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
3030

3131
By the end of Milestone 2, the team completed the technical setup, finalized the measurement pipeline, and validated that all tracking tools operate consistently across model types—ensuring a smooth transition into Milestone 3, where full experiments will be executed.
3232

3333
## 📊 Milestone 3 – Model Benchmarking & Data Collection
3434

3535
**Timeline:** November 7 – November 18, 2025
3636

37-
Milestone 3 marks the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team runs benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This includes tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
37+
Milestone 3 marked the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team ran benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This included tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
3838

39-
During this phase, the team also validates accuracy results on selected reasoning and summarization tasks, investigates irregular outputs, and updates evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** are recorded to support later analysis.
39+
During this phase, the team also validated accuracy results on selected reasoning and summarization tasks, investigated irregular outputs, and updated evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** were recorded to support later analysis.
4040

4141
By the end of Milestone 3, the project has produced a complete experimental dataset covering sustainability metrics and accuracy scores for all evaluated models, providing a strong foundation for **Milestone 4**, which focuses on human evaluation and qualitative assessment.
4242

4343
## 🧪 Milestone 4 – Human Evaluation & Survey Analysis
4444

45-
**Timeline:** November 19 – ongoing
45+
**Timeline:** November 19 – December 3, 2025
4646

47-
Milestone 4 centers on incorporating **human judgment** into the benchmarking process. The team prepares a Google Form survey designed to compare model outputs side-by-side. Participants evaluate **clarity, coherence, informativeness, factuality,** and **overall preference**.
47+
Milestone 4 centered on incorporating **human judgment** into the benchmarking process and concluded successfully. The team prepared and published a Google Form survey to compare model outputs side-by-side, and participants evaluated **clarity, coherence, informativeness, factuality,** and **overall preference**.
4848

49-
Once responses are collected, the team analyzes the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences with automated accuracy metrics from earlier milestones. This helps identify where quantitative and qualitative assessments align or diverge.
49+
To improve participation and focus, the survey scope was refined to eight questions across four categories—**Reasoning, Summarization, Creative Writing,** and **Paraphrasing**—and the **Retrieval/RAG** category was excluded due to its emphasis on factual lookup rather than generative quality.
5050

51-
By the end of Milestone 4, the project integrates the human evaluation results into the broader dataset, enabling a more nuanced understanding of model performance and preparing the groundwork for **Milestone 5**.
51+
Once responses were collected, the team analyzed the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences. Initial insights, including distributional patterns and respondent demographics, were reviewed via Google Forms visualizations, and notable alignments and divergences between human judgments and quantitative metrics were documented to guide interpretation in the final analysis.
52+
53+
By the end of Milestone 4, the project integrated the human evaluation results into the broader dataset, consolidated the confirmed question set and model pairings, and prepared materials for downstream reporting. This provided a more nuanced understanding of model performance, completing the human evaluation phase and setting up the transition into **Milestone 5**.
54+
55+
## 📣 Milestone 5 – Communication of Results & Final Presentation
56+
57+
**Timeline:** December 4 – ongoing
58+
59+
Milestone 5 focuses on packaging and communicating the project’s findings while completing the final presentation and releasing the full set of artifacts. The team is synthesizing human evaluation results to produce a coherent analysis narrative, drafting and editing the presentation and article for publication, and finalizing an infographic and visual summary that will be embedded in both the article and the presentation.
60+
61+
In parallel, the repository is being cleaned and organized to publish the code, data, and analysis notebooks with clear usage notes and data access instructions. Everything will be finalized on December 7.

meeting_minutes/milestone4_meetings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ intentionally used across multiple sections for structural clarity.
5858
### Summary
5959

6060
- The team revisited the original plan for the evaluation form. Initially,
61-
**all 21 questions** across all task categories were intended to be included.
61+
**all 21 questions** across all task categories were intended to be included.
6262
- However, because a 21-question survey would be too long for participants, the
6363
group agreed to **select only two questions per category** to keep the form
6464
manageable.

0 commit comments

Comments
 (0)