You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the research framework and scope finalized in Milestone 1, **Milestone 2** focuses on preparing the experimental environment and defining how sustainability metrics will be measured. This phase involves setting up tools such as **CodeCarbon**, **CarbonTracker**, and **Eco2AI** to monitor energy and carbon usage, and exploring **Water Usage Effectiveness (WUE)** datasets from major cloud providers like AWS, Microsoft, and Google.
27
28
28
29
The team also plans to configure testing environments for small open-source models (e.g., **Mistral**, **LLaMA-2**) using **Hugging Face Transformers**, **PyTorch**, and GPU-enabled platforms such as **Colab**. Another core deliverable is the **experimental design document**, which will outline the metrics (energy, carbon, water, and accuracy), workflows, and methodology diagrams guiding the model evaluation process.
29
30
30
-
This milestone sets the foundation for **Milestone 3**, where real model experiments and energy tracking will begin.
31
+
By the end of Milestone 2, the team completed the technical setup, finalized the measurement pipeline, and validated that all tracking tools operate consistently across model types—ensuring a smooth transition into Milestone 3, where full experiments will be executed.
32
+
33
+
## 📊 Milestone 3 – Model Benchmarking & Data Collection
34
+
35
+
**Timeline:** November 7 – November 18, 2025
36
+
37
+
Milestone 3 marks the beginning of the full experimental phase. Using the measurement pipeline and tooling established in Milestone 2, the team runs benchmark tasks on both proprietary and open-source models to collect data on **accuracy** and **environmental impact**. This includes tracking **energy consumption and carbon emissions** for each testing model under consistent test conditions.
38
+
39
+
During this phase, the team also validates accuracy results on selected reasoning and summarization tasks, investigates irregular outputs, and updates evaluation scripts when needed. Additional observations such as **inference time, token throughput**, and **hardware utilization** are recorded to support later analysis.
40
+
41
+
By the end of Milestone 3, the project has produced a complete experimental dataset covering sustainability metrics and accuracy scores for all evaluated models, providing a strong foundation for **Milestone 4**, which focuses on human evaluation and qualitative assessment.
42
+
43
+
## 🧪 Milestone 4 – Human Evaluation & Survey Analysis
44
+
45
+
**Timeline:** November 19 – ongoing
46
+
47
+
Milestone 4 centers on incorporating **human judgment** into the benchmarking process. The team prepares a Google Form survey designed to compare model outputs side-by-side. Participants evaluate **clarity, coherence, informativeness, factuality,** and **overall preference**.
48
+
49
+
Once responses are collected, the team analyzes the results by aggregating scores, assessing agreement among reviewers, and comparing human preferences with automated accuracy metrics from earlier milestones. This helps identify where quantitative and qualitative assessments align or diverge.
50
+
51
+
By the end of Milestone 4, the project integrates the human evaluation results into the broader dataset, enabling a more nuanced understanding of model performance and preparing the groundwork for **Milestone 5**.
0 commit comments