Meeting 14
Date: November 7, 2025 (Friday, 3:00 PM EST)
Attendees: Amro, Aseel, Caesar, Safia, Banu
Summary
- Amro presented the latest progress on his work with Mistral 7B integrated with RAG. The model demonstrated significantly improved accuracy and more contextually appropriate responses. Although latency increased (64–>152), this was deemed acceptable given the model’s large size and the team’s limited computational resources.
- Caesar tested CodeCarbon and the Emissions Tracker on his model and briefed the group on his findings, that he was not always getting consistent results. And Amro suggested trying an offline version of CodeCarbon to address the issue.
- Safia continued her experiments with SLM(TinyLlama) + RAG and noted that she would begin testing the unified test prompts next.
- Based on Evan’s feedback, the team decided to evaluate outputs rather than models directly. It was noted by Amro that existing Python-based evaluation libraries yielded unsatisfactory results. Consequently, the group agreed to adopt a hybrid evaluation approach, combining AI-based and human-based methods.
- The previously discussed Google Form idea received positive feedback from Evan, who considered it a strong and valuable experimental component of the project. The form will also serve as a human-based evaluation tool.
- The initial structure of the Google Form was outlined:
- The first section will include a brief project introduction and general information.
- Subsequent sections will present texts generated by both open-source and commercial models, displayed in random order.
- Participants will be asked to guess the source of each text and rate it on several criteria, like clarity, relevance, and accuracy, using a 1–5 scale.
- Following Evan’s suggestion, the target audience should be diverse and multicultural, so the form will be shared within the cohort group once finalized.
Action Items
- The team aims to meet again tomorrow, subject to everyone’s availability.
- The recursive model approach will be re-evaluated.
- README files will be prepared for all selected models in the repository, providing clear information about each model’s architecture, functionality, and key characteristics.
- Work on the Google Form will begin, with a target publication date of November 15.
Future Work
- Once published, the form will remain open for two weeks, during which responses will be collected and analyzed.
- In the first week of December(final week of ELO2)—the collected data will be manually analyzed, and the team will write the research article based on the findings.
Meeting 15
Date: November 9, 2025 (Sunday, 2:30 PM EST)
Attendees: Amro, Aseel, Banu, Caesar, Reem, Safia
Summary
- Due to members’ schedule constraints, the meeting originally planned for Saturday was held today instead.
- A brief recap of Meeting 14 was provided.
- The team reviewed the overall working plan and discussed the status of the three models currently under testing.
- Caesar reported that the CodeCarbon issue was resolved after applying Amro’s earlier suggestion. However, his distilled model was unable to generate meaningful outputs for the creative thinking task at first.
- Amro suggested that Caesar adjust and refine the prompts used with the model. After modifying the guidance prompt, temperature, and a few other parameters during the meeting, Caesar’s model showed improved generation results, producing two correct responses out of three creative questions in real time.
- Amro shared that his Mistral 7B model, which was already performing well overall, started producing strong results for the creative generation task after implementing the new guidance prompts.
- Reem presented her proposal to modify the existing recursive reasoning method into a recursive editing approach. She summarized possible directions for its integration, explaining that the process would involve:
- One model generating an initial draft.
- A feedback provider model offering feedback on the draft.
- A refinement cycle combining the initial output and feedback to produce an improved version.
- The cycle repeating iteratively until a high-quality output is achieved or a set number of iterations is reached.
- Evaluation criteria varying depending on the specific task and its metrics.
- The group reviewed the current model lineup:
- Caesar: LaMini (open-source distilled model) + RAG
- Amro: Mistral 7B (open-source) + RAG
- Safia: TinyLlama (open-source SLM) + RAG
- The team agreed to apply the recursive editing framework to Safia’s model as the fourth setup for further testing and evaluation.
- The team also discussed the approach for human evaluation, focusing on assessing the accuracy and quality of outputs based on participants’ opinions. It was agreed to complete final testing during the week.
Action Items
- Reem, Aseel, and Banu will implement the recursive editing framework on the small language model (TinyLlama + RAG + Recursive Editing).
- Caesar will apply the same recursive editing methodology to the distilled model.
- Amro will test the approach on the Mistral model setup.
- All members will prepare model outputs and documentation by the next meeting on Friday, November 14.
- The agenda for Friday’s meeting will focus on reviewing results and preparing the Google Form, which will be finalized on Saturday, November 15, to collect structured human feedback on model outputs.
Meeting 16
Date: November 14, 2025 (Thursday, 2:30 PM EST)
Attendees: Amro, Aseel, Banu, Caesar, Reem, Safia
Summary
- Following the last meeting, through asynchronous work and communication via WhatsApp, it was determined that the TinyLlama model was not suitable for recursive editing. As a result, it was decided to switch to Microsoft's Phi-3 model, but shortly after, the model was removed from Hugging Face and became unavailable. Consequently, the search for another suitable SLM began and finally the new SLMs were confirmed: Reem is working on Microsoft's Qwen model, while Aseel is implementing Google's Gemma model .
- Caesar reported facing token limitation challenges when implementing recursive editing on his model.
- Reem presented updates on her recursive cycle experimentation and discussed the technical problems she encountered, particularly regarding context window tuning and token allocation across generation, feedback, and refinement stages.
- Amro proposed a strategic approach: using different specialized small models for specific tasks (such as text criticism and refinement) rather than relying on large generalized models, aligning with current AI development trends. He suggested offloading the retrieval and initial generation tasks to users (allowing them to copy-paste from documents), then using a small model specifically to criticize and refine the generated text (similar to writing an email and asking ChatGPT to review and fix it.)This approach acknowledges that small models lack the capacity for full answer generation but could excel at focused refinement tasks.
- Additional technical discussions covered context window tuning, max token adjustments, and restoring model capacity by separating loops into different cells.
- The group reaffirmed that GPUs remain significantly more effective than CPUs for model execution and testing.
- Google Form methodology was started to be discussed: The team decided to use vague model descriptions in the evaluation form to prevent participant bias during assessment.
Action Items
- The team will reconvene tomorrow at 2:30 PM EST to finalize the work and clarify remaining tasks, specifically regarding the Google Form.
- Safia and Banu will work on the Google Form after tomorrow meeting. It must be finalized and sent to Evan for review by Monday before posting in public groups.
- Reem and Aseel will continue working on their respective models.
- All members to finalize model implementations and documentation.
Meeting 17
Date: November 15, 2025 (Wednesday, 2:30 PM EST)
Attendees: Amro, Aseel, Caesar, Reem, Banu
Summary
- Aseel confirmed that she pushed her model and related work to the repository earlier in the day.
- Reem reported that she is in the process of finalizing her model outputs and preparing them for upload.
- Amro noted that he made improvements to the documentation for his model, refining clarity and organization.
- The team discussed the structure of the Google Form. Key points included:
- Whether all texts in the form should come exclusively from the open-source models the team has been working on,
- or whether the form should also include commercial model outputs, presented in a mixed format.
- The group agreed to seek Evan’s guidance to determine the most appropriate design. The final form structure will be shaped according to his recommendation.
Action Items
- Create the first Google Form draft tomorrow and prepare to present it on Monday.
- Finalize and publish the form on Monday, after incorporating Evan’s feedback.
- Finalize the model documentation in the shortest time possible and initiate the overall repository restructuring.
Meeting 14
Date: November 7, 2025 (Friday, 3:00 PM EST)
Attendees: Amro, Aseel, Caesar, Safia, Banu
Summary
Action Items
Future Work
Meeting 15
Date: November 9, 2025 (Sunday, 2:30 PM EST)
Attendees: Amro, Aseel, Banu, Caesar, Reem, Safia
Summary
Action Items
Meeting 16
Date: November 14, 2025 (Thursday, 2:30 PM EST)
Attendees: Amro, Aseel, Banu, Caesar, Reem, Safia
Summary
Action Items
Meeting 17
Date: November 15, 2025 (Wednesday, 2:30 PM EST)
Attendees: Amro, Aseel, Caesar, Reem, Banu
Summary
Action Items