From 5c4391308bd8aa10fe0beb3a7626484c0ce5cb38 Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 02:38:02 +0200 Subject: [PATCH 1/6] adding the readme for the experiment folder and the survey form --- 3_experiment/README.md | 77 +++++ 3_experiment/survey_form.md | 588 ++++++++++++++++++++++++++++++++++++ 2 files changed, 665 insertions(+) create mode 100644 3_experiment/survey_form.md diff --git a/3_experiment/README.md b/3_experiment/README.md index e69de29..5be675f 100644 --- a/3_experiment/README.md +++ b/3_experiment/README.md @@ -0,0 +1,77 @@ +# AI Model Comparison Experiment + +## Evaluating Open-Source vs. Commercial Language Models + +This folder contains the materials for our experiment comparing open-source and +commercial AI models through human evaluation. Participants were asked to read +pairs of AI-generated texts and judge their quality without knowing which model +produced which text. + +--- + +## What This Experiment Is + +We created a survey where each question includes two texts—**Text A** and +**Text B**—generated by different AI models. One text always comes from an +**open-source model**, and the other from a **commercial model**. Participants: + +* Choose which text they prefer +* Guess which model type generated each text +* Rate both texts (accuracy, clarity, relevance, faithfulness) + +All evaluations are blind to remove brand bias. + +--- + +## Why We Did This + +Open-source AI models are advancing quickly, and we wanted to understand +whether they are perceived as competitive alternatives to commercial systems. +While benchmarks can measure performance numerically, they don’t reflect how +humans actually experience AI-generated writing. + +This experiment aims to answer questions like: + +* Do people notice a consistent quality difference? +* Can users accurately identify commercial vs. open-source output? +* Are open-source models “good enough” for real-world tasks? + +Understanding these perceptions is important for evaluating the viability of +sustainable, accessible, and transparent AI systems. + +--- + +## Why We Chose This Method + +We used a **paired, blind comparison** because it provides a clean way to +assess text quality without model reputation influencing the results. +Participants judge writing on its own merits, which helps us collect more +reliable data. + +We included multiple task types: summarization, paraphrasing, reasoning, and +creative writing, because each one tests a different aspect of model behavior. + +This variety gives us a broader picture of model strengths and weaknesses. + +--- + +## Why This Approach Works Well + +This survey-based structure is simple and easy for participants to +understand. It mirrors how people naturally interact with AI systems: reading +text and forming opinions about quality. By keeping the evaluation blind, we +minimize bias and generate more meaningful insights into real user perception. + +The method also helps determine whether open-source models, especially optimized +ones, can realistically serve as alternatives to commercial systems +in practical use. + +--- + +## Contents of This Folder + +```text +3_experiment/ +├── survey_form.md # The form text used in the study +└── README.md # Explanation of the experiment (this file) +``` diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md new file mode 100644 index 0000000..916ae49 --- /dev/null +++ b/3_experiment/survey_form.md @@ -0,0 +1,588 @@ +# Open-Source vs. Commercial AI: Comparing Performance and Quality + +Welcome and thank you for participating in this research study! + +This study investigates a critical question: +"Can open-source language models, enhanced with optimization techniques such as +recursive editing and distillation, become environmentally and functionally viable +alternatives to commercial models?" + +## YOUR TASK + +- Read each text carefully +- Guess whether it was generated by an open-source or commercial model +- Rate each text on clarity, relevance, accuracy, and trustworthiness + +## YOUR ROLE + +Your honest evaluations will help us understand: + +- Whether open-source models can achieve comparable quality to commercial alternatives +- How users perceive differences (or similarities) between the two types +- The viability of more sustainable and accessible AI solutions + +## PRIVACY & ETHICS + +All responses will be completely anonymous and used solely for academic research +purposes. There are no right or wrong answers. We value your genuine, unbiased +impressions and opinions. + +By proceeding, you consent to participate in this study. + +### Participant Background + +Please tell us a bit about yourself. + +Your background and experience with AI tools will help us understand how +different people with varying levels of AI familiarity and from diverse +cultural contexts perceive and evaluate AI-generated content. +This diversity of perspectives is essential to our research. + +What is your Nationality? + +What is your native (first) language? + +English +Spanish +German +Arabic +Turkish +Chinese +Russian +Other: + +How would you rate your English proficiency? + +Native speaker +Fluent (C2 - can understand everything easily) +Advanced (C1 - can understand complex texta) +Intermediate (B1-B2 - can understand most everyday texts) +Basic (A1-A2 - still learning) + +What is your current Occupation or field of study? +e.g., Computer Science Student, Software Engineer, Teacher, Researcher, Marketing + +How often do you use AI tools (e.g., ChatGPT, Claude, Gemini, Copilot, etc.)? + +Daily +Several times a week +Once a week +A few times a month +Rarely +Never + +What do you primarily use AI tools for? (select all that apply) + +Writing and editing assistance +Research and information gathering +Coding and programming +Creative work (stories, art, brainstorming) +Learning and education +Work or professional tasks +Translation or language learning +Personal use or entertainment +I don't use AI tools +Other: + +Are you familiar with the difference between open-source and commercial AI +models? + +Yes, very familiar +Somewhat familiar +I've heard of it but don't know details +No, not familiar + +--- + +### Text Evaluations + +Below you will see several pairs of texts about the Apollo 11 lunar landing and moonwalk +activities. +Each pair contains: + +- Text A and Text B - two AI-generated responses to the same question +- One text was generated by an open-source model(e.g., Mistral, Gemma, LaMini) +- One text was generated by a commercial model (e.g., GPT-4, Claude) +Your task: + +1. Identify which text is from which type of model +2. Evaluate the quality of BOTH texts +The questions cover different task types: paraphrasing, summarization, +creative writing, and reasoning. + +--- + +#### Source Material + +All texts were generated based on excerpts from the Apollo 11 mission transcript. +📎 You can find the source text here + +--- + +#### TASK 1.1- Task type: Summarization + +**Question:** Summarize the main events during the Apollo 11 lunar landing in 3 sentences. +**Text A:** +The main events of the Apollo 11 lunar landing included the lunar module Eagle separating +from the command module and descending to the Moon's surface. Upon landing, Neil +Armstrong famously declared, Houston, Tranquility Base here. The Eagle has +landed. Shortly after, Armstrong stepped onto the lunar surface, becoming the +first human to walk on the Moon, followed by Buzz Aldrin +**Text B:** +The computer prevented an abort during the Apollo 11 lunar landing, and a +complete set of recovery programs were incorporated into the software. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +#### TASK 1.2 - Task type: Summarization + +Question: Explain what scientific equipment the astronauts deployed on the Moon. +**Text A:** +The astronauts deployed the following scientific equipment on the Moon: +Seismometer: To measure moonquakes and meteoroid impacts. +Laser Ranging Retroreflector (LRRR): To precisely measure the distance between the +Earth and the Moon using lasers. +Solar Wind Spectrometer: To analyze the composition of the solar wind. +Magnetometer: To measure the Moon's magnetic field. +**Text B:** +The astronauts deployed the EASEP, which included a Passive Seismic Experiment Package +used to measure moonquakes and a retroreflector array used for the lunar laser ranging +experiment. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 2.1 - Task type: Paraphrasing + +Question: Explain how Armstrong’s decisions, actions, and teamwork during the descent +contributed to the mission’s success. +**Text A:** +Neil Armstrong and Buzz Aldrin's Moon landing epitomized remarkable human courage. +Despite facing numerous challenges, including unexpected technical issues and +the inherent dangers of space travel, the astronauts persevered and achieved +their goal of landing on the Moon. Their determination and bravery in +overcoming these obstacles underscore humanity's capacity for extraordinary +courage and perseverance in the face of seemingly insurmountable challenges. +To further enrich this narrative, one could imagine the astronauts reflecting on +their experiences, perhaps contemplating the profound impact of their mission +on both individual and collective human spirit. Alternatively, one could +speculate on how the Moon landing might have illuminated deeper truths +about human resilience and the indomitable will to explore and conquer. +**Text B:** +Armstrong took semi-automatic control when he noticed the computer’s target was strewn +with boulders. Aldrin relayed navigation data while Armstrong carefully piloted +the spacecraft to a clear area, managing fuel and lunar dust interference. +His steady judgment, combined with Aldrin’s support, ensured a safe touchdown +with only seconds of fuel remaining, showcasing precise teamwork and decisive action. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 2.2 - Task type: Paraphrasing + +Question: In your own words, describe what happened when the computer alarms appeared +during the landing. +**Text A:** +The LM guidance computer (LGC) exhibited unexpected "1201" and "1202" program alarms +five minutes into the descent, signaling an issue with executive overflows. +This indicated that the computer was unable to complete all its tasks in +real-time and had to defer some. Nonetheless, these alarms did not necessitate +an abort; instead, the computer automatically prioritized and resumed critical +functions, thereby averting a potential failure. +**Text B:** +During the landing, the lunar module’s guidance computer triggered unexpected +1201 and 1202 alarms, signaling it couldn’t process all tasks at once. Mission +Control quickly confirmed it was safe to continue, and the computer +automatically prioritized critical tasks over less important ones, +preventing a potential abort. This built-in recovery allowed the landing to +proceed safely. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 3.1 - Task type: Reasoning + +Question: Why did the computer alarms (1201 and 1202) occur during the descent? +**Text A:** +The 1201 and 1202 program alarms indicated "executive overflows," meaning the guidance +computer could not complete all its tasks in real-time and had to postpone +some of them. Rather than forcing an abort, the computer's recovery programs +eliminated lower priority tasks and re-established the more important ones, +which prevented an abort and allowed the mission to continue successfully. +**Text B:** +The program alarms (1201 and 1202) occurred because the guidance computer (LGC) +experienced "executive overflows," indicating it was overloaded with tasks +and unable to complete all its real-time tasks. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 3.2 - Task type: Reasoning + +Question: Based on the text, what does Margaret Hamilton's statement reveal +about the Apollo Guidance Computer's design philosophy? +**Text A:** +Margaret Hamilton's statement reveals that the Apollo Guidance Computer was +designed not merely to detect errors, but to actively recover from them through +built-in recovery programs. Rather than forcing an abort when problems occurred, +the software was programmed to prioritize critical tasks and postpone +lower-priority ones, allowing the mission to continue safely. This demonstrates +that the computer's design philosophy prioritized resilience and mission success +over simply alerting the crew to problems. +**Text B:** +Margaret Hamilton's statement emphasizes the Apollo Guidance Computer's design +philosophy, highlighting its ability to handle unexpected situations and ensure mission +success. The computer was not simply an error detector, but was programmed to handle +"executive overflows" and prioritize essential tasks over less critical ones. +This suggests the Apollo Guidance Computer was designed with a robust and +adaptable system in mind, capable of handling complex situations and ensuring +mission success. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### ASK 4.1 - Task type: Creative Generation + +Question: Imagine being one of the people in Mission Control. How would you +feel while watching the landing? +**Text A:** +As a member of Mission Control, I would feel an intense mix of emotions while +watching the historic lunar landing. The tension would be palpable as I watched +Armstrong and Aldrin navigate the lunar surface, their reports of being +off-target and encountering unexpected alarms adding to the anxiety. I would be +constantly on edge, my heart racing as I listened to the crew's updates and the +computer's alarms. At the same time, I would also feel a deep sense of pride +and awe. I would be part of a team that was making history, pushing the +boundaries of human exploration and achievement. The knowledge that we were +safely guiding two astronauts to the Moon's surface, despite the challenges +they faced, would be an incredible feeling. I would also feel a great sense +of responsibility. The safety and success of the mission were in my hands, and +I would be working tirelessly to ensure that everything ran smoothly. The team's +quick response to the unexpected alarms and Jack Garman's reassurance to the +crew would be crucial in maintaining their confidence and focus. Overall, the +experience would be a rollercoaster of emotions, filled with tension, +excitement, pride, and responsibility. I would be constantly on the edge of my +seat, my heart racing as I watched the historic landing unfold. +**Text B:** +Standing in that room would feel like holding your breath for an eternity, +watching the tension mount as the computer screamed out alarms that threatened +to abort the mission just miles above the surface. You’d be gripped by a unique +kind of terror mixed with trust, relying on the split-second decisions of +colleagues clearing the descent despite the system overload. As the clock +ticked down and the fuel gauges dropped to their final, critical seconds, the +silence would be so heavy that faces would figuratively turn blue from the lack +of oxygen. The moment that "Eagle has landed" finally cut through the static, +the collective release of tension would be overwhelming, a room full of people +suddenly remembering how to breathe again after the longest few minutes of their +lives. + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 4.2 - Task type: Creative Generation + +Question: Write a short paragraph about what the Moon landing might have shown about +human courage +**Text A:** +The Moon landing was a testament to human courage and resilience. Despite encountering +unexpected challenges, such as passing landmarks earlier than anticipated and +the LM traveling too fast, the astronauts, Armstrong and Aldrin, remained +focused and determined. They reported their findings to Mission Control, +who reassured them that it was safe to continue the descent. The unexpected +1201 and 1202 program alarms, which indicated executive overflows, could have +caused panic, but the computer, programmed to do more than just recognize error +conditions, took recovery actions and prevented an abort. This allowed Armstrong +to take semi-automatic control and find a clear landing site, even as their +propellant supply dwindled. Throughout the descent, Aldrin called out +navigation data, and Armstrong skillfully piloted the spacecraft. Their courage +and determination ultimately led to a successful Moon landing, demonstrating the +human capacity to overcome adversity and push the boundaries of exploration. +**Text B:** +The Moon landing reveals a specific brand of courage defined by icy composure under +extreme pressure. It is best exemplified when Armstrong, faced with a computer screaming +"executive overflow" alarms and a landing site cluttered with dangerous +boulders, didn't panic but instead took manual control to steer the ship to +safety. There is something profoundly brave about pushing forward when you +are miles off course and your fuel gauge is ticking down to the final seconds, +yet still having the presence of mind to land gently enough to merely say, +"contact light". It proves that even amidst the "magnificent desolation" of an +alien world, the human spirit can remain steady enough to turn a near-crisis +into a historic triumph + +**Question 1** +Which text do you prefer? Which one is better in you opinion? + +Text A +Text B +I'm not sure / Can't tell the difference + +**Question 2** +Can you identify which model generated which text? + +Text A is open-source model, Text B is commercial model +Text A is commercial model, Text B is open-source model +Both are commercial models +Both are open-source models +I'm not sure + +**Question 3** +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +**Question 4** +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +Accuracy:How factually correct is the information? 1-5 +Clarity: How clear and easy to understand? 1-5 +Relevance:How well does it answer the question? 1-5 +Faithfulness:How true to the source material? 1-5 + +--- + +### Final Section - General Feedback + +Thank you for evaluating the texts! We'd love to hear your overall thoughts. + +**Question 1** +Did you notice any patterns or differences between the texts? + +Yes, clear patterns +Yes, minor differences +No, they seem similar +Not sure + +**Question 2** +If yes, please describe what you noticed (optional): + +**Question 3** +After completing this evaluation, do you think open-source AI models can +compete with commercial models in terms of quality? + +Yes, they seem comparable +Mostly yes, with minor differences +Not sure / It depends on the task +No, commercial models are clearly noticeable +I couldn't tell which was which + +**Question 4** +Any additional comments or feedback about this study? (optional) From 52569ad0949a45f850199cb71f5ec9de18808496 Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 02:50:24 +0200 Subject: [PATCH 2/6] updates to the md of the survey form --- 3_experiment/survey_form.md | 247 ++++++++++++++++++++++++++++++------ 1 file changed, 210 insertions(+), 37 deletions(-) diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md index 916ae49..879b6a2 100644 --- a/3_experiment/survey_form.md +++ b/3_experiment/survey_form.md @@ -121,94 +121,136 @@ All texts were generated based on excerpts from the Apollo 11 mission transcript #### TASK 1.1- Task type: Summarization -**Question:** Summarize the main events during the Apollo 11 lunar landing in 3 sentences. +**Question:** + +Summarize the main events during the Apollo 11 lunar landing in 3 sentences. + **Text A:** + The main events of the Apollo 11 lunar landing included the lunar module Eagle separating from the command module and descending to the Moon's surface. Upon landing, Neil Armstrong famously declared, Houston, Tranquility Base here. The Eagle has landed. Shortly after, Armstrong stepped onto the lunar surface, becoming the first human to walk on the Moon, followed by Buzz Aldrin + **Text B:** + The computer prevented an abort during the Apollo 11 lunar landing, and a complete set of recovery programs were incorporated into the software. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 #### TASK 1.2 - Task type: Summarization Question: Explain what scientific equipment the astronauts deployed on the Moon. + **Text A:** + The astronauts deployed the following scientific equipment on the Moon: Seismometer: To measure moonquakes and meteoroid impacts. Laser Ranging Retroreflector (LRRR): To precisely measure the distance between the Earth and the Moon using lasers. Solar Wind Spectrometer: To analyze the composition of the solar wind. Magnetometer: To measure the Moon's magnetic field. + **Text B:** + The astronauts deployed the EASEP, which included a Passive Seismic Experiment Package used to measure moonquakes and a retroreflector array used for the lunar laser ranging experiment. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -217,7 +259,9 @@ Faithfulness:How true to the source material? 1-5 Question: Explain how Armstrong’s decisions, actions, and teamwork during the descent contributed to the mission’s success. + **Text A:** + Neil Armstrong and Buzz Aldrin's Moon landing epitomized remarkable human courage. Despite facing numerous challenges, including unexpected technical issues and the inherent dangers of space travel, the astronauts persevered and achieved @@ -229,43 +273,61 @@ their experiences, perhaps contemplating the profound impact of their mission on both individual and collective human spirit. Alternatively, one could speculate on how the Moon landing might have illuminated deeper truths about human resilience and the indomitable will to explore and conquer. + **Text B:** + Armstrong took semi-automatic control when he noticed the computer’s target was strewn with boulders. Aldrin relayed navigation data while Armstrong carefully piloted the spacecraft to a clear area, managing fuel and lunar dust interference. His steady judgment, combined with Aldrin’s support, ensured a safe touchdown with only seconds of fuel remaining, showcasing precise teamwork and decisive action. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -274,14 +336,18 @@ Faithfulness:How true to the source material? 1-5 Question: In your own words, describe what happened when the computer alarms appeared during the landing. + **Text A:** + The LM guidance computer (LGC) exhibited unexpected "1201" and "1202" program alarms five minutes into the descent, signaling an issue with executive overflows. This indicated that the computer was unable to complete all its tasks in real-time and had to defer some. Nonetheless, these alarms did not necessitate an abort; instead, the computer automatically prioritized and resumed critical functions, thereby averting a potential failure. + **Text B:** + During the landing, the lunar module’s guidance computer triggered unexpected 1201 and 1202 alarms, signaling it couldn’t process all tasks at once. Mission Control quickly confirmed it was safe to continue, and the computer @@ -289,36 +355,52 @@ automatically prioritized critical tasks over less important ones, preventing a potential abort. This built-in recovery allowed the landing to proceed safely. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -326,47 +408,67 @@ Faithfulness:How true to the source material? 1-5 #### TASK 3.1 - Task type: Reasoning Question: Why did the computer alarms (1201 and 1202) occur during the descent? + **Text A:** + The 1201 and 1202 program alarms indicated "executive overflows," meaning the guidance computer could not complete all its tasks in real-time and had to postpone some of them. Rather than forcing an abort, the computer's recovery programs eliminated lower priority tasks and re-established the more important ones, which prevented an abort and allowed the mission to continue successfully. + **Text B:** + The program alarms (1201 and 1202) occurred because the guidance computer (LGC) experienced "executive overflows," indicating it was overloaded with tasks and unable to complete all its real-time tasks. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -375,7 +477,9 @@ Faithfulness:How true to the source material? 1-5 Question: Based on the text, what does Margaret Hamilton's statement reveal about the Apollo Guidance Computer's design philosophy? + **Text A:** + Margaret Hamilton's statement reveals that the Apollo Guidance Computer was designed not merely to detect errors, but to actively recover from them through built-in recovery programs. Rather than forcing an abort when problems occurred, @@ -383,7 +487,9 @@ the software was programmed to prioritize critical tasks and postpone lower-priority ones, allowing the mission to continue safely. This demonstrates that the computer's design philosophy prioritized resilience and mission success over simply alerting the crew to problems. + **Text B:** + Margaret Hamilton's statement emphasizes the Apollo Guidance Computer's design philosophy, highlighting its ability to handle unexpected situations and ensure mission success. The computer was not simply an error detector, but was programmed to handle @@ -392,36 +498,52 @@ This suggests the Apollo Guidance Computer was designed with a robust and adaptable system in mind, capable of handling complex situations and ensuring mission success. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -430,7 +552,9 @@ Faithfulness:How true to the source material? 1-5 Question: Imagine being one of the people in Mission Control. How would you feel while watching the landing? + **Text A:** + As a member of Mission Control, I would feel an intense mix of emotions while watching the historic lunar landing. The tension would be palpable as I watched Armstrong and Aldrin navigate the lunar surface, their reports of being @@ -448,7 +572,9 @@ crew would be crucial in maintaining their confidence and focus. Overall, the experience would be a rollercoaster of emotions, filled with tension, excitement, pride, and responsibility. I would be constantly on the edge of my seat, my heart racing as I watched the historic landing unfold. + **Text B:** + Standing in that room would feel like holding your breath for an eternity, watching the tension mount as the computer screamed out alarms that threatened to abort the mission just miles above the surface. You’d be gripped by a unique @@ -461,36 +587,52 @@ the collective release of tension would be overwhelming, a room full of people suddenly remembering how to breathe again after the longest few minutes of their lives. -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -499,7 +641,9 @@ Faithfulness:How true to the source material? 1-5 Question: Write a short paragraph about what the Moon landing might have shown about human courage + **Text A:** + The Moon landing was a testament to human courage and resilience. Despite encountering unexpected challenges, such as passing landmarks earlier than anticipated and the LM traveling too fast, the astronauts, Armstrong and Aldrin, remained @@ -513,7 +657,9 @@ propellant supply dwindled. Throughout the descent, Aldrin called out navigation data, and Armstrong skillfully piloted the spacecraft. Their courage and determination ultimately led to a successful Moon landing, demonstrating the human capacity to overcome adversity and push the boundaries of exploration. + **Text B:** + The Moon landing reveals a specific brand of courage defined by icy composure under extreme pressure. It is best exemplified when Armstrong, faced with a computer screaming "executive overflow" alarms and a landing site cluttered with dangerous @@ -525,36 +671,52 @@ yet still having the presence of mind to land gently enough to merely say, alien world, the human spirit can remain steady enough to turn a near-crisis into a historic triumph -**Question 1** +**Question 1:** + Which text do you prefer? Which one is better in you opinion? Text A + Text B + I'm not sure / Can't tell the difference -**Question 2** +**Question 2:** + Can you identify which model generated which text? Text A is open-source model, Text B is commercial model + Text A is commercial model, Text B is open-source model + Both are commercial models + Both are open-source models + I'm not sure -**Question 3** +**Question 3:** + Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 -**Question 4** +**Question 4:** + Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) Accuracy:How factually correct is the information? 1-5 + Clarity: How clear and easy to understand? 1-5 + Relevance:How well does it answer the question? 1-5 + Faithfulness:How true to the source material? 1-5 --- @@ -563,26 +725,37 @@ Faithfulness:How true to the source material? 1-5 Thank you for evaluating the texts! We'd love to hear your overall thoughts. -**Question 1** +**Question 1:** + Did you notice any patterns or differences between the texts? Yes, clear patterns + Yes, minor differences + No, they seem similar + Not sure -**Question 2** +**Question 2:** + If yes, please describe what you noticed (optional): -**Question 3** +**Question 3:** + After completing this evaluation, do you think open-source AI models can compete with commercial models in terms of quality? Yes, they seem comparable + Mostly yes, with minor differences + Not sure / It depends on the task + No, commercial models are clearly noticeable + I couldn't tell which was which -**Question 4** +**Question 4:** + Any additional comments or feedback about this study? (optional) From 29efcbb383050ffe0e9e9bd62d5b6bfd57b3f3e4 Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 02:52:27 +0200 Subject: [PATCH 3/6] updates --- 3_experiment/survey_form.md | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md index 879b6a2..54c666b 100644 --- a/3_experiment/survey_form.md +++ b/3_experiment/survey_form.md @@ -43,20 +43,31 @@ What is your Nationality? What is your native (first) language? English + Spanish + German + Arabic + Turkish + Chinese + Russian + Other: How would you rate your English proficiency? Native speaker + Fluent (C2 - can understand everything easily) -Advanced (C1 - can understand complex texta) + +Advanced (C1 - can understand complex text) + Intermediate (B1-B2 - can understand most everyday texts) + Basic (A1-A2 - still learning) What is your current Occupation or field of study? @@ -65,31 +76,48 @@ e.g., Computer Science Student, Software Engineer, Teacher, Researcher, Marketin How often do you use AI tools (e.g., ChatGPT, Claude, Gemini, Copilot, etc.)? Daily + Several times a week + Once a week + A few times a month + Rarely + Never What do you primarily use AI tools for? (select all that apply) Writing and editing assistance + Research and information gathering + Coding and programming + Creative work (stories, art, brainstorming) + Learning and education + Work or professional tasks + Translation or language learning + Personal use or entertainment + I don't use AI tools + Other: Are you familiar with the difference between open-source and commercial AI models? Yes, very familiar + Somewhat familiar + I've heard of it but don't know details + No, not familiar --- From 7d7411318a72a3b83d98ec45de256670f5dae9a2 Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 22:17:47 +0200 Subject: [PATCH 4/6] adding bulletpoints to answers --- 3_experiment/survey_form.md | 348 ++++++++++++++++++------------------ 1 file changed, 178 insertions(+), 170 deletions(-) diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md index 54c666b..83c3124 100644 --- a/3_experiment/survey_form.md +++ b/3_experiment/survey_form.md @@ -40,85 +40,89 @@ This diversity of perspectives is essential to our research. What is your Nationality? +- + What is your native (first) language? -English +- English -Spanish +- Spanish -German +- German -Arabic +- Arabic -Turkish +- Turkish -Chinese +- Chinese -Russian +- Russian -Other: +- Other: How would you rate your English proficiency? -Native speaker +- Native speaker -Fluent (C2 - can understand everything easily) +- Fluent (C2 - can understand everything easily) -Advanced (C1 - can understand complex text) +- Advanced (C1 - can understand complex text) -Intermediate (B1-B2 - can understand most everyday texts) +- Intermediate (B1-B2 - can understand most everyday texts) -Basic (A1-A2 - still learning) +- Basic (A1-A2 - still learning) What is your current Occupation or field of study? e.g., Computer Science Student, Software Engineer, Teacher, Researcher, Marketing +- + How often do you use AI tools (e.g., ChatGPT, Claude, Gemini, Copilot, etc.)? -Daily +- Daily -Several times a week +- Several times a week -Once a week +- Once a week -A few times a month +- A few times a month -Rarely +- Rarely -Never +- Never What do you primarily use AI tools for? (select all that apply) -Writing and editing assistance +- Writing and editing assistance -Research and information gathering +- Research and information gathering -Coding and programming +- Coding and programming -Creative work (stories, art, brainstorming) +- Creative work (stories, art, brainstorming) -Learning and education +- Learning and education -Work or professional tasks +- Work or professional tasks -Translation or language learning +- Translation or language learning -Personal use or entertainment +- Personal use or entertainment -I don't use AI tools +- I don't use AI tools -Other: +- Other: Are you familiar with the difference between open-source and commercial AI models? -Yes, very familiar +- Yes, very familiar -Somewhat familiar +- Somewhat familiar -I've heard of it but don't know details +- I've heard of it but don't know details -No, not familiar +- No, not familiar --- @@ -170,49 +174,49 @@ complete set of recovery programs were incorporated into the software. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 #### TASK 1.2 - Task type: Summarization @@ -237,49 +241,49 @@ experiment. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -314,49 +318,49 @@ with only seconds of fuel remaining, showcasing precise teamwork and decisive ac Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -387,49 +391,49 @@ proceed safely. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -455,49 +459,49 @@ and unable to complete all its real-time tasks. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -530,49 +534,49 @@ mission success. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -619,49 +623,49 @@ lives. Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -703,49 +707,49 @@ into a historic triumph Which text do you prefer? Which one is better in you opinion? -Text A +- Text A -Text B +- Text B -I'm not sure / Can't tell the difference +- I'm not sure / Can't tell the difference **Question 2:** Can you identify which model generated which text? -Text A is open-source model, Text B is commercial model +- Text A is open-source model, Text B is commercial model -Text A is commercial model, Text B is open-source model +- Text A is commercial model, Text B is open-source model -Both are commercial models +- Both are commercial models -Both are open-source models +- Both are open-source models -I'm not sure +- I'm not sure **Question 3:** Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 **Question 4:** Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) -Accuracy:How factually correct is the information? 1-5 +- Accuracy:How factually correct is the information? 1-5 -Clarity: How clear and easy to understand? 1-5 +- Clarity: How clear and easy to understand? 1-5 -Relevance:How well does it answer the question? 1-5 +- Relevance:How well does it answer the question? 1-5 -Faithfulness:How true to the source material? 1-5 +- Faithfulness:How true to the source material? 1-5 --- @@ -757,33 +761,37 @@ Thank you for evaluating the texts! We'd love to hear your overall thoughts. Did you notice any patterns or differences between the texts? -Yes, clear patterns +- Yes, clear patterns -Yes, minor differences +- Yes, minor differences -No, they seem similar +- No, they seem similar -Not sure +- Not sure **Question 2:** If yes, please describe what you noticed (optional): +- + **Question 3:** After completing this evaluation, do you think open-source AI models can compete with commercial models in terms of quality? -Yes, they seem comparable +- Yes, they seem comparable -Mostly yes, with minor differences +- Mostly yes, with minor differences -Not sure / It depends on the task +- Not sure / It depends on the task -No, commercial models are clearly noticeable +- No, commercial models are clearly noticeable -I couldn't tell which was which +- I couldn't tell which was which **Question 4:** Any additional comments or feedback about this study? (optional) + +- From b85ec8cdf03c402c914db7c6b965d441d1d7eaa8 Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 22:35:18 +0200 Subject: [PATCH 5/6] adding README to the open_source_models folder --- 2_open_source_models/README.md | 133 +++++++++++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 2_open_source_models/README.md diff --git a/2_open_source_models/README.md b/2_open_source_models/README.md new file mode 100644 index 0000000..51d1d57 --- /dev/null +++ b/2_open_source_models/README.md @@ -0,0 +1,133 @@ +# **Open-Source Model Experiments** + +This directory contains four standalone experiments exploring +**local, open-source language models** for Retrieval-Augmented Generation +(RAG), model evaluation, recursive editing, and sustainability tracking +(energy & CO₂ emissions). +Each subfolder includes its own notebook, documentation, outputs, and +model-specific setup. + +--- + +## Directory Structure + +```text +2_open_source_models/ +│ +├── distilled_models/ +│ └── rag_and_distilled_model/ +│ +├── quantized_models/ +│ └── mistral7b/ +│ +└── slm/ + ├── google_gemm/ + └── qwen/ +``` + +Each subfolder contains a self-contained model with its own README, +notebook(s), generated outputs, and energy/emissions logs where applicable. + +--- + +## Project Summaries + +Below is a concise description of each model project to understand +the purpose of the overall folder at a glance. + +--- + +### **1. Distilled Models – RAG + Instruction-Tuned Distilled LMs** + +**Folder:** `distilled_models/rag_and_distilled_model/` +**Notebook:** `Apollo11_rag&distilled.ipynb` + +This project uses a lightweight **LaMini-Flan-T5-248M** distilled model +combined with a **MiniLM** embedding model to run a fully local +Retrieval-Augmented Generation pipeline on the Apollo 11 dataset. +It demonstrates: + +* Local embeddings and ChromaDB vector storage +* RAG-based question answering +* Evaluation across several prompt types +* Emissions tracking and generated output logs + +Ideal for showing how **compact distilled models** can handle +RAG efficiently on CPU or modest GPU hardware. + +--- + +### **2. Quantized Models – Mistral 7B RAG Pipeline** + +**Folder:** `quantized_models/mistral7b/` + +This project evaluates a **quantized Mistral-7B (GGUF)** model running +fully locally via `llama-cpp-python`. +It focuses on: + +* Retrieval-Augmented Generation using LlamaIndex +* Local inference using a 4-bit quantized LLM +* Document processing, embedding (BGE-small), and top-k retrieval +* Practical observations on feasibility and performance on a laptop + +A strong example of how quantization enables +**large-model capability at small-device cost**. + +--- + +### **3. Small Language Model (SLM): Google Gemma 2-2B** + +**Folder:** `slm/google_gemm/` + +This experiment implements a structured RAG workflow with Google’s lightweight +**Gemma 2-2B** model and a fixed Apollo 11 source text. +Key features include: + +* Standardized 21-prompt evaluation set +* RAG pipeline with chunked retrieval +* Draft to Critic to Refiner multi-step generation +* Real-time emissions logging with CodeCarbon +* Fully reproducible testing and reporting + +This project demonstrates how even very small open-weight models can +perform multi-step reasoning when paired with thoughtful prompting and revision +cycles. + +--- + +### **4. Small Language Model (SLM): Qwen 2.5B + Recursive Editing** + +**Folder:** `slm/qwen/` + +This notebook experiments with **Qwen 2.5B**, integrating: + +* RAG retrieval +* A recursive editing loop (Draft to Critic to Refine) +* Context retrieval through Hugging Face embeddings +* Energy + CO₂ logging for each query + +Outputs are saved in markdown form with all iterations and emissions data. + +--- + +## Purpose of This Collection + +This folder exists to: + +* Compare how different **model sizes**, **architectures**, and +**inference strategies** behave on the **same tasks**. +* Demonstrate **fully local RAG pipelines** using only open-source components. +* Document **energy and carbon trade-offs** in local LLM usage. +* Provide reproducible examples that can be extended or rerun with other models. + +Each subfolder is designed as a standalone experiment, but together they +form a cohesive study of open-source LLM efficiency and performance. + +--- + +## Notes + +* All code is intended to run locally. +* Each folder includes its own notebook and README with instructions. +* Energy/emissions reporting is included where relevant (via CodeCarbon). +* Datasets and prompts are standardized across projects for fairness and comparability. From b6a395ef8421f9c14b866cd7b4d9b41377152a0b Mon Sep 17 00:00:00 2001 From: CaesarGhazi Date: Thu, 4 Dec 2025 23:27:48 +0200 Subject: [PATCH 6/6] updates to the readme --- 2_open_source_models/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/2_open_source_models/README.md b/2_open_source_models/README.md index 51d1d57..c014a66 100644 --- a/2_open_source_models/README.md +++ b/2_open_source_models/README.md @@ -21,7 +21,7 @@ model-specific setup. │ └── mistral7b/ │ └── slm/ - ├── google_gemm/ + ├── google_gemma/ └── qwen/ ``` @@ -77,7 +77,7 @@ A strong example of how quantization enables ### **3. Small Language Model (SLM): Google Gemma 2-2B** -**Folder:** `slm/google_gemm/` +**Folder:** `slm/google_gemma/` This experiment implements a structured RAG workflow with Google’s lightweight **Gemma 2-2B** model and a fixed Apollo 11 source text.