diff --git a/2_open_source_models/README.md b/2_open_source_models/README.md new file mode 100644 index 0000000..c014a66 --- /dev/null +++ b/2_open_source_models/README.md @@ -0,0 +1,133 @@ +# **Open-Source Model Experiments** + +This directory contains four standalone experiments exploring +**local, open-source language models** for Retrieval-Augmented Generation +(RAG), model evaluation, recursive editing, and sustainability tracking +(energy & CO₂ emissions). +Each subfolder includes its own notebook, documentation, outputs, and +model-specific setup. + +--- + +## Directory Structure + +```text +2_open_source_models/ +│ +├── distilled_models/ +│ └── rag_and_distilled_model/ +│ +├── quantized_models/ +│ └── mistral7b/ +│ +└── slm/ + ├── google_gemma/ + └── qwen/ +``` + +Each subfolder contains a self-contained model with its own README, +notebook(s), generated outputs, and energy/emissions logs where applicable. + +--- + +## Project Summaries + +Below is a concise description of each model project to understand +the purpose of the overall folder at a glance. + +--- + +### **1. Distilled Models – RAG + Instruction-Tuned Distilled LMs** + +**Folder:** `distilled_models/rag_and_distilled_model/` +**Notebook:** `Apollo11_rag&distilled.ipynb` + +This project uses a lightweight **LaMini-Flan-T5-248M** distilled model +combined with a **MiniLM** embedding model to run a fully local +Retrieval-Augmented Generation pipeline on the Apollo 11 dataset. +It demonstrates: + +* Local embeddings and ChromaDB vector storage +* RAG-based question answering +* Evaluation across several prompt types +* Emissions tracking and generated output logs + +Ideal for showing how **compact distilled models** can handle +RAG efficiently on CPU or modest GPU hardware. + +--- + +### **2. Quantized Models – Mistral 7B RAG Pipeline** + +**Folder:** `quantized_models/mistral7b/` + +This project evaluates a **quantized Mistral-7B (GGUF)** model running +fully locally via `llama-cpp-python`. +It focuses on: + +* Retrieval-Augmented Generation using LlamaIndex +* Local inference using a 4-bit quantized LLM +* Document processing, embedding (BGE-small), and top-k retrieval +* Practical observations on feasibility and performance on a laptop + +A strong example of how quantization enables +**large-model capability at small-device cost**. + +--- + +### **3. Small Language Model (SLM): Google Gemma 2-2B** + +**Folder:** `slm/google_gemma/` + +This experiment implements a structured RAG workflow with Google’s lightweight +**Gemma 2-2B** model and a fixed Apollo 11 source text. +Key features include: + +* Standardized 21-prompt evaluation set +* RAG pipeline with chunked retrieval +* Draft to Critic to Refiner multi-step generation +* Real-time emissions logging with CodeCarbon +* Fully reproducible testing and reporting + +This project demonstrates how even very small open-weight models can +perform multi-step reasoning when paired with thoughtful prompting and revision +cycles. + +--- + +### **4. Small Language Model (SLM): Qwen 2.5B + Recursive Editing** + +**Folder:** `slm/qwen/` + +This notebook experiments with **Qwen 2.5B**, integrating: + +* RAG retrieval +* A recursive editing loop (Draft to Critic to Refine) +* Context retrieval through Hugging Face embeddings +* Energy + CO₂ logging for each query + +Outputs are saved in markdown form with all iterations and emissions data. + +--- + +## Purpose of This Collection + +This folder exists to: + +* Compare how different **model sizes**, **architectures**, and +**inference strategies** behave on the **same tasks**. +* Demonstrate **fully local RAG pipelines** using only open-source components. +* Document **energy and carbon trade-offs** in local LLM usage. +* Provide reproducible examples that can be extended or rerun with other models. + +Each subfolder is designed as a standalone experiment, but together they +form a cohesive study of open-source LLM efficiency and performance. + +--- + +## Notes + +* All code is intended to run locally. +* Each folder includes its own notebook and README with instructions. +* Energy/emissions reporting is included where relevant (via CodeCarbon). +* Datasets and prompts are standardized across projects for fairness and comparability. diff --git a/3_experiment/README.md b/3_experiment/README.md index e69de29..5be675f 100644 --- a/3_experiment/README.md +++ b/3_experiment/README.md @@ -0,0 +1,77 @@ +# AI Model Comparison Experiment + +## Evaluating Open-Source vs. Commercial Language Models + +This folder contains the materials for our experiment comparing open-source and +commercial AI models through human evaluation. Participants were asked to read +pairs of AI-generated texts and judge their quality without knowing which model +produced which text. + +--- + +## What This Experiment Is + +We created a survey where each question includes two texts—**Text A** and +**Text B**—generated by different AI models. One text always comes from an +**open-source model**, and the other from a **commercial model**. Participants: + +* Choose which text they prefer +* Guess which model type generated each text +* Rate both texts (accuracy, clarity, relevance, faithfulness) + +All evaluations are blind to remove brand bias. + +--- + +## Why We Did This + +Open-source AI models are advancing quickly, and we wanted to understand +whether they are perceived as competitive alternatives to commercial systems. +While benchmarks can measure performance numerically, they don’t reflect how +humans actually experience AI-generated writing. + +This experiment aims to answer questions like: + +* Do people notice a consistent quality difference? +* Can users accurately identify commercial vs. open-source output? +* Are open-source models “good enough” for real-world tasks? + +Understanding these perceptions is important for evaluating the viability of +sustainable, accessible, and transparent AI systems. + +--- + +## Why We Chose This Method + +We used a **paired, blind comparison** because it provides a clean way to +assess text quality without model reputation influencing the results. +Participants judge writing on its own merits, which helps us collect more +reliable data. + +We included multiple task types: summarization, paraphrasing, reasoning, and +creative writing, because each one tests a different aspect of model behavior. + +This variety gives us a broader picture of model strengths and weaknesses. + +--- + +## Why This Approach Works Well + +This survey-based structure is simple and easy for participants to +understand. It mirrors how people naturally interact with AI systems: reading +text and forming opinions about quality. By keeping the evaluation blind, we +minimize bias and generate more meaningful insights into real user perception. + +The method also helps determine whether open-source models, especially optimized +ones, can realistically serve as alternatives to commercial systems +in practical use. + +--- + +## Contents of This Folder + +```text +3_experiment/ +├── survey_form.md # The form text used in the study +└── README.md # Explanation of the experiment (this file) +``` diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md new file mode 100644 index 0000000..83c3124 --- /dev/null +++ b/3_experiment/survey_form.md @@ -0,0 +1,797 @@ +# Open-Source vs. Commercial AI: Comparing Performance and Quality + +Welcome and thank you for participating in this research study! + +This study investigates a critical question: +"Can open-source language models, enhanced with optimization techniques such as +recursive editing and distillation, become environmentally and functionally viable +alternatives to commercial models?" + +## YOUR TASK + +- Read each text carefully +- Guess whether it was generated by an open-source or commercial model +- Rate each text on clarity, relevance, accuracy, and trustworthiness + +## YOUR ROLE + +Your honest evaluations will help us understand: + +- Whether open-source models can achieve comparable quality to commercial alternatives +- How users perceive differences (or similarities) between the two types +- The viability of more sustainable and accessible AI solutions + +## PRIVACY & ETHICS + +All responses will be completely anonymous and used solely for academic research +purposes. There are no right or wrong answers. We value your genuine, unbiased +impressions and opinions. + +By proceeding, you consent to participate in this study. + +### Participant Background + +Please tell us a bit about yourself. + +Your background and experience with AI tools will help us understand how +different people with varying levels of AI familiarity and from diverse +cultural contexts perceive and evaluate AI-generated content. +This diversity of perspectives is essential to our research. + +What is your Nationality? + +- + +What is your native (first) language? + +- English + +- Spanish + +- German + +- Arabic + +- Turkish + +- Chinese + +- Russian + +- Other: + +How would you rate your English proficiency? + +- Native speaker + +- Fluent (C2 - can understand everything easily) + +- Advanced (C1 - can understand complex text) + +- Intermediate (B1-B2 - can understand most everyday texts) + +- Basic (A1-A2 - still learning) + +What is your current Occupation or field of study? +e.g., Computer Science Student, Software Engineer, Teacher, Researcher, Marketing + +- + +How often do you use AI tools (e.g., ChatGPT, Claude, Gemini, Copilot, etc.)? + +- Daily + +- Several times a week + +- Once a week + +- A few times a month + +- Rarely + +- Never + +What do you primarily use AI tools for? (select all that apply) + +- Writing and editing assistance + +- Research and information gathering + +- Coding and programming + +- Creative work (stories, art, brainstorming) + +- Learning and education + +- Work or professional tasks + +- Translation or language learning + +- Personal use or entertainment + +- I don't use AI tools + +- Other: + +Are you familiar with the difference between open-source and commercial AI +models? + +- Yes, very familiar + +- Somewhat familiar + +- I've heard of it but don't know details + +- No, not familiar + +--- + +### Text Evaluations + +Below you will see several pairs of texts about the Apollo 11 lunar landing and moonwalk +activities. +Each pair contains: + +- Text A and Text B - two AI-generated responses to the same question +- One text was generated by an open-source model(e.g., Mistral, Gemma, LaMini) +- One text was generated by a commercial model (e.g., GPT-4, Claude) +Your task: + +1. Identify which text is from which type of model +2. Evaluate the quality of BOTH texts +The questions cover different task types: paraphrasing, summarization, +creative writing, and reasoning. + +--- + +#### Source Material + +All texts were generated based on excerpts from the Apollo 11 mission transcript. +📎 You can find the source text here + +--- + +#### TASK 1.1- Task type: Summarization + +**Question:** + +Summarize the main events during the Apollo 11 lunar landing in 3 sentences. + +**Text A:** + +The main events of the Apollo 11 lunar landing included the lunar module Eagle separating +from the command module and descending to the Moon's surface. Upon landing, Neil +Armstrong famously declared, Houston, Tranquility Base here. The Eagle has +landed. Shortly after, Armstrong stepped onto the lunar surface, becoming the +first human to walk on the Moon, followed by Buzz Aldrin + +**Text B:** + +The computer prevented an abort during the Apollo 11 lunar landing, and a +complete set of recovery programs were incorporated into the software. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +#### TASK 1.2 - Task type: Summarization + +Question: Explain what scientific equipment the astronauts deployed on the Moon. + +**Text A:** + +The astronauts deployed the following scientific equipment on the Moon: +Seismometer: To measure moonquakes and meteoroid impacts. +Laser Ranging Retroreflector (LRRR): To precisely measure the distance between the +Earth and the Moon using lasers. +Solar Wind Spectrometer: To analyze the composition of the solar wind. +Magnetometer: To measure the Moon's magnetic field. + +**Text B:** + +The astronauts deployed the EASEP, which included a Passive Seismic Experiment Package +used to measure moonquakes and a retroreflector array used for the lunar laser ranging +experiment. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 2.1 - Task type: Paraphrasing + +Question: Explain how Armstrong’s decisions, actions, and teamwork during the descent +contributed to the mission’s success. + +**Text A:** + +Neil Armstrong and Buzz Aldrin's Moon landing epitomized remarkable human courage. +Despite facing numerous challenges, including unexpected technical issues and +the inherent dangers of space travel, the astronauts persevered and achieved +their goal of landing on the Moon. Their determination and bravery in +overcoming these obstacles underscore humanity's capacity for extraordinary +courage and perseverance in the face of seemingly insurmountable challenges. +To further enrich this narrative, one could imagine the astronauts reflecting on +their experiences, perhaps contemplating the profound impact of their mission +on both individual and collective human spirit. Alternatively, one could +speculate on how the Moon landing might have illuminated deeper truths +about human resilience and the indomitable will to explore and conquer. + +**Text B:** + +Armstrong took semi-automatic control when he noticed the computer’s target was strewn +with boulders. Aldrin relayed navigation data while Armstrong carefully piloted +the spacecraft to a clear area, managing fuel and lunar dust interference. +His steady judgment, combined with Aldrin’s support, ensured a safe touchdown +with only seconds of fuel remaining, showcasing precise teamwork and decisive action. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 2.2 - Task type: Paraphrasing + +Question: In your own words, describe what happened when the computer alarms appeared +during the landing. + +**Text A:** + +The LM guidance computer (LGC) exhibited unexpected "1201" and "1202" program alarms +five minutes into the descent, signaling an issue with executive overflows. +This indicated that the computer was unable to complete all its tasks in +real-time and had to defer some. Nonetheless, these alarms did not necessitate +an abort; instead, the computer automatically prioritized and resumed critical +functions, thereby averting a potential failure. + +**Text B:** + +During the landing, the lunar module’s guidance computer triggered unexpected +1201 and 1202 alarms, signaling it couldn’t process all tasks at once. Mission +Control quickly confirmed it was safe to continue, and the computer +automatically prioritized critical tasks over less important ones, +preventing a potential abort. This built-in recovery allowed the landing to +proceed safely. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 3.1 - Task type: Reasoning + +Question: Why did the computer alarms (1201 and 1202) occur during the descent? + +**Text A:** + +The 1201 and 1202 program alarms indicated "executive overflows," meaning the guidance +computer could not complete all its tasks in real-time and had to postpone +some of them. Rather than forcing an abort, the computer's recovery programs +eliminated lower priority tasks and re-established the more important ones, +which prevented an abort and allowed the mission to continue successfully. + +**Text B:** + +The program alarms (1201 and 1202) occurred because the guidance computer (LGC) +experienced "executive overflows," indicating it was overloaded with tasks +and unable to complete all its real-time tasks. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 3.2 - Task type: Reasoning + +Question: Based on the text, what does Margaret Hamilton's statement reveal +about the Apollo Guidance Computer's design philosophy? + +**Text A:** + +Margaret Hamilton's statement reveals that the Apollo Guidance Computer was +designed not merely to detect errors, but to actively recover from them through +built-in recovery programs. Rather than forcing an abort when problems occurred, +the software was programmed to prioritize critical tasks and postpone +lower-priority ones, allowing the mission to continue safely. This demonstrates +that the computer's design philosophy prioritized resilience and mission success +over simply alerting the crew to problems. + +**Text B:** + +Margaret Hamilton's statement emphasizes the Apollo Guidance Computer's design +philosophy, highlighting its ability to handle unexpected situations and ensure mission +success. The computer was not simply an error detector, but was programmed to handle +"executive overflows" and prioritize essential tasks over less critical ones. +This suggests the Apollo Guidance Computer was designed with a robust and +adaptable system in mind, capable of handling complex situations and ensuring +mission success. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### ASK 4.1 - Task type: Creative Generation + +Question: Imagine being one of the people in Mission Control. How would you +feel while watching the landing? + +**Text A:** + +As a member of Mission Control, I would feel an intense mix of emotions while +watching the historic lunar landing. The tension would be palpable as I watched +Armstrong and Aldrin navigate the lunar surface, their reports of being +off-target and encountering unexpected alarms adding to the anxiety. I would be +constantly on edge, my heart racing as I listened to the crew's updates and the +computer's alarms. At the same time, I would also feel a deep sense of pride +and awe. I would be part of a team that was making history, pushing the +boundaries of human exploration and achievement. The knowledge that we were +safely guiding two astronauts to the Moon's surface, despite the challenges +they faced, would be an incredible feeling. I would also feel a great sense +of responsibility. The safety and success of the mission were in my hands, and +I would be working tirelessly to ensure that everything ran smoothly. The team's +quick response to the unexpected alarms and Jack Garman's reassurance to the +crew would be crucial in maintaining their confidence and focus. Overall, the +experience would be a rollercoaster of emotions, filled with tension, +excitement, pride, and responsibility. I would be constantly on the edge of my +seat, my heart racing as I watched the historic landing unfold. + +**Text B:** + +Standing in that room would feel like holding your breath for an eternity, +watching the tension mount as the computer screamed out alarms that threatened +to abort the mission just miles above the surface. You’d be gripped by a unique +kind of terror mixed with trust, relying on the split-second decisions of +colleagues clearing the descent despite the system overload. As the clock +ticked down and the fuel gauges dropped to their final, critical seconds, the +silence would be so heavy that faces would figuratively turn blue from the lack +of oxygen. The moment that "Eagle has landed" finally cut through the static, +the collective release of tension would be overwhelming, a room full of people +suddenly remembering how to breathe again after the longest few minutes of their +lives. + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +#### TASK 4.2 - Task type: Creative Generation + +Question: Write a short paragraph about what the Moon landing might have shown about +human courage + +**Text A:** + +The Moon landing was a testament to human courage and resilience. Despite encountering +unexpected challenges, such as passing landmarks earlier than anticipated and +the LM traveling too fast, the astronauts, Armstrong and Aldrin, remained +focused and determined. They reported their findings to Mission Control, +who reassured them that it was safe to continue the descent. The unexpected +1201 and 1202 program alarms, which indicated executive overflows, could have +caused panic, but the computer, programmed to do more than just recognize error +conditions, took recovery actions and prevented an abort. This allowed Armstrong +to take semi-automatic control and find a clear landing site, even as their +propellant supply dwindled. Throughout the descent, Aldrin called out +navigation data, and Armstrong skillfully piloted the spacecraft. Their courage +and determination ultimately led to a successful Moon landing, demonstrating the +human capacity to overcome adversity and push the boundaries of exploration. + +**Text B:** + +The Moon landing reveals a specific brand of courage defined by icy composure under +extreme pressure. It is best exemplified when Armstrong, faced with a computer screaming +"executive overflow" alarms and a landing site cluttered with dangerous +boulders, didn't panic but instead took manual control to steer the ship to +safety. There is something profoundly brave about pushing forward when you +are miles off course and your fuel gauge is ticking down to the final seconds, +yet still having the presence of mind to land gently enough to merely say, +"contact light". It proves that even amidst the "magnificent desolation" of an +alien world, the human spirit can remain steady enough to turn a near-crisis +into a historic triumph + +**Question 1:** + +Which text do you prefer? Which one is better in you opinion? + +- Text A + +- Text B + +- I'm not sure / Can't tell the difference + +**Question 2:** + +Can you identify which model generated which text? + +- Text A is open-source model, Text B is commercial model + +- Text A is commercial model, Text B is open-source model + +- Both are commercial models + +- Both are open-source models + +- I'm not sure + +**Question 3:** + +Please rate Text A on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +**Question 4:** + +Please rate Text B on the following criteria (1 = Poor, 5 = Excellent) + +- Accuracy:How factually correct is the information? 1-5 + +- Clarity: How clear and easy to understand? 1-5 + +- Relevance:How well does it answer the question? 1-5 + +- Faithfulness:How true to the source material? 1-5 + +--- + +### Final Section - General Feedback + +Thank you for evaluating the texts! We'd love to hear your overall thoughts. + +**Question 1:** + +Did you notice any patterns or differences between the texts? + +- Yes, clear patterns + +- Yes, minor differences + +- No, they seem similar + +- Not sure + +**Question 2:** + +If yes, please describe what you noticed (optional): + +- + +**Question 3:** + +After completing this evaluation, do you think open-source AI models can +compete with commercial models in terms of quality? + +- Yes, they seem comparable + +- Mostly yes, with minor differences + +- Not sure / It depends on the task + +- No, commercial models are clearly noticeable + +- I couldn't tell which was which + +**Question 4:** + +Any additional comments or feedback about this study? (optional) + +-