|
| 1 | +# AI Model Comparison Experiment |
| 2 | + |
| 3 | +## Evaluating Open-Source vs. Commercial Language Models |
| 4 | + |
| 5 | +This folder contains the materials for our experiment comparing open-source and |
| 6 | +commercial AI models through human evaluation. Participants were asked to read |
| 7 | +pairs of AI-generated texts and judge their quality without knowing which model |
| 8 | +produced which text. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## What This Experiment Is |
| 13 | + |
| 14 | +We created a survey where each question includes two texts—**Text A** and |
| 15 | +**Text B**—generated by different AI models. One text always comes from an |
| 16 | +**open-source model**, and the other from a **commercial model**. Participants: |
| 17 | + |
| 18 | +* Choose which text they prefer |
| 19 | +* Guess which model type generated each text |
| 20 | +* Rate both texts (accuracy, clarity, relevance, faithfulness) |
| 21 | + |
| 22 | +All evaluations are blind to remove brand bias. |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Why We Did This |
| 27 | + |
| 28 | +Open-source AI models are advancing quickly, and we wanted to understand |
| 29 | +whether they are perceived as competitive alternatives to commercial systems. |
| 30 | +While benchmarks can measure performance numerically, they don’t reflect how |
| 31 | +humans actually experience AI-generated writing. |
| 32 | + |
| 33 | +This experiment aims to answer questions like: |
| 34 | + |
| 35 | +* Do people notice a consistent quality difference? |
| 36 | +* Can users accurately identify commercial vs. open-source output? |
| 37 | +* Are open-source models “good enough” for real-world tasks? |
| 38 | + |
| 39 | +Understanding these perceptions is important for evaluating the viability of |
| 40 | +sustainable, accessible, and transparent AI systems. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## Why We Chose This Method |
| 45 | + |
| 46 | +We used a **paired, blind comparison** because it provides a clean way to |
| 47 | +assess text quality without model reputation influencing the results. |
| 48 | +Participants judge writing on its own merits, which helps us collect more |
| 49 | +reliable data. |
| 50 | + |
| 51 | +We included multiple task types: summarization, paraphrasing, reasoning, and |
| 52 | +creative writing, because each one tests a different aspect of model behavior. |
| 53 | + |
| 54 | +This variety gives us a broader picture of model strengths and weaknesses. |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## Why This Approach Works Well |
| 59 | + |
| 60 | +This survey-based structure is simple and easy for participants to |
| 61 | +understand. It mirrors how people naturally interact with AI systems: reading |
| 62 | +text and forming opinions about quality. By keeping the evaluation blind, we |
| 63 | +minimize bias and generate more meaningful insights into real user perception. |
| 64 | + |
| 65 | +The method also helps determine whether open-source models, especially optimized |
| 66 | +ones, can realistically serve as alternatives to commercial systems |
| 67 | +in practical use. |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## Contents of This Folder |
| 72 | + |
| 73 | +```text |
| 74 | +3_experiment/ |
| 75 | +├── survey_form.md # The form text used in the study |
| 76 | +└── README.md # Explanation of the experiment (this file) |
| 77 | +``` |
0 commit comments