Skip to content

Commit 5c43913

Browse files
committed
adding the readme for the experiment folder and the survey form
1 parent e6ecc5c commit 5c43913

2 files changed

Lines changed: 665 additions & 0 deletions

File tree

3_experiment/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# AI Model Comparison Experiment
2+
3+
## Evaluating Open-Source vs. Commercial Language Models
4+
5+
This folder contains the materials for our experiment comparing open-source and
6+
commercial AI models through human evaluation. Participants were asked to read
7+
pairs of AI-generated texts and judge their quality without knowing which model
8+
produced which text.
9+
10+
---
11+
12+
## What This Experiment Is
13+
14+
We created a survey where each question includes two texts—**Text A** and
15+
**Text B**—generated by different AI models. One text always comes from an
16+
**open-source model**, and the other from a **commercial model**. Participants:
17+
18+
* Choose which text they prefer
19+
* Guess which model type generated each text
20+
* Rate both texts (accuracy, clarity, relevance, faithfulness)
21+
22+
All evaluations are blind to remove brand bias.
23+
24+
---
25+
26+
## Why We Did This
27+
28+
Open-source AI models are advancing quickly, and we wanted to understand
29+
whether they are perceived as competitive alternatives to commercial systems.
30+
While benchmarks can measure performance numerically, they don’t reflect how
31+
humans actually experience AI-generated writing.
32+
33+
This experiment aims to answer questions like:
34+
35+
* Do people notice a consistent quality difference?
36+
* Can users accurately identify commercial vs. open-source output?
37+
* Are open-source models “good enough” for real-world tasks?
38+
39+
Understanding these perceptions is important for evaluating the viability of
40+
sustainable, accessible, and transparent AI systems.
41+
42+
---
43+
44+
## Why We Chose This Method
45+
46+
We used a **paired, blind comparison** because it provides a clean way to
47+
assess text quality without model reputation influencing the results.
48+
Participants judge writing on its own merits, which helps us collect more
49+
reliable data.
50+
51+
We included multiple task types: summarization, paraphrasing, reasoning, and
52+
creative writing, because each one tests a different aspect of model behavior.
53+
54+
This variety gives us a broader picture of model strengths and weaknesses.
55+
56+
---
57+
58+
## Why This Approach Works Well
59+
60+
This survey-based structure is simple and easy for participants to
61+
understand. It mirrors how people naturally interact with AI systems: reading
62+
text and forming opinions about quality. By keeping the evaluation blind, we
63+
minimize bias and generate more meaningful insights into real user perception.
64+
65+
The method also helps determine whether open-source models, especially optimized
66+
ones, can realistically serve as alternatives to commercial systems
67+
in practical use.
68+
69+
---
70+
71+
## Contents of This Folder
72+
73+
```text
74+
3_experiment/
75+
├── survey_form.md # The form text used in the study
76+
└── README.md # Explanation of the experiment (this file)
77+
```

0 commit comments

Comments
 (0)