@@ -6,8 +6,8 @@ This is the unified test dataset for comparing different AI models (commercial,
66distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
77
88The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9- accompanied by 15 standardized prompts testing summarization, reasoning, and
10- retrieval-augmented generation capabilities.
9+ accompanied by 21 standardized prompts testing summarization, reasoning,
10+ retrieval, paraphrasing, and creative generation capabilities.
1111
1212---
1313
@@ -16,7 +16,7 @@ retrieval-augmented generation capabilities.
1616- ** [ README.md] [ readme ] ** - This file (overview and instructions)
1717- ** [ source_text.txt] [ source ] ** - Apollo 11 excerpted text
1818(~ 1,400 words, plain text)
19- - ** [ test_prompts.md] [ prompts ] ** - 15 test prompts (readable format)
19+ - ** [ test_prompts.md] [ prompts ] ** - All test prompts (readable format)
2020- ** [ test_data.json] [ json ] ** - Complete dataset (structured format for automated
2121 testing)
2222- ** [ RATIONALE.md] [ rationale ] ** - Detailed explanation of selection decisions
@@ -56,7 +56,7 @@ operations"
5656
5757- Individual sentences are unchanged; some paragraphs omitted for length management.
5858- Complete original sections total ~ 3,800 words; excerpted to ~ 1,400 words for
59- practical testing while maintaining all information necessary for the 15 test prompts.
59+ practical testing while maintaining all information necessary for the 21 test prompts.
6060
6161📌 See [ source_text.txt] [ source ] for the complete excerpted text.
6262
@@ -72,63 +72,93 @@ technical terms)
7272be tested
7373- ✅ ** Narrative structure** - Clear sequence from descent through surface
7474activities
75- - ✅ ** All prompts answerable** - 15 test prompts verified to work with selected
75+ - ✅ ** All prompts answerable** - 21 test prompts verified to work with selected
7676passages
7777
7878The excerpts cover the dramatic descent and landing sequence, followed by
7979moonwalk activities, ensuring comprehensive testing across summarization,
80- reasoning, and RAG tasks.
80+ reasoning, RAG, paraphrasing and creative generation tasks.
8181
8282📌 See [ RATIONALE.md] [ rationale ] for detailed selection methodology.
8383
8484---
8585
8686## 📝 Test Structure
8787
88- ** 15 Standardized Prompts** across three categories:
88+ The test includes ** 21 standardized prompts** distributed across ** five categories** .
89+ In addition, a ** Master Instruction** and ** task-specific guidance prompts** are
90+ provided to ensure consistency and clarity across all tasks.
8991
90- ### Summarization (5 prompts)
92+ ### Prompt Delivery Overview
9193
92- Tests model's ability to condense and extract key information
94+ The test follows this sequence:
95+
96+ ** 1. The Master Instruction** is used ** once at the beginning** of the test.
97+ ** 2.** Before each category, a ** task-specific guidance prompt** clarifies how
98+ the model should approach that task type (e.g., reasoning, summarization, retrieval).
99+ ** 3.** Then, the ** individual prompts for that category** are presented in order
100+ of increasing difficulty.
101+
102+ ### Prompt Categories
103+
104+ #### 1. Summarization (5 prompts)
105+
106+ Tests model's ability to condense and extract key information.
93107
94108** Difficulty:** Easy → Medium → Hard
95- ** Examples:** Main events, challenges faced, activities performed, equipment
96- deployed
109+ ** Examples:** Main events, challenges faced, activities performed, equipment deployed
97110
98- ### Reasoning (5 prompts)
111+ #### 2. Reasoning (5 prompts)
99112
100- Tests model's ability to analyze, infer, and make connections
113+ Tests model's ability to analyze, infer, and make connections.
101114
102- ** Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
103- analysis
115+ ** Types:** Causal reasoning, hypothetical scenarios, interpretation,
116+ deep analysis
104117** Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
105118manual control? What does Margaret Hamilton's statement reveal?
106119
107- ### RAG - Retrieval (5 prompts)
120+ #### 3. RAG – Retrieval (5 prompts)
108121
109- Tests model's ability to retrieve specific information from source text
122+ Tests model's ability to retrieve specific information from source text.
110123
111124** Types:** Times, quotes, numbers, lists, complex multi-part facts
112125** Examples:** Landing time? Material collected? Scientific instruments deployed?
113126
114- 📌 See [ test_prompts.md] [ prompts ] for the readable format, or [ test_data.json] [ json ]
115- for its structured data version.
127+ #### 4. Paraphrasing (3 prompts)
128+
129+ Tests model's ability to restate information in its own words.
130+
131+ ** Difficulty:** Easy → Medium
132+ ** Examples:** Describe computer alarms, Armstrong’s teamwork, or sample collection.
133+
134+ #### 5. Creative Generation (3 prompts)
135+
136+ Tests model's interpretive and imaginative capabilities.
137+
138+ ** Difficulty:** Easy → Medium
139+ ** Examples:** Imagine being in Mission Control. What does landing show about courage?
140+ How did it change Earth?
141+
142+ 📌 See [ test_prompts.md] [ prompts ] for the readable version with full prompt texts,
143+ or [ test_data.json] [ json ] for its structured data version.
116144
117145---
118146
119147## 🔧 How to Use
120148
121149### General Instructions
122150
123- - ** All 15 prompts** should be tested across all models to ensure a fair comparison.
151+ - ** All 21 prompts** should be tested across all models to ensure a fair comparison.
152+ - The ** Master Instruction** and any ** task-specific guidance prompts** should
153+ be applied as described in the Test Structure section.
124154- Some prompts can be more challenging for smaller models,
125155but attempting all prompts provides comprehensive evaluation data.
126156
127157** Testing Protocol:**
128158
129159** 1.** Use the source text from ** [ source_text.txt] [ source ] **
130160exactly as provided
131- ** 2.** Use all 15 prompts from ** [ test_prompts.md] [ prompts ] ** without
161+ ** 2.** Use all prompts from ** [ test_prompts.md] [ prompts ] ** without
132162modification
133163** 3.** * (Optional)* Use ** [ test_data.json] [ json ] ** for automated or scripted
134164 testing workflows
@@ -144,9 +174,15 @@ For each prompt, record:
144174** 1. Accuracy** - Is the answer factually correct?
145175** 2. Completeness** - Are all key points covered?
146176** 3. Specificity** - Are specific details included (times, names, numbers)?
147- ** 4. Reasoning Quality** - For reasoning prompts, is the logic sound and
148- well-supported?
149-
177+ ** 4. Reasoning Quality** - Is the logic sound and well-supported?
178+ ** 5. Paraphrasing Quality** - Is information reworded(not copied)
179+ while maintaining accuracy?
180+ ** 6. Creative Generation Quality** - Is the response coherent, relevant, and text-inspired?
181+ ** 7. Instruction Following** - Does the model follow the master or task-spesific
182+ instructions (no source mentions, concise, natural)?
183+
184+ ** Note:** Creative generation prompts have no single correct answer. Evaluate
185+ based on coherence, relevance to text, and quality of reasoning.
150186Maintain consistent evaluation criteria across all models for fair comparison.
151187
152188---
0 commit comments