Skip to content

Commit c81f92d

Browse files
committed
docs(apollo11): add paraphrasing & creative generation categories and introduce instruction prompts
1 parent 7b7a41d commit c81f92d

1 file changed

Lines changed: 60 additions & 24 deletions

File tree

test_dataset_apollo11/README.md

Lines changed: 60 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ This is the unified test dataset for comparing different AI models (commercial,
66
distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
77

88
The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9-
accompanied by 15 standardized prompts testing summarization, reasoning, and
10-
retrieval-augmented generation capabilities.
9+
accompanied by 21 standardized prompts testing summarization, reasoning,
10+
retrieval, paraphrasing, and creative generation capabilities.
1111

1212
---
1313

@@ -16,7 +16,7 @@ retrieval-augmented generation capabilities.
1616
- **[README.md][readme]** - This file (overview and instructions)
1717
- **[source_text.txt][source]** - Apollo 11 excerpted text
1818
(~1,400 words, plain text)
19-
- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
19+
- **[test_prompts.md][prompts]** - All test prompts (readable format)
2020
- **[test_data.json][json]** - Complete dataset (structured format for automated
2121
testing)
2222
- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions
@@ -56,7 +56,7 @@ operations"
5656

5757
- Individual sentences are unchanged; some paragraphs omitted for length management.
5858
- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
59-
practical testing while maintaining all information necessary for the 15 test prompts.
59+
practical testing while maintaining all information necessary for the 21 test prompts.
6060

6161
📌 See [source_text.txt][source] for the complete excerpted text.
6262

@@ -72,63 +72,93 @@ technical terms)
7272
be tested
7373
-**Narrative structure** - Clear sequence from descent through surface
7474
activities
75-
-**All prompts answerable** - 15 test prompts verified to work with selected
75+
-**All prompts answerable** - 21 test prompts verified to work with selected
7676
passages
7777

7878
The excerpts cover the dramatic descent and landing sequence, followed by
7979
moonwalk activities, ensuring comprehensive testing across summarization,
80-
reasoning, and RAG tasks.
80+
reasoning, RAG, paraphrasing and creative generation tasks.
8181

8282
📌 See [RATIONALE.md][rationale] for detailed selection methodology.
8383

8484
---
8585

8686
## 📝 Test Structure
8787

88-
**15 Standardized Prompts** across three categories:
88+
The test includes **21 standardized prompts** distributed across **five categories**.
89+
In addition, a **Master Instruction** and **task-specific guidance prompts** are
90+
provided to ensure consistency and clarity across all tasks.
8991

90-
### Summarization (5 prompts)
92+
### Prompt Delivery Overview
9193

92-
Tests model's ability to condense and extract key information
94+
The test follows this sequence:
95+
96+
**1. The Master Instruction** is used **once at the beginning** of the test.
97+
**2.** Before each category, a **task-specific guidance prompt** clarifies how
98+
the model should approach that task type (e.g., reasoning, summarization, retrieval).
99+
**3.** Then, the **individual prompts for that category** are presented in order
100+
of increasing difficulty.
101+
102+
### Prompt Categories
103+
104+
#### 1. Summarization (5 prompts)
105+
106+
Tests model's ability to condense and extract key information.
93107

94108
**Difficulty:** Easy → Medium → Hard
95-
**Examples:** Main events, challenges faced, activities performed, equipment
96-
deployed
109+
**Examples:** Main events, challenges faced, activities performed, equipment deployed
97110

98-
### Reasoning (5 prompts)
111+
#### 2. Reasoning (5 prompts)
99112

100-
Tests model's ability to analyze, infer, and make connections
113+
Tests model's ability to analyze, infer, and make connections.
101114

102-
**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
103-
analysis
115+
**Types:** Causal reasoning, hypothetical scenarios, interpretation,
116+
deep analysis
104117
**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
105118
manual control? What does Margaret Hamilton's statement reveal?
106119

107-
### RAG - Retrieval (5 prompts)
120+
#### 3. RAG Retrieval (5 prompts)
108121

109-
Tests model's ability to retrieve specific information from source text
122+
Tests model's ability to retrieve specific information from source text.
110123

111124
**Types:** Times, quotes, numbers, lists, complex multi-part facts
112125
**Examples:** Landing time? Material collected? Scientific instruments deployed?
113126

114-
📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
115-
for its structured data version.
127+
#### 4. Paraphrasing (3 prompts)
128+
129+
Tests model's ability to restate information in its own words.
130+
131+
**Difficulty:** Easy → Medium
132+
**Examples:** Describe computer alarms, Armstrong’s teamwork, or sample collection.
133+
134+
#### 5. Creative Generation (3 prompts)
135+
136+
Tests model's interpretive and imaginative capabilities.
137+
138+
**Difficulty:** Easy → Medium
139+
**Examples:** Imagine being in Mission Control. What does landing show about courage?
140+
How did it change Earth?
141+
142+
📌 See [test_prompts.md][prompts] for the readable version with full prompt texts,
143+
or [test_data.json][json] for its structured data version.
116144

117145
---
118146

119147
## 🔧 How to Use
120148

121149
### General Instructions
122150

123-
- **All 15 prompts** should be tested across all models to ensure a fair comparison.
151+
- **All 21 prompts** should be tested across all models to ensure a fair comparison.
152+
- The **Master Instruction** and any **task-specific guidance prompts** should
153+
be applied as described in the Test Structure section.
124154
- Some prompts can be more challenging for smaller models,
125155
but attempting all prompts provides comprehensive evaluation data.
126156

127157
**Testing Protocol:**
128158

129159
**1.** Use the source text from **[source_text.txt][source]**
130160
exactly as provided
131-
**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without
161+
**2.** Use all prompts from **[test_prompts.md][prompts]** without
132162
modification
133163
**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
134164
testing workflows
@@ -144,9 +174,15 @@ For each prompt, record:
144174
**1. Accuracy** - Is the answer factually correct?
145175
**2. Completeness** - Are all key points covered?
146176
**3. Specificity** - Are specific details included (times, names, numbers)?
147-
**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
148-
well-supported?
149-
177+
**4. Reasoning Quality** - Is the logic sound and well-supported?
178+
**5. Paraphrasing Quality** - Is information reworded(not copied)
179+
while maintaining accuracy?
180+
**6. Creative Generation Quality** - Is the response coherent, relevant, and text-inspired?
181+
**7. Instruction Following** - Does the model follow the master or task-spesific
182+
instructions (no source mentions, concise, natural)?
183+
184+
**Note:** Creative generation prompts have no single correct answer. Evaluate
185+
based on coherence, relevance to text, and quality of reasoning.
150186
Maintain consistent evaluation criteria across all models for fair comparison.
151187

152188
---

0 commit comments

Comments
 (0)