Skip to content

Commit e35566a

Browse files
committed
docs(apollo11/readme): add overview and instructions for Apollo 11 test dataset
1 parent 2c0ec37 commit e35566a

1 file changed

Lines changed: 181 additions & 0 deletions

File tree

test_dataset_apollo11/README.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# 🚀 Apollo 11 Test Dataset
2+
3+
## 🌕 Overview
4+
5+
This is the unified test dataset for comparing different AI models (commercial,
6+
distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
7+
8+
The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9+
accompanied by 15 standardized prompts testing summarization, reasoning, and
10+
retrieval-augmented generation capabilities.
11+
12+
---
13+
14+
## 📂 Dataset Contents
15+
16+
- **[README.md][readme]** - This file (overview and instructions)
17+
- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text)
18+
- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
19+
- **[test_data.json][json]** - Complete dataset (structured format for automated
20+
testing)
21+
- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions
22+
23+
📌 **Process documentation:** For background on dataset creation decisions and
24+
team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)**
25+
26+
[readme]: /test_dataset_apollo11/README.md
27+
[source]: /test_dataset_apollo11/source_text.txt
28+
[prompts]: /test_dataset_apollo11/test_prompts.md
29+
[json]: /test_dataset_apollo11/test_data.json
30+
[rationale]: /test_dataset_apollo11/RATIONALE.md
31+
32+
---
33+
34+
## 📄 Source & License
35+
36+
**Source:** Wikipedia - Apollo 11 article
37+
**URL:** <https://en.wikipedia.org/wiki/Apollo_11>
38+
**Permanent Link:** <https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845>
39+
**Revision ID:** 1252473845 (Wikipedia internal revision number)
40+
**Date Accessed:** October 22, 2025
41+
**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface
42+
operations"
43+
**Word Count:** ~1,400 words
44+
**Language:** English
45+
46+
**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)
47+
48+
- ✅ Content can be used freely for research
49+
- ✅ Wikipedia must be attributed as the source
50+
- ✅ Derivative works must be shared under the same license
51+
52+
**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0
53+
54+
**Text Structure:** Selected passages from Wikipedia sections.
55+
56+
- Individual sentences are unchanged; some paragraphs omitted for length management.
57+
- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
58+
practical testing while maintaining all information necessary for the 15 test prompts.
59+
60+
📌 See [source_text.txt][source] for the complete excerpted text.
61+
62+
---
63+
64+
## 🎯 Selection Rationale
65+
66+
**Practical length** - ~1,400 words manageable for all model types including
67+
distilled models with standard chunking
68+
**Rich in specific details** - Ideal for RAG testing (times, names, numbers,
69+
technical terms)
70+
**Multiple complexity levels** - Both simple recall and complex reasoning can
71+
be tested
72+
**Narrative structure** - Clear sequence from descent through surface
73+
activities
74+
**All prompts answerable** - 15 test prompts verified to work with selected
75+
passages
76+
77+
The excerpts cover the dramatic descent and landing sequence, followed by
78+
moonwalk activities, ensuring comprehensive testing across summarization,
79+
reasoning, and RAG tasks.
80+
81+
📌 See [RATIONALE.md][rationale] for detailed selection methodology.
82+
83+
---
84+
85+
## 📝 Test Structure
86+
87+
**15 Standardized Prompts** across three categories:
88+
89+
### Summarization (5 prompts)
90+
91+
Tests model's ability to condense and extract key information
92+
93+
**Difficulty:** Easy → Medium → Hard
94+
**Examples:** Main events, challenges faced, activities performed, equipment
95+
deployed
96+
97+
### Reasoning (5 prompts)
98+
99+
Tests model's ability to analyze, infer, and make connections
100+
101+
**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
102+
analysis
103+
**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
104+
manual control? What does Margaret Hamilton's statement reveal?
105+
106+
### RAG - Retrieval (5 prompts)
107+
108+
Tests model's ability to retrieve specific information from source text
109+
110+
**Types:** Times, quotes, numbers, lists, complex multi-part facts
111+
**Examples:** Landing time? Material collected? Scientific instruments deployed?
112+
113+
📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
114+
for its structured data version.
115+
116+
---
117+
118+
## 🔧 How to Use
119+
120+
### General Instructions
121+
122+
- **All 15 prompts** should be tested across all models to ensure a fair comparison.
123+
- Some prompts can be more challenging for smaller models,
124+
but attempting all prompts provides comprehensive evaluation data.
125+
126+
**Testing Protocol:**
127+
128+
**1.** Use the source text from **[source_text.txt][source]** exactly as provided
129+
**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification
130+
**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
131+
testing workflows
132+
**4.** Record responses for each prompt with model configuration details
133+
**5.** Note any errors, failures, or unusual behaviors
134+
135+
---
136+
137+
## 📊 Evaluation
138+
139+
For each prompt, record:
140+
141+
**1. Accuracy** - Is the answer factually correct?
142+
**2. Completeness** - Are all key points covered?
143+
**3. Specificity** - Are specific details included (times, names, numbers)?
144+
**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
145+
well-supported?
146+
147+
Maintain consistent evaluation criteria across all models for fair comparison.
148+
149+
---
150+
151+
## ⚠️ Guidelines
152+
153+
**Critical Rules:**
154+
155+
- **DO NOT modify** the source text
156+
- **DO NOT modify** the prompts
157+
- **DO record** all test configurations (model version, parameters, hardware)
158+
- **DO note** any failures as "No response" or "Error" with details
159+
160+
**Technical Notes:**
161+
162+
- For RAG systems: Load the source text into the database and verify indexing
163+
before testing
164+
- For models with token limits: Chunking may be required
165+
- Environment: Use consistent hardware and settings when possible
166+
- Environmental measurements: Use standardized protocols
167+
168+
---
169+
170+
## 📖 How to Cite This Dataset
171+
172+
When referencing this dataset in reports or publications:
173+
174+
> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article
175+
> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0.
176+
> Available at: <https://en.wikipedia.org/wiki/Apollo_11>
177+
178+
---
179+
180+
*For questions or issues, please contact the project team.
181+
Good luck with testing!* 🚀

0 commit comments

Comments
 (0)