Skip to content

Commit ce1e9ae

Browse files
committed
docs(apollo11/rationale): add detailed explanation of dataset selection decisions
1 parent f5cc035 commit ce1e9ae

1 file changed

Lines changed: 189 additions & 0 deletions

File tree

test_dataset_apollo11/RATIONALE.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Rationale for Text Selection
2+
3+
## Overview
4+
5+
Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing”
6+
and “Lunar surface operations” sections were selected as the unified test
7+
dataset for the ELO2 - Green AI project.
8+
9+
---
10+
11+
## Why Apollo 11?
12+
13+
**Universal Knowledge:**
14+
15+
All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their
16+
training data, enabling fair comparison with and without RAG.
17+
18+
**Rich Factual Content:**
19+
20+
Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC),
21+
numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton),
22+
and technical terms (LGC, PLSS, EASEP).
23+
24+
**Accessibility:**
25+
26+
Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable
27+
via permanent links.
28+
29+
**Appropriate Length:**
30+
31+
Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for
32+
evaluation yet processable by smaller models on standard hardware. This length
33+
aligns with standard benchmarks:
34+
35+
- Summarization tasks typically use 500-2,000 words,
36+
- QA benchmarks 300-1,500 words,
37+
- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015).
38+
39+
The excerpted length balances comprehensiveness with practical testability.
40+
41+
---
42+
43+
## Why These Excerpted Passages?
44+
45+
**Continuous Narrative:**
46+
47+
Selected passages flow from descent through surface activities, forming a natural
48+
story arc ideal for summarization tasks requiring temporal understanding.
49+
50+
**Balanced Complexity:**
51+
52+
- Simple facts (times, names, quotes) suitable for smaller and distilled models
53+
- Complex elements (technical problems, decision-making, procedures) challenging
54+
for all models
55+
56+
**Optimal for RAG:**
57+
58+
Dense with retrievable facts across categories—times, quantities, names, equipment,quotes.
59+
60+
**Reasoning Opportunities:**
61+
62+
Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?),
63+
and analytical reasoning.
64+
65+
**Verified Coverage:**
66+
67+
All 15 test prompts confirmed answerable with excerpted passages through
68+
preliminary testing.
69+
70+
**Length Management:**
71+
72+
Complete sections (~3,800 words) would require extensive chunking for distilled models
73+
with limited token capacity. Excerpted passages (~1,400 words) are more manageable
74+
while maintaining comprehensive content for all test scenarios.
75+
76+
---
77+
78+
## Alignment with Project Goals
79+
80+
**Fair Comparison:**
81+
82+
- Commercial models tested on likely training data
83+
- RAG systems given the same information
84+
- All models evaluated on identical input
85+
86+
**Reproducibility:**
87+
88+
Permanent Wikipedia link, documented excerpt selections, license documentation.
89+
90+
**Why Not Other Approaches?**
91+
92+
- Entire Wikipedia article (all sections)?
93+
Too long (~10,000+ words)—exceeds processing capacity of smaller models,
94+
impractical for manual verification.
95+
- Self-written summary?
96+
Custom summaries cannot be reproduced by others and raise objectivity concerns
97+
plus potential copyright issues.
98+
- Multiple unrelated passages?
99+
Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow,
100+
prevent reasoning questions requiring connected context.
101+
- Technical manuals or engineering documents?
102+
NASA reports are too specialized, likely absent from training data, and limit
103+
question diversity to technical retrieval.
104+
- Complete sections without excerpting?
105+
While more comprehensive, ~3,800 words presents practical challenges for smaller
106+
models and extends testing time. Excerpting maintains essential information
107+
while improving testability across architectures.
108+
109+
---
110+
111+
## Excerpt Selection Methodology
112+
113+
**From “Lunar landing” section:**
114+
115+
- Descent problems and trajectory issues
116+
- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation
117+
- Manual landing sequence with fuel concerns
118+
- Landing confirmation moment
119+
120+
**From “Lunar surface operations” section:**
121+
122+
- EVA preparation and first step
123+
- Armstrong’s famous quote and its controversy
124+
- Surface activities and movement
125+
- Flag planting and Nixon communication
126+
- Scientific equipment deployment (EASEP)
127+
- Sample collection activities
128+
- Return to lunar module
129+
130+
**Omitted content:**
131+
132+
- Extended technical explanations of radar systems
133+
- Detailed crew dialogue transcripts
134+
- Some procedural minutiae
135+
136+
**Selection criteria:**
137+
138+
- Information density for prompts
139+
- Narrative continuity
140+
- Factual richness for RAG tasks
141+
- Reasoning opportunities
142+
143+
---
144+
145+
## Limitations
146+
147+
**Excerpt nature:**
148+
149+
Using selected passages rather than complete sections reduces some contextual richness,
150+
though all test prompts remain fully answerable.
151+
152+
**Single domain:**
153+
154+
Results may not generalize beyond this topic.
155+
156+
- *Acknowledgment:* This is a focused benchmark within defined scope.
157+
158+
---
159+
160+
## Conclusion
161+
162+
The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”*
163+
sections provide:
164+
165+
✅ Practical content for all model types
166+
✅ Reproducibility through permanent links and documented selections
167+
✅ Balance of factual density and narrative coherence
168+
✅ Support for diverse question types
169+
✅ Academic integrity through proper licensing and attribution
170+
✅ Alignment with Green AI benchmarking objectives
171+
172+
This selection enables fair, transparent comparison of AI model accuracy and environmental
173+
efficiency while maintaining practical testability on available hardware.
174+
175+
---
176+
177+
## References
178+
179+
**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman,
180+
M.,& Blunsom, P. (2015).**
181+
*Teaching Machines to Read and Comprehend.*
182+
*Advances in Neural Information Processing Systems, 28.*
183+
[arXiv:1506.03340](https://arxiv.org/abs/1506.03340)
184+
185+
**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).**
186+
*SQuAD: 100,000+ Questions for Machine Comprehension of Text.*
187+
*Proceedings of the 2016 Conference on Empirical Methods in Natural Language
188+
Processing (EMNLP), pp. 2383–2392.*
189+
[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/)

0 commit comments

Comments
 (0)