|
| 1 | +# Rationale for Text Selection |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing” |
| 6 | +and “Lunar surface operations” sections were selected as the unified test |
| 7 | +dataset for the ELO2 - Green AI project. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Why Apollo 11? |
| 12 | + |
| 13 | +**Universal Knowledge:** |
| 14 | + |
| 15 | +All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their |
| 16 | +training data, enabling fair comparison with and without RAG. |
| 17 | + |
| 18 | +**Rich Factual Content:** |
| 19 | + |
| 20 | +Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC), |
| 21 | +numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton), |
| 22 | +and technical terms (LGC, PLSS, EASEP). |
| 23 | + |
| 24 | +**Accessibility:** |
| 25 | + |
| 26 | +Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable |
| 27 | +via permanent links. |
| 28 | + |
| 29 | +**Appropriate Length:** |
| 30 | + |
| 31 | +Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for |
| 32 | +evaluation yet processable by smaller models on standard hardware. This length |
| 33 | +aligns with standard benchmarks: |
| 34 | + |
| 35 | +- Summarization tasks typically use 500-2,000 words, |
| 36 | +- QA benchmarks 300-1,500 words, |
| 37 | +- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015). |
| 38 | + |
| 39 | +The excerpted length balances comprehensiveness with practical testability. |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## Why These Excerpted Passages? |
| 44 | + |
| 45 | +**Continuous Narrative:** |
| 46 | + |
| 47 | +Selected passages flow from descent through surface activities, forming a natural |
| 48 | +story arc ideal for summarization tasks requiring temporal understanding. |
| 49 | + |
| 50 | +**Balanced Complexity:** |
| 51 | + |
| 52 | +- Simple facts (times, names, quotes) suitable for smaller and distilled models |
| 53 | +- Complex elements (technical problems, decision-making, procedures) challenging |
| 54 | + for all models |
| 55 | + |
| 56 | +**Optimal for RAG:** |
| 57 | + |
| 58 | +Dense with retrievable facts across categories—times, quantities, names, equipment,quotes. |
| 59 | + |
| 60 | +**Reasoning Opportunities:** |
| 61 | + |
| 62 | +Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?), |
| 63 | +and analytical reasoning. |
| 64 | + |
| 65 | +**Verified Coverage:** |
| 66 | + |
| 67 | +All 15 test prompts confirmed answerable with excerpted passages through |
| 68 | +preliminary testing. |
| 69 | + |
| 70 | +**Length Management:** |
| 71 | + |
| 72 | +Complete sections (~3,800 words) would require extensive chunking for distilled models |
| 73 | +with limited token capacity. Excerpted passages (~1,400 words) are more manageable |
| 74 | +while maintaining comprehensive content for all test scenarios. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## Alignment with Project Goals |
| 79 | + |
| 80 | +**Fair Comparison:** |
| 81 | + |
| 82 | +- Commercial models tested on likely training data |
| 83 | +- RAG systems given the same information |
| 84 | +- All models evaluated on identical input |
| 85 | + |
| 86 | +**Reproducibility:** |
| 87 | + |
| 88 | +Permanent Wikipedia link, documented excerpt selections, license documentation. |
| 89 | + |
| 90 | +**Why Not Other Approaches?** |
| 91 | + |
| 92 | +- Entire Wikipedia article (all sections)? |
| 93 | + Too long (~10,000+ words)—exceeds processing capacity of smaller models, |
| 94 | + impractical for manual verification. |
| 95 | +- Self-written summary? |
| 96 | + Custom summaries cannot be reproduced by others and raise objectivity concerns |
| 97 | + plus potential copyright issues. |
| 98 | +- Multiple unrelated passages? |
| 99 | + Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow, |
| 100 | + prevent reasoning questions requiring connected context. |
| 101 | +- Technical manuals or engineering documents? |
| 102 | + NASA reports are too specialized, likely absent from training data, and limit |
| 103 | + question diversity to technical retrieval. |
| 104 | +- Complete sections without excerpting? |
| 105 | + While more comprehensive, ~3,800 words presents practical challenges for smaller |
| 106 | + models and extends testing time. Excerpting maintains essential information |
| 107 | + while improving testability across architectures. |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Excerpt Selection Methodology |
| 112 | + |
| 113 | +**From “Lunar landing” section:** |
| 114 | + |
| 115 | +- Descent problems and trajectory issues |
| 116 | +- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation |
| 117 | +- Manual landing sequence with fuel concerns |
| 118 | +- Landing confirmation moment |
| 119 | + |
| 120 | +**From “Lunar surface operations” section:** |
| 121 | + |
| 122 | +- EVA preparation and first step |
| 123 | +- Armstrong’s famous quote and its controversy |
| 124 | +- Surface activities and movement |
| 125 | +- Flag planting and Nixon communication |
| 126 | +- Scientific equipment deployment (EASEP) |
| 127 | +- Sample collection activities |
| 128 | +- Return to lunar module |
| 129 | + |
| 130 | +**Omitted content:** |
| 131 | + |
| 132 | +- Extended technical explanations of radar systems |
| 133 | +- Detailed crew dialogue transcripts |
| 134 | +- Some procedural minutiae |
| 135 | + |
| 136 | +**Selection criteria:** |
| 137 | + |
| 138 | +- Information density for prompts |
| 139 | +- Narrative continuity |
| 140 | +- Factual richness for RAG tasks |
| 141 | +- Reasoning opportunities |
| 142 | + |
| 143 | +--- |
| 144 | + |
| 145 | +## Limitations |
| 146 | + |
| 147 | +**Excerpt nature:** |
| 148 | + |
| 149 | +Using selected passages rather than complete sections reduces some contextual richness, |
| 150 | +though all test prompts remain fully answerable. |
| 151 | + |
| 152 | +**Single domain:** |
| 153 | + |
| 154 | +Results may not generalize beyond this topic. |
| 155 | + |
| 156 | +- *Acknowledgment:* This is a focused benchmark within defined scope. |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## Conclusion |
| 161 | + |
| 162 | +The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”* |
| 163 | +sections provide: |
| 164 | + |
| 165 | +✅ Practical content for all model types |
| 166 | +✅ Reproducibility through permanent links and documented selections |
| 167 | +✅ Balance of factual density and narrative coherence |
| 168 | +✅ Support for diverse question types |
| 169 | +✅ Academic integrity through proper licensing and attribution |
| 170 | +✅ Alignment with Green AI benchmarking objectives |
| 171 | + |
| 172 | +This selection enables fair, transparent comparison of AI model accuracy and environmental |
| 173 | +efficiency while maintaining practical testability on available hardware. |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +## References |
| 178 | + |
| 179 | +**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, |
| 180 | +M.,& Blunsom, P. (2015).** |
| 181 | +*Teaching Machines to Read and Comprehend.* |
| 182 | +*Advances in Neural Information Processing Systems, 28.* |
| 183 | +[arXiv:1506.03340](https://arxiv.org/abs/1506.03340) |
| 184 | + |
| 185 | +**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).** |
| 186 | +*SQuAD: 100,000+ Questions for Machine Comprehension of Text.* |
| 187 | +*Proceedings of the 2016 Conference on Empirical Methods in Natural Language |
| 188 | +Processing (EMNLP), pp. 2383–2392.* |
| 189 | +[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/) |
0 commit comments