|
| 1 | +# 🚀 Apollo 11 Test Dataset |
| 2 | + |
| 3 | +## 🌕 Overview |
| 4 | + |
| 5 | +This is the unified test dataset for comparing different AI models (commercial, |
| 6 | +distilled, SLM, and RAG systems) in the ELO2 - Green AI project. |
| 7 | + |
| 8 | +The dataset consists of selected passages from Wikipedia's Apollo 11 article, |
| 9 | +accompanied by 15 standardized prompts testing summarization, reasoning, and |
| 10 | +retrieval-augmented generation capabilities. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 📂 Dataset Contents |
| 15 | + |
| 16 | +- **[README.md][readme]** - This file (overview and instructions) |
| 17 | +- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text) |
| 18 | +- **[test_prompts.md][prompts]** - 15 test prompts (readable format) |
| 19 | +- **[test_data.json][json]** - Complete dataset (structured format for automated |
| 20 | + testing) |
| 21 | +- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions |
| 22 | + |
| 23 | +📌 **Process documentation:** For background on dataset creation decisions and |
| 24 | +team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)** |
| 25 | + |
| 26 | +[readme]: /test_dataset_apollo11/README.md |
| 27 | +[source]: /test_dataset_apollo11/source_text.txt |
| 28 | +[prompts]: /test_dataset_apollo11/test_prompts.md |
| 29 | +[json]: /test_dataset_apollo11/test_data.json |
| 30 | +[rationale]: /test_dataset_apollo11/RATIONALE.md |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 📄 Source & License |
| 35 | + |
| 36 | +**Source:** Wikipedia - Apollo 11 article |
| 37 | +**URL:** <https://en.wikipedia.org/wiki/Apollo_11> |
| 38 | +**Permanent Link:** <https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845> |
| 39 | +**Revision ID:** 1252473845 (Wikipedia internal revision number) |
| 40 | +**Date Accessed:** October 22, 2025 |
| 41 | +**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface |
| 42 | +operations" |
| 43 | +**Word Count:** ~1,400 words |
| 44 | +**Language:** English |
| 45 | + |
| 46 | +**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) |
| 47 | + |
| 48 | +- ✅ Content can be used freely for research |
| 49 | +- ✅ Wikipedia must be attributed as the source |
| 50 | +- ✅ Derivative works must be shared under the same license |
| 51 | + |
| 52 | +**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0 |
| 53 | + |
| 54 | +**Text Structure:** Selected passages from Wikipedia sections. |
| 55 | + |
| 56 | +- Individual sentences are unchanged; some paragraphs omitted for length management. |
| 57 | +- Complete original sections total ~3,800 words; excerpted to ~1,400 words for |
| 58 | +practical testing while maintaining all information necessary for the 15 test prompts. |
| 59 | + |
| 60 | +📌 See [source_text.txt][source] for the complete excerpted text. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## 🎯 Selection Rationale |
| 65 | + |
| 66 | +✅ **Practical length** - ~1,400 words manageable for all model types including |
| 67 | +distilled models with standard chunking |
| 68 | +✅ **Rich in specific details** - Ideal for RAG testing (times, names, numbers, |
| 69 | +technical terms) |
| 70 | +✅ **Multiple complexity levels** - Both simple recall and complex reasoning can |
| 71 | +be tested |
| 72 | +✅ **Narrative structure** - Clear sequence from descent through surface |
| 73 | +activities |
| 74 | +✅ **All prompts answerable** - 15 test prompts verified to work with selected |
| 75 | +passages |
| 76 | + |
| 77 | +The excerpts cover the dramatic descent and landing sequence, followed by |
| 78 | +moonwalk activities, ensuring comprehensive testing across summarization, |
| 79 | +reasoning, and RAG tasks. |
| 80 | + |
| 81 | +📌 See [RATIONALE.md][rationale] for detailed selection methodology. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## 📝 Test Structure |
| 86 | + |
| 87 | +**15 Standardized Prompts** across three categories: |
| 88 | + |
| 89 | +### Summarization (5 prompts) |
| 90 | + |
| 91 | +Tests model's ability to condense and extract key information |
| 92 | + |
| 93 | +**Difficulty:** Easy → Medium → Hard |
| 94 | +**Examples:** Main events, challenges faced, activities performed, equipment |
| 95 | +deployed |
| 96 | + |
| 97 | +### Reasoning (5 prompts) |
| 98 | + |
| 99 | +Tests model's ability to analyze, infer, and make connections |
| 100 | + |
| 101 | +**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep |
| 102 | +analysis |
| 103 | +**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken |
| 104 | +manual control? What does Margaret Hamilton's statement reveal? |
| 105 | + |
| 106 | +### RAG - Retrieval (5 prompts) |
| 107 | + |
| 108 | +Tests model's ability to retrieve specific information from source text |
| 109 | + |
| 110 | +**Types:** Times, quotes, numbers, lists, complex multi-part facts |
| 111 | +**Examples:** Landing time? Material collected? Scientific instruments deployed? |
| 112 | + |
| 113 | +📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json] |
| 114 | +for its structured data version. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## 🔧 How to Use |
| 119 | + |
| 120 | +### General Instructions |
| 121 | + |
| 122 | +- **All 15 prompts** should be tested across all models to ensure a fair comparison. |
| 123 | +- Some prompts can be more challenging for smaller models, |
| 124 | +but attempting all prompts provides comprehensive evaluation data. |
| 125 | + |
| 126 | +**Testing Protocol:** |
| 127 | + |
| 128 | +**1.** Use the source text from **[source_text.txt][source]** exactly as provided |
| 129 | +**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification |
| 130 | +**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted |
| 131 | + testing workflows |
| 132 | +**4.** Record responses for each prompt with model configuration details |
| 133 | +**5.** Note any errors, failures, or unusual behaviors |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +## 📊 Evaluation |
| 138 | + |
| 139 | +For each prompt, record: |
| 140 | + |
| 141 | +**1. Accuracy** - Is the answer factually correct? |
| 142 | +**2. Completeness** - Are all key points covered? |
| 143 | +**3. Specificity** - Are specific details included (times, names, numbers)? |
| 144 | +**4. Reasoning Quality** - For reasoning prompts, is the logic sound and |
| 145 | + well-supported? |
| 146 | + |
| 147 | +Maintain consistent evaluation criteria across all models for fair comparison. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## ⚠️ Guidelines |
| 152 | + |
| 153 | +**Critical Rules:** |
| 154 | + |
| 155 | +- **DO NOT modify** the source text |
| 156 | +- **DO NOT modify** the prompts |
| 157 | +- **DO record** all test configurations (model version, parameters, hardware) |
| 158 | +- **DO note** any failures as "No response" or "Error" with details |
| 159 | + |
| 160 | +**Technical Notes:** |
| 161 | + |
| 162 | +- For RAG systems: Load the source text into the database and verify indexing |
| 163 | + before testing |
| 164 | +- For models with token limits: Chunking may be required |
| 165 | +- Environment: Use consistent hardware and settings when possible |
| 166 | +- Environmental measurements: Use standardized protocols |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## 📖 How to Cite This Dataset |
| 171 | + |
| 172 | +When referencing this dataset in reports or publications: |
| 173 | + |
| 174 | +> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article |
| 175 | +> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0. |
| 176 | +> Available at: <https://en.wikipedia.org/wiki/Apollo_11> |
| 177 | +
|
| 178 | +--- |
| 179 | + |
| 180 | +*For questions or issues, please contact the project team. |
| 181 | +Good luck with testing!* 🚀 |
0 commit comments