Skip to content

Commit 71b27d9

Browse files
authored
Merge pull request #9 from MIT-Emerging-Talent/test_prompts
Milestone 2: Apollo11 test prompts
2 parents b62efdd + 21e6af8 commit 71b27d9

5 files changed

Lines changed: 722 additions & 0 deletions

File tree

test_dataset_apollo11/RATIONALE.md

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Rationale for Text Selection
2+
3+
## Overview
4+
5+
Excerpted passages (~1,400 words) from Wikipedia’s Apollo 11 “Lunar landing”
6+
and “Lunar surface operations” sections were selected as the unified test
7+
dataset for the ELO2 - Green AI project.
8+
9+
---
10+
11+
## Why Apollo 11?
12+
13+
**Universal Knowledge:**
14+
15+
All major commercial models (GPT-4, Claude, Gemini) have Apollo 11 in their
16+
training data, enabling fair comparison with and without RAG.
17+
18+
**Rich Factual Content:**
19+
20+
Dense with verifiable facts ideal for RAG testing— timestamps (20:17:40 UTC),
21+
numbers (216 lbs fuel, 21.55 kg samples), names (Armstrong, Aldrin, Hamilton),
22+
and technical terms (LGC, PLSS, EASEP).
23+
24+
**Accessibility:**
25+
26+
Wikipedia content is freely available, properly licensed (CC BY-SA 3.0), and stable
27+
via permanent links.
28+
29+
**Appropriate Length:**
30+
31+
Complete sections total ~3,800 words; excerpted to ~1,400 words—substantial for
32+
evaluation yet processable by smaller models on standard hardware. This length
33+
aligns with standard benchmarks:
34+
35+
- Summarization tasks typically use 500-2,000 words,
36+
- QA benchmarks 300-1,500 words,
37+
- RAG evaluations 1,000-3,000 words (Rajpurkar et al., 2016; Hermann et al., 2015).
38+
39+
The excerpted length balances comprehensiveness with practical testability.
40+
41+
---
42+
43+
## Why These Excerpted Passages?
44+
45+
**Continuous Narrative:**
46+
47+
Selected passages flow from descent through surface activities, forming a natural
48+
story arc ideal for summarization tasks requiring temporal understanding.
49+
50+
**Balanced Complexity:**
51+
52+
- Simple facts (times, names, quotes) suitable for smaller and distilled models
53+
- Complex elements (technical problems, decision-making, procedures) challenging
54+
for all models
55+
56+
**Optimal for RAG:**
57+
58+
Dense with retrievable facts across categories—times, quantities, names, equipment,quotes.
59+
60+
**Reasoning Opportunities:**
61+
62+
Supports causal (Why?), hypothetical (What if?), interpretive (What does X reveal?),
63+
and analytical reasoning.
64+
65+
**Verified Coverage:**
66+
67+
All 15 test prompts confirmed answerable with excerpted passages through
68+
preliminary testing.
69+
70+
**Length Management:**
71+
72+
Complete sections (~3,800 words) would require extensive chunking for distilled models
73+
with limited token capacity. Excerpted passages (~1,400 words) are more manageable
74+
while maintaining comprehensive content for all test scenarios.
75+
76+
---
77+
78+
## Alignment with Project Goals
79+
80+
**Fair Comparison:**
81+
82+
- Commercial models tested on likely training data
83+
- RAG systems given the same information
84+
- All models evaluated on identical input
85+
86+
**Reproducibility:**
87+
88+
Permanent Wikipedia link, documented excerpt selections, license documentation.
89+
90+
**Why Not Other Approaches?**
91+
92+
- Entire Wikipedia article (all sections)?
93+
Too long (~10,000+ words)—exceeds processing capacity of smaller models,
94+
impractical for manual verification.
95+
- Self-written summary?
96+
Custom summaries cannot be reproduced by others and raise objectivity concerns
97+
plus potential copyright issues.
98+
- Multiple unrelated passages?
99+
Disconnected excerpts (e.g., Apollo 11 + climate change) break narrative flow,
100+
prevent reasoning questions requiring connected context.
101+
- Technical manuals or engineering documents?
102+
NASA reports are too specialized, likely absent from training data, and limit
103+
question diversity to technical retrieval.
104+
- Complete sections without excerpting?
105+
While more comprehensive, ~3,800 words presents practical challenges for smaller
106+
models and extends testing time. Excerpting maintains essential information
107+
while improving testability across architectures.
108+
109+
---
110+
111+
## Excerpt Selection Methodology
112+
113+
**From “Lunar landing” section:**
114+
115+
- Descent problems and trajectory issues
116+
- Computer alarms (1201, 1202) and Margaret Hamilton’s explanation
117+
- Manual landing sequence with fuel concerns
118+
- Landing confirmation moment
119+
120+
**From “Lunar surface operations” section:**
121+
122+
- EVA preparation and first step
123+
- Armstrong’s famous quote and its controversy
124+
- Surface activities and movement
125+
- Flag planting and Nixon communication
126+
- Scientific equipment deployment (EASEP)
127+
- Sample collection activities
128+
- Return to lunar module
129+
130+
**Omitted content:**
131+
132+
- Extended technical explanations of radar systems
133+
- Detailed crew dialogue transcripts
134+
- Some procedural minutiae
135+
136+
**Selection criteria:**
137+
138+
- Information density for prompts
139+
- Narrative continuity
140+
- Factual richness for RAG tasks
141+
- Reasoning opportunities
142+
143+
---
144+
145+
## Limitations
146+
147+
**Excerpt nature:**
148+
149+
Using selected passages rather than complete sections reduces some contextual richness,
150+
though all test prompts remain fully answerable.
151+
152+
**Single domain:**
153+
154+
Results may not generalize beyond this topic.
155+
156+
- *Acknowledgment:* This is a focused benchmark within defined scope.
157+
158+
---
159+
160+
## Conclusion
161+
162+
The excerpted passages from *“Lunar landing”* and *“Lunar surface operations”*
163+
sections provide:
164+
165+
✅ Practical content for all model types
166+
✅ Reproducibility through permanent links and documented selections
167+
✅ Balance of factual density and narrative coherence
168+
✅ Support for diverse question types
169+
✅ Academic integrity through proper licensing and attribution
170+
✅ Alignment with Green AI benchmarking objectives
171+
172+
This selection enables fair, transparent comparison of AI model accuracy and environmental
173+
efficiency while maintaining practical testability on available hardware.
174+
175+
---
176+
177+
## References
178+
179+
**Hermann, K. M., Kociský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman,
180+
M.,& Blunsom, P. (2015).**
181+
*Teaching Machines to Read and Comprehend.*
182+
*Advances in Neural Information Processing Systems, 28.*
183+
[arXiv:1506.03340](https://arxiv.org/abs/1506.03340)
184+
185+
**Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016).**
186+
*SQuAD: 100,000+ Questions for Machine Comprehension of Text.*
187+
*Proceedings of the 2016 Conference on Empirical Methods in Natural Language
188+
Processing (EMNLP), pp. 2383–2392.*
189+
[ACL Anthology D16-1264](https://aclanthology.org/D16-1264/)

test_dataset_apollo11/README.md

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# 🚀 Apollo 11 Test Dataset
2+
3+
## 🌕 Overview
4+
5+
This is the unified test dataset for comparing different AI models (commercial,
6+
distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
7+
8+
The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9+
accompanied by 15 standardized prompts testing summarization, reasoning, and
10+
retrieval-augmented generation capabilities.
11+
12+
---
13+
14+
## 📂 Dataset Contents
15+
16+
- **[README.md][readme]** - This file (overview and instructions)
17+
- **[source_text.txt][source]** - Apollo 11 excerpted text (~1,400 words, plain text)
18+
- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
19+
- **[test_data.json][json]** - Complete dataset (structured format for automated
20+
testing)
21+
- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions
22+
23+
📌 **Process documentation:** For background on dataset creation decisions and
24+
team discussions, see the **[team briefing](https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing)**
25+
26+
[readme]: /test_dataset_apollo11/README.md
27+
[source]: /test_dataset_apollo11/source_text.txt
28+
[prompts]: /test_dataset_apollo11/test_prompts.md
29+
[json]: /test_dataset_apollo11/test_data.json
30+
[rationale]: /test_dataset_apollo11/RATIONALE.md
31+
32+
---
33+
34+
## 📄 Source & License
35+
36+
**Source:** Wikipedia - Apollo 11 article
37+
**URL:** <https://en.wikipedia.org/wiki/Apollo_11>
38+
**Permanent Link:** <https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845>
39+
**Revision ID:** 1252473845 (Wikipedia internal revision number)
40+
**Date Accessed:** October 22, 2025
41+
**Sections:** Excerpted passages from "Lunar landing" and "Lunar surface
42+
operations"
43+
**Word Count:** ~1,400 words
44+
**Language:** English
45+
46+
**License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)
47+
48+
- ✅ Content can be used freely for research
49+
- ✅ Wikipedia must be attributed as the source
50+
- ✅ Derivative works must be shared under the same license
51+
52+
**Attribution:** "Apollo 11" by Wikipedia contributors, licensed under CC BY-SA 3.0
53+
54+
**Text Structure:** Selected passages from Wikipedia sections.
55+
56+
- Individual sentences are unchanged; some paragraphs omitted for length management.
57+
- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
58+
practical testing while maintaining all information necessary for the 15 test prompts.
59+
60+
📌 See [source_text.txt][source] for the complete excerpted text.
61+
62+
---
63+
64+
## 🎯 Selection Rationale
65+
66+
**Practical length** - ~1,400 words manageable for all model types including
67+
distilled models with standard chunking
68+
**Rich in specific details** - Ideal for RAG testing (times, names, numbers,
69+
technical terms)
70+
**Multiple complexity levels** - Both simple recall and complex reasoning can
71+
be tested
72+
**Narrative structure** - Clear sequence from descent through surface
73+
activities
74+
**All prompts answerable** - 15 test prompts verified to work with selected
75+
passages
76+
77+
The excerpts cover the dramatic descent and landing sequence, followed by
78+
moonwalk activities, ensuring comprehensive testing across summarization,
79+
reasoning, and RAG tasks.
80+
81+
📌 See [RATIONALE.md][rationale] for detailed selection methodology.
82+
83+
---
84+
85+
## 📝 Test Structure
86+
87+
**15 Standardized Prompts** across three categories:
88+
89+
### Summarization (5 prompts)
90+
91+
Tests model's ability to condense and extract key information
92+
93+
**Difficulty:** Easy → Medium → Hard
94+
**Examples:** Main events, challenges faced, activities performed, equipment
95+
deployed
96+
97+
### Reasoning (5 prompts)
98+
99+
Tests model's ability to analyze, infer, and make connections
100+
101+
**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
102+
analysis
103+
**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
104+
manual control? What does Margaret Hamilton's statement reveal?
105+
106+
### RAG - Retrieval (5 prompts)
107+
108+
Tests model's ability to retrieve specific information from source text
109+
110+
**Types:** Times, quotes, numbers, lists, complex multi-part facts
111+
**Examples:** Landing time? Material collected? Scientific instruments deployed?
112+
113+
📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
114+
for its structured data version.
115+
116+
---
117+
118+
## 🔧 How to Use
119+
120+
### General Instructions
121+
122+
- **All 15 prompts** should be tested across all models to ensure a fair comparison.
123+
- Some prompts can be more challenging for smaller models,
124+
but attempting all prompts provides comprehensive evaluation data.
125+
126+
**Testing Protocol:**
127+
128+
**1.** Use the source text from **[source_text.txt][source]** exactly as provided
129+
**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without modification
130+
**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
131+
testing workflows
132+
**4.** Record responses for each prompt with model configuration details
133+
**5.** Note any errors, failures, or unusual behaviors
134+
135+
---
136+
137+
## 📊 Evaluation
138+
139+
For each prompt, record:
140+
141+
**1. Accuracy** - Is the answer factually correct?
142+
**2. Completeness** - Are all key points covered?
143+
**3. Specificity** - Are specific details included (times, names, numbers)?
144+
**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
145+
well-supported?
146+
147+
Maintain consistent evaluation criteria across all models for fair comparison.
148+
149+
---
150+
151+
## ⚠️ Guidelines
152+
153+
**Critical Rules:**
154+
155+
- **DO NOT modify** the source text
156+
- **DO NOT modify** the prompts
157+
- **DO record** all test configurations (model version, parameters, hardware)
158+
- **DO note** any failures as "No response" or "Error" with details
159+
160+
**Technical Notes:**
161+
162+
- For RAG systems: Load the source text into the database and verify indexing
163+
before testing
164+
- For models with token limits: Chunking may be required
165+
- Environment: Use consistent hardware and settings when possible
166+
- Environmental measurements: Use standardized protocols
167+
168+
---
169+
170+
## 📖 How to Cite This Dataset
171+
172+
When referencing this dataset in reports or publications:
173+
174+
> Apollo 11 Test Dataset: Excerpted passages from Wikipedia's "Apollo 11" article
175+
> (Revision 1252473845, accessed October 22, 2025), licensed under CC BY-SA 3.0.
176+
> Available at: <https://en.wikipedia.org/wiki/Apollo_11>
177+
178+
---
179+
180+
*For questions or issues, please contact the project team.
181+
Good luck with testing!* 🚀

0 commit comments

Comments
 (0)