Summary
Standardize our model testing using Apollo 11 source text with 15 prompts covering summarization, reasoning, and RAG tasks.
Proposal
Source Text: Wikipedia Apollo 11 excerpts (from “Lunar Landing” and “Lunar Surface Operations” sections) with permanent link (~1,400 words, CC BY-SA 3.0)
15 Test Prompts:
- 5 Summarization (easy → hard)
- 5 Reasoning (causal, analytical, hypothetical)
- 5 RAG (fact retrieval with ground truth answers)
Why Apollo 11?
- Works for all model types (DistilBERT, SLMs, commercial)
- Fact-dense for RAG testing
- Properly licensed and reproducible
- Hardware-friendly length
Open Questions for Team
Prompt format: JSON for automation or plain text for simplicity?
Text length: Is 1,400 words optimal, or should we go shorter/longer(full sections?)?
Multiple sources: Start with one text or prepare multiple examples?
Next Steps
- Please review detailed documentation here
- Discuss and feedback for open questions
- Implementation after approval
Summary
Standardize our model testing using Apollo 11 source text with 15 prompts covering summarization, reasoning, and RAG tasks.
Proposal
Source Text: Wikipedia Apollo 11 excerpts (from “Lunar Landing” and “Lunar Surface Operations” sections) with permanent link (~1,400 words, CC BY-SA 3.0)
15 Test Prompts:
Why Apollo 11?
Open Questions for Team
Prompt format: JSON for automation or plain text for simplicity?
Text length: Is 1,400 words optimal, or should we go shorter/longer(full sections?)?
Multiple sources: Start with one text or prepare multiple examples?
Next Steps