@@ -13,34 +13,34 @@ retrieval-augmented generation capabilities.
1313
1414## 📂 Dataset Contents
1515
16- - ** [ README.md] [ readme ] ** - This file (overview and instructions)
17- - ** [ source_text.txt] [ source ] ** - Apollo 11 excerpted text (~ 1,400 words, plain text)
18- - ** [ test_prompts.md] [ prompts ] ** - 15 test prompts (readable format)
16+ - ** [ README.md] [ readme ] ** - This file (overview and instructions)
17+ - ** [ source_text.txt] [ source ] ** - Apollo 11 excerpted text (~ 1,400 words, plain text)
18+ - ** [ test_prompts.md] [ prompts ] ** - 15 test prompts (readable format)
1919- ** [ test_data.json] [ json ] ** - Complete dataset (structured format for automated
20- testing)
20+ testing)
2121- ** [ RATIONALE.md] [ rationale ] ** - Detailed explanation of selection decisions
2222
2323📌 ** Process documentation:** For background on dataset creation decisions and
2424team discussions, see the ** [ team briefing] ( https://docs.google.com/document/d/1jAE2Y2BJDx014MAXCxyH0-2EgieL_tCxCEeMK4VWBNQ/edit?usp=sharing ) **
2525
26- [ readme ] : /test_dataset_apollo11/README.md
27- [ source ] : /test_dataset_apollo11/source_text.txt
28- [ prompts ] : /test_dataset_apollo11/test_prompts.md
29- [ json ] : /test_dataset_apollo11/test_data.json
26+ [ readme] : /test_dataset_apollo11/README.md
27+ [ source] : /test_dataset_apollo11/source_text.txt
28+ [ prompts] : /test_dataset_apollo11/test_prompts.md
29+ [ json] : /test_dataset_apollo11/test_data.json
3030[ rationale ] : /test_dataset_apollo11/RATIONALE.md
3131
3232---
3333
3434## 📄 Source & License
3535
36- ** Source:** Wikipedia - Apollo 11 article
37- ** URL:** < https://en.wikipedia.org/wiki/Apollo_11 >
38- ** Permanent Link:** < https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845 >
39- ** Revision ID:** 1252473845 (Wikipedia internal revision number)
40- ** Date Accessed:** October 22, 2025
36+ ** Source:** Wikipedia - Apollo 11 article
37+ ** URL:** < https://en.wikipedia.org/wiki/Apollo_11 >
38+ ** Permanent Link:** < https://en.wikipedia.org/w/index.php?title=Apollo_11&oldid=1252473845 >
39+ ** Revision ID:** 1252473845 (Wikipedia internal revision number)
40+ ** Date Accessed:** October 22, 2025
4141** Sections:** Excerpted passages from "Lunar landing" and "Lunar surface
42- operations"
43- ** Word Count:** ~ 1,400 words
42+ operations"
43+ ** Word Count:** ~ 1,400 words
4444** Language:** English
4545
4646** License:** Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0)
@@ -63,15 +63,15 @@ practical testing while maintaining all information necessary for the 15 test pr
6363
6464## 🎯 Selection Rationale
6565
66- ✅ ** Practical length** - ~ 1,400 words manageable for all model types including
66+ - ✅ ** Practical length** - ~ 1,400 words manageable for all model types including
6767distilled models with standard chunking
68- ✅ ** Rich in specific details** - Ideal for RAG testing (times, names, numbers,
68+ - ✅ ** Rich in specific details** - Ideal for RAG testing (times, names, numbers,
6969technical terms)
70- ✅ ** Multiple complexity levels** - Both simple recall and complex reasoning can
70+ - ✅ ** Multiple complexity levels** - Both simple recall and complex reasoning can
7171be tested
72- ✅ ** Narrative structure** - Clear sequence from descent through surface
72+ - ✅ ** Narrative structure** - Clear sequence from descent through surface
7373activities
74- ✅ ** All prompts answerable** - 15 test prompts verified to work with selected
74+ - ✅ ** All prompts answerable** - 15 test prompts verified to work with selected
7575passages
7676
7777The excerpts cover the dramatic descent and landing sequence, followed by
@@ -90,7 +90,7 @@ reasoning, and RAG tasks.
9090
9191Tests model's ability to condense and extract key information
9292
93- ** Difficulty:** Easy → Medium → Hard
93+ ** Difficulty:** Easy → Medium → Hard
9494** Examples:** Main events, challenges faced, activities performed, equipment
9595deployed
9696
@@ -99,15 +99,15 @@ deployed
9999Tests model's ability to analyze, infer, and make connections
100100
101101** Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
102- analysis
102+ analysis
103103** Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
104104manual control? What does Margaret Hamilton's statement reveal?
105105
106106### RAG - Retrieval (5 prompts)
107107
108108Tests model's ability to retrieve specific information from source text
109109
110- ** Types:** Times, quotes, numbers, lists, complex multi-part facts
110+ ** Types:** Times, quotes, numbers, lists, complex multi-part facts
111111** Examples:** Landing time? Material collected? Scientific instruments deployed?
112112
113113📌 See [ test_prompts.md] [ prompts ] for the readable format, or [ test_data.json] [ json ]
@@ -125,11 +125,11 @@ but attempting all prompts provides comprehensive evaluation data.
125125
126126** Testing Protocol:**
127127
128- ** 1.** Use the source text from ** [ source_text.txt] [ source ] ** exactly as provided
129- ** 2.** Use all 15 prompts from ** [ test_prompts.md] [ prompts ] ** without modification
128+ ** 1.** Use the source text from ** [ source_text.txt] [ source ] ** exactly as provided
129+ ** 2.** Use all 15 prompts from ** [ test_prompts.md] [ prompts ] ** without modification
130130** 3.** * (Optional)* Use ** [ test_data.json] [ json ] ** for automated or scripted
131- testing workflows
132- ** 4.** Record responses for each prompt with model configuration details
131+ testing workflows
132+ ** 4.** Record responses for each prompt with model configuration details
133133** 5.** Note any errors, failures, or unusual behaviors
134134
135135---
@@ -138,9 +138,9 @@ but attempting all prompts provides comprehensive evaluation data.
138138
139139For each prompt, record:
140140
141- ** 1. Accuracy** - Is the answer factually correct?
142- ** 2. Completeness** - Are all key points covered?
143- ** 3. Specificity** - Are specific details included (times, names, numbers)?
141+ ** 1. Accuracy** - Is the answer factually correct?
142+ ** 2. Completeness** - Are all key points covered?
143+ ** 3. Specificity** - Are specific details included (times, names, numbers)?
144144** 4. Reasoning Quality** - For reasoning prompts, is the logic sound and
145145 well-supported?
146146
0 commit comments