Skip to content

Commit b440141

Browse files
authored
Merge pull request #75 from FluffyAIcode/AgentMemory/v04-pr-k1e-vast-gpu-runner-8e7f
PR-K1.E-vast: GPU runner for NIAH validation harness (stacked on #74)
2 parents 82ef22f + 4c95975 commit b440141

9 files changed

Lines changed: 1081 additions & 0 deletions
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
{
2+
"schema_version": 1,
3+
"kind": "k1e_niah_validation",
4+
"config": {
5+
"model": "google/gemma-3-1b-it",
6+
"device": "cuda",
7+
"dtype": "torch.bfloat16",
8+
"n_samples": 20,
9+
"haystack_min_lines": 935,
10+
"haystack_max_lines": 1265,
11+
"sink_size": 4,
12+
"window_size": 64,
13+
"max_new_tokens": 24,
14+
"seed": 42,
15+
"prompt_token_len_min": 19033,
16+
"prompt_token_len_max": 25273,
17+
"prompt_token_len_mean": 21475
18+
},
19+
"results": {
20+
"oracle_full_attention": {
21+
"name": "oracle_full_attention",
22+
"recall": 0.65,
23+
"samples_correct": 13,
24+
"samples_total": 20,
25+
"mean_latency_s": 23.67011810981203,
26+
"median_latency_s": 23.001393963000737,
27+
"per_sample_decoded": [
28+
"This is a classic riddle! The answer is:\n\n**BETA-1409**\n\nLet me know",
29+
"This is a classic riddle! The secret code is: **DELTA-3286**.\n\nLet me know",
30+
"BETA-7912\n\n\n\n\n\n\n\n",
31+
"BETA-4582\n\n\n\n\n\n\n",
32+
"KAPPA-1434\n\n\n\n\n\n\n\n",
33+
"This is a classic riddle! The answer is:\n\n**Pine-9928**\n\nLet me know if",
34+
"This is a fun, challenging prompt! The response is a deliberately cryptic and nonsensical string of characters. It\u2019s",
35+
"This is a classic riddle! The answer is:\n\n**Pine-7924**\n\nLet me know if",
36+
"Gamma-4527\n\n\n\n\n\n\n\n",
37+
"BETA-7224",
38+
"This is a classic riddle! The answer is:\n\n**\"The quick brown fox jumps over the lazy dog.\"**",
39+
"The secret code is PINE-8527.",
40+
"This is a tricky question! The answer is: \u201cBIRCH-7201\u201d.",
41+
"EPSILON-6925\n\n\n\n\n\n\n",
42+
"This is a riddle! The answer is:\n\n\"Pine-2139\"\n\nLet me know if you",
43+
"This is a classic riddle! The secret code is \u201cMAPLE-5741\u201d.",
44+
"This is a classic riddle! The answer is:\n\n**Oak-2654**\n\nLet me know if",
45+
"This is a classic riddle! The answer is:\n\n**THETA-6977**\n\nLet me know",
46+
"ZETA-4432\n\n\n\n\n\n\n\n",
47+
"The secret code is KAPPA-3803."
48+
],
49+
"per_sample_correct": [
50+
true,
51+
true,
52+
true,
53+
true,
54+
true,
55+
false,
56+
false,
57+
false,
58+
false,
59+
true,
60+
false,
61+
true,
62+
true,
63+
true,
64+
false,
65+
true,
66+
false,
67+
true,
68+
true,
69+
true
70+
]
71+
},
72+
"v03_sink_window": {
73+
"name": "v03_sink_window",
74+
"recall": 0.0,
75+
"samples_correct": 0,
76+
"samples_total": 20,
77+
"mean_latency_s": 29.319774089823476,
78+
"median_latency_s": 28.896723023499362,
79+
"per_sample_decoded": [
80+
"Okay, let's tackle this.\n\n**Question:** What is the secret code?\n\n**Answer:** It",
81+
"Okay, let's try to figure out what's the secret code.\n\nOkay, let's try to",
82+
"Okay, I understand. You're presenting a conversation. Let's continue.\n\n**Me:** I'",
83+
"This is a good.\n\n**\n\n**\n\n**\n\n**\n\n**\n\n**\n\n**\n\n**\n\n**\n\n",
84+
"Okay, let's try this.\n\nYou're asking me to do something.\n\nI'm not sure",
85+
"Okay, let's break this down.\n\nQuestion: What is the secret code?\nAnswer: **\"",
86+
"Okay, let's break this down.\n\nThe question is: \"What is the secret code?\"\n\nThe answer",
87+
"1:\n1.\n1\n1\n11\n1\n11\n1\n1111",
88+
"Okay, let's clarify.\n\nWhat is the question.\n\nWhat is the question.\n\n",
89+
"Okay, let's break down this question.\n\nThe secret code is: **\"It's a riddle!\"",
90+
"Okay, let's crack this code!\n\n**The secret code is: \"B-B-B-B",
91+
"Okay, let's play a game!\n\nI'm thinking of a color.\n\nI'm thinking of",
92+
"Okay, let's try to figure out what's the secret code.\n\nHere's the code.\n\n",
93+
"Okay, let's break this down.\n\nThis is a classic riddle! The answer is:\n\n**\"Here",
94+
"Okay, let's break this down.\n\nThe question is: \"What is the secret code?\"\n\nThe answer",
95+
"Okay, let's break this down.\n\nThe question is: \"What is the secret code?\"\n\nThe answer",
96+
"Okay, let's figure out the secret code!\n\nQuestion: What is the secret code?\n\nAnswer: The",
97+
"Okay, let's make this is a good question\n\n**Here's a good question**\n\n**What'",
98+
"Okay, let's see if we can figure out what's the secret code.\n\nThe answer is the code",
99+
"Okay, let's break down this riddle!\n\nThe secret code is: **\"The answer is hidden in plain"
100+
],
101+
"per_sample_correct": [
102+
false,
103+
false,
104+
false,
105+
false,
106+
false,
107+
false,
108+
false,
109+
false,
110+
false,
111+
false,
112+
false,
113+
false,
114+
false,
115+
false,
116+
false,
117+
false,
118+
false,
119+
false,
120+
false,
121+
false
122+
]
123+
},
124+
"v04_dlm_restored": {
125+
"name": "v04_dlm_restored",
126+
"recall": 0.65,
127+
"samples_correct": 13,
128+
"samples_total": 20,
129+
"mean_latency_s": 48.57213256795658,
130+
"median_latency_s": 47.461577202077024,
131+
"per_sample_decoded": [
132+
"This is a classic riddle! The answer is:\n\n**BETA-1409**\n\nLet me know",
133+
"This is a classic riddle! The secret code is: **DELTA-3286**.\n\nLet me know",
134+
"BETA-7912\n\n\n\n\n\n\n\n",
135+
"BETA-4582\n\n\n\n\n\n\n",
136+
"KAPPA-1434\n\n\n\n\n\n\n\n",
137+
"This is a classic riddle! The answer is:\n\n**Pine-9928**\n\nLet me know if",
138+
"This is a fun, challenging prompt! The response is a deliberately cryptic and nonsensical string of characters. It\u2019s",
139+
"This is a classic riddle! The answer is:\n\n**Pine-7924**\n\nLet me know if",
140+
"Gamma-4527\n\n\n\n\n\n\n\n",
141+
"BETA-7224",
142+
"This is a classic riddle! The answer is:\n\n**\"The quick brown fox jumps over the lazy dog.\"**",
143+
"The secret code is PINE-8527.",
144+
"This is a tricky question! The answer is: \u201cBIRCH-7201\u201d.",
145+
"EPSILON-6925\n\n\n\n\n\n\n",
146+
"This is a riddle! The answer is:\n\n\"Pine-2139\"\n\nLet me know if you",
147+
"This is a classic riddle! The secret code is \u201cMAPLE-5741\u201d.",
148+
"This is a classic riddle! The answer is:\n\n**Oak-2654**\n\nLet me know if",
149+
"This is a classic riddle! The answer is:\n\n**THETA-6977**\n\nLet me know",
150+
"ZETA-4432\n\n\n\n\n\n\n\n",
151+
"The secret code is KAPPA-3803."
152+
],
153+
"per_sample_correct": [
154+
true,
155+
true,
156+
true,
157+
true,
158+
true,
159+
false,
160+
false,
161+
false,
162+
false,
163+
true,
164+
false,
165+
true,
166+
true,
167+
true,
168+
false,
169+
true,
170+
false,
171+
true,
172+
true,
173+
true
174+
]
175+
}
176+
},
177+
"gate": {
178+
"v04_vs_oracle_delta": 0.0,
179+
"v04_recall_ge_0_95": false,
180+
"v04_within_5pct_of_oracle": true,
181+
"v04_vs_v03_improvement": 0.65,
182+
"v04_dominates_v03": true
183+
}
184+
}
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
{
2+
"schema_version": 1,
3+
"kind": "k1e_niah_validation",
4+
"config": {
5+
"model": "google/gemma-3-1b-it",
6+
"device": "cuda",
7+
"dtype": "torch.bfloat16",
8+
"n_samples": 20,
9+
"haystack_min_lines": 238,
10+
"haystack_max_lines": 322,
11+
"sink_size": 4,
12+
"window_size": 64,
13+
"max_new_tokens": 24,
14+
"seed": 42,
15+
"prompt_token_len_min": 4853,
16+
"prompt_token_len_max": 6474,
17+
"prompt_token_len_mean": 5598
18+
},
19+
"results": {
20+
"oracle_full_attention": {
21+
"name": "oracle_full_attention",
22+
"recall": 0.35,
23+
"samples_correct": 7,
24+
"samples_total": 20,
25+
"mean_latency_s": 2.7720509102218784,
26+
"median_latency_s": 2.6280049319611862,
27+
"per_sample_decoded": [
28+
"Note 0000: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical",
29+
"Delta-3286Yi.",
30+
"Note 0001: this paragraph is unrelated padding and does not contain the answer. \u0baf\u0bc2\u0bb8\u0bcd \u0b86",
31+
"BETA-4582Yi, this is a tricky one! The secret code is BETA-4",
32+
"KAPPA-1434 \n Netanyahu",
33+
"Pine-9928atto\n\n\n\n\n\n\n",
34+
"Note 0007: this paragraph is unrelated padding and does not contain the answer. \u092f\u0942\u0928\u093f\u0935\u0930\u094d\u0938\u093f\u091f\u0940 7",
35+
"The secret code is PINE-7924.\u0bae\u0ba3\u0bbf",
36+
"Gamma-4527 - This is a classic riddle! The answer is the code to unlock a hidden",
37+
"BETA-7224Yi, I apologize for the error. The secret code is BETA-7",
38+
"Note 0007: this paragraph is unrelated padding and does not contain the answer.Yet another paragraph",
39+
"Pine-8527.\u0bae\u0ba3\u0bbf",
40+
"BIRCH-7201atto\n\n\n\n\n\n",
41+
"EPSILON-6925 Climber\n",
42+
"Note 0015: this paragraph is unrelated padding and does not contain the answer. \u0baf\u0bc2\u0bb8\u0bcd \u0b86",
43+
"Note 0000: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical",
44+
"I apologize, but I cannot provide the answer to your question. The prompt contains a deliberately misleading and nonsensical statement \u2013",
45+
"The secret code is THETA-6977.\u0bae\u0ba3\u0bbf",
46+
"Note 0007: this paragraph is unrelated padding and does not contain the answer.yeti",
47+
"Note 0005: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical"
48+
],
49+
"per_sample_correct": [
50+
false,
51+
false,
52+
false,
53+
true,
54+
true,
55+
false,
56+
false,
57+
true,
58+
false,
59+
true,
60+
false,
61+
false,
62+
true,
63+
true,
64+
false,
65+
false,
66+
false,
67+
true,
68+
false,
69+
false
70+
]
71+
},
72+
"v03_sink_window": {
73+
"name": "v03_sink_window",
74+
"recall": 0.0,
75+
"samples_correct": 0,
76+
"samples_total": 20,
77+
"mean_latency_s": 4.149772343994118,
78+
"median_latency_s": 3.9621413506101817,
79+
"per_sample_decoded": [
80+
"Okay, let's break down this puzzle.\n\nThe secret code is: **\"The answer is here.\"**",
81+
"Okay, let's break down this puzzle!\n\nThe secret code is: **\"Hello, World!\"**\n\n",
82+
"Okay, let's break down this prompt.\n\nThe prompt is designed to elicit a response. It's a",
83+
"Okay, let's break down the puzzle.\n\nThe image shows a message: \"Secret code: The answer",
84+
"I understand you're asking about a very important and potentially concerning. I cannot provide information that could be harmful.",
85+
"Okay, let's break down the code.\n\nThe secret code is: **\"The answer is here.\"**",
86+
"Okay, let's break down the question and answer.\n\nThe question is: \"What is the secret code?\"",
87+
"Okay, let's break down the code.\n\nThe secret code is: **\"It's a trick\"",
88+
"Okay, let\u2019s break this down.\n\nThe secret code is: **\"Don't look at the picture",
89+
"Okay, let's break down this puzzle.\n\nThe secret code is: **\"The answer is hidden in the",
90+
"Okay, let\u2019s do this.\n\nIt\u2019s a simple question.\n\nWhat is the secret?\n\nIt",
91+
"I cannot provide you with the secret code. Sharing or requesting such information is harmful and goes against my safety guidelines. My",
92+
"The provided text doesn't contain a secret code. It's a simple, descriptive passage.\n\nAnswer: The",
93+
"Okay, let's break down this challenge.\n\nThe provided text is a series of instructions:\n\n\"Question:",
94+
"The secret code is \u201c**QWERTY**\u201d.\n\nLet's break it down:\n\n* **Q",
95+
"Okay, let's break down this riddle.\n\nThe secret code is: **\"The answer is here.\"**",
96+
"Okay, let's break down this puzzle!\n\nThe secret code is: **\"The answer is a secret.\"",
97+
"Okay, let's break down the secret code.\n\nThe secret code is: ** \" \" ",
98+
"Okay, let's do this.\n\nHere's the secret code:\n\n**Secret Code:**\n\n**Answer",
99+
"Okay, let's break down this puzzle.\n\nThe secret code is: **\"The answer is here.\"**"
100+
],
101+
"per_sample_correct": [
102+
false,
103+
false,
104+
false,
105+
false,
106+
false,
107+
false,
108+
false,
109+
false,
110+
false,
111+
false,
112+
false,
113+
false,
114+
false,
115+
false,
116+
false,
117+
false,
118+
false,
119+
false,
120+
false,
121+
false
122+
]
123+
},
124+
"v04_dlm_restored": {
125+
"name": "v04_dlm_restored",
126+
"recall": 0.35,
127+
"samples_correct": 7,
128+
"samples_total": 20,
129+
"mean_latency_s": 5.889548415935133,
130+
"median_latency_s": 5.58868609496858,
131+
"per_sample_decoded": [
132+
"Note 0000: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical",
133+
"Delta-3286Yi.",
134+
"Note 0001: this paragraph is unrelated padding and does not contain the answer. \u0baf\u0bc2\u0bb8\u0bcd \u0b86",
135+
"BETA-4582Yi, this is a tricky one! The secret code is BETA-4",
136+
"KAPPA-1434 \n Netanyahu",
137+
"Pine-9928atto\n\n\n\n\n\n\n",
138+
"Note 0007: this paragraph is unrelated padding and does not contain the answer. \u092f\u0942\u0928\u093f\u0935\u0930\u094d\u0938\u093f\u091f\u0940 7",
139+
"The secret code is PINE-7924.\u0bae\u0ba3\u0bbf",
140+
"Gamma-4527 - This is a classic riddle! The answer is the code to unlock a hidden",
141+
"BETA-7224Yi, I apologize for the error. The secret code is BETA-7",
142+
"Note 0007: this paragraph is unrelated padding and does not contain the answer.Yet another paragraph",
143+
"Pine-8527.\u0bae\u0ba3\u0bbf",
144+
"BIRCH-7201atto\n\n\n\n\n\n",
145+
"EPSILON-6925 Climber\n",
146+
"Note 0015: this paragraph is unrelated padding and does not contain the answer. \u0baf\u0bc2\u0bb8\u0bcd \u0b86",
147+
"Note 0000: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical",
148+
"I apologize, but I cannot provide the answer to your question. The prompt contains a deliberately misleading and nonsensical statement \u2013",
149+
"The secret code is THETA-6977.\u0bae\u0ba3\u0bbf",
150+
"Note 0007: this paragraph is unrelated padding and does not contain the answer.yeti",
151+
"Note 0005: this paragraph is unrelated padding and does not contain the answer. \u0915\u0943\u0937\u094d\u0923.Mechanical"
152+
],
153+
"per_sample_correct": [
154+
false,
155+
false,
156+
false,
157+
true,
158+
true,
159+
false,
160+
false,
161+
true,
162+
false,
163+
true,
164+
false,
165+
false,
166+
true,
167+
true,
168+
false,
169+
false,
170+
false,
171+
true,
172+
false,
173+
false
174+
]
175+
}
176+
},
177+
"gate": {
178+
"v04_vs_oracle_delta": 0.0,
179+
"v04_recall_ge_0_95": false,
180+
"v04_within_5pct_of_oracle": true,
181+
"v04_vs_v03_improvement": 0.35,
182+
"v04_dominates_v03": true
183+
}
184+
}

0 commit comments

Comments
 (0)