Skip to content

Commit 5118c06

Browse files
committed
test(live): add opt-in provider eval suites for OpenAI/Ollama/Agent
Empirical-verification pass for the LLM-call metadata work (issue #907). Until now every assertion rested on mocked SSE streams; this adds live provider eval suites that hit real endpoints and verify three load-bearing claims against actual wire payloads: - OpenAI: `completion_tokens` already includes `reasoning_tokens` (so `output = completion_tokens`, no carve-out) - OpenAI: `prompt_tokens_details.cached_tokens` populates on ≥1024-token prefix matches and surfaces as `usage.cacheRead` - Ollama: thinking content has no associated token count; `usage.reasoning` must stay 0 even on a thinking-on Qwen3 turn The reasoning-subset claim is the one that drove removing `+ reasoningTokens` from the OpenAI output extractor — live verification confirms the wire shape matches our assumption against `gpt-5.4-nano`. Infrastructure: - `tools/test/load-env.js` walks up to find a workspace `.env`; silent if absent so CI is unaffected - `tools/test/live.ts` provides `liveDescribe`/`requireEnv`/`suiteLevel` helpers (OpenAI/Ollama/Agent suites use the equivalent inline pattern for full TypeScript inference; the helpers are exported for future use) - Each suite is gated by `<NAMESPACE>_LIVE_SUITE=smoke|extended`; runner scripts set `*_LIVE_READY=1` which un-ignores the live test files and disables the global `fetch` mock in jest setup - New root scripts: `test:live:openai{,:smoke,:extended}`, `test:live:agent{,:smoke,:extended}` - Ollama smoke runner unchanged; ollama.live.test.ts extended with a new `Ollama live token-usage audit` describe block (4 tests) - `.gitignore` updated to cover `.env`/`.env.local` (secrets-leak gap) Suites are excluded from default `pnpm test` via `testPathIgnorePatterns`, require explicit env vars, and never run in CI. See `LLM_METADATA_DECISIONS.md` #19 for the rationale and #20 for the adapter→builtin→override precedence bug the live tests uncovered. Verified locally: - `pnpm test:live:openai:extended`: 5/5 pass - `pnpm test:live:ollama:extended`: 12/12 pass - `pnpm test:live:agent:extended`: 3/3 pass - `pnpm test` (default): 109/109 pass with no live env vars set
1 parent ff81bc8 commit 5118c06

18 files changed

Lines changed: 6204 additions & 9730 deletions

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,8 @@
44
**/yarn-error.log
55
lerna-debug.log
66
**/src/*.js
7-
**/src/*.d.ts
7+
**/src/*.d.ts
8+
.env
9+
.env.local
10+
**/.env
11+
**/.env.local

LLM_METADATA_DECISIONS.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,3 +141,40 @@ Tracking issue: [constructive-planning #907](https://github.com/constructive-io/
141141
unconditionally for every new request — including decision-resume
142142
requests via `respondWithDecision`. Mirrors the agent-side rule from
143143
decision #6 (reset on each new request, not on `continue()`).
144+
145+
19. **Live provider eval suites are opt-in, `.env`-loaded, excluded from
146+
default `pnpm test` via `testPathIgnorePatterns`, and never run in CI.**
147+
Three suites land: `packages/openai/__tests__/openai.live.test.ts`,
148+
`packages/ollama/__tests__/ollama.live.test.ts` (extended with a new
149+
`Ollama live token-usage audit` block), and
150+
`packages/agent/__tests__/agent.live.test.ts`. Each suite is gated by
151+
`<NAMESPACE>_LIVE_SUITE=smoke|extended` (e.g. `OPENAI_LIVE_SUITE`); the
152+
`pnpm test:live:<provider>{,:smoke,:extended}` runners set
153+
`*_LIVE_READY=1` which both un-ignores the file in Jest config and
154+
disables the `global.fetch = jest.fn()` mock in `openai/jest.setup.js`.
155+
A shared `tools/test/load-env.js` walks up to find a workspace `.env`
156+
and is silent if absent, so CI is unaffected. Why: empirical wire-shape
157+
verification is the only way to confirm load-bearing claims like
158+
"`completion_tokens` already includes `reasoning_tokens`" — but live
159+
suites are expensive (real tokens) and require secrets, so they must
160+
stay out of the default loop. How to apply: when changing usage
161+
extraction, header construction, or any wire-shape detail, run the
162+
matching `pnpm test:live:*:extended` locally before merging. The
163+
`.gitignore` was updated to cover `.env` / `.env.local` to close a
164+
secrets-leak gap.
165+
166+
20. **Adapter-default `compat` must be the base layer of `createModel`'s
167+
merge, not the override layer.** The original spread order was
168+
`{ ...builtIn.compat, ...this.compat, ...overrides.compat }`, which
169+
silently clobbered model-specific settings (notably
170+
`maxTokensField: 'max_completion_tokens'` for reasoning-capable models)
171+
with the adapter's generic default (`'max_tokens'`). OpenAI returned
172+
400 (`Unsupported parameter: 'max_tokens'`) for `gpt-5.4-nano`. The
173+
mock-mode unit tests didn't catch it because the mocked `fetch` never
174+
validated the body. The live smoke test caught it on the very first
175+
real call. Why: model-specific knowledge in the built-in catalog is
176+
more authoritative than weak adapter defaults; user-provided overrides
177+
are most authoritative of all. How to apply: spread order is now
178+
`{ ...this.compat, ...builtIn.compat, ...overrides.compat }` — same
179+
rule for `headers`. Same precedence rule should be applied any time a
180+
new merge of compat-like fields is introduced.

package.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,12 @@
2222
"typecheck": "node ./scripts/typecheck.js",
2323
"test:live:ollama": "pnpm --filter @agentic-kit/ollama run test:live:smoke",
2424
"test:live:ollama:extended": "pnpm --filter @agentic-kit/ollama run test:live:extended",
25+
"test:live:openai": "pnpm --filter @agentic-kit/openai run test:live:smoke",
26+
"test:live:openai:smoke": "pnpm --filter @agentic-kit/openai run test:live:smoke",
27+
"test:live:openai:extended": "pnpm --filter @agentic-kit/openai run test:live:extended",
28+
"test:live:agent": "pnpm --filter @agentic-kit/agent run test:live:smoke",
29+
"test:live:agent:smoke": "pnpm --filter @agentic-kit/agent run test:live:smoke",
30+
"test:live:agent:extended": "pnpm --filter @agentic-kit/agent run test:live:extended",
2531
"lint": "pnpm -r run lint",
2632
"internal:deps": "makage update-workspace",
2733
"deps": "pnpm up -r -i -L"
@@ -32,6 +38,7 @@
3238
"@types/node": "^20.12.7",
3339
"@typescript-eslint/eslint-plugin": "^8.58.2",
3440
"@typescript-eslint/parser": "^8.58.2",
41+
"dotenv": "^16.4.5",
3542
"eslint": "^9.39.2",
3643
"eslint-config-prettier": "^10.1.8",
3744
"eslint-plugin-simple-import-sort": "^12.1.0",
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
import { OpenAIAdapter } from '@agentic-kit/openai';
2+
import { createUserMessage, type AssistantMessage } from 'agentic-kit';
3+
4+
import { Agent } from '../src';
5+
6+
const modelId = process.env.OPENAI_LIVE_MODEL ?? 'gpt-5.4-nano';
7+
const apiKey = process.env.OPENAI_API_KEY;
8+
9+
if (!apiKey) {
10+
throw new Error('Missing required env var: OPENAI_API_KEY');
11+
}
12+
13+
const liveSuite = process.env.AGENT_LIVE_SUITE ?? 'smoke';
14+
const runSmoke = liveSuite === 'smoke' || liveSuite === 'extended';
15+
const runExtended = liveSuite === 'extended';
16+
const describeSmoke = runSmoke ? describe : describe.skip;
17+
const describeExtended = runExtended ? describe : describe.skip;
18+
19+
describeSmoke('Agent live smoke', () => {
20+
jest.setTimeout(60_000);
21+
22+
it('single turn populates state.totalUsage from the assistant message', async () => {
23+
const adapter = new OpenAIAdapter({ apiKey });
24+
const model = adapter.createModel(modelId);
25+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
26+
27+
await agent.prompt('Reply with the single word PONG.');
28+
29+
expect(agent.state.totalUsage.input).toBeGreaterThan(0);
30+
expect(agent.state.totalUsage.output).toBeGreaterThan(0);
31+
expect(agent.state.totalUsage.totalTokens).toBeGreaterThan(0);
32+
expect(agent.state.totalUsage.cost.total).toBeGreaterThan(0);
33+
34+
const lastAssistant = agent.state.messages
35+
.filter((m): m is AssistantMessage => m.role === 'assistant')
36+
.at(-1)!;
37+
38+
// Single turn: the per-message usage IS the cumulative total.
39+
expect(agent.state.totalUsage.input).toBe(lastAssistant.usage.input);
40+
expect(agent.state.totalUsage.output).toBe(lastAssistant.usage.output);
41+
expect(agent.state.totalUsage.reasoning).toBe(lastAssistant.usage.reasoning);
42+
expect(agent.state.totalUsage.cacheRead).toBe(lastAssistant.usage.cacheRead);
43+
expect(agent.state.totalUsage.cacheWrite).toBe(lastAssistant.usage.cacheWrite);
44+
expect(agent.state.totalUsage.totalTokens).toBe(lastAssistant.usage.totalTokens);
45+
});
46+
});
47+
48+
describeExtended('Agent live extended', () => {
49+
jest.setTimeout(120_000);
50+
51+
it('state.totalUsage equals field-wise sum across two turns', async () => {
52+
const adapter = new OpenAIAdapter({ apiKey });
53+
const model = adapter.createModel(modelId);
54+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
55+
56+
await agent.prompt('What is 2 + 2? Reply with just the number.');
57+
58+
const t1Usage = {
59+
...agent.state.totalUsage,
60+
cost: { ...agent.state.totalUsage.cost },
61+
};
62+
63+
// continue() does not accept text; append the follow-up user message first.
64+
agent.appendMessage(createUserMessage('Now what is that doubled? Reply with just the number.'));
65+
await agent.continue();
66+
67+
const lastAssistant = agent.state.messages
68+
.filter((m): m is AssistantMessage => m.role === 'assistant')
69+
.at(-1)!;
70+
71+
expect(agent.state.totalUsage.input).toBe(t1Usage.input + lastAssistant.usage.input);
72+
expect(agent.state.totalUsage.output).toBe(t1Usage.output + lastAssistant.usage.output);
73+
expect(agent.state.totalUsage.reasoning).toBe(t1Usage.reasoning + lastAssistant.usage.reasoning);
74+
expect(agent.state.totalUsage.cacheRead).toBe(t1Usage.cacheRead + lastAssistant.usage.cacheRead);
75+
expect(agent.state.totalUsage.cacheWrite).toBe(t1Usage.cacheWrite + lastAssistant.usage.cacheWrite);
76+
expect(agent.state.totalUsage.totalTokens).toBe(t1Usage.totalTokens + lastAssistant.usage.totalTokens);
77+
expect(agent.state.totalUsage.cost.input).toBeCloseTo(
78+
t1Usage.cost.input + lastAssistant.usage.cost.input,
79+
10
80+
);
81+
expect(agent.state.totalUsage.cost.output).toBeCloseTo(
82+
t1Usage.cost.output + lastAssistant.usage.cost.output,
83+
10
84+
);
85+
expect(agent.state.totalUsage.cost.total).toBeCloseTo(
86+
t1Usage.cost.total + lastAssistant.usage.cost.total,
87+
10
88+
);
89+
});
90+
91+
it('prompt() resets totalUsage; continue() preserves it', async () => {
92+
const adapter = new OpenAIAdapter({ apiKey });
93+
const model = adapter.createModel(modelId);
94+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
95+
96+
await agent.prompt('Reply with the single word A.');
97+
const firstTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
98+
99+
agent.appendMessage(createUserMessage('Reply with the single word B.'));
100+
await agent.continue();
101+
const secondTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
102+
103+
// continue() must not reset — totals should have grown.
104+
expect(secondTotals.input).toBeGreaterThanOrEqual(firstTotals.input);
105+
expect(secondTotals.totalTokens).toBeGreaterThanOrEqual(firstTotals.totalTokens);
106+
expect(agent.state.totalUsage.input).toBeGreaterThanOrEqual(firstTotals.input);
107+
108+
await agent.prompt('Reply with the single word C.');
109+
110+
const thirdAssistant = agent.state.messages
111+
.filter((m): m is AssistantMessage => m.role === 'assistant')
112+
.at(-1)!;
113+
114+
// prompt() resets: the new total should be one turn's worth, not cumulative
115+
// across all three. We use < rather than === because token counts vary and
116+
// we cannot pin the exact value — only that it did not carry over the prior
117+
// two turns' worth of input tokens.
118+
expect(agent.state.totalUsage.input).toBeLessThan(secondTotals.input + 100);
119+
expect(agent.state.totalUsage.totalTokens).toBe(thirdAssistant.usage.totalTokens);
120+
expect(agent.state.totalUsage.input).toBe(thirdAssistant.usage.input);
121+
expect(agent.state.totalUsage.output).toBe(thirdAssistant.usage.output);
122+
});
123+
});

packages/agent/jest.config.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@ module.exports = {
1515
testRegex: '(/__tests__/.*|(\\.|/)(test|spec))\\.(jsx?|tsx?)$',
1616
moduleFileExtensions: ['ts', 'tsx', 'js', 'jsx', 'json', 'node'],
1717
modulePathIgnorePatterns: ['dist/*'],
18+
testPathIgnorePatterns: process.env.AGENT_LIVE_READY === '1' ? [] : ['\\.live\\.test\\.ts$'],
1819
moduleNameMapper: {
1920
'^(\\.{1,2}/.*)\\.js$': '$1',
2021
'^@test/(.*)$': '<rootDir>/../../tools/test/$1',
2122
'^agentic-kit$': '<rootDir>/../agentic-kit/src',
2223
'^@agentic-kit/(.*)$': '<rootDir>/../$1/src',
2324
},
25+
setupFiles: ['<rootDir>/../../tools/test/load-env.js'],
2426
};

packages/agent/package.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,16 @@
3434
"build:dev": "makage build --dev",
3535
"lint": "eslint . --fix",
3636
"test": "jest",
37-
"test:watch": "jest --watch"
37+
"test:watch": "jest --watch",
38+
"test:live": "node ./scripts/run-live-tests.js smoke",
39+
"test:live:smoke": "node ./scripts/run-live-tests.js smoke",
40+
"test:live:extended": "node ./scripts/run-live-tests.js extended"
3841
},
3942
"dependencies": {
4043
"agentic-kit": "workspace:*"
4144
},
45+
"devDependencies": {
46+
"@agentic-kit/openai": "workspace:*"
47+
},
4248
"keywords": []
4349
}
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/usr/bin/env node
2+
3+
const { spawnSync } = require('node:child_process');
4+
const { existsSync } = require('node:fs');
5+
const { dirname, join } = require('node:path');
6+
7+
function findEnvFile(start) {
8+
let dir = start;
9+
while (true) {
10+
const candidate = join(dir, '.env');
11+
if (existsSync(candidate)) return candidate;
12+
if (existsSync(join(dir, 'pnpm-workspace.yaml'))) return null;
13+
const parent = dirname(dir);
14+
if (parent === dir) return null;
15+
dir = parent;
16+
}
17+
}
18+
19+
const envPath = findEnvFile(__dirname);
20+
if (envPath) {
21+
require('dotenv').config({ path: envPath });
22+
}
23+
24+
const requestedSuite = process.argv[2] || process.env.AGENT_LIVE_SUITE || 'smoke';
25+
const validSuites = new Set(['smoke', 'extended']);
26+
27+
if (!validSuites.has(requestedSuite)) {
28+
console.error(
29+
`[agent-live] invalid suite '${requestedSuite}'. Use one of: ${Array.from(validSuites).join(', ')}`
30+
);
31+
process.exit(1);
32+
}
33+
34+
if (!process.env.OPENAI_API_KEY) {
35+
console.log('[agent-live] skipping live tests: OPENAI_API_KEY is not set');
36+
process.exit(0);
37+
}
38+
39+
console.log(`[agent-live] running ${requestedSuite} live tests against the OpenAI API`);
40+
41+
const pnpmCommand = process.platform === 'win32' ? 'pnpm.cmd' : 'pnpm';
42+
const result = spawnSync(
43+
pnpmCommand,
44+
['exec', 'jest', '--runInBand', '--runTestsByPath', '__tests__/agent.live.test.ts', '--verbose', '--forceExit'],
45+
{
46+
stdio: 'inherit',
47+
env: {
48+
...process.env,
49+
AGENT_LIVE_READY: '1',
50+
AGENT_LIVE_SUITE: requestedSuite,
51+
},
52+
}
53+
);
54+
55+
if (result.error) {
56+
throw result.error;
57+
}
58+
59+
process.exit(result.status ?? 1);

0 commit comments

Comments
 (0)