Skip to content

Commit 144653d

Browse files
Brooooooklynclaude
andauthored
fix: make serve runnable (mlx-node#51)
<!-- CURSOR_SUMMARY --> > [!NOTE] > **Medium Risk** > Touches core memory/cache controls, session reuse semantics, and request-lifecycle bookkeeping; regressions could impact cache safety or reuse correctness, though changes are heavily test-covered. > > **Overview** > Makes `@mlx-node/server` runnable and safer under partial native upgrades by **moving Metal cache draining to a gated `@mlx-node/core.__internal__.clearCache` export** and resolving it lazily (warn-once + no-op fallback when missing). > > Adds a new Rust `cache_limit` module that coordinates a process-wide MLX allocator cache limit across concurrently loaded models (weight-bytes based) and exposes read-only `memoryStats`; updates Gemma4 internals with an additional quantized router-projection setter. > > Extends the server session/cache behavior around tier-2 `prompt_cache_key` reuse (HMAC-scoped keys, opt-out, min-length gate, cached-token observability via `X-Cached-Tokens`/SSE usage fields, and stricter precedence with `previous_response_id`), plus an `IdleSweeper` suspend API (`withSuspendedDrains`/`suspendDrains`) to prevent drains during hot model loads and to correctly decrement in-flight counts on abort paths. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a7c7a3e. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 268d046 commit 144653d

60 files changed

Lines changed: 12618 additions & 1093 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ libc = "0.2"
1616
mimalloc-safe = { version = "0.1" }
1717
parquet = { version = "58", features = ["arrow"] }
1818
rand = "0.10"
19-
serde_json = "1"
19+
serde_json = { version = "1", features = ["preserve_order"] }
2020
serde = { version = "1", features = ["derive"] }
2121
thiserror = "2"
2222
tokio = { version = "1.48", features = ["rt-multi-thread", "macros", "sync"] }
2323
tracing = "0.1"
2424
tracing-subscriber = { version = "0.3.19", default-features = false, features = ["std", "fmt", "json"] }
2525
tokenizers = "0.22"
26-
minijinja = { version = "2.5", features = ["json"] }
26+
minijinja = { version = "2.5", features = ["json", "preserve_order"] }
2727
uuid = { version = "1.11", features = ["v4"] }
2828
regex = "1.11"
2929
sqlx = { version = "0.8", features = ["runtime-tokio", "sqlite"] }

__test__/core/core-exports.test.ts

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
/**
2+
* Export-gate assertions for `@mlx-node/core`.
3+
*
4+
* Round-5 Codex follow-up: the Metal cache-pool drain function
5+
* (`clear_cache` in `crates/mlx-core/src/cache_limit.rs`) is annotated
6+
* `#[napi(namespace = "__internal__")]` on purpose — the drain is a
7+
* process-wide `mlx_synchronize` routed through the default stream,
8+
* which does NOT wait on the custom generation streams that per-model
9+
* threads run on. Calling it while a decode is in flight risks racing
10+
* live Metal command buffers.
11+
*
12+
* The namespace prefix is a deliberate speed-bump: the ONLY safe
13+
* caller today is `@mlx-node/server`'s idle sweeper (which only
14+
* triggers after the in-flight counter returns to zero). Every other
15+
* caller has to opt in by acknowledging the `__internal__` prefix and
16+
* reading the `@internal` caveat.
17+
*
18+
* These tests guard that gate:
19+
*
20+
* 1. The root module must NOT expose `clearCache` directly — a
21+
* regression where someone drops the `namespace = "__internal__"`
22+
* attribute would silently surface the footgun on the public API.
23+
* 2. The namespace'd form must BE present and callable — a broken
24+
* native build (e.g. a `#[napi]` attribute error that drops the
25+
* export) would leave `@mlx-node/server`'s sweeper with no drain
26+
* path.
27+
* 3. `memoryStats` stays exposed on the root as a cheap smoke test
28+
* that the broader binding is loading at all — if the native
29+
* addon is mis-linked both this AND the `__internal__` probe
30+
* fail, which localizes the problem.
31+
*/
32+
33+
import { describe, expect, it } from 'vite-plus/test';
34+
35+
// Intentionally read through `require` as well as ESM `import` so the
36+
// test catches both consumption patterns. The `vite.config.ts` alias
37+
// points `@mlx-node/core` at `packages/core/index.cjs`, so both
38+
// resolutions land on the same module instance in the test runtime.
39+
// eslint-disable-next-line @typescript-eslint/no-require-imports
40+
const coreRequire: Record<string, unknown> = require('@mlx-node/core');
41+
42+
describe('@mlx-node/core public export surface', () => {
43+
it('does NOT expose `clearCache` on the root namespace', () => {
44+
// A regression here means someone dropped the
45+
// `#[napi(namespace = "__internal__")]` attribute from
46+
// `clear_cache` in `crates/mlx-core/src/cache_limit.rs`. That
47+
// would make the process-wide Metal drain callable from user
48+
// code without the `__internal__.` speed-bump — exactly the
49+
// footgun the gate was designed to prevent.
50+
expect(coreRequire.clearCache).toBeUndefined();
51+
});
52+
53+
it('exposes `__internal__.clearCache` as a callable function', () => {
54+
const internal = coreRequire.__internal__ as { clearCache?: unknown } | undefined;
55+
expect(internal).toBeDefined();
56+
expect(typeof internal?.clearCache).toBe('function');
57+
});
58+
59+
it('exposes `memoryStats` on the root namespace as a callable function', () => {
60+
// Acts as a "did the binding load at all" smoke test alongside
61+
// the `__internal__` check above — a broken build drops BOTH,
62+
// which is easier to diagnose than one failing in isolation.
63+
expect(typeof coreRequire.memoryStats).toBe('function');
64+
});
65+
});

__test__/models/chat-session.test.ts

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,11 @@
2020
* export is added in T5.
2121
*/
2222
import type { ChatConfig, ChatMessage, ChatResult } from '@mlx-node/core';
23+
import { ChatSession, type SessionCapableModel } from '@mlx-node/lm';
24+
import type { ChatStreamEvent, ChatStreamFinal } from '@mlx-node/lm';
2325
import { describe, expect, it, vi } from 'vite-plus/test';
2426

25-
import { ChatSession, type SessionCapableModel } from '../../packages/lm/src/chat-session.js';
26-
import type { ChatStreamEvent, ChatStreamFinal } from '../../packages/lm/src/stream.js';
27+
import { resetPreservingNativeCacheForWarmReuse } from '../../packages/server/src/chat-session-warm-reuse.js';
2728

2829
/** Build a minimal `ChatResult` sufficient for the session layer. */
2930
function makeChatResult(text: string): ChatResult {
@@ -76,6 +77,7 @@ function finalChunk(text: string, finishReason: string = 'stop'): ChatStreamFina
7677
promptTokens: 1,
7778
reasoningTokens: 0,
7879
rawText: text,
80+
cachedTokens: 0,
7981
} satisfies ChatStreamFinal;
8082
}
8183

@@ -530,6 +532,84 @@ describe('ChatSession', () => {
530532
expect(messages).toEqual([{ role: 'user', content: 'fresh' }]);
531533
});
532534

535+
it('public reset() always full-wipes — the keepNativeCache option is not accepted (Round 5 Fix #1)', async () => {
536+
// Public contract: `ChatSession.reset()` is always a full wipe.
537+
// Round 4 accidentally exposed a `{ keepNativeCache: true }`
538+
// option that downstream consumers could call in a context
539+
// where the shared native model still held an unrelated
540+
// request's cache, reintroducing the cross-request cache-
541+
// affinity leak that Round 3 closed. Round 5 removed the
542+
// option from the public surface — the preserved-cache path
543+
// is now behind the helper
544+
// `resetPreservingNativeCacheForWarmReuse(session)`, which
545+
// lives inside `@mlx-node/server` and is called exclusively by
546+
// `SessionRegistry`-gated server endpoints. (Round 6 Fix #1
547+
// refactored this from a class method into a module-level
548+
// function; Round 7 Fix #2 relocated the module into the
549+
// server package itself so there is no `@mlx-node/lm` export
550+
// surface the helper could leak through.)
551+
const { model, resetCaches } = makeMockModel();
552+
const session = new ChatSession(model);
553+
554+
await session.send('one');
555+
expect(session.turns).toBe(1);
556+
557+
// No-argument form always wipes.
558+
await session.reset();
559+
expect(resetCaches).toHaveBeenCalledTimes(1);
560+
expect(session.turns).toBe(0);
561+
});
562+
563+
it('resetPreservingNativeCacheForWarmReuse() wipes JS state only — no resetCaches call (Round 5 Fix #1 / Round 6 Fix #1 internal helper)', async () => {
564+
// The module-level helper that replaced Round 4's
565+
// `reset({ keepNativeCache: true })` public option, then Round 5's
566+
// `_resetPreservingNativeCacheForWarmReuse()` class method. Used
567+
// only by the server-side SessionRegistry warm-replay path on a
568+
// tier-1 / tier-2 HIT, where the registry authoritatively vouches
569+
// for the native cache belonging to this chain. Verify:
570+
// (1) no resetCaches call, (2) turns/history zeroed so
571+
// primeHistory() will accept the session.
572+
const { model, resetCaches } = makeMockModel();
573+
const session = new ChatSession(model);
574+
575+
await session.send('one');
576+
await session.send('two');
577+
expect(session.turns).toBe(2);
578+
579+
await resetPreservingNativeCacheForWarmReuse(session);
580+
expect(resetCaches).not.toHaveBeenCalled();
581+
expect(session.turns).toBe(0);
582+
expect(session.hasImages).toBe(false);
583+
});
584+
585+
it('resetPreservingNativeCacheForWarmReuse is NOT exported from @mlx-node/lm public surface (Round 6 Fix #1 / Round 7 Fix #2)', async () => {
586+
// Structural guard: the helper lives inside `@mlx-node/server`
587+
// (server-private module) and MUST NOT appear in either the
588+
// `@mlx-node/lm` main export surface or the `@mlx-node/server`
589+
// main export surface. Round 6 Fix #1 enforced the lm-public
590+
// absence; Round 7 Fix #2 deleted the `@mlx-node/lm/internal`
591+
// subpath export entirely and relocated the helper into the
592+
// server package so the only reachable call site is
593+
// `endpoints/responses.ts`, which already holds the
594+
// `SessionRegistry` HIT gate. Downstream consumers doing a plain
595+
// `import { ... } from '@mlx-node/lm'` or `from '@mlx-node/server'`
596+
// must therefore not be able to discover or invoke the helper.
597+
const lmPublicModule = (await import('../../packages/lm/src/index.js')) as Record<string, unknown>;
598+
expect('resetPreservingNativeCacheForWarmReuse' in lmPublicModule).toBe(false);
599+
600+
const serverPublicModule = (await import('../../packages/server/src/index.js')) as Record<string, unknown>;
601+
expect('resetPreservingNativeCacheForWarmReuse' in serverPublicModule).toBe(false);
602+
603+
// Sanity-check: the helper is still reachable from the
604+
// server-private module the endpoint code imports. Tests can
605+
// reach it via the relative path; downstream consumers cannot.
606+
const serverPrivateModule = (await import('../../packages/server/src/chat-session-warm-reuse.js')) as Record<
607+
string,
608+
unknown
609+
>;
610+
expect(typeof serverPrivateModule.resetPreservingNativeCacheForWarmReuse).toBe('function');
611+
});
612+
533613
it('rejects reset() while a send() is in flight', async () => {
534614
let resolveFirst: (r: ChatResult) => void = () => {
535615
/* overwritten below */

__test__/models/model-loader-gemma4.test.ts

Lines changed: 163 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1+
import type { Gemma4Config } from '@mlx-node/core';
12
import { Gemma4Model as Gemma4ModelNative } from '@mlx-node/core';
23
import { Gemma4Model } from '@mlx-node/lm';
3-
import { describe, it, expect } from 'vite-plus/test';
4+
import { describe, expect, it } from 'vite-plus/test';
45

56
/**
67
* Regression guard for the bug where `packages/lm/src/models/model-loader.ts`
@@ -56,3 +57,164 @@ describe('Gemma4Model re-export from @mlx-node/lm', () => {
5657
expect(fn.constructor.name).toBe('AsyncGeneratorFunction');
5758
});
5859
});
60+
61+
/**
62+
* A minimal `Gemma4Config` that satisfies the NAPI-derived struct so
63+
* `new Gemma4ModelNative(cfg)` accepts it. Values are NOT meaningful —
64+
* the stub constructor never materializes weights or a tokenizer, so
65+
* only the shape matters. Kept inside the test file rather than a
66+
* shared helper because the fields are documented fully by
67+
* `packages/core/index.d.cts` and any drift is caught by the existing
68+
* typecheck pass on the test suite.
69+
*/
70+
function stubConfig(overrides: Partial<Gemma4Config> = {}): Gemma4Config {
71+
return {
72+
vocabSize: 256,
73+
hiddenSize: 8,
74+
numHiddenLayers: 1,
75+
numAttentionHeads: 1,
76+
numKeyValueHeads: 1,
77+
headDim: 8,
78+
intermediateSize: 16,
79+
rmsNormEps: 1e-6,
80+
tieWordEmbeddings: false,
81+
maxPositionEmbeddings: 128,
82+
slidingWindow: 64,
83+
layerTypes: ['full_attention'],
84+
ropeTheta: 1_000_000,
85+
ropeLocalBaseFreq: 10_000,
86+
partialRotaryFactor: 0.25,
87+
attentionKEqV: false,
88+
perLayerInputEmbeds: false,
89+
padTokenId: 0,
90+
eosTokenIds: [1],
91+
bosTokenId: 2,
92+
attentionBias: false,
93+
useDoubleWideMlp: false,
94+
enableMoeBlock: false,
95+
...overrides,
96+
};
97+
}
98+
99+
/**
100+
* Round-5 Finding B regression coverage.
101+
*
102+
* `new Gemma4Model(config)` was a runnable entry point before the
103+
* cache-limit coordinator work landed. It is now a deliberate
104+
* config-only stub (matches `VLModel::new(config)` /
105+
* `QianfanOCRModel::new(config)`) because a no-op `new(config)` would
106+
* have registered an empty coordinator delta and broken the
107+
* deterministic-weight-bytes baseline.
108+
*
109+
* The Rust side uses an exact error message (see
110+
* `crates/mlx-core/src/models/gemma4/model.rs`:
111+
* `"Model not initialized. Call Gemma4Model.load() first."`) so this
112+
* test asserts on a fragment (`"not initialized"`) to keep the probe
113+
* robust against punctuation tweaks without letting the wrong error
114+
* pass.
115+
*
116+
* We exercise the NATIVE class directly, not the `@mlx-node/lm`
117+
* wrapper, because the wrapper adds async-generator streaming that
118+
* masks the native rejection behind the generator protocol — the
119+
* native surface is the one the handler dispatches on.
120+
*/
121+
describe('Gemma4Model(config) stub (round-5 Finding B)', () => {
122+
it('returns an object with isInitialized=false and a numeric modelId', () => {
123+
const stub = new Gemma4ModelNative(stubConfig());
124+
expect(stub.isInitialized).toBe(false);
125+
expect(typeof stub.modelId()).toBe('number');
126+
});
127+
128+
it('rejects chatSessionStart with a "not initialized" error', async () => {
129+
const stub = new Gemma4ModelNative(stubConfig());
130+
await expect(stub.chatSessionStart([{ role: 'user', content: 'hi' }])).rejects.toThrow(/not initialized/i);
131+
});
132+
133+
it('rejects chatSessionContinue with a "not initialized" error', async () => {
134+
const stub = new Gemma4ModelNative(stubConfig());
135+
await expect(stub.chatSessionContinue('hi', null, null)).rejects.toThrow(/not initialized/i);
136+
});
137+
138+
it('rejects chatSessionContinueTool with a "not initialized" error', async () => {
139+
const stub = new Gemma4ModelNative(stubConfig());
140+
await expect(stub.chatSessionContinueTool('tool_123', '{"ok":true}')).rejects.toThrow(/not initialized/i);
141+
});
142+
143+
it('rejects chatStreamSessionStart with a "not initialized" error', async () => {
144+
const stub = new Gemma4ModelNative(stubConfig());
145+
// Streaming methods still resolve on the NAPI boundary — they
146+
// hand back a `ChatStreamHandle` — but the precondition check
147+
// runs BEFORE the callback is ever invoked, so the promise must
148+
// reject synchronously with the not-initialized message.
149+
await expect(stub.chatStreamSessionStart([{ role: 'user', content: 'hi' }], null, () => {})).rejects.toThrow(
150+
/not initialized/i,
151+
);
152+
});
153+
154+
it('rejects chatStreamSessionContinue with a "not initialized" error', async () => {
155+
const stub = new Gemma4ModelNative(stubConfig());
156+
await expect(stub.chatStreamSessionContinue('hi', null, null, () => {})).rejects.toThrow(/not initialized/i);
157+
});
158+
159+
it('rejects chatStreamSessionContinueTool with a "not initialized" error', async () => {
160+
const stub = new Gemma4ModelNative(stubConfig());
161+
await expect(stub.chatStreamSessionContinueTool('tool_123', '{"ok":true}', null, () => {})).rejects.toThrow(
162+
/not initialized/i,
163+
);
164+
});
165+
166+
it('resetCaches is a silent no-op on the stub', () => {
167+
const stub = new Gemma4ModelNative(stubConfig());
168+
// Matches the documented contract on the Rust impl: uninitialized
169+
// stub returns Ok(()) so `ChatSession.reset()` is idempotent
170+
// across stub + loaded instances.
171+
expect(() => stub.resetCaches()).not.toThrow();
172+
});
173+
});
174+
175+
/**
176+
* Round-5 Finding B also asks for a positive-path assertion that
177+
* `Gemma4Model.load(validPath)` returns a runnable model. Real weights
178+
* are not available in CI — they are multi-gigabyte HuggingFace
179+
* downloads — so we assert the SHAPE of the class instead: `load` is a
180+
* static async function, the stub produced by `new(config)` is an
181+
* instance of the same class a `load()` call would return, and the
182+
* runnable surface (`chatSessionStart` et al) is present on the
183+
* prototype so a loaded instance would dispatch correctly. A
184+
* full-weight end-to-end load is covered by the integration runs in
185+
* `examples/` when real weights are available locally.
186+
*/
187+
describe('Gemma4Model.load() shape (round-5 Finding B)', () => {
188+
it('exposes load as a static promise-returning function on the class', () => {
189+
expect(typeof Gemma4ModelNative.load).toBe('function');
190+
// NAPI-RS emits a plain function whose body dispatches to a
191+
// native tokio task and returns a thenable — it is NOT a native
192+
// JS `async function` (constructor name would be
193+
// `AsyncFunction`), so we verify the return shape instead. A
194+
// refactor that accidentally made `load()` sync would return an
195+
// instance of `Gemma4ModelNative`, not a thenable, and this
196+
// probe would catch it. We pass an intentionally-invalid path so
197+
// no real disk I/O happens — the returned promise rejects, and
198+
// we only care about the `then` shape on the returned value.
199+
const ret = Gemma4ModelNative.load('/dev/null/__does_not_exist__');
200+
expect(ret).toBeDefined();
201+
expect(typeof (ret as { then?: unknown }).then).toBe('function');
202+
// Swallow the eventual rejection so vitest does not flag an
203+
// unhandled rejection on shutdown.
204+
ret.then(
205+
() => undefined,
206+
() => undefined,
207+
);
208+
});
209+
210+
it('exposes the full session surface on the prototype', () => {
211+
const proto = Gemma4ModelNative.prototype as unknown as Record<string, unknown>;
212+
expect(typeof proto.chatSessionStart).toBe('function');
213+
expect(typeof proto.chatSessionContinue).toBe('function');
214+
expect(typeof proto.chatSessionContinueTool).toBe('function');
215+
expect(typeof proto.chatStreamSessionStart).toBe('function');
216+
expect(typeof proto.chatStreamSessionContinue).toBe('function');
217+
expect(typeof proto.chatStreamSessionContinueTool).toBe('function');
218+
expect(typeof proto.resetCaches).toBe('function');
219+
});
220+
});

0 commit comments

Comments
 (0)