Skip to content

OAI-compat endpoint returns 404 "Invalid session" for Qwen3.5-35B-A3B (MoE) checkpoints #117

@yych42

Description

@yych42

Summary

A Qwen/Qwen3.5-35B-A3B-Base LoRA checkpoint saved via save_weights_for_sampler(...) cannot be sampled through the OpenAI-compat endpoint (/oai/api/v1/completions and /oai/api/v1/chat/completions) — both return HTTP 404 {"detail": "Invalid session"}. The same checkpoint samples cleanly through the Python SDK (ServiceClient.create_sampling_client(model_path=...).sample(...)), and a Llama-3.1-70B checkpoint owned by the same org / accessed with the same API key works fine through the OAI-compat endpoint. This looks like a Tinker-side issue on the OAI-compat router, not a weights or save-path issue.

Affected checkpoint

  • Path: tinker://60e46cd5-35d0-586f-8e68-53fafb9163d8:train:0/sampler_weights/step_000280
  • Base model: Qwen/Qwen3.5-35B-A3B-Base (VL MoE)
  • Created: 2026-05-20
  • LoRA rank: 64
  • Saved via training_client.save_weights_for_sampler(name=...) from the Tinker SDK

Working control checkpoint (same org, same API key, same endpoint)

  • Path: tinker://47f6bf3f-1319-56ec-91c8-4b5f4860f652:train:0/sampler_weights/step_000280
  • Base model: meta-llama/Llama-3.1-70B
  • Created: ~one month earlier than the Qwen checkpoint above
  • Saved via the same code path (save_weights_for_sampler)

Reproduction

All four calls were made from the same machine, within seconds of each other, using the same API key.

1. GET /v1/models — Qwen checkpoint is listed correctly

curl -s "$BASE/v1/models" -H "Authorization: Bearer $TINKER_API_KEY"

Response includes:

{
  "id": "tinker://60e46cd5-35d0-586f-8e68-53fafb9163d8:train:0/sampler_weights/step_000280",
  "object": "model",
  "created": 1779276501,
  "owned_by": "tml:organization_user:0fd650fc-****-****-****-cdbde33f1d5b",
  "base_model": "Qwen/Qwen3.5-35B-A3B-Base"
}

2. POST /v1/completions against the Qwen checkpoint → 404

curl -X POST "$BASE/v1/completions" \
  -H "Authorization: Bearer $TINKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinker://60e46cd5-35d0-586f-8e68-53fafb9163d8:train:0/sampler_weights/step_000280",
    "prompt": "ping",
    "max_tokens": 16,
    "temperature": 0.7
  }'

Response:

{"detail": "Invalid session"}

Status: HTTP 404.

3. POST /v1/chat/completions against the Qwen checkpoint → 404 (same response)

curl -X POST "$BASE/v1/chat/completions" \
  -H "Authorization: Bearer $TINKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinker://60e46cd5-35d0-586f-8e68-53fafb9163d8:train:0/sampler_weights/step_000280",
    "messages": [{"role":"user","content":"ping"}],
    "max_tokens": 16
  }'

Response: {"detail": "Invalid session"}, HTTP 404.

4. POST /v1/completions against the Llama control checkpoint → 200 OK

Same key, same endpoint, same request shape, only the model field changed:

curl -X POST "$BASE/v1/completions" \
  -H "Authorization: Bearer $TINKER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tinker://47f6bf3f-1319-56ec-91c8-4b5f4860f652:train:0/sampler_weights/step_000280",
    "prompt": "ping",
    "max_tokens": 16,
    "temperature": 0.7
  }'

Response (truncated for readability):

{
  "id": "c0d1eb73890acb858a9f94da00a16f4c:sample:...",
  "object": "text_completion",
  "model": "tinker://47f6bf3f-1319-56ec-91c8-4b5f4860f652:train:0/sampler_weights/step_000280",
  "choices": [{"index": 0, "text": "ouin.circ_r - Pingouin\nChi-Square Goodness of", "finish_reason": "length", "logprobs": null}],
  "usage": {"prompt_tokens": 2, "completion_tokens": 16, "total_tokens": 18}
}

Status: HTTP 200.

5. Python SDK against the same Qwen checkpoint → works

import tinker
from tinker import types
from tinker_cookbook.tokenizer_utils import get_tokenizer
from tinker_cookbook import renderers

service_client = tinker.ServiceClient()
sampling_client = service_client.create_sampling_client(
    model_path="tinker://60e46cd5-35d0-586f-8e68-53fafb9163d8:train:0/sampler_weights/step_000280"
)

tokenizer = get_tokenizer("Qwen/Qwen3.5-35B-A3B-Base")
renderer = renderers.get_renderer("qwen3", tokenizer)
model_input = renderer.build_generation_prompt([{"role": "user", "content": "ping"}])
stop = renderer.get_stop_sequences()

result = sampling_client.sample(
    model_input,
    sampling_params=types.SamplingParams(max_tokens=16, temperature=0.7, top_p=0.9, stop=stop),
    num_samples=1,
).result()

print(tokenizer.decode(result.sequences[0].tokens, skip_special_tokens=True))
# -> "🏓 Ping!"  (6 tokens, stop_reason=stop, EOT at position 5)

The SDK call was made ~30 seconds before reproduction step 2 above, so the OAI-compat 404 is not "SDK session not yet warm" — even with an active SDK session against the same checkpoint, the OAI router returns Invalid session. SDK and OAI sessions appear to be independent.

Expected vs actual

  • Expected: POST /v1/completions and /v1/chat/completions against a checkpoint that appears in /v1/models return a normal OpenAI-compat completion, identical behavior to what we see for the Llama-3.1-70B checkpoint.
  • Actual: Returns HTTP 404 {"detail": "Invalid session"} consistently for the Qwen3.5-35B-A3B-Base (MoE) checkpoint.

Environment

  • Tinker SDK: latest installed (Python 3.14 venv, with the pydantic.v1 compatibility warning)
  • Endpoint base: https://tinker.thinkingmachines.dev/services/tinker-prod/oai/api/v1
  • Date of repro: 2026-05-21

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions