Skip to content

Add JANG model loader integration#212

Open
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2
Open

Add JANG model loader integration#212
samuelfaj wants to merge 27 commits into
raullenchai:mainfrom
samuelfaj:add-jangtq-loader-v2

Conversation

@samuelfaj
Copy link
Copy Markdown
Contributor

Summary

  • Detect local or Hugging Face models with jang_config.json before the vendored architecture fallback.
  • Route JANGTQ/MXTQ models through jang_tools.load_jangtq.load_jangtq_model and standard JANG models through jang_tools.loader.load_jang_model.
  • Add optional rapid-mlx[jang] dependency extra and regression tests for JANGTQ, JANG v2, and normal DeepSeek V4 fallback behavior.
  • Patch DeepSeek V4 JANGTQ tokenizer loading so jang-tools does not fall through Transformers AutoConfig for the vendored deepseek_v4 architecture.

Root cause

DeepSeek V4 JANGTQ bundles declare weight_format: mxtq and store routed experts as tq_packed/tq_norms tensors. The existing loader treated them like normal DeepSeek V4 MLX weights, so mlx_lm.load_model rejected thousands of unexpected JANGTQ parameters. During live validation, jang-tools also hit a DSV4 tokenizer/EOS expansion path that calls Transformers AutoConfig; the wrapper now patches that call for DSV4 JANGTQ to load tokenizer.json directly.

Validation

  • uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q
  • uv run --extra dev ruff check pyproject.toml vllm_mlx/utils/tokenizer.py tests/test_jangtq_loader.py
  • uv run --extra jang python - <<'PY' ... import jang_tools ... PY
  • Local model detection: DeepSeek-V4-Flash-JANGTQ detected as weight_format=mxtq, profile=JANGTQ2.
  • Live serve validation reached DSV4 streaming hydrate, replaced 129 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, then exposed a tokenizer path bug that this branch patches.

@samuelfaj
Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for .
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible request returned HTTP 200 with , , , .
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj
Copy link
Copy Markdown
Contributor Author

Validation update:

  • Full JANGTQ serve startup completed locally for /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • Hydration replaced 129 DSV4 routed TQ modules, loaded 85 regular shards, patched 43 SwitchGLU instances, completed warmup, and served on port 8011.
  • OpenAI-compatible /v1/chat/completions request returned HTTP 200 with model=local, prompt_tokens=9, completion_tokens=8, total_tokens=17.
  • Additional compatibility fixes landed in the branch for DSV4 tokenizer metadata and MLX scalar RoPE offsets under rapid-mlx batching.

@samuelfaj
Copy link
Copy Markdown
Contributor Author

Final validation update:

  • Fixed quality issue by routing DSV4 JANGTQ through direct mlx_lm.generate on the model-owning MLX worker instead of the continuous batching generator path, which produced corrupted/repetitive tokens for this runtime.
  • Server validation command completed on port 8013 with /Users/samuelfajreldines/dev/models/DeepSeek-V4-Flash-JANGTQ.
  • /v1/chat/completions simple math request returned HTTP 200 with content exactly 4, prompt_tokens=17, completion_tokens=1, total_tokens=18.
  • /v1/chat/completions exact-ok request returned HTTP 200 with content exactly ok, prompt_tokens=9, completion_tokens=1, total_tokens=10.
  • Regression tests: uv run --extra dev --extra jang python -m pytest tests/test_jangtq_loader.py tests/test_deepseek_v4_vendored.py -q passed, 12 tests.
  • Ruff passed for changed files.

@samuelfaj
Copy link
Copy Markdown
Contributor Author

Performance/streaming update:

  • The DeepSeek V4 JANGTQ direct fallback now uses mlx_lm.stream_generate for streaming requests, so tokens are delivered as they are produced instead of waiting for full completion.
  • Non-streaming requests keep the safe direct mlx_lm.generate path.
  • Added an explicit TODO in the direct fallback explaining the future real batching fix: compare BatchGenerator logits/output against mlx_lm.generate, then fix cache offset handling, prompt-cache merge/extract, and RoPE position state until batching is bit-consistent with the direct path.
  • Live streaming validation returned SSE chunks with content exactly ok and final usage prompt_tokens=9, completion_tokens=2, total_tokens=11.
  • Focused tests passed: 17 tests.
  • Ruff passed.

@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from 2f48ce6 to 0ee615b Compare May 5, 2026 03:31
@samuelfaj samuelfaj marked this pull request as draft May 5, 2026 14:23
@samuelfaj samuelfaj marked this pull request as ready for review May 5, 2026 15:48
@samuelfaj samuelfaj force-pushed the add-jangtq-loader-v2 branch from ea128df to 9b0bb10 Compare May 5, 2026 15:58
@raullenchai
Copy link
Copy Markdown
Owner

Hi @samuelfaj — thanks for the work. Applying our new SOP §0 necessity gate (see docs/development/pr_merge_sop.md) I need a demand signal before merging.

Holding for clarification, not closing yet.

Reasoning:

To unlock merge, I need one or more of:

  1. User demand: a GitHub issue from a user (you or someone else) saying "I want to serve JANG model X with rapid-mlx and it doesn't work". Even one is enough.
  2. JANG popularity signal: pointer to a HuggingFace model page using JANGTQ/MXTQ format with non-trivial download counts, or a community discussion (Reddit/Discord/X) showing people are trying to run JANG locally.
  3. Scope split: separate the JANG-specific changes (vllm_mlx/jang_tools/*, tests/test_jangtq_loader.py, jang detection in loader, [jang] extras) from the unrelated infra changes (anthropic auth, completions, health, request_metrics, etc.). The current diff makes it impossible to review JANG support on its own merits.

For now please rebase on top of latest main (which now has #260, #262, #258 merged) and drop the parts that came from #205/#212-stack-overlap. After that I can give the JANG-specific surface the focused review it deserves.

Apologies for the friction — the necessity gate is new this week and I'm working through the backlog. Your #204 (Qwen tool-call fix) is being reviewed now since it has clear user value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants