Add OpenAI privacy-filter token classification model#60
Merged
Conversation
Port openai/privacy-filter to mlx-embeddings: a bidirectional 1.5B/50M-active MoE token classifier for PII detection (8 BIOES span labels, 33 classes). The architecture is a bidirectional GPT-OSS variant with GQA + attention sinks, YARN RoPE (interleaved layout), 128-expert top-4 MoE, and ±128 sliding-window attention. Sanitize splits the fused concat-layout gate_up_proj into separate gate/up projections and transposes expert weights for mlx SwitchLinear. Numerical parity with the HF reference in fp32: max logit diff < 0.004, 100% prediction agreement across PII test strings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the model to the supported architectures list and a Token Classification (PII detection) usage section with a working example. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use itertools.groupby to collapse consecutive BIOES tokens into clean decoded spans rather than dumping per-token BPE fragments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Call the router module directly (rather than a manual matmul on
.weight/.bias) so the MoE forward works with both dense nn.Linear
and QuantizedLinear weights. The softmax still runs in fp32 for
numerical parity with the reference.
Verified against /tmp/privacy-filter-{q4,mxfp4}: both quantizations
extract the same PII spans as the bf16 checkpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a quant_predicate on the privacy-filter Model that keeps the MoE router at 8 bits while the rest of the weights quantize to the user's chosen bit width. The router is a small but routing-sensitive linear; a uniform 4-bit quantization of the router was measurably degrading accuracy in gpt-oss-style models, and the same applies here. Follow mlx-vlm's pattern in convert.py: delegate quantization to mlx_lm.utils.quantize_model, passing a wrapper that composes mlx-embeddings' skip-vision / group-size checks with the model's quant_predicate. mlx_lm handles recording per-layer overrides into config["quantization"][path], and the existing load path in utils.py already respects those. Verified: bf16 and q4 (uniform) both still extract the same PII spans; mixed-precision q4-experts + q8-router saves to disk with 4.52 bits/ weight, loads correctly, and extracts the same spans. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the local nn.quantize call but switch the class_predicate to the compose-with-model.quant_predicate pattern from mlx-vlm: chain the default skip-vision / group-size predicate with the model's own predicate, and record any per-layer dict results so the load path re-quantizes the same way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…based on mode Refactor the quantize_model function to ensure model_quant_predicate is only set when the mode is "affine", improving clarity and functionality in the quantization process.
…itions Refactor the entry points for console scripts by changing the format to specify the module paths directly. This includes adding a new entry point for 'mlx_embeddings.convert', enhancing the CLI functionality.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openai/privacy-filter(bidirectional GPT-OSS variant, 1.5B total / 50M active MoE token classifier for PII detection).mx.fast.scaled_dot_product_attention(sinks=…), YARN RoPE with interleaved layout, 128-expert top-4 MoE usingSwitchGLU+ a customPrivacyFilterSwiGLUactivation(up+1)·gate·σ(α·gate), bidirectional ±128 sliding-window mask, and a 33-class BIOES token-classification head (score).sanitize()drops the paralleloriginal/OpenAI-format checkpoint, splits the fused concat-layoutgate_up_projinto separategate_proj/up_proj, and transposes expert weight matrices from(E, in, out)to(E, out, in)to matchSwitchLinear.logitsfield toBaseModelOutputso token-classification outputs have a natural home.Numerical parity
fp32 vs HF reference on four PII prompts:
My name is Alice Smith and I live at 123 Main Street.Email alice@example.com or phone 555-123-4567.Visit https://example.com. My SSN is 123-45-6789.Hello, my account number is 9876543210 on 2024-03-15.Test plan
pytest mlx_embeddings/tests/test_models.py— all 16 tests pass, including the newtest_openai_privacy_filter_modelmlx_embeddings.utils.load("openai/privacy-filter")+ forward passtransformersreference (attn_implementation="eager")mlx_embeddings.convert🤖 Generated with Claude Code