Commit 41b2e83
feat: Add 4-bit quantization support for LLM inference on Apple Silicon (#96)
* feat: Add 4-bit quantization support for LLM inference on Apple Silicon
This PR adds quantized tensor operations to EMLX, enabling efficient
large language model inference on Apple Silicon GPUs. It powers a pure
Elixir LLM inference stack achieving 135 tok/s on Qwen3-8B-4bit.
## Motivation
Running 8B parameter models requires 16GB+ at fp16. With 4-bit
quantization, the same model fits in ~5GB, enabling inference on
consumer Macs. This work is part of a broader effort to bring
production LLM inference to the Elixir ecosystem:
- bobby_posts: Pure Elixir Qwen3-8B inference (135 tok/s)
- bobby_posts_adapters: LoRA fine-tuning for personalized generation
- bumblebee_quantized: Quantized model loading for Bumblebee
- safetensors_ex: MLX 4-bit safetensors format support
## Implementation
### NIFs (c_src/emlx_nif.cpp)
Three new NIFs wrapping MLX's quantization functions:
- quantized_matmul(x, w, scales, biases, transpose, group_size, bits)
- dequantize(w, scales, biases, group_size, bits)
- quantize(w, group_size, bits)
### Backend Integration (lib/emlx/backend.ex)
Per Paulo's feedback, quantization metadata is stored directly on the
Backend struct (not a nested map):
defstruct [:ref, :shape, :type, :data, :scales, :biases, :group_size]
When Nx.dot detects a quantized tensor (scales != nil), it automatically
dispatches to quantized_matmul. The tensor type {:s, 4} carries the bit
width, so bits is not stored separately.
### User API (lib/emlx/quantization.ex)
Clean user-facing module with comprehensive documentation:
# Quantize weights
{q_weight, scales, biases} = EMLX.Quantization.quantize(weight)
# Create tensor for Nx operations
qt = EMLX.Quantization.tensor(q_weight, scales, biases, shape)
# Nx.dot automatically uses quantized_matmul
result = Nx.dot(input, qt)
### Elixir API (lib/emlx.ex)
Low-level functions for direct NIF access:
- EMLX.quantized_matmul/7
- EMLX.dequantize/5
- EMLX.quantize/3
- EMLX.quantized_tensor/5
## MLX 4-bit Format
MLX uses group-wise affine quantization:
dequantized[i] = scales[i/group_size] * (packed_int4[i] - biases[i/group_size])
Weights are packed as uint32 (8 int4 values per uint32). With group_size=64:
- Weight [out, in] becomes [out, in/8] as uint32
- Scales: [out, in/group_size] as bfloat16
- Biases: [out, in/group_size] as bfloat16
## Tests
33 tests covering:
- Low-level NIF operations (6 tests)
- Backend integration with Nx.dot (9 tests)
- EMLX.Quantization module API (18 tests)
- End-to-end LLM inference patterns
## Performance
On Apple M-series with Qwen3-8B-4bit:
- Single-token latency: ~135 tok/s
- Memory: 4-5GB vs 16GB for fp16
- 14x faster than Python mlx_lm (9.5 tok/s)
## Bumblebee Integration Path
With this merged, quantized models can use EMLX as a pure backend:
1. Model loader detects quantized safetensors
2. Creates EMLX.Quantization.tensor for each quantized weight
3. Model definition unchanged - Nx.dot works transparently
4. EMLX backend handles all dispatch
This enables upstreaming quantized model support to Bumblebee without
changing the serving interface.
## References
- Use case: https://github.com/notactuallytreyanastasio/bobby_posts
- PR discussion: #96
- MLX quantization: https://ml-explore.github.io/mlx/build/html/python/nn.html
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: PR feedback suggestions
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
* Apply suggestion from @polvalente
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>
* green out after merge
* refactor: enable quantized ops to run in defn
* feat: fused quantized matmul in backend
* fix: tag metal tests
* refactor default device
* fix: from pointer use default device
* fix: propagate env
* fix ensure_all_started
* fix ensure all started call
* fix again
---------
Co-authored-by: Trey Anastasio <notactuallytreyanastasio@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Paulo Valente <16843419+polvalente@users.noreply.github.com>1 parent 3ceaf34 commit 41b2e83
11 files changed
Lines changed: 1242 additions & 47 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1491 | 1491 | | |
1492 | 1492 | | |
1493 | 1493 | | |
| 1494 | + | |
| 1495 | + | |
| 1496 | + | |
| 1497 | + | |
| 1498 | + | |
| 1499 | + | |
| 1500 | + | |
| 1501 | + | |
| 1502 | + | |
| 1503 | + | |
| 1504 | + | |
| 1505 | + | |
| 1506 | + | |
| 1507 | + | |
| 1508 | + | |
| 1509 | + | |
| 1510 | + | |
| 1511 | + | |
| 1512 | + | |
| 1513 | + | |
| 1514 | + | |
| 1515 | + | |
| 1516 | + | |
| 1517 | + | |
| 1518 | + | |
| 1519 | + | |
| 1520 | + | |
| 1521 | + | |
| 1522 | + | |
| 1523 | + | |
| 1524 | + | |
| 1525 | + | |
| 1526 | + | |
| 1527 | + | |
| 1528 | + | |
| 1529 | + | |
| 1530 | + | |
| 1531 | + | |
| 1532 | + | |
| 1533 | + | |
| 1534 | + | |
| 1535 | + | |
| 1536 | + | |
| 1537 | + | |
| 1538 | + | |
| 1539 | + | |
| 1540 | + | |
| 1541 | + | |
| 1542 | + | |
| 1543 | + | |
| 1544 | + | |
| 1545 | + | |
| 1546 | + | |
| 1547 | + | |
| 1548 | + | |
| 1549 | + | |
| 1550 | + | |
| 1551 | + | |
| 1552 | + | |
| 1553 | + | |
| 1554 | + | |
1494 | 1555 | | |
1495 | 1556 | | |
1496 | 1557 | | |
| |||
1997 | 2058 | | |
1998 | 2059 | | |
1999 | 2060 | | |
2000 | | - | |
| 2061 | + | |
| 2062 | + | |
| 2063 | + | |
| 2064 | + | |
| 2065 | + | |
2001 | 2066 | | |
2002 | 2067 | | |
2003 | 2068 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
323 | 323 | | |
324 | 324 | | |
325 | 325 | | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
326 | 551 | | |
327 | 552 | | |
328 | 553 | | |
| |||
506 | 731 | | |
507 | 732 | | |
508 | 733 | | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
509 | 746 | | |
510 | 747 | | |
511 | 748 | | |
| |||
558 | 795 | | |
559 | 796 | | |
560 | 797 | | |
561 | | - | |
| 798 | + | |
562 | 799 | | |
563 | 800 | | |
564 | 801 | | |
| |||
570 | 807 | | |
571 | 808 | | |
572 | 809 | | |
573 | | - | |
| 810 | + | |
574 | 811 | | |
575 | 812 | | |
576 | 813 | | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
577 | 826 | | |
578 | 827 | | |
579 | 828 | | |
| |||
0 commit comments