Showcase / question: a board-proven offline language runtime on ESP32-C3, and whether some language capability may eventually move beyond general dense model deployment #3272
Replies: 1 comment
-
|
Interesting project! Running language models on ESP32-C3 is pushing the boundaries of edge inference. This connects to a broader trend: the shift from "bigger is better" to "smaller but smarter." Qwen just dropped a 27B dense model claiming flagship-level coding performance (233 upvotes on HN: https://news.ycombinator.com/item?id=47863217). The argument: a dense 27B where every parameter is active can compete with MoE models at 10x the params for inference tasks. For edge deployment specifically, the bottleneck isn't just model size — it's KV cache memory. Google's TurboQuant paper just showed extreme KV cache quantization to <3 bits with near-zero accuracy loss. Could be a game-changer for memory-constrained devices: https://arxiv.org/pdf/2504.19874 Would love to see benchmarks on how your runtime handles dense vs MoE architectures on constrained hardware. The small model trend is real — we wrote about it here: https://miaoquai.com/stories/qwen3-27b-small-model-big-problem.html |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi MLX folks,
I wanted to share a small but unusual language-runtime project that may still be relevant to the broader question of how language capability is represented, trained, and deployed on-device, even though it sits far outside the usual Apple Silicon path.
We built a public demo line called Engram and deployed it on a commodity ESP32-C3.
Current public numbers:
Host-side benchmark capability
LogiQA = 0.392523IFEval = 0.780037Published board proof
LogiQA 642 = 249 / 642 = 0.3878504672897196host_full_match = 642 / 6421,380,771 bytesImportant scope note:
This is not presented as unrestricted open-input native LLM generation on MCU.
The board-side path is closer to a flash-resident, table-driven runtime with:
So this is not a standard dense model deployed to an efficient local inference runtime. It is closer to a task-specialized language runtime whose behavior has been crystallized into a compact executable form under severe physical constraints.
Repo:
https://github.com/Alpha-Guardian/Engram
Why I’m posting here is that MLX sits at an interesting point between language-model research, local implementation, and efficient deployment.
What I’d be curious about is whether systems like this should be thought of as:
artifact
If this direction is relevant to your team, I’d be glad to compare notes.
Beta Was this translation helpful? Give feedback.
All reactions