Replies: 1 comment
-
|
Is this a spam or what? None of the links are working |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, we would like to share our project on deploying 8B Llama on Jetson Orin Nano, using only 2.5GB of GPU shared memory (peak), with a comparison against a llama.cpp INT4 baseline.
Baseline (llama.cpp INT4)
In our baseline setup, Llama-3.1-8B INT4 reached:
On Jetson Orin Nano, this uses most of the available memory budget and leaves limited headroom for other edge workloads.
Our result
Using our own extreme low-bit (1.58-bit) deployment pipeline, we ran an 8B-class Llama model with:
This makes the deployment more practical on Orin Nano when the LLM needs to coexist with other components on the device.
Main Techniques
Demo Video
https://youtu.be/yVZSksaqf08?si=mld2zZgHXssOIOvy
Notes
Why this may be useful
For edge deployments, memory headroom matters because the LLM often needs to run alongside other components such as:
Reducing the model footprint makes combined on-device LLM deployment more realistic even on Nano-class edge SoCs.
And we are sharing more details at GTC 2026.
If you are blocked by memory footprint or latency while building Llama or other LLMs on Jetson or other SoC platforms, please leave us a message. Let's connect at GTC!
Contact: https://enerzai.com/contact
Beta Was this translation helpful? Give feedback.
All reactions