Running 8B Llama on Jetson Orin Nano (vs llama.cpp INT4 baseline) #20383

EZ-sm · 2026-03-11T07:07:08Z

EZ-sm
Mar 11, 2026

Hi, we would like to share our project on deploying 8B Llama on Jetson Orin Nano, using only 2.5GB of GPU shared memory (peak), with a comparison against a llama.cpp INT4 baseline.

Baseline (llama.cpp INT4)

In our baseline setup, Llama-3.1-8B INT4 reached:

5.2GB GPU shared memory (peak)
6.8GB total RAM (peak)

On Jetson Orin Nano, this uses most of the available memory budget and leaves limited headroom for other edge workloads.

Our result

Using our own extreme low-bit (1.58-bit) deployment pipeline, we ran an 8B-class Llama model with:

2.5GB GPU shared memory (peak)
4.1GB total RAM (peak)

This makes the deployment more practical on Orin Nano when the LLM needs to coexist with other components on the device.

Main Techniques

1.58-bit quantization (Mixed-precision QAT)
Kernel-level optimizations (Custom kernel for embedding access and layer fusion)

Demo Video

https://youtu.be/yVZSksaqf08?si=mld2zZgHXssOIOvy

Notes

For our 1.58-bit Llama model, instruction tuning has been limited to date and we expect further improvements with additional tuning.

Why this may be useful

For edge deployments, memory headroom matters because the LLM often needs to run alongside other components such as:

Other AI models including STT, TTS, and more
System workloads including perception, logging, control, networking, and more

Reducing the model footprint makes combined on-device LLM deployment more realistic even on Nano-class edge SoCs.

And we are sharing more details at GTC 2026.

If you are blocked by memory footprint or latency while building Llama or other LLMs on Jetson or other SoC platforms, please leave us a message. Let's connect at GTC!

Contact: https://enerzai.com/contact

Reslan-Tinawi · 2026-06-05T11:17:13Z

Reslan-Tinawi
Jun 5, 2026

Is this a spam or what? None of the links are working

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running 8B Llama on Jetson Orin Nano (vs llama.cpp INT4 baseline) #20383

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running 8B Llama on Jetson Orin Nano (vs llama.cpp INT4 baseline) #20383

Uh oh!

Uh oh!

EZ-sm Mar 11, 2026

Baseline (llama.cpp INT4)

Our result

Main Techniques

Demo Video

Why this may be useful

Replies: 1 comment

Uh oh!

Reslan-Tinawi Jun 5, 2026

EZ-sm
Mar 11, 2026

Reslan-Tinawi
Jun 5, 2026