Environment
Verified on the latest tooling supported by optimum-intel main (which
declares transformers>=4.45,<5.1):
optimum-intel: 1.27.0.dev0+190d59f (main HEAD)
transformers: 5.0.0 (latest within the <5.1 constraint)
openvino: 2026.1.0
nncf: 3.1.0
torch: 2.11.0
- Python: 3.12.3, Linux x86_64, 133 GB RAM + 8 GB swap
Description
optimum-cli export openvino --model mistralai/Mixtral-8x22B-v0.1 --weight-format int4 ...
is a non-starter on any machine with less than ~400 GB RAM. The export
pipeline calls from_pretrained and then traces the entire model through
TorchScript in a single pass, holding the full BF16 weight set plus the
traced graph plus the OV constants in RAM simultaneously.
Mixtral 8x22B is 282 GB BF16 on disk; peak resident set during export is
roughly double, depending on transformers' state-dict load behaviour. On a
133 GB RAM + 8 GB swap host the export is OOM-killed within minutes.
This makes the standard CLI workflow unusable for frontier-scale MoE on
commodity hardware. We worked around it by writing a per-stage exporter
that loads layer-by-layer from safetensors via safe_open(), but the
user-facing path needs a streaming/sharded variant.
Note on transformers 5.x: low_cpu_mem_usage is silently dropped as a
kwarg in transformers 5.0.0+ (verified in
PreTrainedModel.from_pretrained source — the kwarg is in the
"Not used anymore -- remove them from the kwargs" pop list). So passing
it from optimum-intel is harmless but a no-op on 5.x. transformers 5.x
appears to handle memory-efficient init by default; that helps the
load-time peak but does nothing for the trace-time peak (see Tier 2 below).
Steps to reproduce
optimum-cli export openvino \
--model mistralai/Mixtral-8x22B-v0.1 \
--weight-format int4 \
--task text-generation-with-past \
/path/to/output
On any machine with <400 GB RAM the process is OOM-killed during the
from_pretrained → convert_model step. The behaviour is also visible
without running anything: loading_kwargs is built in
optimum/exporters/openvino/__main__.py:298-406 (lines against upstream
main HEAD 190d59f) and never includes low_cpu_mem_usage=True or
device_map.
Expected behavior
The CLI either (a) exports successfully on commodity hardware (the
"streaming" path), or (b) at minimum reduces peak RAM during the
from_pretrained → tracing handoff enough to make Mixtral 8x7B / Llama 70B
exportable on a 133 GB RAM box.
Where this happens in the source
Lines below are against optimum-intel main HEAD 190d59f and
transformers==5.0.0.
optimum/exporters/openvino/__main__.py:490-504:
model = TasksManager.get_model_from_task(
task_model_loading,
model_name_or_path,
subfolder=subfolder, revision=revision, cache_dir=cache_dir,
token=token, local_files_only=local_files_only,
force_download=force_download, trust_remote_code=trust_remote_code,
framework=framework, device=device,
library_name=library_name,
**loading_kwargs,
)
loading_kwargs is built earlier (__main__.py:298-406) and contains at
most torch_dtype, variant, quantization_config, _attn_implementation,
config. There is no low_cpu_mem_usage and no device_map.
The export then traces the live nn.Module via TorchScript at
optimum/exporters/openvino/convert.py:442-448:
ts_decoder = TorchScriptPythonDecoder(model, example_input=dummy_inputs, **ts_decoder_kwargs)
ov_model = convert_model(
ts_decoder, example_input=dummy_inputs,
input=[(item.shape, item.type) for item in input_info],
extension=conversion_extensions,
)
Tracing requires the full module to be live at once, so even when transformers
loads weights efficiently, peak RAM during conversion stays >2× model size.
_apply_model_size_based_quantization (__main__.py:755-817) reads the
already-fully-exported OV IR back from disk submodel-by-submodel, but a
"submodel" here is a top-level component (text encoder, language model,
vision encoder), not a transformer layer. The Mixtral 8x22B language model
is a single submodel.
Proposed Solution
Two tiers, with clearly different scopes:
Tier 1 — small change, partial mitigation (helps Mixtral 8x7B and
Llama 70B; does not solve 8x22B):
For library_name == "transformers", add device_map="cpu" (and
low_cpu_mem_usage=True for back-compat with users still on transformers
4.x where the kwarg is honoured) to loading_kwargs in main_export. The
low_cpu_mem_usage=True kwarg is silently dropped on 5.x, so passing it is
harmless there; transformers 5.x already defaults to memory-efficient
loading.
This is roughly 5–10 lines and would help the 4.x population today and the
5.x population either be a no-op (already efficient) or surface clearer
behaviour for device_map-based loading.
I am happy to open a PR for tier 1, gated on a free-RAM heuristic if the
maintainers prefer that conservative form.
Tier 2 — architectural change, the real fix (required for 8x22B):
Replace the monolithic TorchScriptPythonDecoder(model, ...) path with a
per-decoder-layer trace-and-serialise loop:
- Iterate
model.model.layers (or the architecture-specific equivalent
via a small _MODEL_STRUCTURE mapping).
- For each layer: load weights from safetensors with
safe_open() →
trace the layer → save partial IR → free the layer's parameters.
- After all layers are done, stitch the per-layer IRs back together with
a top-level OV model graph that wires
inputs → embed → layer_0 → ... → layer_N → head using OV Core
graph composition.
This needs new public APIs in optimum.exporters.openvino.convert
(approximately 500–1000 LOC across convert_model, _save_model, and
the export config classes). Too deep for a drive-by outside contribution
without prior maintainer alignment, so this part of the issue is largely
a request for direction:
- Is this an architectural direction the maintainers want?
- Is there an existing API surface I should target rather than designing
a new one?
- Would a
--per-layer-export CLI flag be acceptable as the entry point,
with the monolithic path remaining the default?
Related
I did not find a prior issue specific to "Mixtral 8x22B export blows past
133 GB RAM" via GitHub search. If one exists, happy to consolidate.
Environment
Verified on the latest tooling supported by optimum-intel
main(whichdeclares
transformers>=4.45,<5.1):optimum-intel: 1.27.0.dev0+190d59f (main HEAD)transformers: 5.0.0 (latest within the<5.1constraint)openvino: 2026.1.0nncf: 3.1.0torch: 2.11.0Description
optimum-cli export openvino --model mistralai/Mixtral-8x22B-v0.1 --weight-format int4 ...is a non-starter on any machine with less than ~400 GB RAM. The export
pipeline calls
from_pretrainedand then traces the entire model throughTorchScript in a single pass, holding the full BF16 weight set plus the
traced graph plus the OV constants in RAM simultaneously.
Mixtral 8x22B is 282 GB BF16 on disk; peak resident set during export is
roughly double, depending on transformers' state-dict load behaviour. On a
133 GB RAM + 8 GB swap host the export is OOM-killed within minutes.
This makes the standard CLI workflow unusable for frontier-scale MoE on
commodity hardware. We worked around it by writing a per-stage exporter
that loads layer-by-layer from safetensors via
safe_open(), but theuser-facing path needs a streaming/sharded variant.
Note on transformers 5.x:
low_cpu_mem_usageis silently dropped as akwarg in transformers 5.0.0+ (verified in
PreTrainedModel.from_pretrainedsource — the kwarg is in the"Not used anymore -- remove them from the kwargs" pop list). So passing
it from optimum-intel is harmless but a no-op on 5.x. transformers 5.x
appears to handle memory-efficient init by default; that helps the
load-time peak but does nothing for the trace-time peak (see Tier 2 below).
Steps to reproduce
optimum-cli export openvino \ --model mistralai/Mixtral-8x22B-v0.1 \ --weight-format int4 \ --task text-generation-with-past \ /path/to/outputOn any machine with <400 GB RAM the process is OOM-killed during the
from_pretrained→convert_modelstep. The behaviour is also visiblewithout running anything:
loading_kwargsis built inoptimum/exporters/openvino/__main__.py:298-406(lines against upstreammainHEAD190d59f) and never includeslow_cpu_mem_usage=Trueordevice_map.Expected behavior
The CLI either (a) exports successfully on commodity hardware (the
"streaming" path), or (b) at minimum reduces peak RAM during the
from_pretrained→ tracing handoff enough to make Mixtral 8x7B / Llama 70Bexportable on a 133 GB RAM box.
Where this happens in the source
Lines below are against optimum-intel
mainHEAD190d59fandtransformers==5.0.0.optimum/exporters/openvino/__main__.py:490-504:loading_kwargsis built earlier (__main__.py:298-406) and contains atmost
torch_dtype,variant,quantization_config,_attn_implementation,config. There is nolow_cpu_mem_usageand nodevice_map.The export then traces the live
nn.Modulevia TorchScript atoptimum/exporters/openvino/convert.py:442-448:Tracing requires the full module to be live at once, so even when transformers
loads weights efficiently, peak RAM during conversion stays >2× model size.
_apply_model_size_based_quantization(__main__.py:755-817) reads thealready-fully-exported OV IR back from disk submodel-by-submodel, but a
"submodel" here is a top-level component (text encoder, language model,
vision encoder), not a transformer layer. The Mixtral 8x22B language model
is a single submodel.
Proposed Solution
Two tiers, with clearly different scopes:
Tier 1 — small change, partial mitigation (helps Mixtral 8x7B and
Llama 70B; does not solve 8x22B):
For
library_name == "transformers", adddevice_map="cpu"(andlow_cpu_mem_usage=Truefor back-compat with users still on transformers4.x where the kwarg is honoured) to
loading_kwargsinmain_export. Thelow_cpu_mem_usage=Truekwarg is silently dropped on 5.x, so passing it isharmless there; transformers 5.x already defaults to memory-efficient
loading.
This is roughly 5–10 lines and would help the 4.x population today and the
5.x population either be a no-op (already efficient) or surface clearer
behaviour for
device_map-based loading.I am happy to open a PR for tier 1, gated on a free-RAM heuristic if the
maintainers prefer that conservative form.
Tier 2 — architectural change, the real fix (required for 8x22B):
Replace the monolithic
TorchScriptPythonDecoder(model, ...)path with aper-decoder-layer trace-and-serialise loop:
model.model.layers(or the architecture-specific equivalentvia a small
_MODEL_STRUCTUREmapping).safe_open()→trace the layer → save partial IR → free the layer's parameters.
a top-level OV model graph that wires
inputs → embed → layer_0 → ... → layer_N → headusing OVCoregraph composition.
This needs new public APIs in
optimum.exporters.openvino.convert(approximately 500–1000 LOC across
convert_model,_save_model, andthe export config classes). Too deep for a drive-by outside contribution
without prior maintainer alignment, so this part of the issue is largely
a request for direction:
a new one?
--per-layer-exportCLI flag be acceptable as the entry point,with the monolithic path remaining the default?
Related
from_pretrained(closed; runtime not export)I did not find a prior issue specific to "Mixtral 8x22B export blows past
133 GB RAM" via GitHub search. If one exists, happy to consolidate.