Skip to content

[Bug] Mixtral 8x22B (and other large MoE) cannot be exported on a sub-400 GB RAM machine — full model is loaded in one shot #1708

@t8

Description

@t8

Environment

Verified on the latest tooling supported by optimum-intel main (which
declares transformers>=4.45,<5.1):

  • optimum-intel: 1.27.0.dev0+190d59f (main HEAD)
  • transformers: 5.0.0 (latest within the <5.1 constraint)
  • openvino: 2026.1.0
  • nncf: 3.1.0
  • torch: 2.11.0
  • Python: 3.12.3, Linux x86_64, 133 GB RAM + 8 GB swap

Description

optimum-cli export openvino --model mistralai/Mixtral-8x22B-v0.1 --weight-format int4 ...
is a non-starter on any machine with less than ~400 GB RAM. The export
pipeline calls from_pretrained and then traces the entire model through
TorchScript in a single pass, holding the full BF16 weight set plus the
traced graph plus the OV constants in RAM simultaneously.

Mixtral 8x22B is 282 GB BF16 on disk; peak resident set during export is
roughly double, depending on transformers' state-dict load behaviour. On a
133 GB RAM + 8 GB swap host the export is OOM-killed within minutes.

This makes the standard CLI workflow unusable for frontier-scale MoE on
commodity hardware. We worked around it by writing a per-stage exporter
that loads layer-by-layer from safetensors via safe_open(), but the
user-facing path needs a streaming/sharded variant.

Note on transformers 5.x: low_cpu_mem_usage is silently dropped as a
kwarg in transformers 5.0.0+ (verified in
PreTrainedModel.from_pretrained source — the kwarg is in the
"Not used anymore -- remove them from the kwargs" pop list). So passing
it from optimum-intel is harmless but a no-op on 5.x. transformers 5.x
appears to handle memory-efficient init by default; that helps the
load-time peak but does nothing for the trace-time peak (see Tier 2 below).

Steps to reproduce

optimum-cli export openvino \
    --model mistralai/Mixtral-8x22B-v0.1 \
    --weight-format int4 \
    --task text-generation-with-past \
    /path/to/output

On any machine with <400 GB RAM the process is OOM-killed during the
from_pretrainedconvert_model step. The behaviour is also visible
without running anything: loading_kwargs is built in
optimum/exporters/openvino/__main__.py:298-406 (lines against upstream
main HEAD 190d59f) and never includes low_cpu_mem_usage=True or
device_map.

Expected behavior

The CLI either (a) exports successfully on commodity hardware (the
"streaming" path), or (b) at minimum reduces peak RAM during the
from_pretrained → tracing handoff enough to make Mixtral 8x7B / Llama 70B
exportable on a 133 GB RAM box.

Where this happens in the source

Lines below are against optimum-intel main HEAD 190d59f and
transformers==5.0.0.

optimum/exporters/openvino/__main__.py:490-504:

model = TasksManager.get_model_from_task(
    task_model_loading,
    model_name_or_path,
    subfolder=subfolder, revision=revision, cache_dir=cache_dir,
    token=token, local_files_only=local_files_only,
    force_download=force_download, trust_remote_code=trust_remote_code,
    framework=framework, device=device,
    library_name=library_name,
    **loading_kwargs,
)

loading_kwargs is built earlier (__main__.py:298-406) and contains at
most torch_dtype, variant, quantization_config, _attn_implementation,
config. There is no low_cpu_mem_usage and no device_map.

The export then traces the live nn.Module via TorchScript at
optimum/exporters/openvino/convert.py:442-448:

ts_decoder = TorchScriptPythonDecoder(model, example_input=dummy_inputs, **ts_decoder_kwargs)
ov_model = convert_model(
    ts_decoder, example_input=dummy_inputs,
    input=[(item.shape, item.type) for item in input_info],
    extension=conversion_extensions,
)

Tracing requires the full module to be live at once, so even when transformers
loads weights efficiently, peak RAM during conversion stays >2× model size.

_apply_model_size_based_quantization (__main__.py:755-817) reads the
already-fully-exported OV IR back from disk submodel-by-submodel, but a
"submodel" here is a top-level component (text encoder, language model,
vision encoder), not a transformer layer. The Mixtral 8x22B language model
is a single submodel.

Proposed Solution

Two tiers, with clearly different scopes:

Tier 1 — small change, partial mitigation (helps Mixtral 8x7B and
Llama 70B; does not solve 8x22B):

For library_name == "transformers", add device_map="cpu" (and
low_cpu_mem_usage=True for back-compat with users still on transformers
4.x where the kwarg is honoured) to loading_kwargs in main_export. The
low_cpu_mem_usage=True kwarg is silently dropped on 5.x, so passing it is
harmless there; transformers 5.x already defaults to memory-efficient
loading.

This is roughly 5–10 lines and would help the 4.x population today and the
5.x population either be a no-op (already efficient) or surface clearer
behaviour for device_map-based loading.

I am happy to open a PR for tier 1, gated on a free-RAM heuristic if the
maintainers prefer that conservative form.

Tier 2 — architectural change, the real fix (required for 8x22B):

Replace the monolithic TorchScriptPythonDecoder(model, ...) path with a
per-decoder-layer trace-and-serialise loop:

  1. Iterate model.model.layers (or the architecture-specific equivalent
    via a small _MODEL_STRUCTURE mapping).
  2. For each layer: load weights from safetensors with safe_open()
    trace the layer → save partial IR → free the layer's parameters.
  3. After all layers are done, stitch the per-layer IRs back together with
    a top-level OV model graph that wires
    inputs → embed → layer_0 → ... → layer_N → head using OV Core
    graph composition.

This needs new public APIs in optimum.exporters.openvino.convert
(approximately 500–1000 LOC across convert_model, _save_model, and
the export config classes). Too deep for a drive-by outside contribution
without prior maintainer alignment, so this part of the issue is largely
a request for direction:

  • Is this an architectural direction the maintainers want?
  • Is there an existing API surface I should target rather than designing
    a new one?
  • Would a --per-layer-export CLI flag be acceptable as the entry point,
    with the monolithic path remaining the default?

Related

I did not find a prior issue specific to "Mixtral 8x22B export blows past
133 GB RAM" via GitHub search. If one exists, happy to consolidate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions