[Bug] Mixtral 8x22B (and other large MoE) cannot be exported on a sub-400 GB RAM machine — full model is loaded in one shot

### Environment

Verified on the latest tooling supported by optimum-intel `main` (which
declares `transformers>=4.45,<5.1`):

- `optimum-intel`: 1.27.0.dev0+190d59f (main HEAD)
- `transformers`: 5.0.0 (latest within the `<5.1` constraint)
- `openvino`: 2026.1.0
- `nncf`: 3.1.0
- `torch`: 2.11.0
- Python: 3.12.3, Linux x86_64, 133 GB RAM + 8 GB swap

### Description

`optimum-cli export openvino --model mistralai/Mixtral-8x22B-v0.1 --weight-format int4 ...`
is a non-starter on any machine with less than ~400 GB RAM. The export
pipeline calls `from_pretrained` and then traces the entire model through
TorchScript in a single pass, holding the full BF16 weight set plus the
traced graph plus the OV constants in RAM simultaneously.

Mixtral 8x22B is 282 GB BF16 on disk; peak resident set during export is
roughly double, depending on transformers' state-dict load behaviour. On a
133 GB RAM + 8 GB swap host the export is OOM-killed within minutes.

This makes the standard CLI workflow unusable for frontier-scale MoE on
commodity hardware. We worked around it by writing a per-stage exporter
that loads layer-by-layer from safetensors via `safe_open()`, but the
user-facing path needs a streaming/sharded variant.

Note on transformers 5.x: `low_cpu_mem_usage` is **silently dropped** as a
kwarg in transformers 5.0.0+ (verified in
`PreTrainedModel.from_pretrained` source — the kwarg is in the
"Not used anymore -- remove them from the kwargs" pop list). So passing
it from optimum-intel is harmless but a no-op on 5.x. transformers 5.x
appears to handle memory-efficient init by default; that helps the
load-time peak but does nothing for the trace-time peak (see Tier 2 below).

### Steps to reproduce

```bash
optimum-cli export openvino \
    --model mistralai/Mixtral-8x22B-v0.1 \
    --weight-format int4 \
    --task text-generation-with-past \
    /path/to/output
```

On any machine with <400 GB RAM the process is OOM-killed during the
`from_pretrained` → `convert_model` step. The behaviour is also visible
without running anything: `loading_kwargs` is built in
`optimum/exporters/openvino/__main__.py:298-406` (lines against upstream
`main` HEAD `190d59f`) and never includes `low_cpu_mem_usage=True` or
`device_map`.

### Expected behavior

The CLI either (a) exports successfully on commodity hardware (the
"streaming" path), or (b) at minimum reduces peak RAM during the
`from_pretrained` → tracing handoff enough to make Mixtral 8x7B / Llama 70B
exportable on a 133 GB RAM box.

### Where this happens in the source

Lines below are against optimum-intel `main` HEAD `190d59f` and
`transformers==5.0.0`.

`optimum/exporters/openvino/__main__.py:490-504`:

```python
model = TasksManager.get_model_from_task(
    task_model_loading,
    model_name_or_path,
    subfolder=subfolder, revision=revision, cache_dir=cache_dir,
    token=token, local_files_only=local_files_only,
    force_download=force_download, trust_remote_code=trust_remote_code,
    framework=framework, device=device,
    library_name=library_name,
    **loading_kwargs,
)
```

`loading_kwargs` is built earlier (`__main__.py:298-406`) and contains at
most `torch_dtype`, `variant`, `quantization_config`, `_attn_implementation`,
`config`. There is no `low_cpu_mem_usage` and no `device_map`.

The export then traces the live `nn.Module` via TorchScript at
`optimum/exporters/openvino/convert.py:442-448`:

```python
ts_decoder = TorchScriptPythonDecoder(model, example_input=dummy_inputs, **ts_decoder_kwargs)
ov_model = convert_model(
    ts_decoder, example_input=dummy_inputs,
    input=[(item.shape, item.type) for item in input_info],
    extension=conversion_extensions,
)
```

Tracing requires the full module to be live at once, so even when transformers
loads weights efficiently, peak RAM during conversion stays >2× model size.

`_apply_model_size_based_quantization` (`__main__.py:755-817`) reads the
already-fully-exported OV IR back from disk submodel-by-submodel, but a
"submodel" here is a top-level component (text encoder, language model,
vision encoder), not a transformer layer. The Mixtral 8x22B language model
is a single submodel.

### Proposed Solution

Two tiers, with clearly different scopes:

**Tier 1 — small change, partial mitigation (helps Mixtral 8x7B and
Llama 70B; does not solve 8x22B):**

For `library_name == "transformers"`, add `device_map="cpu"` (and
`low_cpu_mem_usage=True` for back-compat with users still on transformers
4.x where the kwarg is honoured) to `loading_kwargs` in `main_export`. The
`low_cpu_mem_usage=True` kwarg is silently dropped on 5.x, so passing it is
harmless there; transformers 5.x already defaults to memory-efficient
loading.

This is roughly 5–10 lines and would help the 4.x population today and the
5.x population either be a no-op (already efficient) or surface clearer
behaviour for `device_map`-based loading.

I am happy to open a PR for tier 1, gated on a free-RAM heuristic if the
maintainers prefer that conservative form.

**Tier 2 — architectural change, the real fix (required for 8x22B):**

Replace the monolithic `TorchScriptPythonDecoder(model, ...)` path with a
per-decoder-layer trace-and-serialise loop:

1. Iterate `model.model.layers` (or the architecture-specific equivalent
   via a small `_MODEL_STRUCTURE` mapping).
2. For each layer: load weights from safetensors with `safe_open()` →
   trace the layer → save partial IR → free the layer's parameters.
3. After all layers are done, stitch the per-layer IRs back together with
   a top-level OV model graph that wires
   `inputs → embed → layer_0 → ... → layer_N → head` using OV `Core`
   graph composition.

This needs new public APIs in `optimum.exporters.openvino.convert`
(approximately 500–1000 LOC across `convert_model`, `_save_model`, and
the export config classes). Too deep for a drive-by outside contribution
without prior maintainer alignment, so this part of the issue is largely
a request for direction:

- Is this an architectural direction the maintainers want?
- Is there an existing API surface I should target rather than designing
  a new one?
- Would a `--per-layer-export` CLI flag be acceptable as the entry point,
  with the monolithic path remaining the default?

### Related

- huggingface/transformers#28476 — peak RAM memory usage when loading model
- huggingface/optimum-intel#217 — runtime OOM on `from_pretrained` (closed; runtime not export)

I did not find a prior issue specific to "Mixtral 8x22B export blows past
133 GB RAM" via GitHub search. If one exists, happy to consolidate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Mixtral 8x22B (and other large MoE) cannot be exported on a sub-400 GB RAM machine — full model is loaded in one shot #1708

Environment

Description

Steps to reproduce

Expected behavior

Where this happens in the source

Proposed Solution

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Mixtral 8x22B (and other large MoE) cannot be exported on a sub-400 GB RAM machine — full model is loaded in one shot #1708

Description

Environment

Description

Steps to reproduce

Expected behavior

Where this happens in the source

Proposed Solution

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions