Enable GPTQModel to handle GraniteMoeParallelExperts

Granite MoE uses a [3D tensor](https://github.com/huggingface/transformers/blob/main/src/transformers/models/granitemoe/modeling_granitemoe.py#L276) to hold the expert weights, so GPTQModel does not work out of the box.

There are two options
1. module swap `GraniteMoeParallelExperts` to hold a ModuleList of Linears, then AutoGPTQ will be able to detect them and replace them with `QuantLinears`
2. write a custom [gptq](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-peft/src/fms_acceleration_peft/gptqmodel/quantization/gptq.py#L52) module that handles the `GraniteMoeParallelExperts` case 

Either of the two approaches will solve both quant + inference paths. Option 1 should be easier than Option 2, but in some sense Option 2 should be more proper.

When doing option 2 we should be reusing code from the original [gptq](https://github.com/foundation-model-stack/fms-acceleration/blob/main/plugins/accelerated-peft/src/fms_acceleration_peft/gptqmodel/quantization/gptq.py#L52). 
- also it should be written generally, to not just only handle this particular `GraniteMoeParallelExperts` instance, but all cases with 3D tensors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GPTQModel to handle GraniteMoeParallelExperts #112

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Enable GPTQModel to handle GraniteMoeParallelExperts #112

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions