Granite MoE uses a 3D tensor to hold the expert weights, so GPTQModel does not work out of the box.
There are two options
- module swap
GraniteMoeParallelExperts to hold a ModuleList of Linears, then AutoGPTQ will be able to detect them and replace them with QuantLinears
- write a custom gptq module that handles the
GraniteMoeParallelExperts case
Either of the two approaches will solve both quant + inference paths. Option 1 should be easier than Option 2, but in some sense Option 2 should be more proper.
When doing option 2 we should be reusing code from the original gptq.
- also it should be written generally, to not just only handle this particular
GraniteMoeParallelExperts instance, but all cases with 3D tensors
Granite MoE uses a 3D tensor to hold the expert weights, so GPTQModel does not work out of the box.
There are two options
GraniteMoeParallelExpertsto hold a ModuleList of Linears, then AutoGPTQ will be able to detect them and replace them withQuantLinearsGraniteMoeParallelExpertscaseEither of the two approaches will solve both quant + inference paths. Option 1 should be easier than Option 2, but in some sense Option 2 should be more proper.
When doing option 2 we should be reusing code from the original gptq.
GraniteMoeParallelExpertsinstance, but all cases with 3D tensors