The model extraction appears to have multiple different software_normalized names that all refer to the same software. For example, in the 5% dataset, I found several case variants of MATLAB
These all clearly refer to the same software. A quick fix could be lower-casing the software_normalized values during extraction, which would collapse many of these variants into a single name and make downstream grouping/filtering more reliable.
The model extraction appears to have multiple different software_normalized names that all refer to the same software. For example, in the 5% dataset, I found several case variants of MATLAB
These all clearly refer to the same software. A quick fix could be lower-casing the software_normalized values during extraction, which would collapse many of these variants into a single name and make downstream grouping/filtering more reliable.