Skip to content

ScmRun.groupby fails on Python 3.12 / numpy 2.x  #318

@benmsanderson

Description

@benmsanderson

Summary

Working towards support of 3.12 in openscmrunner hits some issues in scmdata.

ScmRun.groupby raises TypeError: Cannot interpret '<StringDtype(...)>' as a data type on Python 3.12 with pandas 3.x and numpy 2.x. This blocks any caller that goes through ScmRun.convert_unit (which internally calls groupby), so it breaks downstream code that worked on Python 3.11 / numpy 1.x.

Repro

import scmdata
import pandas as pd

df = pd.DataFrame(
    [[1.0, 2.0]],
    index=pd.MultiIndex.from_tuples(
        [("FaIR", "ssp245", "m", "World", "Emissions|CO2", "GtC/yr", 0)],
        names=[
            "climate_model", "scenario", "model", "region",
            "variable", "unit", "run_id",
        ],
    ),
    columns=[2020, 2021],
)
run = scmdata.ScmRun(df)
run.convert_unit("PgC/yr", variable="Emissions|CO2")

Traceback:

File ".../scmdata/groupby.py", line 61, in __init__
    if any([np.issubdtype(m[c].dtype, np.number) for c in m]):
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../numpy/_core/numerictypes.py", line 534, in issubdtype
    arg1 = dtype(arg1).type
           ^^^^^^^^^^^
TypeError: Cannot interpret '<StringDtype(storage='python', na_value=nan)>' as a data type

Environment

  • Python 3.12.12
  • scmdata 0.18.0
  • numpy 2.4.6
  • pandas 3.0.3
  • macOS (also reproduces on Linux per CI of a downstream project)

Root cause

scmdata/groupby.py:61 calls np.issubdtype(m[c].dtype, np.number) for each meta column. Under pandas 3.x, string-valued meta columns default to StringDtype rather than object, and numpy 2.x rejects StringDtype as an argument to np.issubdtype (it cannot be coerced via dtype()). On Python 3.11 / numpy 1.x the same call returned False silently, so the bug only surfaces on the newer stack.

Suggested fix

Guard the issubdtype call against non-numpy-coercible dtypes. Two options:

# (a) Use the dtype.kind shortcut, which is well-defined for all pandas dtypes:
if any(getattr(m[c].dtype, "kind", "O") in "biufc" for c in m):

# (b) Or wrap in try/except and treat unknown dtypes as non-numeric (semantically
#     correct: a StringDtype is not numeric):
def _is_numeric(dtype):
    try:
        return np.issubdtype(dtype, np.number)
    except TypeError:
        return False
if any(_is_numeric(m[c].dtype) for c in m):

(a) is cheaper and more idiomatic. Happy to PR whichever you'd prefer.

Downstream impact

This blocks Python 3.12 support in openscm/openscm-runner (any path that round-trips through convert_unit). Filed from work at github.com/benmsanderson/openscm-runner (AR7 modernisation fork).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions