Overview
When to_dataframe() is called on an xarray Dataset with a multi-dimensional index along a given dimension, the index coordinates are translated both:
- into levels of a pandas
MultiIndex for the dataframe
- into individual columns of the dataframe.
Is this expected and intended behavior?
Main reprex
import numpy as np
import pandas as pd
import xarray as xr
data_dict = dict(x=[1, 2, 1, 2, 1], y=["a", "a", "b", "b", "b"], z=[5, 10, 15, 20, 25])
data_dict_w_dims = {k: ("my_dim", v) for k, v in data_dict.items()}
# create a dataset multi-indexed along "my_dim" by "x" and "y"
xr_dat = xr.Dataset(data_dict_w_dims).set_coords(["x", "y"]).set_xindex(["x", "y"])
print(xr_dat)
# <xarray.Dataset> Size: 140B
# Dimensions: (my_dim: 5)
# Coordinates:
# * my_dim (my_dim) object 40B MultiIndex
# * x (my_dim) int64 40B 1 2 1 2 1
# * y (my_dim) <U1 20B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (my_dim) int64 40B 5 10 15 20 25
print(xr_dat.to_dataframe()) # x and y present both as columns and as multi-index
# z x y
# x y
# 1 a 5 1 a
# 2 a 10 2 a
# 1 b 15 1 b
# 2 b 20 2 b
# 1 b 25 1 b
Cause
I believe the key line is here in the _to_dataframe() internal method:
|
def _to_dataframe(self, ordered_dims: Mapping[Any, int]): |
|
from xarray.core.extension_array import PandasExtensionArray |
|
|
|
columns_in_order = [k for k in self.variables if k not in self.dims] |
The constituent IndexArrays of the multi-index are present in self.variables (and not in self.dims), so they become columns:
"x" in xr_dat.dims
# False
"x" in xr_dat.variables
# True
xr_dat.variables["x"]
# <xarray.IndexVariable 'my_dim' (my_dim: 5)> Size: 40B
# [5 values with dtype=int64]
This has consequences for pandas -> xarray -> pandas conversion
Because of this, converting a MultiIndex-ed pandas dataframe to an xarray Dataset via the xr.Dataset() constructor and then converting back to pandas via .to_dataframe() will not give back the original dataframe.
Reprex
# create a multi-indexed pandas dataframe
pd_df = pd.DataFrame(
data_dict
).set_index(["x", "y"])
print(pd_df) # multi-indexed-df with one column
# z
# x y
# 1 a 5
# 2 a 10
# 1 b 15
# 2 b 20
# 1 b 25
# Conversion to xarray is as expected:
xr_from_pd = xr.Dataset(pd_df)
print(xr_from_pd)
# <xarray.Dataset> Size: 160B
# Dimensions: (dim_0: 5)
# Coordinates:
# * dim_0 (dim_0) object 40B MultiIndex
# * x (dim_0) int64 40B 1 2 1 2 1
# * y (dim_0) object 40B 'a' 'a' 'b' 'b' 'b'
# Data variables:
# z (dim_0) int64 40B 5 10 15 20 25
# Converting back to pandas df via `to_dataframe()` yields a df multi-indexed by
# x and y that also contains `x` and `y` as columns:
print(xr_from_pd.to_dataframe()) # x and y as multi-index and as columns
# x y z
# x y
# 1 a 1 a 5
# 2 a 2 a 10
# 1 b 1 b 15
# 2 b 2 b 20
# 1 b 1 b 25
Thoughts
- If this behavior is not intended, the flagged line in
_to_dataframe() should be changed to determine column names in a way that ignores IndexVariables that form part of a multi-index.
- It might be important not just to filter to data variables, because one might want coordinates to become columns when they are not going to be part of the pandas
MultiIndex, e.g.
# similar dataset with x and y as coordinates but not as a multi-index
dat_no_multiindex = xr.Dataset(
data_dict_w_dims
).set_coords(["x", "y"])
# potentially intended behavior?
print(dat_no_multiindex.to_dataframe())
# x y z
# my_dim
# 0 1 a 5
# 1 2 a 10
# 2 1 b 15
# 3 2 b 20
# 4 1 b 25
Overview
When
to_dataframe()is called on an xarrayDatasetwith a multi-dimensional index along a given dimension, the index coordinates are translated both:MultiIndexfor the dataframeIs this expected and intended behavior?
Main reprex
Cause
I believe the key line is here in the
_to_dataframe()internal method:xarray/xarray/core/dataset.py
Lines 7092 to 7095 in 699d895
The constituent
IndexArraysof the multi-index are present inself.variables(and not inself.dims), so they become columns:This has consequences for pandas -> xarray -> pandas conversion
Because of this, converting a
MultiIndex-ed pandas dataframe to an xarrayDatasetvia thexr.Dataset()constructor and then converting back to pandas via.to_dataframe()will not give back the original dataframe.Reprex
Thoughts
_to_dataframe()should be changed to determine column names in a way that ignoresIndexVariablesthat form part of a multi-index.MultiIndex, e.g.