Skip to content

Commit 21e230e

Browse files
committed
[df] Allow reading a char column into a numpy array
In the AsNumpy operation values of the dataset are read into a ROOT::RVec collection of the corresponding column type. Subsequently, the raw data is accessed from the RVec and used to generate the array interface for a numpy array view on the collected data. When the column is of type char, and thus RDF would read values into a ROOT::RVec<char>, the raw data is accessed as a 'char *'. The Python bindings automatically convert 'char *' and 'const char *' to Python strings for full compatibility with existing functions (e.g. otherwise TObject::GetName would not return a string in Python). Thus, the array interface cannot be generated. This commit proposes to introduce a special behaviour in AsNumpy to automatically view the char column as an 'unsigned char' column. This in turn will not incur in the automatic conversion on the Python side. An array of 'unsigned char' is interpreted as a numpy array with dtype uint8. Since this is a decision which might be unexpected to some users, the commit also proposes to let the user know about it via a warning.
1 parent 8519c46 commit 21e230e

3 files changed

Lines changed: 42 additions & 3 deletions

File tree

bindings/pyroot/pythonizations/python/ROOT/_pythonization/_rdataframe.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,8 @@ def pypowarray(numpyvec, pow):
234234
from . import pythonization
235235
from ._pyz_utils import MethodTemplateGetter, MethodTemplateWrapper
236236

237+
import warnings
238+
237239

238240
def RDataFrameAsNumpy(
239241
df: ROOT.RDataFrame, # noqa: F821
@@ -296,6 +298,12 @@ def RDataFrameAsNumpy(
296298
result_ptrs = {}
297299
for column in columns:
298300
column_type = df.GetColumnType(column)
301+
if column_type == "char":
302+
column_type = "unsigned char"
303+
warnings.warn(
304+
f"RDataFrame.AsNumpy: column '{column}' has type 'char', which would be automatically converted to a "
305+
"Python string. Interpreting as 'unsigned char' instead, which results in a numpy array of dtype uint8."
306+
)
299307

300308
# If the column type is a class, make sure cling knows about it
301309
tclass = ROOT.TClass.GetClass(column_type)

bindings/pyroot/pythonizations/test/rdataframe_asnumpy.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from ROOT._pythonization._rdataframe import _clone_asnumpyresult
1010
import os
1111

12+
1213
def make_tree(*dtypes):
1314
"""
1415
Make a tree with branches of different data-types
@@ -400,6 +401,26 @@ def test_rdataframe_as_numpy_array_jagged(self):
400401
self.assertTrue(all(isinstance(x, np.ndarray) for x in array))
401402
self.assertTrue(all(len(x) == i for i, x in enumerate(array)))
402403

404+
def test_rdataframe_as_numpy_char_col_as_uint8(self):
405+
ROOT.gInterpreter.Declare(
406+
r"""
407+
#ifndef ROOT_TEST_RDataFrameAsNumpy_GH_22554
408+
#define ROOT_TEST_RDataFrameAsNumpy_GH_22554
409+
char make_char(ULong64_t i) {
410+
return static_cast<char>(65 + (i % 26)); // A, B, C, ...
411+
}
412+
#endif
413+
"""
414+
)
415+
416+
rdf = ROOT.RDataFrame(10).Define("mycol", "make_char(rdfentry_)")
417+
with self.assertWarns(
418+
UserWarning, msg="column 'mycol' has type 'char', which would be automatically converted"
419+
):
420+
npy = rdf.AsNumpy(["mycol"])["mycol"]
421+
self.assertEqual(npy.dtype, np.uint8)
422+
self.assertTrue(all(npy == np.array([65 + (i % 26) for i in range(10)], dtype=np.uint8)))
423+
403424

404425
if __name__ == "__main__":
405426
unittest.main()

tree/dataframe/src/RDFUtils.cxx

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -513,9 +513,19 @@ unsigned int GetColumnWidth(const std::vector<std::string>& names, const unsigne
513513
void CheckReaderTypeMatches(const std::type_info &colType, const std::type_info &requestedType,
514514
const std::string &colName)
515515
{
516-
// We want to explicitly support the reading of bools as unsigned char, as
517-
// this is quite common to circumvent the std::vector<bool> specialization.
518-
const bool explicitlySupported = (colType == typeid(bool) && requestedType == typeid(unsigned char)) ? true : false;
516+
// We explicitly support certain type conversions
517+
const bool explicitlySupported = [&colType, &requestedType]() {
518+
// bool as unsigned char is common to circumvent the std::vector<bool> specialization.
519+
if (colType == typeid(bool) && requestedType == typeid(unsigned char))
520+
return true;
521+
// char as unsigned char allows reading a vector of char as a Python numpy array of integers, avoiding the
522+
// automatic conversion of 'char *' to string in Python. For more info, see
523+
// https://github.com/root-project/root/issues/22554
524+
if (colType == typeid(char) && requestedType == typeid(unsigned char))
525+
return true;
526+
527+
return false;
528+
}();
519529

520530
// Here we compare names and not typeinfos since they may come from two different contexts: a compiled
521531
// and a jitted one.

0 commit comments

Comments
 (0)