Skip to content

Data corruption when replacing values in DataFrame #296

@hypsakata

Description

@hypsakata

I've encountered a problem where replacing values in a DataFrame corrupts the original column data.

irb> df = RedAmber::DataFrame.new(val: [352, 256, 4, 0]);
irb> df.assign(val: df[:val].replace(df[:val] == 0, 1))
=>
#<RedAmber::DataFrame : 4 x 1 Vector, 0x000000000003bf60>
      val
  <uint8>
0      96
1       0
2       4
3       1

This happens because the column data type changes to match the input value's type. While this behavior is consistent, it is not intuitive and should match the original column's data type. How about changing this behavior?

However, as a result of changing the behavior, if the column type is uint8 and the replacement value is of type double or a large integer, the replacement data will be corrupted. It would be useful to have a method for easier data type casting or to allow specifying the data type as a keyword argument in the replace method, e.g., .replace(…, data_type: :double).

I'd appreciate any comments or ideas on this matter.
Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions