Skip to content

Where appropriate, consider exchanging .withColumn() for .select() or .selectExpr() #34

@chrimaho

Description

@chrimaho

As noted here:

Optimizing PySpark Performance: Using select() Over withColumn()

Using the .withColumn() can become computationally expensive, particularly when the dimensionality of the data becomes quite large (by column or by row).

One suggestion is to instead use the .select() or .selectExpr() syntax instead.

While computationally this may be quicker; it's not always the best choice. Sometimes, the .withColumns() method can be better (for multiple columns). Sometimes, you just want to adjust one column in place, which these .select*() methods cannot do (only extract explicit columns, or add new columns as required).

Nonetheless, it's a good suggestion, and something that we should consider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementChange to an existing feature or functionhelp wantedExtra attention is needed
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions