As noted here:
Optimizing PySpark Performance: Using select() Over withColumn()
Using the .withColumn() can become computationally expensive, particularly when the dimensionality of the data becomes quite large (by column or by row).
One suggestion is to instead use the .select() or .selectExpr() syntax instead.
While computationally this may be quicker; it's not always the best choice. Sometimes, the .withColumns() method can be better (for multiple columns). Sometimes, you just want to adjust one column in place, which these .select*() methods cannot do (only extract explicit columns, or add new columns as required).
Nonetheless, it's a good suggestion, and something that we should consider.
As noted here:
Using the
.withColumn()can become computationally expensive, particularly when the dimensionality of the data becomes quite large (by column or by row).One suggestion is to instead use the
.select()or.selectExpr()syntax instead.While computationally this may be quicker; it's not always the best choice. Sometimes, the
.withColumns()method can be better (for multiple columns). Sometimes, you just want to adjust one column in place, which these.select*()methods cannot do (only extract explicit columns, or add new columns as required).Nonetheless, it's a good suggestion, and something that we should consider.