Where appropriate, consider exchanging `.withColumn()` for `.select()` or `.selectExpr()`

As noted here:

> [Optimizing PySpark Performance: Using `select()` Over `withColumn()`][medium-article]

Using the [`.withColumn()`][withColumn] can become computationally expensive, particularly when the dimensionality of the data becomes quite large (by column or by row).

One suggestion is to instead use the [`.select()`][select] or [`.selectExpr()`][selectExpr] syntax instead.

While computationally this may be quicker; it's not always the best choice. Sometimes, the [`.withColumns()`][withColumns] method can be better (for multiple columns). Sometimes, you just want to adjust one column _in place_, which these `.select*()` methods cannot do (only extract explicit columns, or add new columns as required).

Nonetheless, it's a good suggestion, and something that we should consider.

[medium-article]: https://medium.com/@mukovhe.justice/optimizing-pyspark-performance-using-select-over-withcolumn-1e1c71c041bb
[withColumn]: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html
[withColumns]: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html
[select]: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.select.html
[selectExpr]: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.selectExpr.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where appropriate, consider exchanging `.withColumn()` for `.select()` or `.selectExpr()` #34

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Where appropriate, consider exchanging .withColumn() for .select() or .selectExpr() #34

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Where appropriate, consider exchanging `.withColumn()` for `.select()` or `.selectExpr()` #34