Skip to content

[Python] PySpark-API: struct(List[Column]) not working #17189

@keen85

Description

@keen85

What happens?

From PySpark docs: pyspark.sql.functions.struct
struct() should accept a list of Columns as input.

In DuckDB this does not work:
InvalidInputException: Invalid Input Error: Expected argument of type Expression, received '<class 'list'>' instead
DuckDB currently only accepts plain arguments (unpacked list).

To Reproduce

input (from docs)

from duckdb.experimental.spark.sql import SparkSession as session
from duckdb.experimental.spark.sql import functions as F

spark = session.builder.getOrCreate()
df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age"))
df.select(F.struct([df.age, df.name]).alias("struct")).collect()

output

---------------------------------------------------------------------------
InvalidInputException                     Traceback (most recent call last)
Cell In[30], line 1
----> 1 df.select(F.struct([df.age, df.name]).alias("struct")).collect()

File ...\.venv\lib\site-packages\duckdb\experimental\spark\sql\functions.py:108, in struct(*cols)
    106 def struct(*cols: Column) -> Column:
    107     return Column(
--> 108         FunctionExpression("struct_pack", *[_inner_expr_or_val(x) for x in cols])
    109     )

InvalidInputException: Invalid Input Error: Expected argument of type Expression, received '<class 'list'>' instead

workaround: unpacking the list via * asterisk operator

from duckdb.experimental.spark.sql import SparkSession as session
from duckdb.experimental.spark.sql import functions as F

spark = session.builder.getOrCreate()
df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age"))
df.select(F.struct(*[df.age, df.name]).alias("struct")).collect()

output

[Row(struct={'age': 2, 'name': 'Alice'}),
 Row(struct={'age': 5, 'name': 'Bob'})]

OS:

win_amd64

DuckDB Version:

1.2.2 (duckdb-1.2.2-cp310-cp310-win_amd64.whl)

DuckDB Client:

Python

Hardware:

No response

Full Name:

Martin Bode

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions