[SPARK-55934][PYTHON][TEST][FOLLOWUP] Fix MAP_ARROW_ITER bench UDF return type

viirya · viirya · commit 70469a28fe99 · 2026-05-27T23:51:13.000-07:00
### What changes were proposed in this pull request? Fix the default `ret_type` resolution in `_MapArrowIterBenchMixin._write_scenario` in `python/benchmarks/bench_eval_type.py`. The three benchmark UDFs (`identity_udf`, `sort_udf`, `filter_udf`) all return whole `pa.RecordBatch`es with the input row schema, so their declared return type should be the inner row `StructType`, not the first nested field's data type. Before: ```python ret_type = schema.fields[0].dataType.fields[0].dataType # first column's type ``` After: ```python ret_type = schema.fields[0].dataType # the row's StructType ``` ### Why are the changes needed? `mapInArrow` UDFs are contractually `(Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]`, with the user-supplied `schema` describing the output rows as a whole. The benchmark UDFs return full batches but were declaring just one column's type, which is semantically inconsistent with the API. This did not surface as an error because `worker.py` discards `return_type` for `SQL_MAP_ARROW_ITER_UDF` in `read_single_udf` (returns `None`) and only checks `Iterator[pa.RecordBatch]` structurally via `verify_return_type`. Schema-level validation is currently absent, so the mismatched type was tolerated. If the worker ever adds schema validation for this eval type, the previous declaration would break the benchmark. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Confirmed the new default resolves to a 5-field `StructType` matching the input row schema for `sm_batch_few_col`. - Ran `MapArrowIterUDFTimeBench.setup` + `time_worker` for `(sm_batch_few_col, pure_ints) x (identity_udf, sort_udf, filter_udf)`. - Ran `MapArrowIterUDFPeakmemBench.setup` + `peakmem_worker` for `sm_batch_few_col/identity_udf`. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes apache#56168 from viirya/SPARK-55934-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
diff --git a/python/benchmarks/bench_eval_type.py b/python/benchmarks/bench_eval_type.py
@@ -1457,7 +1457,10 @@ def _write_scenario(self, scenario, udf_name, buf):
         batches, schema = self._build_scenario(scenario)
         udf_func, ret_type, arg_offsets = self._udfs[udf_name]
         if ret_type is None:
-            ret_type = schema.fields[0].dataType.fields[0].dataType
+            # mapInArrow UDFs return an Iterator[pa.RecordBatch] with the same
+            # schema as the input row (the inner struct, since make_batches
+            # wraps the row schema in a single struct column for the wire).
+            ret_type = schema.fields[0].dataType
         MockProtocolWriter.write_worker_input(
             PythonEvalType.SQL_MAP_ARROW_ITER_UDF,
             lambda b: MockProtocolWriter.write_udf_payload(udf_func, ret_type, arg_offsets, b),