Skip to content

Commit fc5abd6

Browse files
committed
[SPARK-55754][PYTHON][TEST][FOLLOWUP] Fix pure_ints type mismatch in bench
### What changes were proposed in this pull request? Refactor `MockDataFactory.NAMED_TYPE_POOLS` in `python/benchmarks/bench_eval_type.py` so the `pure_ints`, `pure_floats`, and `pure_strings` entries reuse the corresponding `TYPE_REGISTRY` entries instead of duplicating their factory lambdas. ### Why are the changes needed? `NAMED_TYPE_POOLS[\"pure_ints\"]` declared the column as `IntegerType()` (32-bit) but generated data with `np.int64`. Because every benchmark that uses this pool runs through serializers with `arrow_cast=True`, the mismatch was silently corrected by a 64-to-32 narrowing cast inside the pandas/arrow conversion path -- meaning the `pure_ints` scenario in seven mixins (`ArrowBatchedUDF`, `ArrowUDTF`, `ArrowTableUDF`, `MapArrowIterUDF`, `MapPandasIterUDF`, `ScalarArrowUDF`, `ScalarPandasUDF`) was measuring an extra narrowing step rather than a pure int32 baseline. `pure_floats` and `pure_strings` had no such mismatch but duplicated the same lambdas as `TYPE_REGISTRY[\"double\"]` / `TYPE_REGISTRY[\"string\"]`, risking drift in future edits. Reusing the registry entries eliminates the duplication. `pure_ts` is left as-is because no matching `TYPE_REGISTRY` entry exists. ### Does this PR introduce _any_ user-facing change? No. Test-only change in the benchmark module. ### How was this patch tested? - Confirmed `NAMED_TYPE_POOLS[\"pure_ints\"][0]` now produces a `pa.int32()` array matching its `IntegerType()` declaration (was `pa.int64()`). - Confirmed `pure_floats` and `pure_strings` still produce `pa.float64()` and `pa.string()` arrays after the refactor. - Ran `setup` + `time_worker` for the `pure_ints` scenario across all seven affected `*TimeBench` classes; all passed. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: Claude Code (claude-opus-4-7) Closes apache#56169 from viirya/SPARK-55724-pure-ints-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
1 parent 70469a2 commit fc5abd6

1 file changed

Lines changed: 3 additions & 5 deletions

File tree

python/benchmarks/bench_eval_type.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -261,11 +261,9 @@ class MockDataFactory:
261261

262262
NAMED_TYPE_POOLS: dict[str, list[tuple[Callable, Any]]] = {
263263
"mixed": MIXED_TYPES,
264-
"pure_ints": [
265-
(lambda r: pa.array(np.random.randint(0, 1000, r, dtype=np.int64)), IntegerType())
266-
],
267-
"pure_floats": [(lambda r: pa.array(np.random.rand(r)), DoubleType())],
268-
"pure_strings": [(lambda r: pa.array([f"s{j}" for j in range(r)]), StringType())],
264+
"pure_ints": [TYPE_REGISTRY["int"]],
265+
"pure_floats": [TYPE_REGISTRY["double"]],
266+
"pure_strings": [TYPE_REGISTRY["string"]],
269267
"pure_ts": [
270268
(
271269
lambda r: pa.array(

0 commit comments

Comments
 (0)