Improve hash partition for columns contains Struct/list/map#63152
Improve hash partition for columns contains Struct/list/map#63152laysfire wants to merge 4 commits into
Conversation
Signed-off-by: yifan.xie <xyfabcd@163.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the row-by-row hashing logic in _hash_partition by replacing manual column indexing with a more efficient enumerate(zip(*table.columns)) approach. This change improves code readability and potentially performance when processing unhashable PyArrow columns. I have no feedback to provide as there are no review comments.
|
@owenowenisme hi, could you help review? |
owenowenisme
left a comment
There was a problem hiding this comment.
Great improvement! Could you add comments explaining this improvement in the code?
Signed-off-by: yifan.xie <xyfabcd@163.com>
## Description Optimize `_convert_arrow_table_to_examples` in `tfrecords_datasink.py` by hoisting per-column lookups and schema validation out of the row loop, and iterating columns in lockstep with `zip(*columns)`. The old code called `arrow_table[name][i]` per row × column, paying Python-level `__getitem__` dispatch on every element. The new code uses `ChunkedArray.__iter__` (a C-level loop) for row-by-row scalars, and moves the `name in schema_dict` validation to a one-shot up-front check. Same optimization pattern as #63152. ## Benchmark End-to-end TFRecord write (convert → SerializeToString → CRC32C → write-to-buffer), 5 runs each, best time reported. macOS, Apple Silicon, pyarrow 16. | Case | Before | After | Speedup | |---|---|---|---| | Mixed scalars (200k × 3) | 4.55s | 3.71s | **1.22x** | | Wide table (50k × 20) | 5.55s | 4.34s | **1.28x** | | List-of-int (200k × 2) | 3.39s | 3.00s | **1.13x** | | Long strings (50k × 2) | 0.82s | 0.73s | **1.12x** | | With tf_schema (200k × 2) | 4.79s | 4.16s | **1.15x** | **12–28% faster end-to-end**, with wider tables seeing the largest wins (hoisting `arrow_table[name]` per column matters more when there are more columns). ## Correctness Output is byte-identical to the previous implementation. Verified by comparing `SerializeToString()` bytes across scalar columns, list-of-int columns, multi-chunk inputs, empty tables, null-containing columns, and `tf_schema` validation. The existing tests in `python/ray/data/tests/datasource/test_tfrecords.py` exercise these cases and should continue to pass unchanged. --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Description
This pr is to improve hash partition performance when table contains pandas can't handle types by moving retrieve table columns operations out of loop.
Use the following script to verify:
The test result is:
Related issues