Commit 73ac00a
authored
[fix] handle parquet schema mismatch in dataset concatenation (#146)
* [fix] handle parquet schema mismatch in dataset concatenation
When loading multiple parquet files via YAML config, concatenate_datasets()
fails if columns have different inferred Arrow types. Add a safe wrapper
that falls back to row-wise concatenation on schema mismatch.
* refactor: use cast() instead of to_list() for schema alignment
Replace the to_list() + from_list() fallback with cast()-based schema
alignment. This avoids materializing the entire dataset into memory,
making it safe for large-scale data.
Strategy:
1. Try cast() to the first dataset's features (fast, zero-copy)
2. If that fails, build a merged feature set preferring non-null types
and cast all datasets to the merged schema
---------
Co-authored-by: mwxely <mwxely@users.noreply.github.com>1 parent 87a1f86 commit 73ac00a
1 file changed
Lines changed: 47 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
24 | 69 | | |
25 | 70 | | |
26 | 71 | | |
| |||
104 | 149 | | |
105 | 150 | | |
106 | 151 | | |
107 | | - | |
| 152 | + | |
108 | 153 | | |
109 | 154 | | |
110 | 155 | | |
| |||
222 | 267 | | |
223 | 268 | | |
224 | 269 | | |
225 | | - | |
| 270 | + | |
226 | 271 | | |
227 | 272 | | |
228 | 273 | | |
| |||
0 commit comments