Commit 6da06ad
feat: add
Closes #3170
## Rationale
Columns that contain large or frequently repeated string values (e.g.
JSON blobs, low-cardinality categoricals) can exhaust memory when
PyArrow loads them as plain string arrays. PyArrow's Parquet reader
natively supports dictionary-encoded reads via its `dictionary_columns`
kwarg, which deduplicates values and can dramatically reduce peak memory
usage.
This was previously discussed in #3168 and a prior implementation
(#3234) was closed as stale.
## Changes
- Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`,
`TableScan.__init__`, and `StagedTable.scan()`.
- Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()`
→ `ArrowScan.__init__` → `_task_to_record_batches` →
`_get_file_format()`.
- Only applied when `task.file.file_format == FileFormat.PARQUET`;
silently ignored for ORC (which does not support this kwarg).
## Usage
```python
# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()
```
## Verification
- Added `test_dictionary_columns_produces_dict_encoded_output` —
confirms the requested column is dict-encoded, non-requested columns are
plain, and values are identical.
- `make lint` ✓
- `pytest tests/table/ tests/io/test_pyarrow.py` ✓
---------
Co-authored-by: Gayathri Srividya Rajavarapu <gayathrir@Gayathris-MacBook-Air.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>dictionary_columns to Arrow scans (#3461)1 parent a1e12ad commit 6da06ad
3 files changed
Lines changed: 118 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1625 | 1625 | | |
1626 | 1626 | | |
1627 | 1627 | | |
| 1628 | + | |
1628 | 1629 | | |
1629 | | - | |
| 1630 | + | |
| 1631 | + | |
| 1632 | + | |
| 1633 | + | |
1630 | 1634 | | |
1631 | 1635 | | |
1632 | 1636 | | |
| |||
1729 | 1733 | | |
1730 | 1734 | | |
1731 | 1735 | | |
| 1736 | + | |
1732 | 1737 | | |
1733 | 1738 | | |
1734 | 1739 | | |
| |||
1738 | 1743 | | |
1739 | 1744 | | |
1740 | 1745 | | |
| 1746 | + | |
1741 | 1747 | | |
1742 | 1748 | | |
1743 | 1749 | | |
| |||
1748 | 1754 | | |
1749 | 1755 | | |
1750 | 1756 | | |
| 1757 | + | |
| 1758 | + | |
1751 | 1759 | | |
1752 | 1760 | | |
1753 | 1761 | | |
| |||
1756 | 1764 | | |
1757 | 1765 | | |
1758 | 1766 | | |
| 1767 | + | |
1759 | 1768 | | |
1760 | 1769 | | |
1761 | 1770 | | |
| |||
1866 | 1875 | | |
1867 | 1876 | | |
1868 | 1877 | | |
| 1878 | + | |
1869 | 1879 | | |
1870 | 1880 | | |
1871 | 1881 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2072 | 2072 | | |
2073 | 2073 | | |
2074 | 2074 | | |
2075 | | - | |
2076 | | - | |
2077 | | - | |
2078 | | - | |
2079 | | - | |
2080 | | - | |
2081 | | - | |
| 2075 | + | |
| 2076 | + | |
| 2077 | + | |
| 2078 | + | |
| 2079 | + | |
2082 | 2080 | | |
2083 | 2081 | | |
2084 | 2082 | | |
| |||
2213 | 2211 | | |
2214 | 2212 | | |
2215 | 2213 | | |
2216 | | - | |
| 2214 | + | |
2217 | 2215 | | |
2218 | 2216 | | |
2219 | 2217 | | |
2220 | 2218 | | |
| 2219 | + | |
| 2220 | + | |
| 2221 | + | |
| 2222 | + | |
| 2223 | + | |
| 2224 | + | |
| 2225 | + | |
| 2226 | + | |
2221 | 2227 | | |
2222 | 2228 | | |
2223 | 2229 | | |
2224 | 2230 | | |
2225 | 2231 | | |
2226 | 2232 | | |
2227 | | - | |
| 2233 | + | |
| 2234 | + | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
2228 | 2240 | | |
2229 | 2241 | | |
2230 | | - | |
| 2242 | + | |
2231 | 2243 | | |
2232 | 2244 | | |
2233 | 2245 | | |
2234 | 2246 | | |
2235 | 2247 | | |
2236 | 2248 | | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
2237 | 2257 | | |
2238 | 2258 | | |
2239 | 2259 | | |
| |||
2244 | 2264 | | |
2245 | 2265 | | |
2246 | 2266 | | |
2247 | | - | |
| 2267 | + | |
| 2268 | + | |
| 2269 | + | |
| 2270 | + | |
| 2271 | + | |
| 2272 | + | |
| 2273 | + | |
2248 | 2274 | | |
2249 | 2275 | | |
2250 | 2276 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5103 | 5103 | | |
5104 | 5104 | | |
5105 | 5105 | | |
| 5106 | + | |
| 5107 | + | |
| 5108 | + | |
| 5109 | + | |
| 5110 | + | |
| 5111 | + | |
| 5112 | + | |
| 5113 | + | |
| 5114 | + | |
| 5115 | + | |
| 5116 | + | |
| 5117 | + | |
| 5118 | + | |
| 5119 | + | |
| 5120 | + | |
| 5121 | + | |
| 5122 | + | |
| 5123 | + | |
| 5124 | + | |
| 5125 | + | |
| 5126 | + | |
| 5127 | + | |
| 5128 | + | |
| 5129 | + | |
| 5130 | + | |
| 5131 | + | |
| 5132 | + | |
| 5133 | + | |
| 5134 | + | |
| 5135 | + | |
| 5136 | + | |
| 5137 | + | |
| 5138 | + | |
| 5139 | + | |
| 5140 | + | |
| 5141 | + | |
| 5142 | + | |
| 5143 | + | |
| 5144 | + | |
| 5145 | + | |
| 5146 | + | |
| 5147 | + | |
| 5148 | + | |
| 5149 | + | |
| 5150 | + | |
| 5151 | + | |
| 5152 | + | |
| 5153 | + | |
| 5154 | + | |
| 5155 | + | |
| 5156 | + | |
| 5157 | + | |
| 5158 | + | |
| 5159 | + | |
| 5160 | + | |
| 5161 | + | |
| 5162 | + | |
| 5163 | + | |
| 5164 | + | |
| 5165 | + | |
| 5166 | + | |
| 5167 | + | |
| 5168 | + | |
| 5169 | + | |
| 5170 | + | |
| 5171 | + | |
| 5172 | + | |
| 5173 | + | |
| 5174 | + | |
| 5175 | + | |
0 commit comments