Commit 897a4d7
Phase DS-1: TPC-DS Dictionary Encoding — 50 columns across 14 tables
Apply Arrow dictionary(int8(), utf8()) encoding to low/medium-cardinality TPC-DS
columns, following TPC-H Phase 3.3 pattern (+57% throughput improvement).
Encoded columns (50 total):
- customer_demographics: cd_gender, cd_marital_status, cd_education_status, cd_credit_rating
- customer_address: ca_location_type, ca_state, ca_country, ca_street_type
- time_dim: t_am_pm, t_shift, t_sub_shift, t_meal_time
- date_dim: d_day_name
- item: i_category, i_size, i_color, i_units, i_container
- call_center: cc_class, cc_hours, cc_name + address fields (cc_state, cc_country, cc_street_type)
- catalog_page: cp_department, cp_type
- web_page: wp_type
- web_site: web_class + address fields (web_state, web_country, web_street_type)
- warehouse: address fields (w_state, w_country, w_street_type)
- ship_mode: sm_type, sm_code, sm_carrier
- store: s_hours, s_geography_class, s_division_name, s_company_name + address fields (s_state, s_country, s_street_type)
- customer: c_salutation
- promotion: p_purpose
Implementation:
- src/tpcds_main.cpp: DICTIONARY type handling in create_builders() and finish_batch()
- include/tpch/dsdgen_converter.hpp: get_dict_for_field() declaration
- src/dsdgen/dsdgen_converter.cpp: 25+ encode functions + dictionary registry (41 entries)
- src/dsdgen/dsdgen_wrapper.cpp: 9 table schemas updated to dict8
All 24 TPC-DS tables validated (SF=1). Expected: +50-60% performance gain at scale (from Phase 3.3 precedent).
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>1 parent 1015c1c commit 897a4d7
4 files changed
Lines changed: 392 additions & 115 deletions
File tree
- include/tpch
- src
- dsdgen
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
195 | 200 | | |
0 commit comments