Skip to content

Commit 897a4d7

Browse files
tsafinclaude
andcommitted
Phase DS-1: TPC-DS Dictionary Encoding — 50 columns across 14 tables
Apply Arrow dictionary(int8(), utf8()) encoding to low/medium-cardinality TPC-DS columns, following TPC-H Phase 3.3 pattern (+57% throughput improvement). Encoded columns (50 total): - customer_demographics: cd_gender, cd_marital_status, cd_education_status, cd_credit_rating - customer_address: ca_location_type, ca_state, ca_country, ca_street_type - time_dim: t_am_pm, t_shift, t_sub_shift, t_meal_time - date_dim: d_day_name - item: i_category, i_size, i_color, i_units, i_container - call_center: cc_class, cc_hours, cc_name + address fields (cc_state, cc_country, cc_street_type) - catalog_page: cp_department, cp_type - web_page: wp_type - web_site: web_class + address fields (web_state, web_country, web_street_type) - warehouse: address fields (w_state, w_country, w_street_type) - ship_mode: sm_type, sm_code, sm_carrier - store: s_hours, s_geography_class, s_division_name, s_company_name + address fields (s_state, s_country, s_street_type) - customer: c_salutation - promotion: p_purpose Implementation: - src/tpcds_main.cpp: DICTIONARY type handling in create_builders() and finish_batch() - include/tpch/dsdgen_converter.hpp: get_dict_for_field() declaration - src/dsdgen/dsdgen_converter.cpp: 25+ encode functions + dictionary registry (41 entries) - src/dsdgen/dsdgen_wrapper.cpp: 9 table schemas updated to dict8 All 24 TPC-DS tables validated (SF=1). Expected: +50-60% performance gain at scale (from Phase 3.3 precedent). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
1 parent 1015c1c commit 897a4d7

4 files changed

Lines changed: 392 additions & 115 deletions

File tree

include/tpch/dsdgen_converter.hpp

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,4 +192,9 @@ void append_dsdgen_row_to_builders(
192192
const void* row,
193193
std::map<std::string, std::shared_ptr<arrow::ArrayBuilder>>& builders);
194194

195+
/**
196+
* Returns static dictionary Arrow array for dict8-encoded columns, or nullptr.
197+
*/
198+
std::shared_ptr<arrow::Array> get_dict_for_field(const std::string& field_name);
199+
195200
} // namespace tpcds

0 commit comments

Comments
 (0)