Skip to content

Commit facaa57

Browse files
Rituparna KhaundRituparna Khaund
authored andcommitted
out_s3: add format=parquet option with page-level compression
The compression config option conflated byte-level compression (gzip, zstd, snappy) with format conversion (parquet), making it impossible to produce Parquet files with page-level compression. This adds 'format parquet' as a new option. When format is parquet, the compression option (snappy, zstd, gzip) controls the page-level codec inside the Parquet file. Default is uncompressed to preserve existing behavior. The old 'compression=parquet' syntax is preserved as a deprecated alias that emits a warning and maps to format=parquet with no page-level compression (identical to current behavior). Arrow support is untouched and continues to work via 'compression=arrow' as before. Signed-off-by: Rituparna Khaund <ritukhau@amazon.co.uk>
1 parent 69b3dd9 commit facaa57

6 files changed

Lines changed: 558 additions & 54 deletions

File tree

e2e-tests.txt

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
E2E Test Results — out_s3 Parquet format with page-level compression
2+
=====================================================================
3+
4+
Test 1: format=parquet, compression=snappy
5+
------------------------------------------
6+
Config:
7+
format parquet
8+
compression snappy
9+
use_put_object On
10+
11+
Result: PASS
12+
- All columns show compression: SNAPPY
13+
- date column: space_saved: 63%
14+
- Parquet magic bytes present (created_by: parquet-cpp-arrow version 24.0.0)
15+
16+
parquet-tools inspect output:
17+
18+
############ file meta data ############
19+
created_by: parquet-cpp-arrow version 24.0.0
20+
num_columns: 5
21+
num_rows: 60
22+
num_row_groups: 1
23+
format_version: 2.6
24+
serialized_size: 710
25+
26+
Column(date) compression: SNAPPY (space_saved: 63%)
27+
Column(impressionId) compression: SNAPPY (space_saved: -8%)
28+
Column(level) compression: SNAPPY (space_saved: -8%)
29+
Column(seedAsin) compression: SNAPPY (space_saved: -8%)
30+
Column(ts) compression: SNAPPY (space_saved: -8%)
31+
32+
Note: Small columns with non-repetitive data show negative space_saved
33+
because Snappy framing overhead exceeds savings on tiny payloads. This
34+
is expected and matches behavior of other Parquet writers (Spark, PyArrow).
35+
36+
37+
Test 2: format=parquet, compression=zstd
38+
----------------------------------------
39+
Config:
40+
format parquet
41+
compression zstd
42+
use_put_object On
43+
44+
Result: PASS
45+
- All columns show compression: ZSTD
46+
- date column: space_saved: 76%
47+
- ZSTD achieves better ratio on repetitive data (date) vs Snappy
48+
49+
parquet-tools inspect output:
50+
51+
############ file meta data ############
52+
created_by: parquet-cpp-arrow version 24.0.0
53+
num_columns: 5
54+
num_rows: 45
55+
num_row_groups: 1
56+
format_version: 2.6
57+
serialized_size: 714
58+
59+
Column(date) compression: ZSTD (space_saved: 76%)
60+
Column(impressionId) compression: ZSTD (space_saved: -36%)
61+
Column(level) compression: ZSTD (space_saved: -38%)
62+
Column(seedAsin) compression: ZSTD (space_saved: -35%)
63+
Column(ts) compression: ZSTD (space_saved: -38%)
64+
65+
66+
Test 3: format=parquet, compression=gzip
67+
----------------------------------------
68+
Config:
69+
format parquet
70+
compression gzip
71+
use_put_object On
72+
73+
Result: PASS
74+
- All columns show compression: GZIP
75+
- date column: space_saved: 27%
76+
- Gzip overhead is highest on small columns due to framing
77+
78+
parquet-tools inspect output:
79+
80+
############ file meta data ############
81+
created_by: parquet-cpp-arrow version 24.0.0
82+
num_columns: 5
83+
num_rows: 5
84+
num_row_groups: 1
85+
format_version: 2.6
86+
serialized_size: 711
87+
88+
Column(date) compression: GZIP (space_saved: 27%)
89+
Column(impressionId) compression: GZIP (space_saved: -72%)
90+
Column(level) compression: GZIP (space_saved: -83%)
91+
Column(seedAsin) compression: GZIP (space_saved: -78%)
92+
Column(ts) compression: GZIP (space_saved: -83%)
93+
94+
95+
Test 4: format=parquet, compression=(none)
96+
------------------------------------------
97+
Config:
98+
format parquet
99+
use_put_object On
100+
101+
Result: PASS
102+
- All columns show compression: UNCOMPRESSED
103+
- Default behavior when no compression specified is uncompressed pages
104+
- Backwards compatible with previous compression=parquet behavior
105+
106+
parquet-tools inspect output:
107+
108+
############ file meta data ############
109+
created_by: parquet-cpp-arrow version 24.0.0
110+
num_columns: 5
111+
num_rows: 59
112+
num_row_groups: 1
113+
format_version: 2.6
114+
serialized_size: 710
115+
116+
Column(date) compression: UNCOMPRESSED (space_saved: 0%)
117+
Column(impressionId) compression: UNCOMPRESSED (space_saved: 0%)
118+
Column(level) compression: UNCOMPRESSED (space_saved: 0%)
119+
Column(seedAsin) compression: UNCOMPRESSED (space_saved: 0%)
120+
Column(ts) compression: UNCOMPRESSED (space_saved: 0%)
121+
122+
123+
Test 5: compression=parquet (deprecated path)
124+
---------------------------------------------
125+
Config:
126+
compression parquet
127+
use_put_object On
128+
129+
Expected: Deprecation warning in logs + Parquet with UNCOMPRESSED pages
130+
Result: PASS
131+
- Deprecation warning emitted at startup:
132+
[warn] 'compression=parquet' is deprecated. Use 'format parquet'
133+
with 'compression' set to the desired page-level codec (snappy, zstd, gzip)
134+
- All columns show compression: UNCOMPRESSED
135+
- Backwards compatible with previous behavior
136+
137+
parquet-tools inspect output:
138+
139+
############ file meta data ############
140+
created_by: parquet-cpp-arrow version 24.0.0
141+
num_columns: 5
142+
num_rows: 20
143+
num_row_groups: 1
144+
format_version: 2.6
145+
serialized_size: 710
146+
147+
Column(date) compression: UNCOMPRESSED (space_saved: 0%)
148+
Column(impressionId) compression: UNCOMPRESSED (space_saved: 0%)
149+
Column(level) compression: UNCOMPRESSED (space_saved: 0%)
150+
Column(seedAsin) compression: UNCOMPRESSED (space_saved: 0%)
151+
Column(ts) compression: UNCOMPRESSED (space_saved: 0%)
152+
153+
154+
Test 6: format=parquet, compression=arrow (invalid combo)
155+
---------------------------------------------------------
156+
Config:
157+
format parquet
158+
compression arrow
159+
use_put_object On
160+
161+
Expected: Startup error — unsupported codec
162+
Result: PASS
163+
- Fluent Bit refuses to start with clear error:
164+
[error] 'arrow' is not a supported parquet page codec.
165+
Supported: snappy, zstd, gzip
166+
[error] output initialization failed
167+
168+
169+
Test 7: format=parquet, log_key=log (invalid combo)
170+
---------------------------------------------------
171+
Config:
172+
format parquet
173+
log_key log
174+
use_put_object On
175+
176+
Expected: Startup error — log_key not supported with parquet
177+
Result: PASS
178+
- Fluent Bit refuses to start with clear error:
179+
[error] 'log_key' is not supported when format is parquet
180+
[error] output initialization failed
181+
182+
183+
Test 8: compression=gzip (existing behavior unchanged)
184+
------------------------------------------------------
185+
Config:
186+
compression gzip
187+
188+
Expected: Gzipped JSON output
189+
Result: PASS
190+
- Output is gzipped JSON lines (gunzip produces valid JSON)
191+
- Existing behavior unchanged by our changes
192+
193+
194+
Test 9: compression=arrow (existing behavior unchanged)
195+
-------------------------------------------------------
196+
Config:
197+
compression arrow
198+
use_put_object On
199+
200+
Expected: Arrow/Feather format output
201+
Result: PASS
202+
- Output is valid Arrow/Feather file (readable by pyarrow.feather)
203+
- Existing behavior unchanged by our changes
204+
205+
206+
Test 10: format=json_lines, compression=snappy (existing behavior)
207+
------------------------------------------------------------------
208+
Config:
209+
format json_lines
210+
compression snappy
211+
212+
Expected: Snappy-wrapped JSON output
213+
Result: PASS
214+
- Output is snappy-compressed JSON
215+
- Existing behavior unchanged by our changes
216+

fluent-bit.iml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<module type="JAVA_MODULE" version="4">
3+
<component name="NewModuleRootManager" inherit-compiler-output="true">
4+
<exclude-output />
5+
<content url="file://$MODULE_DIR$">
6+
<sourceFolder url="file://$MODULE_DIR$/lib/librdkafka-2.10.1/tests/java" isTestSource="true" />
7+
</content>
8+
<orderEntry type="inheritedJdk" />
9+
<orderEntry type="sourceFolder" forTests="false" />
10+
</component>
11+
</module>

0 commit comments

Comments
 (0)