Skip to content

fix(parquet/metadata): do not pre-set hasDistinctCount in stat factories#807

Open
SAY-5 wants to merge 1 commit into
apache:mainfrom
SAY-5:say5-stats-distinct-806
Open

fix(parquet/metadata): do not pre-set hasDistinctCount in stat factories#807
SAY-5 wants to merge 1 commit into
apache:mainfrom
SAY-5:say5-stats-distinct-806

Conversation

@SAY-5
Copy link
Copy Markdown
Contributor

@SAY-5 SAY-5 commented May 10, 2026

Rationale for this change

Fixes #806. The New*Statistics factories in parquet/metadata/statistics_types.gen.go initialize hasDistinctCount: true at construction time, so every fresh stat object reports a distinct count of 0. Encode() then unconditionally calls enc.SetDistinctCount(0), which makes the writer emit distinct_count: 0 in both ColumnChunk.metadata.statistics and DataPageHeader.statistics for every column, even when no distinct count has been computed.

The Parquet thrift Statistics.distinct_count field is OPTIONAL. Per spec semantics, it must be absent when unknown. Emitting 0 conflates "no distinct values" with "we don't know" and causes byte-asymmetry against parquet-cpp (pyarrow), which leaves the field unset for the same input.

IncDistinct() is the only call site that should flip hasDistinctCount to true, and grepping the repo confirms it is never invoked from the writer paths, so flipping the default is safe.

What changes are included in this PR?

  • Remove hasDistinctCount: true from the factory initializer in statistics_types.gen.go.tmpl and regenerate statistics_types.gen.go. hasNullCount: true is kept as-is (its semantics differ: IncNulls is called per page and null_count: 0 is meaningful for a writer that observes zero nulls).
  • Add TestNewStatisticsDistinctCountUnset covering Int32/Int64/Float32/Float64/Boolean/ByteArray, asserting both HasDistinctCount() and Encode().HasDistinctCount start false and flip to true after IncDistinct.

Are these changes tested?

Yes. The new test fails on master (HasDistinctCount returns true on a fresh stat object) and passes after the fix. go test ./parquet/metadata/... is green.

Are there any user-facing changes?

Yes, writers that previously emitted distinct_count: 0 for every column will now omit the field, matching the spec and parquet-cpp. Readers already tolerate both, so this is a wire-format normalization, not a breaking change for downstream consumers.

Signed-off-by: SAY-5 <saiasish.cnp@gmail.com>
@SAY-5 SAY-5 requested a review from zeroshade as a code owner May 10, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] NewIntXStatistics factories unconditionally set hasDistinctCount=true, causing distinct_count=0 to always appear in Parquet output

1 participant