Fix/delta composite partition null value#828
Open
alealandreev wants to merge 3 commits into
Open
Conversation
DeltaPartitionExtractor#getSerializedPartitionValue joined the values of
a composite (generated column) partition with Collectors.joining("-").
When one of the component values was absent from the partition values
map, the missing value was rendered as the literal string "null",
producing a corrupted partition value such as "2013-null-20" that was
then fed to the date parser.
Return null when any component value is missing so the partition value
resolves to null, consistent with the single-field branch that uses
getOrDefault(name, null). Added a regression test covering a composite
generated-column partition with a missing component.
DeltaKernelPartitionExtractor#getSerializedPartitionValue contained the
same composite (generated column) partition handling as the Spark-based
DeltaPartitionExtractor: joining component values with
Collectors.joining("-") rendered a missing component as the literal
string "null", corrupting the partition value.
Apply the same fix here - return null when any component value is
missing - and add a regression test mirroring the one added for
DeltaPartitionExtractor.
Contributor
|
@alealandreev are you running into these cases in datasets you are working with? I am curious how the dataset can get into a state where one of the fields is missing. |
The composite (generated column) partition columns are all derived from a single source column, so the realistic trigger for a missing component is a null source value, which makes every derived partition column null. - Reframe the DeltaPartitionExtractor and DeltaKernelPartitionExtractor unit tests around that null-source case (all components null) instead of an artificial single-missing-component map. - Add an integration test (ITDeltaConversionSource) that creates a Delta table partitioned by year/month/day generated columns derived from a nullable timestamp, inserts a row with a null timestamp, and verifies the snapshot resolves the partition value to null. Without the fix this reproduces the failure (ParseException on "null-null-null").
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the pull request
When a Delta table is partitioned by a composite (generated column) partition
— e.g.
year/month/daycolumns derived from a single source column — thepartition extractor joins the component values with
Collectors.joining("-").If one of the component values is missing from the partition values map, the
missing value was rendered as the literal string
"null", producing acorrupted partition value such as
"2013-null-20"that was then fed to thedate parser.
This affects both the Spark-based
DeltaPartitionExtractorand theDeltaKernelPartitionExtractor, which contained identical logic.Brief change log
DeltaPartitionExtractor#getSerializedPartitionValue: returnnullwhen anycomponent value of a composite partition is missing, consistent with the
single-field branch that uses
getOrDefault(name, null).DeltaKernelPartitionExtractor#getSerializedPartitionValue: apply the samefix.
TestDeltaPartitionExtractorandTestDeltaKernelPartitionExtractorcovering a composite generated-columnpartition with a missing component.
Verify this pull request
This change added tests and can be verified as follows:
testGeneratedPartitionValueExtractionWithMissingComponentto bothTestDeltaPartitionExtractorandTestDeltaKernelPartitionExtractor,asserting the partition value resolves to
nullinstead of a valuecontaining the literal
"null".Tests run: 13, Failures: 0, Errors: 0for both testclasses.