You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Addresses review comment #9 on PR #4402: a downstream mapping that drops
schema_metadata between the schema-registry decoder and the iceberg sink
would silently reintroduce the year-50000 corruption that the rest of
the PR closes — the type-resolver picks TIMESTAMPTZ for the column based
on metadata seen at table-creation time, but per-message metadata is
what the shredder needs to interpret each numeric value's unit.
When schema_evolution.require_schema_metadata is true (default false),
the shredder rejects numeric inputs into time-typed columns when no
schema.Common has been registered for that field. time.Time /
time.Duration native inputs are unaffected — they carry their own unit
unambiguously. Non-time columns are unaffected.
The flag is gated to require schema_metadata also be set; setting strict
mode without configuring metadata at all is a config error caught at
startup.
Plumbing:
- config.go: new field with operator-facing description.
- output_iceberg.go: parse and validate the require/has-metadata pair.
- router.go: add to SchemaEvolutionConfig and pass to writer.
- writer.go: extend NewWriter signature; flip shredder strict mode when
the flag is set.
- shredder/shredder.go: new SetStrictTemporalMode(bool) and a
strictTemporal field; thread through convertLeafValue.
- shredder/temporal.go: convertDate / convertTime / convertTimestamp
return field-naming errors instead of falling through when strict
mode is on and metadata is absent.
Tests cover (a) numeric without metadata under strict mode is rejected
with a require_schema_metadata=true error message, (b) native time.Time
and time.Duration inputs are unaffected by strict mode, (c) numeric with
metadata under strict mode succeeds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,10 @@ All notable changes to this project will be documented in this file.
16
16
-**Pipelines whose own code reads the `schema_metadata` bytes via `meta()`** and pattern-matches the historical INT64 shape: schemas now contain `TIMESTAMP` / `DATE` / `TIME_OF_DAY` / `UUID` along with new `unit` and `adjust_to_utc` fields. Update the pattern.
17
17
-**iceberg shredder** is now schema-aware for numeric inputs: a numeric millisecond value declared by the schema as `timestamp-millis` is correctly interpreted as milliseconds rather than as Unix seconds. This closes a previously-silent corruption case where an int64 millis input into a TIMESTAMPTZ column would land ~50,000 years in the future.
18
18
19
+
### Added
20
+
21
+
- iceberg: `schema_evolution.require_schema_metadata` (default `false`). When enabled along with `schema_evolution.schema_metadata`, numeric values shredded into a `timestamp`, `timestamptz`, `date`, or `time` column without registered schema metadata are rejected loudly instead of falling through to the bloblang Unix-seconds default. Use this when you cannot guarantee the upstream attaches schema metadata and prefer a hard error to silently corrupting dates by ~50,000 years. No effect on time-typed columns receiving native `time.Time` / `time.Duration` Go values.
22
+
19
23
### Changed
20
24
21
25
- iceberg: `NewWriter` now takes a `*typeResolver` argument so the writer can use schema metadata to interpret numeric inputs into time-typed columns at shredding time. Internal API change only.
Description("An optional Bloblang mapping to customize column types during schema evolution. This mapping is executed for each new column and can override the inferred or schema-metadata-derived type. The mapping receives an object with fields `name` (column name), `path` (dot-separated path), `value` (sample value), `inferred_type` (the type that would be used without this mapping), `message` (the full message body), `namespace`, and `table`. It must return a string with a valid Iceberg type name: `boolean`, `int`, `long`, `float`, `double`, `string`, `binary`, `date`, `time`, `timestamp`, `timestamptz`, `uuid`, `decimal(p,s)`, or `fixed[n]`.").
Description("When `true`, writing a numeric value into a `timestamp`, `timestamptz`, `date`, or `time` column without `schema_metadata` registered for that column is a hard error. The default `false` permits a fallback path that interprets bare numeric timestamps as Unix seconds and bare numeric times as already-microseconds — convenient, but silently wrong if upstream produced milliseconds. Enable this when you cannot guarantee the upstream attaches schema metadata and want to fail loudly rather than corrupt dates by ~50,000 years. No effect on time-typed columns receiving `time.Time`/`time.Duration` Go values, which carry their own unit unambiguously, and no effect on non-time columns. Requires `schema_metadata` to be set.").
0 commit comments