diff --git a/.claude/skills/common-schema-audit/SKILL.md b/.claude/skills/common-schema-audit/SKILL.md new file mode 100644 index 0000000000..ba34525b34 --- /dev/null +++ b/.claude/skills/common-schema-audit/SKILL.md @@ -0,0 +1,124 @@ +--- +name: common-schema-audit +description: Audit every consumer of the schema.Common metadata format (the format produced by schema_registry_decode's store_schema_metadata, the parquet_decode processor, and CDC sources) for type-coverage drift and value-coercion gaps. Run this whenever a new component starts consuming schema.Common, when a new schema.CommonType variant is added upstream in benthos, or as a periodic maintenance check. +argument-hint: "[--format=md|json] [--component=]" +disable-model-invocation: true +allowed-tools: Bash(go *), Bash(grep *), Bash(find *), Read, Glob, Grep, Task +--- + +# Common-schema consumer drift audit + +`schema.Common` (from `github.com/redpanda-data/benthos/v4/public/schema`) is the canonical type metadata that flows through `meta(schema)` between Avro / Parquet / CDC sources and downstream sinks. Every consumer of this metadata must: + +1. **Handle every variant of `schema.CommonType`** — or fail loudly with a useful error that names the missing case, not a generic "unsupported". +2. **Coerce values when the Go type of the message body doesn't match the schema-declared type** — specifically the temporal-to-numeric and numeric-to-temporal bridges that the iceberg shredder implements via `coerceTemporalToNumeric` and the metadata-aware path in `internal/impl/iceberg/shredder/temporal.go:208`. + +This skill produces a per-consumer report so reviewers can catch drift before it ships. + +## Why this matters + +The "GF iceberg issue" was a value-vs-metadata mismatch class. Fixes closed the gap in each consumer: + +- `iceberg` output → temporal coerce + numeric metadata-aware scaling +- `parquet_encode` → type coverage for Date/TimeOfDay/UUID/Map, temporal coerce bridges +- `confluent` decoder / metadata parser → field-level logicalType, Debezium connect.name, duration +- `confluent` JSON-Schema encoder → Date/TimeOfDay/UUID + +A new consumer of `schema.Common`, or a new `schema.CommonType` variant added upstream in benthos, can re-introduce the same bug class without anyone noticing until a customer pipeline breaks. The audit catches the drift mechanically. + +## Workflow + +1. **Enumerate the type universe.** Read every `schema.CommonType` constant from the benthos source — the authoritative list of variants every consumer must consider. + + ```bash + gopath=$(go env GOMODCACHE) + benthos_dir=$(ls -d $gopath/github.com/redpanda-data/benthos/v4@*/ | tail -1) + grep -E '^\s*(Boolean|Int32|Int64|Float32|Float64|String|ByteArray|Object|Map|Array|Null|Union|Timestamp|Date|TimeOfDay|UUID|Decimal|BigDecimal|Any)\s+CommonType' "$benthos_dir/public/schema/common.go" + ``` + + Cross-check against the current set (as of the GF issue): + `Boolean, Int32, Int64, Float32, Float64, String, ByteArray, Object, Map, Array, Null, Union, Timestamp, Date, TimeOfDay, UUID, Decimal, BigDecimal, Any`. + + If new variants appear in benthos that aren't in this list, every consumer below will silently need an additional case — flag it loudly and update the skill's audit list. + +2. **Find every consumer.** A "consumer" of `schema.Common` is a code path that reads parsed schema metadata and uses it to drive downstream type decisions. The reliable signal is a `schema.ParseFromAny(...)` call, plus any direct `schema.Common` type switches in encoding/coercion paths. + + ```bash + grep -rln 'schema\.ParseFromAny\|case schema\.\(Boolean\|Int32\|Int64\|Float32\|Float64\|String\|ByteArray\|Object\|Map\|Array\|Null\|Union\|Timestamp\|Date\|TimeOfDay\|UUID\|Decimal\|BigDecimal\|Any\)\b' internal/impl/ | grep -v _test + ``` + + Producers (CDC schema builders in `mysql/`, `oracledb/`, `postgresql/`, `mongodb/cdc/`, `mssqlserver/`) are *not* consumers in this sense — they construct `schema.Common` from a source database's metadata; the type-coverage question doesn't apply. Filter those out. + +3. **Per-consumer audit.** For each consumer, delegate to the Explore agent with the brief below. Run consumers in parallel. + + ```text + Working dir: + + Audit the consumer at : against the full schema.CommonType variant set: + Boolean, Int32, Int64, Float32, Float64, String, ByteArray, Object, Map, Array, + Null, Union, Timestamp, Date, TimeOfDay, UUID, Decimal, BigDecimal, Any. + + Report: + (a) Type-coverage table: for each variant, which target type the consumer maps to (or whether it errors). Cite file:line. + (b) Value-coercion handling: when a message value's Go type doesn't match the + schema-declared type, does the consumer coerce or fail loudly? Specifically + check these cross-type cases: + - time.Time value + schema-declared Timestamp + integer-typed target column + - time.Duration value + schema-declared TimeOfDay + integer-typed target column + - Numeric int64 value + schema-declared Timestamp + integer-typed target column (unit-aware scaling) + - Numeric int32 value + schema-declared Date + integer-typed target column + Cite the coercion function and its location. + (c) Verdict: COVERED | PARTIAL | GAP, with one-line justification. + + Reference implementations to compare against: + - iceberg shredder's coerceTemporalToNumeric in internal/impl/iceberg/shredder/temporal.go + - iceberg shredder's metadata-aware numeric scaling at temporal.go:208 onwards + - iceberg type_resolver's commonTypeToIcebergTypeRec in internal/impl/iceberg/type_resolver.go + + Under 300 words per consumer. + ``` + +4. **Aggregate.** Combine the per-consumer reports into a single matrix: + + ``` + | Consumer | Missing types | Missing coercions | Verdict | + |---|---|---|---| + | iceberg | (none) | (none) | COVERED | + | parquet_encode | … | … | … | + … + ``` + +5. **Recommend.** For each GAP / PARTIAL row, propose the fix shape (port from iceberg, add cases to switch, etc.). Reference implementations to mirror, by file path so the pointers stay valid as the codebase evolves: + + - Type coverage extension pattern: `internal/impl/parquet/processor_encode.go::parquetNodeFromCommonField` and `internal/impl/iceberg/type_resolver.go::commonTypeToIcebergTypeRec` — both have a case for every `schema.CommonType` variant with explicit loud-error arms for shapes the sink cannot express. + - Temporal-to-numeric coerce: `internal/impl/iceberg/shredder/temporal.go::coerceTemporalToNumeric` — the `time.Time → unit-scaled int64` helper used when the iceberg column is integer-typed but the schema metadata says Timestamp. + - Numeric-to-temporal scaling: `internal/impl/iceberg/shredder/temporal.go::convertTimestamp` (the `if n, ok := numericInt64(value); ok && common != nil && common.Type == schema.Timestamp` branch) — the metadata-aware unit interpretation for numeric values flowing into time-typed columns. + - JSON Schema format mapping: `internal/impl/confluent/common_to_json_schema.go::commonToJSONSchemaNode` — the `schema.Date → {format:"date"}` / `TimeOfDay → time` / `UUID → uuid` cases. + +## Output format + +By default, produce a Markdown report on stdout with these sections, in order: + +1. **Variant universe** — the full list of `schema.CommonType` values found, plus a delta vs the canonical list (above) so reviewers spot when benthos adds new variants. +2. **Consumer matrix** — one row per consumer, columns as above. +3. **Detailed findings** — per-consumer block with the Explore agent's report verbatim. +4. **Recommendations** — ranked by impact (a sink that customers actually use comes ahead of an internal-only path). + +If `--format=json` is passed, emit a structured JSON document with the same sections; useful for CI. + +If `--component=` is passed, audit only that one consumer (matched by directory name under `internal/impl/`). + +## Adding new consumers + +When adding a new consumer of `schema.Common`: + +1. Either add a case for every variant in your type switch, OR explicitly error on unsupported with a message that names which variant and points at the upstream coercion that would close the gap. +2. If your consumer accepts user-provided values, implement the temporal-to-numeric coercion bridge analogous to `coerceTemporalToNumeric`. The customer is going to flip `preserve_logical_types: true` and start sending `time.Time` values; without the bridge you'll crash on shred/encode time. +3. Add an integration test analogous to `internal/impl/iceberg/integration/schema_metadata_timestamp_test.go::TestIntegrationCoerceTemporalIntoExistingBigintColumn` that pre-creates the target with a numeric column type, sends a typed value through, and asserts the coerce path fires correctly. + +## Notes + +- Producers of `schema.Common` (CDC schema builders) are intentionally out of scope. Their type-mapping coverage is a separate question and varies per source database. +- This skill is read-only. It must not write code or commit changes — its job is to produce the report so a human can prioritise fixes. +- If a consumer's type switch is implemented across multiple files (e.g. iceberg has the switch in `type_resolver.go` plus value handling in `shredder/`), evaluate the consumer as a whole. +- When in doubt, run the existing test suites for the suspected consumer (`go test ./internal/impl//...`) to see what's actually exercised. Coverage gaps in production code rarely have corresponding test coverage. diff --git a/docs/modules/components/pages/outputs/iceberg.adoc b/docs/modules/components/pages/outputs/iceberg.adoc index afbdec1bb4..f26ae88d52 100644 --- a/docs/modules/components/pages/outputs/iceberg.adoc +++ b/docs/modules/components/pages/outputs/iceberg.adoc @@ -149,6 +149,7 @@ output: table_location: s3://my-iceberg-bucket/ # No default (optional) schema_metadata: "" new_column_type_mapping: "" # No default (optional) + require_schema_metadata: false commit: manifest_merge_enabled: true max_snapshot_age: 24h @@ -892,6 +893,15 @@ An optional Bloblang mapping to customize column types during schema evolution. *Type*: `string` +=== `schema_evolution.require_schema_metadata` + +When `true`, writing a numeric value into a `timestamp`, `timestamptz`, `date`, or `time` column without `schema_metadata` registered for that column is a hard error. The default `false` permits a fallback path that interprets bare numeric timestamps as Unix seconds and bare numeric times as already-microseconds — convenient, but silently wrong if upstream produced milliseconds. Enable this when you cannot guarantee the upstream attaches schema metadata and want to fail loudly rather than corrupt dates by ~50,000 years. No effect on time-typed columns receiving `time.Time`/`time.Duration` Go values, which carry their own unit unambiguously, and no effect on non-time columns. Requires `schema_metadata` to be set. + + +*Type*: `bool` + +*Default*: `false` + === `commit` Commit behavior configuration. diff --git a/go.mod b/go.mod index e8284d1c47..4b6ea7f23c 100644 --- a/go.mod +++ b/go.mod @@ -170,7 +170,7 @@ require ( github.com/timeplus-io/proton-go-driver/v2 v2.1.4 github.com/tmc/langchaingo v0.1.14 github.com/trinodb/trino-go-client v0.333.0 - github.com/twmb/avro v1.7.2 + github.com/twmb/avro v1.7.3-0.20260513193503-1e5c2a3fc070 github.com/twmb/franz-go v1.20.7 github.com/twmb/franz-go/pkg/kadm v1.17.2 github.com/twmb/franz-go/pkg/kmsg v1.12.0 diff --git a/go.sum b/go.sum index 826cadd456..6a6af88fca 100644 --- a/go.sum +++ b/go.sum @@ -1748,8 +1748,8 @@ github.com/trivago/grok v1.0.0/go.mod h1:9t59xLInhrncYq9a3J7488NgiBZi5y5yC7bss+w github.com/trivago/tgo v1.0.7 h1:uaWH/XIy9aWYWpjm2CU3RpcqZXmX2ysQ9/Go+d9gyrM= github.com/trivago/tgo v1.0.7/go.mod h1:w4dpD+3tzNIIiIfkWWa85w5/B77tlvdZckQ+6PkFnhc= github.com/tv42/httpunix v0.0.0-20150427012821-b75d8614f926/go.mod h1:9ESjWnEqriFuLhtthL60Sar/7RFoluCcXsuvEwTV5KM= -github.com/twmb/avro v1.7.2 h1:cmrEBRSbELRqsg/dRkQvVWuOaR2EfGifHIt/2iJ9lfI= -github.com/twmb/avro v1.7.2/go.mod h1:X0fT1dY2xcbV4YuCE4mYro+qljHl4kUF5uA/2z1rgSk= +github.com/twmb/avro v1.7.3-0.20260513193503-1e5c2a3fc070 h1:gYa95NoqeXYOMIVIe2/YcbC/l0s5q+wQ70Zo+TYQU4k= +github.com/twmb/avro v1.7.3-0.20260513193503-1e5c2a3fc070/go.mod h1:X0fT1dY2xcbV4YuCE4mYro+qljHl4kUF5uA/2z1rgSk= github.com/twmb/franz-go v1.20.7 h1:P4MGSXJjjAPP3NRGPCks/Lrq+j+twWMVl1qYCVgNmWY= github.com/twmb/franz-go v1.20.7/go.mod h1:0bRX9HZVaoueqFWhPZNi2ODnJL7DNa6mK0HeCrC2bNU= github.com/twmb/franz-go/pkg/kadm v1.17.2 h1:g5f1sAxnTkYC6G96pV5u715HWhxd66hWaDZUAQ8xHY8= diff --git a/internal/impl/confluent/avro_walker.go b/internal/impl/confluent/avro_walker.go index e1bad36110..30911d96d9 100644 --- a/internal/impl/confluent/avro_walker.go +++ b/internal/impl/confluent/avro_walker.go @@ -28,6 +28,10 @@ import ( // decoder path: // - time-millis/time-micros: time.Duration → time.Time (time-of-day) // - duration: avro.Duration → ISO 8601 duration string +// - decimal: raw []byte → json.Number for the SetStructuredMut path +// (Go's json.Marshal cannot natively format *big.Rat as a JSON +// number — it emits "33/100"-style fractional strings — so we +// pre-convert to json.Number for downstream bloblang/SQL use). func preserveLogicalTypeOpts() []avro.SchemaOpt { return []avro.SchemaOpt{ avro.NewCustomType[time.Time, int32]( @@ -55,8 +59,6 @@ func preserveLogicalTypeOpts() []avro.SchemaOpt { return avro.DurationFromBytes(b).String(), nil }, }, - // Decimal: raw type is []byte. Convert to json.Number for the - // SetStructuredMut path (json.Marshal can't handle *big.Rat). avro.CustomType{ LogicalType: "decimal", Decode: func(v any, node *avro.SchemaNode) (any, error) { diff --git a/internal/impl/confluent/common_to_avro.go b/internal/impl/confluent/common_to_avro.go index 560fde3ae1..fbb3b956ea 100644 --- a/internal/impl/confluent/common_to_avro.go +++ b/internal/impl/confluent/common_to_avro.go @@ -72,9 +72,48 @@ func commonToAvroInner(c schema.Common, recordName, namespace string, isRoot boo case schema.Any: return "bytes", nil case schema.Timestamp: + // Honour Logical.Timestamp params if present; legacy nil-Logical + // schemas fall through EffectiveTimestamp() to {Millis, UTC}, + // preserving the pre-PR encoder output exactly. + p := c.EffectiveTimestamp() + base := "long" + logicalName, err := avroTimestampLogicalName(p) + if err != nil { + return nil, fmt.Errorf("timestamp field %q: %w", c.Name, err) + } + return map[string]any{ + "type": base, + "logicalType": logicalName, + }, nil + case schema.Date: return map[string]any{ - "type": "long", - "logicalType": "timestamp-millis", + "type": "int", + "logicalType": "date", + }, nil + case schema.TimeOfDay: + if c.Logical == nil || c.Logical.TimeOfDay == nil { + return nil, fmt.Errorf("time-of-day field %q missing Logical.TimeOfDay parameters", c.Name) + } + // Avro time-{millis,micros} carry no zone semantics, so a + // TimeOfDay{AdjustToUTC=true} cannot be expressed faithfully. + // Reject loudly rather than silently dropping that bit. + if c.Logical.TimeOfDay.AdjustToUTC { + return nil, fmt.Errorf("time-of-day field %q has AdjustToUTC=true; Avro time-millis/time-micros have no UTC-adjust variant — coerce upstream before encoding", c.Name) + } + // Avro defines only time-millis (int) and time-micros (long); reject + // anything else loudly rather than silently downcasting. + switch c.Logical.TimeOfDay.Unit { + case schema.TimeUnitMillis: + return map[string]any{"type": "int", "logicalType": "time-millis"}, nil + case schema.TimeUnitMicros: + return map[string]any{"type": "long", "logicalType": "time-micros"}, nil + default: + return nil, fmt.Errorf("time-of-day field %q has unit %v which Avro cannot express; only MILLIS and MICROS are supported", c.Name, c.Logical.TimeOfDay.Unit) + } + case schema.UUID: + return map[string]any{ + "type": "string", + "logicalType": "uuid", }, nil case schema.Decimal: if c.Logical == nil || c.Logical.Decimal == nil { @@ -172,6 +211,28 @@ func commonToAvroUnion(c schema.Common) (any, error) { return variants, nil } +// avroTimestampLogicalName picks the Avro logical-type name for a +// TimestampParams value. The mapping mirrors the reverse path in +// applyAvroLogicalType: AdjustToUTC=true picks `timestamp-*`, false picks +// `local-timestamp-*`. Avro 1.10+ supports nanos; older brokers may reject it. +func avroTimestampLogicalName(p schema.TimestampParams) (string, error) { + var unit string + switch p.Unit { + case schema.TimeUnitMillis: + unit = "millis" + case schema.TimeUnitMicros: + unit = "micros" + case schema.TimeUnitNanos: + unit = "nanos" + default: + return "", fmt.Errorf("unsupported timestamp unit %v (Avro supports only MILLIS, MICROS, NANOS)", p.Unit) + } + if p.AdjustToUTC { + return "timestamp-" + unit, nil + } + return "local-timestamp-" + unit, nil +} + // sanitizeAvroName derives a valid Avro name from an arbitrary subject string. // Avro names must match [A-Za-z_][A-Za-z0-9_]*. Invalid characters are replaced // with underscores and a leading digit is prefixed with an underscore. diff --git a/internal/impl/confluent/common_to_avro_test.go b/internal/impl/confluent/common_to_avro_test.go index 8e857050bc..a9a985c452 100644 --- a/internal/impl/confluent/common_to_avro_test.go +++ b/internal/impl/confluent/common_to_avro_test.go @@ -219,3 +219,186 @@ func TestSanitizeAvroName(t *testing.T) { }) } } + +// TestCommonToAvroLogicalTypeRoundTrip drives both halves of the Avro +// adapter to confirm encode/decode are symmetric for every logical type +// the connector now preserves. A schema.Common produced by a synthetic +// decoder run, re-encoded to Avro JSON, must yield byte-equivalent output +// to the original Avro spec for the same field. +func TestCommonToAvroLogicalTypeRoundTrip(t *testing.T) { + cases := []struct { + name string + input schema.Common + wantOuter map[string]any + }{ + { + name: "Timestamp default (millis, UTC)", + input: schema.Common{Type: schema.Timestamp}, + wantOuter: map[string]any{"type": "long", "logicalType": "timestamp-millis"}, + }, + { + name: "Timestamp millis explicit", + input: schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMillis, AdjustToUTC: true}}, + }, + wantOuter: map[string]any{"type": "long", "logicalType": "timestamp-millis"}, + }, + { + name: "Timestamp micros UTC", + input: schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + }, + wantOuter: map[string]any{"type": "long", "logicalType": "timestamp-micros"}, + }, + { + name: "Timestamp nanos UTC", + input: schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitNanos, AdjustToUTC: true}}, + }, + wantOuter: map[string]any{"type": "long", "logicalType": "timestamp-nanos"}, + }, + { + name: "Timestamp local micros", + input: schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: false}}, + }, + wantOuter: map[string]any{"type": "long", "logicalType": "local-timestamp-micros"}, + }, + { + name: "Date", + input: schema.Common{Type: schema.Date}, + wantOuter: map[string]any{"type": "int", "logicalType": "date"}, + }, + { + name: "TimeOfDay millis", + input: schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMillis}}, + }, + wantOuter: map[string]any{"type": "int", "logicalType": "time-millis"}, + }, + { + name: "TimeOfDay micros", + input: schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros}}, + }, + wantOuter: map[string]any{"type": "long", "logicalType": "time-micros"}, + }, + { + name: "UUID", + input: schema.Common{Type: schema.UUID}, + wantOuter: map[string]any{"type": "string", "logicalType": "uuid"}, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + got := avroUnmarshal(t, tc.input, "", "") + m, ok := got.(map[string]any) + require.True(t, ok, "expected object output, got %T", got) + assert.Equal(t, tc.wantOuter["type"], m["type"]) + assert.Equal(t, tc.wantOuter["logicalType"], m["logicalType"]) + }) + } +} + +// TestCommonToAvroTimeOfDayRejectsAdjustToUTC verifies that the Avro +// encoder refuses to emit a TimeOfDay schema with AdjustToUTC=true. +// Avro's time-millis / time-micros logical types carry no UTC-adjust bit, +// so silently dropping it would lose the metadata. Fail loud instead. +func TestCommonToAvroTimeOfDayRejectsAdjustToUTC(t *testing.T) { + c := schema.Common{ + Name: "shift_start", + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + } + _, err := commonToAvroSchema(c, "Row", "") + require.Error(t, err) + assert.Contains(t, err.Error(), "AdjustToUTC=true") + assert.Contains(t, err.Error(), "shift_start") +} + +// TestCommonToAvroTimeOfDayRejectsUnsupportedUnit verifies that the +// encoder refuses TimeOfDay schemas with units Avro doesn't define +// (Seconds, Nanos), with a field-naming error mentioning the supported +// units. Without this guard, an encoder downcast would silently change +// resolution. +func TestCommonToAvroTimeOfDayRejectsUnsupportedUnit(t *testing.T) { + for _, u := range []schema.TimeUnit{schema.TimeUnitSeconds, schema.TimeUnitNanos} { + t.Run(u.String(), func(t *testing.T) { + c := schema.Common{ + Name: "open_at", + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: u}}, + } + _, err := commonToAvroSchema(c, "Row", "") + require.Error(t, err) + assert.Contains(t, err.Error(), "MILLIS and MICROS") + assert.Contains(t, err.Error(), "open_at") + }) + } +} + +// TestCommonToAvroTimestampRejectsUnsupportedUnit verifies that the +// avroTimestampLogicalName helper rejects timestamp units Avro doesn't +// define (Seconds, and any other invalid TimeUnit). Avro 1.10+ supports +// millis/micros/nanos. +func TestCommonToAvroTimestampRejectsUnsupportedUnit(t *testing.T) { + c := schema.Common{ + Name: "event_time", + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitSeconds, AdjustToUTC: true}}, + } + _, err := commonToAvroSchema(c, "Row", "") + require.Error(t, err) + assert.Contains(t, err.Error(), "event_time") + assert.Contains(t, err.Error(), "MILLIS, MICROS, NANOS") +} + +// TestCommonToAvroDecodeEncodeRoundTrip ensures that decoding an Avro spec +// via ecsAvroParseFromBytes and re-encoding via commonToAvroSchema yields a +// schema with the same logicalType annotation. This is the symmetry contract +// future format adapters should maintain. +func TestCommonToAvroDecodeEncodeRoundTrip(t *testing.T) { + cases := []struct { + name string + fieldType string + }{ + {"timestamp-millis", `{"type":"long","logicalType":"timestamp-millis"}`}, + {"timestamp-micros", `{"type":"long","logicalType":"timestamp-micros"}`}, + {"local-timestamp-millis", `{"type":"long","logicalType":"local-timestamp-millis"}`}, + {"date", `{"type":"int","logicalType":"date"}`}, + {"time-millis", `{"type":"int","logicalType":"time-millis"}`}, + {"time-micros", `{"type":"long","logicalType":"time-micros"}`}, + {"uuid", `{"type":"string","logicalType":"uuid"}`}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + spec := []byte(`{ + "type":"record","name":"Row", + "fields":[{"name":"f","type":` + tc.fieldType + `}] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{preserveLogicalTypes: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + + // Re-encode the decoded child and verify the logicalType is + // the same as what we started with. + f := c.Children[0] + out, err := commonToAvroSchema(f, "Inner", "") + require.NoError(t, err) + var rt map[string]any + require.NoError(t, json.Unmarshal([]byte(out), &rt)) + + var want map[string]any + require.NoError(t, json.Unmarshal([]byte(tc.fieldType), &want)) + assert.Equal(t, want["logicalType"], rt["logicalType"]) + }) + } +} diff --git a/internal/impl/confluent/common_to_json_schema.go b/internal/impl/confluent/common_to_json_schema.go index 6787cb3393..75b85f2932 100644 --- a/internal/impl/confluent/common_to_json_schema.go +++ b/internal/impl/confluent/common_to_json_schema.go @@ -59,6 +59,12 @@ func commonToJSONSchemaNode(c schema.Common) (map[string]any, error) { return commonToJSONSchemaUnion(c) case schema.Timestamp: return map[string]any{"type": "string", "format": "date-time"}, nil + case schema.Date: + return map[string]any{"type": "string", "format": "date"}, nil + case schema.TimeOfDay: + return map[string]any{"type": "string", "format": "time"}, nil + case schema.UUID: + return map[string]any{"type": "string", "format": "uuid"}, nil case schema.Decimal: if c.Logical == nil || c.Logical.Decimal == nil { return nil, fmt.Errorf("decimal field %q missing precision/scale", c.Name) diff --git a/internal/impl/confluent/common_to_json_schema_test.go b/internal/impl/confluent/common_to_json_schema_test.go index f4ca312b33..72870a0cf5 100644 --- a/internal/impl/confluent/common_to_json_schema_test.go +++ b/internal/impl/confluent/common_to_json_schema_test.go @@ -61,6 +61,29 @@ func TestCommonToJSONSchemaTimestamp(t *testing.T) { assert.Equal(t, "date-time", got["format"]) } +// TestCommonToJSONSchemaDateTimeUUID covers the three semantic types +// the schema-coverage audit flagged as missing. JSON Schema draft 2019-09 +// defines `date`, `time`, and `uuid` format keywords; mapping the +// schema.Common variants to those keeps the encoder symmetric with the +// rest of the type matrix. +func TestCommonToJSONSchemaDateTimeUUID(t *testing.T) { + t.Run("Date", func(t *testing.T) { + got := jsonSchemaUnmarshal(t, schema.Common{Type: schema.Date}) + assert.Equal(t, "string", got["type"]) + assert.Equal(t, "date", got["format"]) + }) + t.Run("TimeOfDay", func(t *testing.T) { + got := jsonSchemaUnmarshal(t, schema.Common{Type: schema.TimeOfDay}) + assert.Equal(t, "string", got["type"]) + assert.Equal(t, "time", got["format"]) + }) + t.Run("UUID", func(t *testing.T) { + got := jsonSchemaUnmarshal(t, schema.Common{Type: schema.UUID}) + assert.Equal(t, "string", got["type"]) + assert.Equal(t, "uuid", got["format"]) + }) +} + func TestCommonToJSONSchemaByteArray(t *testing.T) { got := jsonSchemaUnmarshal(t, schema.Common{Type: schema.ByteArray}) assert.Equal(t, "string", got["type"]) diff --git a/internal/impl/confluent/ecs_avro.go b/internal/impl/confluent/ecs_avro.go index 38187febc5..36470b51a9 100644 --- a/internal/impl/confluent/ecs_avro.go +++ b/internal/impl/confluent/ecs_avro.go @@ -1,4 +1,4 @@ -// Copyright 2024 Redpanda Data, Inc. +// Copyright 2026 Redpanda Data, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. @@ -19,6 +19,7 @@ import ( "errors" "fmt" "math/big" + "strings" "github.com/redpanda-data/benthos/v4/public/schema" ) @@ -44,6 +45,49 @@ func avroSpecInt32(v any) (int32, error) { type ecsAvroConfig struct { rawUnion bool // Whether unions are going to be serialized as raw JSON + + // preserveLogicalTypes controls whether Avro logical-type annotations + // (timestamp-{millis,micros,nanos}, local-timestamp-*, date, + // time-{millis,micros}, uuid) are promoted to their semantic + // schema.Common type in the parsed metadata. When false the metadata + // keeps the base primitive (Int64/Int32/String) — matching the value + // side under the same flag, where twmb/avro returns raw long/int/ + // string for these types unless preserve_logical_types asks for the + // richer Go time/uuid values. + // + // The decimal logical type is intentionally excluded from this gate: + // it pre-dates the preserve_logical_types contract and is load-bearing + // for the value-side normaliseAvroDecimals path, which keys off + // schema.Common.Type == Decimal. Treating decimal as always-on keeps + // existing pipelines that rely on schema metadata for decimal values + // (without enabling preserve_logical_types) working unchanged. + preserveLogicalTypes bool + + // translateKafkaConnectTypes mirrors the schema_registry_decode + // processor's flag of the same name. When true, the metadata parser + // also recognises Kafka Connect / Debezium `connect.name` + // annotations and maps them to their corresponding schema.Common + // types (Date, Timestamp(Unit), etc.), keeping the metadata side in + // lock-step with the value side (kafkaConnectTypeOpt in + // avro_walker.go). Without this, a Debezium-sourced pipeline using + // `translate_kafka_connect_types: true` produces time.Time values + // but the metadata still claims Int64 — the exact mismatch fixed + // for sibling-form Avro by [applyAvroLogicalType]. + translateKafkaConnectTypes bool + + // names is the lexical-scope registry of resolved record/enum/fixed + // definitions, keyed by Avro fullname (and short name for + // convenience). Stored values must be treated as immutable — every + // retrieval clones via cloneCommon so callers can mutate freely + // without corrupting later look-ups. + names map[string]schema.Common + + // namespace is the enclosing Avro namespace, threaded through the + // recursion by value. It is updated when entering a named-type + // declaration that introduces a new namespace, per the Avro spec's + // inheritance rule (a name with no dots and no `namespace` field + // inherits the most tightly enclosing namespace). + namespace string } // ecsAvroParseFromBytes parses an Avro JSON spec into a schema.Common. The @@ -51,6 +95,9 @@ type ecsAvroConfig struct { // decimal field paths during value normalisation; callers that just want // the metadata copy can call ToAny() on the result. func ecsAvroParseFromBytes(cfg ecsAvroConfig, specBytes []byte) (schema.Common, error) { + if cfg.names == nil { + cfg.names = map[string]schema.Common{} + } var as any if err := json.Unmarshal(specBytes, &as); err != nil { return schema.Common{}, err @@ -79,51 +126,409 @@ func ecsAvroParseFromBytes(cfg ecsAvroConfig, specBytes []byte) (schema.Common, return schema.Common{}, fmt.Errorf("expected either an array or object at root of schema, got %T", as) } -// If the union is actually just a verbose way of defining an optional field -// then we return the real type and true. E.g. if we see: +// ecsAvroResolveOptionalUnion checks whether a 2-element union is just a +// nullable wrapper (one branch is "null") and returns the resolved non-null +// branch as a Common with Optional=true. +// +// Handles both orderings ([null, X] and [X, null]) and resolves the non-null +// branch in three forms: // -// `"type": [ "null", "string" ]` +// - a primitive type name string (e.g. "string"), +// - a previously-defined named-type reference string (e.g. "Fee"), +// resolved via cfg.names per the Avro lexical-scope rule, +// - an inline type definition object (e.g. {"type":"record",...}). // -// Then we return string and true. -func ecsAvroIsUnionJustOptional(types []any) (schema.CommonType, bool) { +// The matched bool reports whether the union has the [null, X] / [X, null] +// shape; the error is non-nil only when the shape matched but resolving the +// non-null branch failed (e.g. a malformed inline decimal). Callers must +// surface the error rather than falling through to the general-union path — +// the fall-through would also fail, with a less informative message. +func ecsAvroResolveOptionalUnion(cfg ecsAvroConfig, types []any) (resolved schema.Common, matched bool, err error) { if len(types) != 2 { - return schema.CommonType(-1), false + return schema.Common{}, false, nil } + var other any + for _, t := range types { + if s, ok := t.(string); ok && s == "null" { + continue + } + if other != nil { + // Two non-null branches — not a nullable wrapper. + return schema.Common{}, false, nil + } + other = t + } + if other == nil { + return schema.Common{}, false, nil + } + inner, err := ecsAvroResolveTypeRef(cfg, other) + if err != nil { + return schema.Common{}, true, err + } + inner.Optional = true + return inner, true, nil +} - firstTypeStr, ok := types[0].(string) - if !ok || firstTypeStr != "null" { - return schema.CommonType(-1), false +// ecsAvroResolveTypeRef resolves a single Avro type reference — the value +// of a "type" field, or one branch of a union — to a Common. The reference +// may be a primitive type name string, a previously-defined named-type +// reference string (resolved via cfg.names), or an inline type definition +// object. +// +// Unknown string names fall back to schema.Any so downstream sinks see a +// sensible (if structureless) column. An error is returned only when an +// inline object fails to parse (the wrapped cause flows back to the caller) +// or when ref is neither a string nor a map (a malformed Avro JSON shape +// the upstream parser couldn't reject). +func ecsAvroResolveTypeRef(cfg ecsAvroConfig, ref any) (schema.Common, error) { + switch b := ref.(type) { + case string: + // Try the names map first so a name reference takes priority over + // the schema.Any fallback in ecsAvroTypeToCommon. Primitive names + // are never registered in the map, so primitives reach the + // fallback unchanged. + if resolved, ok := ecsAvroLookupName(cfg, b); ok { + return resolved, nil + } + return schema.Common{Type: ecsAvroTypeToCommon(b)}, nil + case map[string]any: + return ecsAvroFromAnyMap(cfg, b) } + return schema.Common{}, fmt.Errorf("expected type reference to be a string or object, got %T", ref) +} - secondTypeStr, ok := types[1].(string) +// applyAvroLogicalType reads the optional "logicalType" annotation from an +// Avro schema node and refines c (whose Type was already set from the base +// primitive) with the matching schema.Common type and Logical parameters. +// +// Returns (true, nil) when a logical type was recognised and fully populated +// c; the caller should treat c as final and skip the container-handling +// switch in [ecsAvroFromAnyMap]. Returns (false, nil) when no logicalType is +// present or it isn't recognised — per the Avro 1.10 spec, readers must +// silently fall back to the base type for unknown logical types. Returns an +// error only for malformed annotations on a known logical type (e.g. a +// decimal missing precision, a primitive mismatch we want to flag). +func applyAvroLogicalType(cfg ecsAvroConfig, c *schema.Common, as map[string]any) (bool, error) { + logical, ok := as["logicalType"].(string) if !ok { - return schema.CommonType(-1), false + return false, nil + } + + // Gate non-decimal logical types on preserve_logical_types so that the + // schema metadata stays coherent with the value side. With + // preserve_logical_types=false the value side emits raw long/int/string + // for timestamps/dates/times/uuids; surfacing TIMESTAMP/DATE/TIME_OF_DAY/ + // UUID in metadata under that flag would re-introduce the metadata-vs- + // value divergence #4399 was meant to close. Decimal is exempt — see + // the field comment on ecsAvroConfig.preserveLogicalTypes. + if !cfg.preserveLogicalTypes && logical != "decimal" { + return false, nil } - return ecsAvroTypeToCommon(secondTypeStr), true + switch logical { + case "decimal": + // Per the Avro spec, decimal sits on top of bytes or fixed only. + // Other base primitives are malformed — fall back to the base + // type and ignore the annotation, matching how we treat every + // other primitive/logical-type mismatch below. Note that + // ecsAvroTypeToCommon maps `fixed` to schema.Any (it's not in + // the named primitive set), so we allow Any through too. + // + // Caveat: promotion is one-way. After this branch fires the + // resulting schema.Common is just `Decimal(P,S)` — the fact that + // the Avro source was `fixed[N]` rather than `bytes` is gone. + // Every downstream consumer of schema.Common (iceberg, parquet, + // JSON Schema, value-side normaliseAvroDecimals) treats decimal + // purely by precision/scale and picks its own physical encoding, + // so this is lossless for them. The single leak is + // common_to_avro.go's re-emit path, which will always produce a + // `bytes`-backed decimal even when the source schema was + // fixed-backed — semantically equivalent but a different Avro + // schema shape, which can matter for Schema Registry + // compatibility checks that compare by equality. + if c.Type != schema.ByteArray && c.Type != schema.Any { + return false, nil + } + p, err := avroSpecInt32(as["precision"]) + if err != nil { + return false, fmt.Errorf("decimal precision: %w", err) + } + // Per the Avro spec scale is optional and defaults to 0 when absent; + // only an unparseable scale value (wrong type, non-integer) is an error. + var s int32 + if _, present := as["scale"]; present { + s, err = avroSpecInt32(as["scale"]) + if err != nil { + return false, fmt.Errorf("decimal scale: %w", err) + } + } + c.Type = schema.Decimal + c.Logical = &schema.LogicalParams{ + Decimal: &schema.DecimalParams{Precision: p, Scale: s}, + } + if err := c.Validate(); err != nil { + return false, fmt.Errorf("decimal field: %w", err) + } + return true, nil + + case "uuid": + // Avro UUID logical type sits on top of `string`. If the base + // primitive doesn't match, treat as unknown and fall back per spec. + if c.Type != schema.String { + return false, nil + } + c.Type = schema.UUID + return true, nil + + case "date": + if c.Type != schema.Int32 { + return false, nil + } + c.Type = schema.Date + return true, nil + + case "time-millis": + if c.Type != schema.Int32 { + return false, nil + } + c.Type = schema.TimeOfDay + c.Logical = &schema.LogicalParams{ + TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMillis}, + } + return true, nil + + case "time-micros": + if c.Type != schema.Int64 { + return false, nil + } + c.Type = schema.TimeOfDay + c.Logical = &schema.LogicalParams{ + TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros}, + } + return true, nil + + case "duration": + // Avro `duration` logical type sits on top of `fixed` (12 bytes). + // Our value side (preserveLogicalTypeOpts in avro_walker.go) + // decodes it to an ISO 8601 duration string. schema.Common has + // no dedicated Duration type, so we map metadata to schema.String + // — matching what the value side produces and giving iceberg a + // VARCHAR column the operator can query as text. + if c.Type != schema.ByteArray && c.Type != schema.Any { + return false, nil + } + c.Type = schema.String + return true, nil + + case "timestamp-millis", "timestamp-micros", "timestamp-nanos", + "local-timestamp-millis", "local-timestamp-micros", "local-timestamp-nanos": + // All six timestamp logical types use long as their base primitive. + if c.Type != schema.Int64 { + return false, nil + } + unit := avroLogicalTimestampUnit(logical) + adjust := !strings.HasPrefix(logical, "local-") + c.Type = schema.Timestamp + c.Logical = &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{Unit: unit, AdjustToUTC: adjust}, + } + return true, nil + + default: + // Unknown logicalType. Per Avro 1.10 spec readers must ignore. + return false, nil + } } -// ecsAvroIsUnionJustOptionalObject mirrors ecsAvroIsUnionJustOptional but -// for the [null, {object}] shape — Avro's idiom for a nullable named or -// logically-typed field. Returns the resolved Common (with any -// LogicalParams populated) and true on match. -func ecsAvroIsUnionJustOptionalObject(cfg ecsAvroConfig, types []any) (schema.Common, bool) { - if len(types) != 2 { - return schema.Common{}, false +// avroLogicalTimestampUnit maps the suffix of an Avro timestamp logical-type +// name to the corresponding [schema.TimeUnit]. Both `timestamp-*` and +// `local-timestamp-*` variants share the suffix → unit mapping. +// +// The caller in [applyAvroLogicalType] has already filtered to one of the +// six supported timestamp logical-type names, so the default branch is +// unreachable. We panic rather than return an invalid TimeUnit(0) +// sentinel that would propagate a malformed Logical past Validate(). +func avroLogicalTimestampUnit(logical string) schema.TimeUnit { + switch { + case strings.HasSuffix(logical, "-millis"): + return schema.TimeUnitMillis + case strings.HasSuffix(logical, "-micros"): + return schema.TimeUnitMicros + case strings.HasSuffix(logical, "-nanos"): + return schema.TimeUnitNanos + default: + panic(fmt.Sprintf("unreachable: avroLogicalTimestampUnit called with non-timestamp logical type %q", logical)) } - firstStr, ok := types[0].(string) - if !ok || firstStr != "null" { - return schema.Common{}, false +} + +// applyKafkaConnectType reads the optional "connect.name" annotation +// from an Avro schema node (the Kafka Connect / Debezium convention for +// expressing extended semantic types alongside the Avro base primitive) +// and refines c with the matching schema.Common type and Logical params. +// +// This mirrors [kafkaConnectTypeOpt] in avro_walker.go which handles the +// same annotations on the value side: with translate_kafka_connect_types +// enabled the value becomes a time.Time for the temporal entries. Without +// the matching metadata-side mapping the @schema would still claim the +// raw base primitive (Int64/Int32/String) and downstream sinks (notably +// iceberg) would create columns of the wrong type — the same value-vs- +// metadata mismatch [applyAvroLogicalType] fixes for sibling-form +// logicalType. So we gate this on translateKafkaConnectTypes and apply +// it from the same call sites. +// +// Year is mapped to Date rather than a dedicated type: the value side +// returns time.Time at January 1 of the year, which round-trips through +// an iceberg DATE column as days-since-epoch without losing the year +// component (and DATE columns are widely supported by query engines). +// +// Time and Timestamp both map to schema.Timestamp despite Time being +// semantically a time-of-day. The value side returns time.Time for +// both (kafkaConnectTypeOpt's case "io.debezium.time.Timestamp", +// "io.debezium.time.Time"), so the metadata side matches that shape to +// keep the value/metadata contract symmetrical. Operators who need a +// distinct TIME column for time-of-day fields can override via +// new_column_type_mapping. +// +// Returns (false, nil) when the annotation is absent, when it isn't +// recognised (per the Avro convention of silently ignoring unknown +// properties), or when translateKafkaConnectTypes is disabled. +func applyKafkaConnectType(cfg ecsAvroConfig, c *schema.Common, as map[string]any) (bool, error) { + if !cfg.translateKafkaConnectTypes { + return false, nil } - secondMap, ok := types[1].(map[string]any) - if !ok { - return schema.Common{}, false + name, ok := as["connect.name"].(string) + if !ok || name == "" { + return false, nil } - c, err := ecsAvroFromAnyMap(cfg, secondMap) - if err != nil { - return schema.Common{}, false + + switch name { + case "io.debezium.time.Date": + // Wire is int32 days-since-epoch; map straight to Date. + if c.Type != schema.Int32 { + return false, nil + } + c.Type = schema.Date + return true, nil + + case "io.debezium.time.Year": + // Wire is int32 year. The value side returns time.Time at + // Jan 1 of the year; mapping to Date keeps that representation + // faithful end-to-end (days-since-epoch is well-defined for + // any Jan 1 value). + if c.Type != schema.Int32 { + return false, nil + } + c.Type = schema.Date + return true, nil + + case "io.debezium.time.Timestamp", "io.debezium.time.Time": + // Both wire as int64 milliseconds. See the doc comment for + // the rationale on conflating Time with Timestamp. + if c.Type != schema.Int64 { + return false, nil + } + c.Type = schema.Timestamp + c.Logical = &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMillis, AdjustToUTC: true, + }, + } + return true, nil + + case "io.debezium.time.MicroTimestamp", "io.debezium.time.MicroTime": + if c.Type != schema.Int64 { + return false, nil + } + c.Type = schema.Timestamp + c.Logical = &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMicros, AdjustToUTC: true, + }, + } + return true, nil + + case "io.debezium.time.NanoTimestamp", "io.debezium.time.NanoTime": + if c.Type != schema.Int64 { + return false, nil + } + c.Type = schema.Timestamp + c.Logical = &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitNanos, AdjustToUTC: true, + }, + } + return true, nil + + case "io.debezium.time.ZonedTimestamp": + // Wire is string (RFC3339). The value side parses it to a + // time.Time and the AdjustToUTC=true semantics match the + // time-zone-aware nature of the type. + if c.Type != schema.String { + return false, nil + } + c.Type = schema.Timestamp + c.Logical = &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMillis, AdjustToUTC: true, + }, + } + return true, nil + + default: + // Unknown connect.name. Per Avro convention readers must + // silently ignore unrecognised annotations. + return false, nil } - return c, true +} + +// ecsAvroLookupName resolves a string reference to a previously-registered +// named type, applying Avro's name-resolution rules: a reference that +// contains a dot is treated as a fullname; an unqualified reference is +// looked up first against the enclosing namespace, then against the bare +// name as a fallback for root-scope references. +// +// The returned Common is cloned so callers can mutate it freely without +// corrupting the registered entry. +func ecsAvroLookupName(cfg ecsAvroConfig, ref string) (schema.Common, bool) { + if !strings.ContainsRune(ref, '.') && cfg.namespace != "" { + if resolved, ok := cfg.names[cfg.namespace+"."+ref]; ok { + return cloneCommon(resolved), true + } + } + if resolved, ok := cfg.names[ref]; ok { + return cloneCommon(resolved), true + } + return schema.Common{}, false +} + +// cloneCommon deep-copies a schema.Common, allocating fresh slice and +// pointer storage for Children and Logical so the result aliases nothing +// with the source. +func cloneCommon(c schema.Common) schema.Common { + if c.Children != nil { + children := make([]schema.Common, len(c.Children)) + for i := range c.Children { + children[i] = cloneCommon(c.Children[i]) + } + c.Children = children + } + if c.Logical != nil { + l := *c.Logical + if l.Decimal != nil { + d := *l.Decimal + l.Decimal = &d + } + if l.Timestamp != nil { + ts := *l.Timestamp + l.Timestamp = &ts + } + if l.TimeOfDay != nil { + tod := *l.TimeOfDay + l.TimeOfDay = &tod + } + c.Logical = &l + } + return c } func ecsAvroTypeToCommon(t string) schema.CommonType { @@ -157,38 +562,28 @@ func ecsAvroTypeToCommon(t string) schema.CommonType { } func ecsAvroHydrateRawUnion(cfg ecsAvroConfig, c *schema.Common, types []any) error { - // [null, primitive-name] → Optional . - if t, optional := ecsAvroIsUnionJustOptional(types); optional { - c.Type, c.Optional = t, true - return nil - } - // [null, {object}] → Optional , propagating logical params and - // nested children. This catches the common Avro idiom for nullable - // decimal/timestamp/etc. logical types. - if inner, ok := ecsAvroIsUnionJustOptionalObject(cfg, types); ok { + // [null, X] or [X, null] → Optional X. ecsAvroResolveOptionalUnion + // handles primitive names, named-type references, and inline objects + // in either ordering. + if inner, matched, err := ecsAvroResolveOptionalUnion(cfg, types); matched { + if err != nil { + return fmt.Errorf("union `%v`: %w", c.Name, err) + } name := c.Name *c = inner if name != "" { c.Name = name } - c.Optional = true return nil } c.Type = schema.Union for i, uObj := range types { - switch ut := uObj.(type) { - case string: - c.Children = append(c.Children, schema.Common{ - Type: ecsAvroTypeToCommon(ut), - }) - case map[string]any: - tmpC, err := ecsAvroFromAnyMap(cfg, ut) - if err != nil { - return fmt.Errorf("union `%v` child '%v': %w", c.Name, i, err) - } - c.Children = append(c.Children, tmpC) + child, err := ecsAvroResolveTypeRef(cfg, uObj) + if err != nil { + return fmt.Errorf("union `%v` child '%v': %w", c.Name, i, err) } + c.Children = append(c.Children, child) } return nil } @@ -196,19 +591,14 @@ func ecsAvroHydrateRawUnion(cfg ecsAvroConfig, c *schema.Common, types []any) er func ecsAvroHydrateLameUnion(cfg ecsAvroConfig, c *schema.Common, types []any) error { c.Type = schema.Union for i, uObj := range types { - var childT schema.Common - - switch ut := uObj.(type) { - case string: - childT = schema.Common{ - Name: ut, - Type: ecsAvroTypeToCommon(ut), - } - case map[string]any: - var err error - if childT, err = ecsAvroFromAnyMap(cfg, ut); err != nil { - return fmt.Errorf("union `%v` child '%v': %w", c.Name, i, err) - } + childT, err := ecsAvroResolveTypeRef(cfg, uObj) + if err != nil { + return fmt.Errorf("union `%v` child '%v': %w", c.Name, i, err) + } + if s, isStr := uObj.(string); isStr { + // Lame-union children keep the type-name as the Common.Name so + // the tagged-JSON envelope key matches the Avro wire form. + childT.Name = s } if childT.Type == schema.Null { @@ -230,6 +620,87 @@ func ecsAvroHydrateLameUnion(cfg ecsAvroConfig, c *schema.Common, types []any) e } func ecsAvroFromAnyMap(cfg ecsAvroConfig, as map[string]any) (schema.Common, error) { + // Pre-register a structural placeholder before walking children so a + // self-reference (e.g. a linked-list record with a `next` field of its + // own type) resolves to a one-level stub rather than collapsing to + // schema.Any. The placeholder is overwritten with the fully-resolved + // Common once the walk completes. Mutual recursion across distinct + // records is still not supported — the second record's placeholder + // does not exist while the first is being walked. + typeName, _ := as["type"].(string) + fullname, shortName, childNamespace := ecsAvroAssignFullname(cfg.namespace, typeName, as) + if fullname != "" { + placeholder := ecsAvroPlaceholder(typeName, shortName) + cfg.names[fullname] = placeholder + if shortName != "" && shortName != fullname { + cfg.names[shortName] = placeholder + } + // Inheritable namespace propagates into the child walk; sibling + // scopes are unaffected because cfg is passed by value. + cfg.namespace = childNamespace + } + + c, err := ecsAvroFromAnyMapImpl(cfg, as) + if err == nil && fullname != "" { + cfg.names[fullname] = c + if shortName != "" && shortName != fullname { + cfg.names[shortName] = c + } + } + return c, err +} + +// ecsAvroAssignFullname computes the Avro fullname of a named-type +// declaration ([record, enum, fixed]) from its declaration map and the +// enclosing namespace, alongside the short name and the namespace that +// should be threaded into the child walk. Returns empty fullname when the +// node is not a named-type declaration or lacks a `name` field. +// +// Per the Avro spec (`Names` section): +// 1. If `name` contains a dot, it IS the fullname and any `namespace` +// field is ignored. +// 2. Else if `namespace` is set, the fullname is `namespace.name`. +// 3. Else the fullname inherits the enclosing namespace. +func ecsAvroAssignFullname(enclosing, typeName string, as map[string]any) (fullname, shortName, childNamespace string) { + switch typeName { + case "record", "enum", "fixed": + default: + return "", "", enclosing + } + name, _ := as["name"].(string) + if name == "" { + return "", "", enclosing + } + if idx := strings.LastIndex(name, "."); idx >= 0 { + return name, name[idx+1:], name[:idx] + } + if ns, _ := as["namespace"].(string); ns != "" { + return ns + "." + name, name, ns + } + if enclosing != "" { + return enclosing + "." + name, name, enclosing + } + return name, name, "" +} + +// ecsAvroPlaceholder returns the structural stub that stands in for a +// self-referencing named type while its definition is being walked. The +// placeholder uses the short name (matching what ecsAvroFromAnyMapImpl +// would set) and the closest leaf type so downstream sinks see a coherent +// shape rather than schema.Any. +func ecsAvroPlaceholder(typeName, shortName string) schema.Common { + switch typeName { + case "record": + return schema.Common{Name: shortName, Type: schema.Object} + case "enum": + return schema.Common{Name: shortName, Type: schema.String} + case "fixed": + return schema.Common{Name: shortName, Type: schema.ByteArray} + } + return schema.Common{Name: shortName} +} + +func ecsAvroFromAnyMapImpl(cfg ecsAvroConfig, as map[string]any) (schema.Common, error) { var c schema.Common c.Name, _ = as["name"].(string) @@ -244,11 +715,38 @@ func ecsAvroFromAnyMap(cfg ecsAvroConfig, as map[string]any) (schema.Common, err return c, err } } - // Return early: the union hydrators fully populate c. The bottom - // switch must not read fields/values/items from as, which is the outer - // field object, not the resolved inner type. + // The union hydrators fully populate c (including any nested + // children and logical params), so the bottom switch must not run - + // it would try to read fields/values/items from `as`, which is the + // outer field object, not the resolved inner type. + // + // Before returning, apply any `logicalType` annotation that sits as + // a sibling of `type` on the field-level object. This is the + // Java/JDBC Avro idiom, where a nullable timestamp field is written + // as `{"type": ["null", "long"], "logicalType": "timestamp-millis"}` + // rather than nesting the annotation inside the union element. The + // value-side decoder honours both shapes, so we must too - otherwise + // the resulting Common reports the base primitive and consumers + // (e.g. the iceberg output) end up with a column type that mismatches + // the decoded values. + if _, err := applyAvroLogicalType(cfg, &c, as); err != nil { + return c, err + } + if _, err := applyKafkaConnectType(cfg, &c, as); err != nil { + return c, err + } return c, nil case string: + // String form may be a primitive type name OR a name reference to + // a previously-defined record/enum/fixed in lexical scope. + if resolved, ok := ecsAvroLookupName(cfg, t); ok { + fieldName := c.Name + c = resolved + if fieldName != "" { + c.Name = fieldName + } + return c, nil + } c.Type = ecsAvroTypeToCommon(t) case map[string]any: // The type field is an object (e.g. {"type":"map","values":"long"}). @@ -270,27 +768,15 @@ func ecsAvroFromAnyMap(cfg ecsAvroConfig, as map[string]any) (schema.Common, err return schema.Common{}, fmt.Errorf("expected `type` field of type string or array, got %T", t) } - if logical, ok := as["logicalType"].(string); ok && logical == "decimal" { - p, err := avroSpecInt32(as["precision"]) - if err != nil { - return schema.Common{}, fmt.Errorf("decimal precision: %w", err) - } - // Per the Avro spec scale is optional and defaults to 0 when absent; - // only an unparseable scale value (wrong type, non-integer) is an error. - var s int32 - if _, present := as["scale"]; present { - s, err = avroSpecInt32(as["scale"]) - if err != nil { - return schema.Common{}, fmt.Errorf("decimal scale: %w", err) - } - } - c.Type = schema.Decimal - c.Logical = &schema.LogicalParams{ - Decimal: &schema.DecimalParams{Precision: p, Scale: s}, - } - if err := c.Validate(); err != nil { - return schema.Common{}, fmt.Errorf("decimal field: %w", err) - } + if applied, err := applyAvroLogicalType(cfg, &c, as); err != nil { + return schema.Common{}, err + } else if applied { + return c, nil + } + + if applied, err := applyKafkaConnectType(cfg, &c, as); err != nil { + return schema.Common{}, err + } else if applied { return c, nil } diff --git a/internal/impl/confluent/ecs_avro_field_level_logical_type_test.go b/internal/impl/confluent/ecs_avro_field_level_logical_type_test.go new file mode 100644 index 0000000000..6788e37b6f --- /dev/null +++ b/internal/impl/confluent/ecs_avro_field_level_logical_type_test.go @@ -0,0 +1,493 @@ +// Copyright 2026 Redpanda Data, Inc. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package confluent + +import ( + "encoding/json" + "fmt" + "testing" + "time" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + "github.com/twmb/avro" + + "github.com/redpanda-data/benthos/v4/public/schema" +) + +// TestEcsAvroFieldLevelLogicalType verifies that ecsAvroParseFromBytes +// honours a `logicalType` annotation regardless of where it sits relative +// to its `type` declaration. +// +// Two idiomatic shapes appear in the wild for a nullable logically-typed +// field: +// +// 1. logicalType nested inside the union's object element (spec-blessed): +// +// { +// "name": "ts", +// "type": ["null", {"type": "long", "logicalType": "timestamp-millis"}] +// } +// +// 2. logicalType as a sibling of `type` at the field level (Java/JDBC idiom): +// +// { +// "name": "ts", +// "type": ["null", "long"], +// "logicalType": "timestamp-millis" +// } +// +// Both must resolve to the same Common. The value-side decoder honours +// both, so the schema-metadata side must agree — otherwise downstream +// consumers (e.g. the iceberg output) pick column types from the metadata +// that mismatch what the decoded values actually are. +func TestEcsAvroFieldLevelLogicalType(t *testing.T) { + type expect struct { + typ schema.CommonType + unit schema.TimeUnit // for timestamp/time-of-day + adjustUTC bool // for timestamp/time-of-day + precision int32 // for decimal + scale int32 // for decimal + } + + cases := []struct { + name string + baseType string // the second element of the union when in sibling form + logical string + precision int32 + scale int32 + want expect + }{ + { + name: "timestamp-millis", baseType: "long", logical: "timestamp-millis", + want: expect{typ: schema.Timestamp, unit: schema.TimeUnitMillis, adjustUTC: true}, + }, + { + name: "timestamp-micros", baseType: "long", logical: "timestamp-micros", + want: expect{typ: schema.Timestamp, unit: schema.TimeUnitMicros, adjustUTC: true}, + }, + { + name: "local-timestamp-millis", baseType: "long", logical: "local-timestamp-millis", + want: expect{typ: schema.Timestamp, unit: schema.TimeUnitMillis, adjustUTC: false}, + }, + { + name: "date", baseType: "int", logical: "date", + want: expect{typ: schema.Date}, + }, + { + name: "time-millis", baseType: "int", logical: "time-millis", + want: expect{typ: schema.TimeOfDay, unit: schema.TimeUnitMillis}, + }, + { + name: "time-micros", baseType: "long", logical: "time-micros", + want: expect{typ: schema.TimeOfDay, unit: schema.TimeUnitMicros}, + }, + { + name: "uuid", baseType: "string", logical: "uuid", + want: expect{typ: schema.UUID}, + }, + { + name: "decimal", baseType: "bytes", logical: "decimal", precision: 38, scale: 2, + want: expect{typ: schema.Decimal, precision: 38, scale: 2}, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + t.Run("nested-in-type", func(t *testing.T) { + inner := fmt.Sprintf(`{"type":"%s","logicalType":"%s"`, tc.baseType, tc.logical) + if tc.logical == "decimal" { + inner += fmt.Sprintf(`,"precision":%d,"scale":%d`, tc.precision, tc.scale) + } + inner += "}" + spec := fmt.Appendf(nil, `{ + "type": "record", "name": "Row", + "fields": [{"name": "v", "type": ["null", %s]}] + }`, inner) + assertLogicalField(t, spec, tc.want) + }) + + t.Run("sibling-of-type", func(t *testing.T) { + extras := "" + if tc.logical == "decimal" { + extras = fmt.Sprintf(`,"precision":%d,"scale":%d`, tc.precision, tc.scale) + } + spec := fmt.Appendf(nil, `{ + "type": "record", "name": "Row", + "fields": [{"name": "v", "type": ["null", "%s"], "logicalType": "%s"%s}] + }`, tc.baseType, tc.logical, extras) + assertLogicalField(t, spec, tc.want) + }) + }) + } +} + +// TestEcsAvroFieldLevelLogicalType_LegacyShapeWithoutFlag verifies that when +// preserve_logical_types is false, the metadata parser keeps the base +// primitive — preserving the pre-#4399 shape byte-for-byte and avoiding any +// observable change for pipelines that have not opted into logical-type +// preservation. Decimal stays preserved because it pre-dates the flag and +// is load-bearing for normaliseAvroDecimals. +func TestEcsAvroFieldLevelLogicalType_LegacyShapeWithoutFlag(t *testing.T) { + type expect struct { + typ schema.CommonType + decimal bool // expect Decimal type with precision/scale + precision int32 + scale int32 + } + + cases := []struct { + name string + baseType string + logical string + want expect + }{ + {name: "timestamp-millis", baseType: "long", logical: "timestamp-millis", want: expect{typ: schema.Int64}}, + {name: "timestamp-micros", baseType: "long", logical: "timestamp-micros", want: expect{typ: schema.Int64}}, + {name: "local-timestamp-millis", baseType: "long", logical: "local-timestamp-millis", want: expect{typ: schema.Int64}}, + {name: "date", baseType: "int", logical: "date", want: expect{typ: schema.Int32}}, + {name: "time-millis", baseType: "int", logical: "time-millis", want: expect{typ: schema.Int32}}, + {name: "time-micros", baseType: "long", logical: "time-micros", want: expect{typ: schema.Int64}}, + {name: "uuid", baseType: "string", logical: "uuid", want: expect{typ: schema.String}}, + // Decimal is intentionally still preserved (always-on) — see the + // preserveLogicalTypes field comment on ecsAvroConfig. + {name: "decimal", baseType: "bytes", logical: "decimal", want: expect{typ: schema.Decimal, decimal: true, precision: 38, scale: 2}}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + t.Run("nested-in-type", func(t *testing.T) { + inner := fmt.Sprintf(`{"type":"%s","logicalType":"%s"`, tc.baseType, tc.logical) + if tc.want.decimal { + inner += fmt.Sprintf(`,"precision":%d,"scale":%d`, tc.want.precision, tc.want.scale) + } + inner += "}" + spec := fmt.Appendf(nil, `{ + "type": "record", "name": "Row", + "fields": [{"name": "v", "type": ["null", %s]}] + }`, inner) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true /* preserveLogicalTypes: false */}, spec) + require.NoError(t, err) + f := c.Children[0] + assert.Equal(t, tc.want.typ, f.Type) + assert.True(t, f.Optional) + if tc.want.decimal { + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Decimal) + assert.Equal(t, tc.want.precision, f.Logical.Decimal.Precision) + assert.Equal(t, tc.want.scale, f.Logical.Decimal.Scale) + } else { + assert.Nil(t, f.Logical, "Logical params should not be populated under preserveLogicalTypes=false") + } + }) + + t.Run("sibling-of-type", func(t *testing.T) { + extras := "" + if tc.want.decimal { + extras = fmt.Sprintf(`,"precision":%d,"scale":%d`, tc.want.precision, tc.want.scale) + } + spec := fmt.Appendf(nil, `{ + "type": "record", "name": "Row", + "fields": [{"name": "v", "type": ["null", "%s"], "logicalType": "%s"%s}] + }`, tc.baseType, tc.logical, extras) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true /* preserveLogicalTypes: false */}, spec) + require.NoError(t, err) + f := c.Children[0] + assert.Equal(t, tc.want.typ, f.Type) + assert.True(t, f.Optional) + if tc.want.decimal { + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Decimal) + } else { + assert.Nil(t, f.Logical, "Logical params should not be populated under preserveLogicalTypes=false") + } + }) + }) + } +} + +// TestEcsAvroEncoderDecoderRoundTrip verifies that the schema we emit via +// commonToAvroNode is correctly parsed back by ecsAvroParseFromBytes — i.e. +// the encoder and decoder agree on at least one canonical wire form. +func TestEcsAvroEncoderDecoderRoundTrip(t *testing.T) { + original := schema.Common{ + Type: schema.Object, Name: "Row", + Children: []schema.Common{ + { + Name: "ts", Optional: true, Type: schema.Timestamp, + Logical: &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMillis, AdjustToUTC: true}, + }, + }, + { + Name: "amount", Optional: true, Type: schema.Decimal, + Logical: &schema.LogicalParams{ + Decimal: &schema.DecimalParams{Precision: 38, Scale: 4}, + }, + }, + }, + } + + avroNode, err := commonToAvroNode(original, "Row", "", true) + require.NoError(t, err) + avroJSON, err := json.Marshal(avroNode) + require.NoError(t, err) + t.Logf("encoder emitted: %s", string(avroJSON)) + + roundTripped, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true, preserveLogicalTypes: true}, avroJSON) + require.NoError(t, err) + + require.Len(t, roundTripped.Children, 2) + + ts := roundTripped.Children[0] + assert.Equal(t, "ts", ts.Name) + assert.Equal(t, schema.Timestamp, ts.Type) + assert.True(t, ts.Optional) + require.NotNil(t, ts.Logical) + require.NotNil(t, ts.Logical.Timestamp) + assert.Equal(t, schema.TimeUnitMillis, ts.Logical.Timestamp.Unit) + assert.True(t, ts.Logical.Timestamp.AdjustToUTC) + + amt := roundTripped.Children[1] + assert.Equal(t, "amount", amt.Name) + assert.Equal(t, schema.Decimal, amt.Type) + assert.True(t, amt.Optional) + require.NotNil(t, amt.Logical) + require.NotNil(t, amt.Logical.Decimal) + assert.EqualValues(t, 38, amt.Logical.Decimal.Precision) + assert.EqualValues(t, 4, amt.Logical.Decimal.Scale) +} + +func assertLogicalField(t *testing.T, spec []byte, want struct { + typ schema.CommonType + unit schema.TimeUnit + adjustUTC bool + precision int32 + scale int32 +}, +) { + t.Helper() + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true, preserveLogicalTypes: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + f := c.Children[0] + t.Logf("result: %s", mustJSON(c.ToAny())) + assert.Equal(t, want.typ, f.Type) + assert.True(t, f.Optional, "optional should be propagated") + + switch want.typ { + case schema.Timestamp: + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Timestamp) + assert.Equal(t, want.unit, f.Logical.Timestamp.Unit) + assert.Equal(t, want.adjustUTC, f.Logical.Timestamp.AdjustToUTC) + case schema.TimeOfDay: + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.TimeOfDay) + assert.Equal(t, want.unit, f.Logical.TimeOfDay.Unit) + case schema.Decimal: + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Decimal) + assert.Equal(t, want.precision, f.Logical.Decimal.Precision) + assert.Equal(t, want.scale, f.Logical.Decimal.Scale) + } +} + +func mustJSON(v any) string { + b, _ := json.MarshalIndent(v, "", " ") + return string(b) +} + +// TestUpstreamTwmbHonoursSiblingFormLogicalType is a regression guard +// for the upstream twmb/avro dependency. After the bump that landed our +// own field-level logicalType fix upstream (twmb/avro PR #38), the +// value-side decoder now produces time.Time for sibling-form schemas +// natively — previously it returned int64 because the parser silently +// dropped the field-level annotation. +// +// If twmb ever regresses on this, the shredder's metadata-driven +// numeric scaling path in iceberg/shredder/temporal.go would become +// load-bearing again. Pinning the upstream behaviour here makes any +// such regression surface immediately, in the package that depends on +// it, rather than as a customer report. +func TestUpstreamTwmbHonoursSiblingFormLogicalType(t *testing.T) { + cases := []struct { + name string + schema string + }{ + { + "primitive timestamp-millis", + `{"type":"record","name":"R","fields":[ + {"name":"ts","type":"long","logicalType":"timestamp-millis"} + ]}`, + }, + { + "union timestamp-millis (null first)", + `{"type":"record","name":"R","fields":[ + {"name":"ts","type":["null","long"],"logicalType":"timestamp-millis"} + ]}`, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + s, err := avro.Parse(tc.schema) + require.NoError(t, err) + + // Encode a value via the binary-shape side schema to avoid + // requiring time.Time on the encode path; this is the same + // trick our integration test uses, and what a Java/JDBC + // producer would do — write raw long bytes on the wire and + // rely on the schema's logicalType for interpretation. + bin := avro.MustParse(`{"type":"record","name":"R","fields":[ + {"name":"ts","type":["null","long"]} + ]}`) + tsMillis := int64(1700000000000) + payload, err := bin.Encode(&struct { + TS *int64 `avro:"ts"` + }{TS: &tsMillis}) + require.NoError(t, err) + + var native any + _, err = s.Decode(payload, &native) + require.NoError(t, err) + + row, ok := native.(map[string]any) + require.True(t, ok, "expected map, got %T", native) + _, isTime := row["ts"].(time.Time) + assert.True(t, isTime, + "upstream twmb must decode sibling-form timestamp-millis as time.Time; got %T (regression in twmb/avro PR #38)", row["ts"]) + }) + } +} + +// TestEcsAvroKafkaConnectTypes asserts that with translate_kafka_connect_types +// enabled, the metadata parser maps Debezium `connect.name` annotations +// to the same schema.Common shape the value side would produce (via +// kafkaConnectTypeOpt in avro_walker.go). This is the direct analogue of +// the field-level logicalType fix for the Debezium-CDC pipeline shape. +// +// Without translate_kafka_connect_types the parser stays agnostic — the +// connect.name annotation is ignored and the base primitive flows +// through, matching what the value side does in the same mode. +func TestEcsAvroKafkaConnectTypes(t *testing.T) { + type want struct { + typ schema.CommonType + unit schema.TimeUnit + adjustUTC bool + } + + cases := []struct { + name string + baseType string + connectName string + want want + }{ + {name: "Date", baseType: "int", connectName: "io.debezium.time.Date", want: want{typ: schema.Date}}, + {name: "Year", baseType: "int", connectName: "io.debezium.time.Year", want: want{typ: schema.Date}}, + {name: "Timestamp", baseType: "long", connectName: "io.debezium.time.Timestamp", want: want{typ: schema.Timestamp, unit: schema.TimeUnitMillis, adjustUTC: true}}, + {name: "Time", baseType: "long", connectName: "io.debezium.time.Time", want: want{typ: schema.Timestamp, unit: schema.TimeUnitMillis, adjustUTC: true}}, + {name: "MicroTimestamp", baseType: "long", connectName: "io.debezium.time.MicroTimestamp", want: want{typ: schema.Timestamp, unit: schema.TimeUnitMicros, adjustUTC: true}}, + {name: "MicroTime", baseType: "long", connectName: "io.debezium.time.MicroTime", want: want{typ: schema.Timestamp, unit: schema.TimeUnitMicros, adjustUTC: true}}, + {name: "NanoTimestamp", baseType: "long", connectName: "io.debezium.time.NanoTimestamp", want: want{typ: schema.Timestamp, unit: schema.TimeUnitNanos, adjustUTC: true}}, + {name: "NanoTime", baseType: "long", connectName: "io.debezium.time.NanoTime", want: want{typ: schema.Timestamp, unit: schema.TimeUnitNanos, adjustUTC: true}}, + {name: "ZonedTimestamp", baseType: "string", connectName: "io.debezium.time.ZonedTimestamp", want: want{typ: schema.Timestamp, unit: schema.TimeUnitMillis, adjustUTC: true}}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + // Debezium emits the annotation nested inside the union's + // type object — this is the on-wire shape we need to recognise. + spec := fmt.Appendf(nil, `{ + "type":"record","name":"R", + "fields":[{"name":"v","type":["null",{"type":"%s","connect.name":"%s","connect.version":1}]}] + }`, tc.baseType, tc.connectName) + + t.Run("translate-on", func(t *testing.T) { + c, err := ecsAvroParseFromBytes(ecsAvroConfig{ + rawUnion: true, + translateKafkaConnectTypes: true, + }, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + f := c.Children[0] + assert.Equal(t, tc.want.typ, f.Type, "translate-on must promote connect.name to the matching schema.Common type") + assert.True(t, f.Optional) + if tc.want.typ == schema.Timestamp { + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Timestamp) + assert.Equal(t, tc.want.unit, f.Logical.Timestamp.Unit) + assert.Equal(t, tc.want.adjustUTC, f.Logical.Timestamp.AdjustToUTC) + } + }) + + t.Run("translate-off", func(t *testing.T) { + c, err := ecsAvroParseFromBytes(ecsAvroConfig{ + rawUnion: true, + }, spec) + require.NoError(t, err) + f := c.Children[0] + // With translate off, the parser stays agnostic: the + // base primitive flows through, matching the value side + // (which leaves the long/int/string untouched without + // translate_kafka_connect_types). + var wantBase schema.CommonType + switch tc.baseType { + case "int": + wantBase = schema.Int32 + case "long": + wantBase = schema.Int64 + case "string": + wantBase = schema.String + } + assert.Equal(t, wantBase, f.Type, "translate-off must leave the connect.name annotation untouched") + assert.Nil(t, f.Logical, "no Logical params when translate is off") + }) + }) + } +} + +// TestEcsAvroDurationLogicalType pins coverage for the Avro `duration` +// logical type. The value side (preserveLogicalTypeOpts in +// avro_walker.go) decodes the 12-byte fixed payload to an ISO 8601 +// string; the metadata side maps to schema.String so an iceberg sink +// auto-creates a VARCHAR column rather than the BINARY column the raw +// fixed-base would imply. Without preserve_logical_types the value +// side leaves the bytes untranslated and the metadata side +// correspondingly stays at ByteArray — the consistent default. +func TestEcsAvroDurationLogicalType(t *testing.T) { + spec := []byte(`{"type":"record","name":"R","fields":[ + {"name":"d","type":{"type":"fixed","name":"Duration","size":12,"logicalType":"duration"}} + ]}`) + + t.Run("preserve-on", func(t *testing.T) { + c, err := ecsAvroParseFromBytes(ecsAvroConfig{preserveLogicalTypes: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + assert.Equal(t, schema.String, c.Children[0].Type, + "duration metadata should resolve to String to match the value side's ISO 8601 output") + }) + + t.Run("preserve-off", func(t *testing.T) { + c, err := ecsAvroParseFromBytes(ecsAvroConfig{}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + // Fixed maps to Any in ecsAvroTypeToCommon (it's not a named + // primitive). The exact base type matters less than the + // invariant: when preserve is off, metadata stays untranslated. + assert.NotEqual(t, schema.String, c.Children[0].Type, + "without preserve_logical_types the duration metadata must not be promoted") + }) +} diff --git a/internal/impl/confluent/ecs_avro_test.go b/internal/impl/confluent/ecs_avro_test.go index 90c4c42c69..d337a29aca 100644 --- a/internal/impl/confluent/ecs_avro_test.go +++ b/internal/impl/confluent/ecs_avro_test.go @@ -269,6 +269,521 @@ func TestEcsAvroRawUnionNestedRecord(t *testing.T) { assert.Equal(t, schema.String, party.Children[0].Type) } +// TestEcsAvroRawUnionNullableRecordByName is the CON-468 regression for +// nullable record unions where the non-null branch is a string name +// reference to a previously-defined record (rather than an inline +// definition). The Avro JSON spec requires named types to be defined once +// then referenced by name, so any non-trivial customer schema with reused +// records will exercise this path. +func TestEcsAvroRawUnionNullableRecordByName(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "fields": [ + {"name": "amount", "type": "long"}, + {"name": "currency", "type": "string"} + ] + } + }, + {"name": "secondary_fee", "type": ["null", "Fee"], "default": null} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Equal(t, schema.Object, c.Type) + require.Len(t, c.Children, 2) + + primary := c.Children[0] + assert.Equal(t, "primary_fee", primary.Name) + require.Equal(t, schema.Object, primary.Type) + require.Len(t, primary.Children, 2) + + secondary := c.Children[1] + assert.Equal(t, "secondary_fee", secondary.Name) + require.Equal(t, schema.Object, secondary.Type, "name reference to Fee should resolve to the same record structure, not VARCHAR") + assert.True(t, secondary.Optional) + require.Len(t, secondary.Children, 2) + assert.Equal(t, "amount", secondary.Children[0].Name) + assert.Equal(t, schema.Int64, secondary.Children[0].Type) + assert.Equal(t, "currency", secondary.Children[1].Name) + assert.Equal(t, schema.String, secondary.Children[1].Type) +} + +// TestEcsAvroRawUnionNullableOrderIndependence covers CON-468 acceptance +// criterion 2: the [, "null"] ordering (null second) must resolve +// identically to ["null", ] (null first), across inline objects, +// primitives, and name references. +func TestEcsAvroRawUnionNullableOrderIndependence(t *testing.T) { + t.Run("inline record, null second", func(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [{ + "name": "fee", + "type": [{ + "type": "record", + "name": "Fee", + "fields": [{"name": "amount", "type": "long"}] + }, "null"] + }] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + fee := c.Children[0] + assert.Equal(t, schema.Object, fee.Type) + assert.True(t, fee.Optional) + require.Len(t, fee.Children, 1) + assert.Equal(t, "amount", fee.Children[0].Name) + }) + + t.Run("primitive, null second", func(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [{"name": "ref", "type": ["string", "null"]}] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + ref := c.Children[0] + assert.Equal(t, schema.String, ref.Type) + assert.True(t, ref.Optional) + }) + + t.Run("name reference, null second", func(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "secondary_fee", "type": ["Fee", "null"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + secondary := c.Children[1] + assert.Equal(t, schema.Object, secondary.Type) + assert.True(t, secondary.Optional) + require.Len(t, secondary.Children, 1) + assert.Equal(t, "amount", secondary.Children[0].Name) + }) +} + +// TestEcsAvroRawUnionNullableRecordNamespaced verifies that namespaced +// record names can be referenced either by short name (Fee) or by fully- +// qualified name (com.example.Fee), matching the Avro spec's name +// resolution rules. +func TestEcsAvroRawUnionNullableRecordNamespaced(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "namespace": "com.example", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "namespace": "com.example", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "by_short_name", "type": ["null", "Fee"]}, + {"name": "by_full_name", "type": ["null", "com.example.Fee"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 3) + + short := c.Children[1] + assert.Equal(t, "by_short_name", short.Name) + assert.Equal(t, schema.Object, short.Type) + require.Len(t, short.Children, 1) + + full := c.Children[2] + assert.Equal(t, "by_full_name", full.Name) + assert.Equal(t, schema.Object, full.Type) + require.Len(t, full.Children, 1) +} + +// TestEcsAvroRawUnionNullableRecordNested covers CON-468 acceptance +// criterion 2's "record-with-nested-record" case: a nullable record union +// whose record contains its own nullable record union, both as inline +// definitions and as name references at the inner level. +func TestEcsAvroRawUnionNullableRecordNested(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [ + { + "name": "inner_template", + "type": { + "type": "record", + "name": "Inner", + "fields": [{"name": "code", "type": "string"}] + } + }, + { + "name": "outer", + "type": ["null", { + "type": "record", + "name": "Outer", + "fields": [ + {"name": "label", "type": "string"}, + {"name": "inner", "type": ["null", "Inner"]} + ] + }] + } + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 2) + + outer := c.Children[1] + assert.Equal(t, "outer", outer.Name) + assert.Equal(t, schema.Object, outer.Type) + assert.True(t, outer.Optional) + require.Len(t, outer.Children, 2) + + inner := outer.Children[1] + assert.Equal(t, "inner", inner.Name) + assert.Equal(t, schema.Object, inner.Type, "nested name reference must resolve, not collapse to VARCHAR") + assert.True(t, inner.Optional) + require.Len(t, inner.Children, 1) + assert.Equal(t, "code", inner.Children[0].Name) +} + +// TestEcsAvroLameUnionNameResolution covers the same CON-468 bug class in +// the lame-union (rawUnion=false) path that the schema_registry_decode +// processor takes by default. The lame envelope wraps each branch in a +// tagged Object so the wire shape stays the same, but the inner Common +// must still expand previously-defined named-type references rather than +// collapsing them to schema.Any. +func TestEcsAvroLameUnionNameResolution(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "secondary_fee", "type": ["null", "Fee"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{}, spec) + require.NoError(t, err) + + secondary := c.Children[1] + assert.Equal(t, "secondary_fee", secondary.Name) + assert.Equal(t, schema.Union, secondary.Type) + require.Len(t, secondary.Children, 2) + + // One child is the null branch (Common.Type=Null, Name=""). + // The other is the tagged-Object envelope wrapping the resolved Fee. + var feeEnvelope schema.Common + for _, ch := range secondary.Children { + if ch.Type == schema.Object { + feeEnvelope = ch + break + } + } + require.Equal(t, schema.Object, feeEnvelope.Type, "Fee branch should be tagged-Object envelope") + require.Len(t, feeEnvelope.Children, 1) + feeInner := feeEnvelope.Children[0] + assert.Equal(t, "Fee", feeInner.Name) + assert.Equal(t, schema.Object, feeInner.Type, "name reference to Fee should resolve to its record structure, not schema.Any") + require.Len(t, feeInner.Children, 1) + assert.Equal(t, "amount", feeInner.Children[0].Name) + assert.Equal(t, schema.Int64, feeInner.Children[0].Type) +} + +// TestEcsAvroRawUnionAnnotatedInlinePrimitive verifies that the +// optional-union collapse handles the [{primitive-with-annotations}, null] +// shape — the form Kafka Connect / Debezium emit for nullable string fields, +// where the non-null branch is an inline object carrying `connect.default` +// (or similar extension properties) alongside `type: "string"`. The +// resolver should ignore the unknown annotations (per Avro spec) and +// collapse the union to Optional STRING the same way it collapses +// `["null", "string"]`. +// +// Both branch orderings are covered: null-second is the shape the bug +// report came in with, null-first is the canonical Avro spelling. +func TestEcsAvroRawUnionAnnotatedInlinePrimitive(t *testing.T) { + tests := []struct { + name string + spec string + }{ + { + name: "null second", + spec: `{ + "type": "record", + "name": "Rec", + "fields": [{ + "name": "category", + "type": [ + {"type": "string", "connect.default": ""}, + "null" + ], + "default": "" + }] + }`, + }, + { + name: "null first", + spec: `{ + "type": "record", + "name": "Rec", + "fields": [{ + "name": "category", + "type": [ + "null", + {"type": "string", "connect.default": ""} + ], + "default": null + }] + }`, + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, []byte(tt.spec)) + require.NoError(t, err) + require.Equal(t, schema.Object, c.Type) + require.Len(t, c.Children, 1) + category := c.Children[0] + assert.Equal(t, "category", category.Name, "outer field name must be preserved across the union collapse") + assert.Equal(t, schema.String, category.Type, "annotated inline primitive should collapse to STRING, not stay a Union") + assert.True(t, category.Optional, "the null branch must drive Optional=true") + assert.Empty(t, category.Children, "collapsed primitive has no children") + }) + } +} + +// TestEcsAvroUnionInlineErrorPropagation pins down the %w-wrapping +// contract through the union resolvers. A malformed inline decimal sitting +// inside a nullable union must surface its root-cause error (the precision +// parse failure), not collapse to a generic "could not resolve type +// map[string]interface{}". Covers both the optional-union fast path and the +// general union fall-through, and both raw- and lame-union flavours. +func TestEcsAvroUnionInlineErrorPropagation(t *testing.T) { + tests := []struct { + name string + cfg ecsAvroConfig + spec string + wantMsg string + }{ + { + name: "raw union, nullable shape, malformed decimal precision", + cfg: ecsAvroConfig{rawUnion: true}, + spec: `{ + "type": "record", + "name": "Tx", + "fields": [{ + "name": "amount", + "type": ["null", {"type": "bytes", "logicalType": "decimal", "precision": "not-a-number", "scale": 2}] + }] + }`, + wantMsg: "decimal precision", + }, + { + name: "raw union, three-branch shape, malformed decimal scale", + cfg: ecsAvroConfig{rawUnion: true}, + spec: `{ + "type": "record", + "name": "Tx", + "fields": [{ + "name": "amount", + "type": ["null", "string", {"type": "bytes", "logicalType": "decimal", "precision": 9, "scale": "nope"}] + }] + }`, + wantMsg: "decimal scale", + }, + { + name: "lame union, nullable shape, malformed decimal precision", + cfg: ecsAvroConfig{}, + spec: `{ + "type": "record", + "name": "Tx", + "fields": [{ + "name": "amount", + "type": ["null", {"type": "bytes", "logicalType": "decimal", "precision": "not-a-number", "scale": 2}] + }] + }`, + wantMsg: "decimal precision", + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + _, err := ecsAvroParseFromBytes(tt.cfg, []byte(tt.spec)) + require.Error(t, err) + assert.Contains(t, err.Error(), tt.wantMsg, "inner error context must propagate via %%w wrapping, got %q", err.Error()) + assert.NotContains(t, err.Error(), "could not resolve type", "must not collapse to the type-stringifier fallback when an inner error exists") + }) + } +} + +// TestEcsAvroNamespaceInheritance verifies that named types defined inside +// a namespaced record inherit the enclosing namespace when they omit their +// own `namespace` field, per the Avro spec's name-resolution rules. Both +// short-name and fully-qualified references to the inherited fullname must +// resolve, including across deeply nested scopes. +func TestEcsAvroNamespaceInheritance(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "namespace": "com.example", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "by_short_name", "type": ["null", "Fee"]}, + {"name": "by_inherited_full_name", "type": ["null", "com.example.Fee"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 3) + + short := c.Children[1] + assert.Equal(t, "by_short_name", short.Name) + assert.Equal(t, schema.Object, short.Type, "unqualified Fee should resolve via inherited com.example namespace") + require.Len(t, short.Children, 1) + assert.Equal(t, "amount", short.Children[0].Name) + + full := c.Children[2] + assert.Equal(t, "by_inherited_full_name", full.Name) + assert.Equal(t, schema.Object, full.Type, "Fee implicitly belongs to com.example and must be reachable by FQN") + require.Len(t, full.Children, 1) +} + +// TestEcsAvroNameWithEmbeddedDot exercises the third form of Avro's named- +// type spelling: the `name` field itself contains a dot, in which case the +// dot-bearing value IS the fullname and any sibling `namespace` field is +// ignored. References must resolve under the embedded fullname. +func TestEcsAvroNameWithEmbeddedDot(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "namespace": "ignored.outer", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "com.example.Fee", + "namespace": "this.is.ignored", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "by_fqn", "type": ["null", "com.example.Fee"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 2) + + fqn := c.Children[1] + assert.Equal(t, "by_fqn", fqn.Name) + assert.Equal(t, schema.Object, fqn.Type) + require.Len(t, fqn.Children, 1) +} + +// TestEcsAvroSelfReferentialRecord verifies that a record whose own field +// references it by name (e.g. a linked-list `next` pointer) resolves to a +// structural stub — a one-level Object carrying the record's short name — +// rather than collapsing to schema.Any (VARCHAR downstream). +// +// True recursive resolution is out of scope: schema.Common is a value type +// with no back-reference machinery, so the self-reference is necessarily a +// stub. The fix exists so downstream sinks see "this is some kind of +// record" rather than an opaque blob. +func TestEcsAvroSelfReferentialRecord(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Node", + "fields": [ + {"name": "value", "type": "long"}, + {"name": "next", "type": ["null", "Node"]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true}, spec) + require.NoError(t, err) + require.Equal(t, schema.Object, c.Type) + require.Len(t, c.Children, 2) + assert.Equal(t, "value", c.Children[0].Name) + assert.Equal(t, schema.Int64, c.Children[0].Type) + + next := c.Children[1] + assert.Equal(t, "next", next.Name) + assert.Equal(t, schema.Object, next.Type, "self-reference should resolve to Object stub, not schema.Any") + assert.True(t, next.Optional) + assert.Empty(t, next.Children, "self-reference is a one-level stub — recursive resolution is out of scope") +} + +// TestEcsAvroNamesMapIsolation verifies the clone-on-retrieval contract for +// the names map: mutating a resolved Common (e.g. appending to its Children) +// must not propagate to subsequent lookups of the same named type. Without +// the clone, the second `secondary_fee` resolution would see the mutated +// shape of the first. +func TestEcsAvroNamesMapIsolation(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Transfer", + "fields": [ + { + "name": "primary_fee", + "type": { + "type": "record", + "name": "Fee", + "fields": [{"name": "amount", "type": "long"}] + } + }, + {"name": "first_ref", "type": "Fee"}, + {"name": "second_ref", "type": "Fee"} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 3) + + first := c.Children[1] + require.Equal(t, schema.Object, first.Type) + require.Len(t, first.Children, 1) + // Mutate the first resolved copy. If the names map were aliased, the + // next retrieval would see a record with two children. + first.Children = append(first.Children, schema.Common{Name: "poison", Type: schema.String}) + assert.Len(t, first.Children, 2) + + second := c.Children[2] + require.Equal(t, schema.Object, second.Type) + assert.Len(t, second.Children, 1, "later retrieval of Fee must not see mutations applied to earlier resolutions") +} + // TestEcsAvroRecordWithNilFields is a regression test for schemas where a // field's type is a record object without a "fields" key (e.g. back-reference // form {"type":"record","name":"Party"} or "fields":null from some generators). @@ -293,3 +808,154 @@ func TestEcsAvroRecordWithNilFields(t *testing.T) { assert.Equal(t, schema.Object, party.Type) assert.Empty(t, party.Children) } + +// TestEcsAvroLogicalTypeDispatcher exercises the full dispatcher in +// applyAvroLogicalType for every Avro logical type the connector now +// preserves. Each case drives the original-issue-#4399 path: the schema's +// logicalType annotation is honoured rather than dropped. +func TestEcsAvroLogicalTypeDispatcher(t *testing.T) { + type expect struct { + ctype schema.CommonType + hasLogical bool + unit schema.TimeUnit + adjustToUTC bool + } + cases := []struct { + name string + field string // raw JSON for the field's "type" entry + expect expect + }{ + { + name: "timestamp-millis", + field: `{"type":"long","logicalType":"timestamp-millis"}`, + expect: expect{ctype: schema.Timestamp, hasLogical: true, unit: schema.TimeUnitMillis, adjustToUTC: true}, + }, + { + name: "timestamp-micros", + field: `{"type":"long","logicalType":"timestamp-micros"}`, + expect: expect{ctype: schema.Timestamp, hasLogical: true, unit: schema.TimeUnitMicros, adjustToUTC: true}, + }, + { + name: "timestamp-nanos", + field: `{"type":"long","logicalType":"timestamp-nanos"}`, + expect: expect{ctype: schema.Timestamp, hasLogical: true, unit: schema.TimeUnitNanos, adjustToUTC: true}, + }, + { + name: "local-timestamp-millis", + field: `{"type":"long","logicalType":"local-timestamp-millis"}`, + expect: expect{ctype: schema.Timestamp, hasLogical: true, unit: schema.TimeUnitMillis, adjustToUTC: false}, + }, + { + name: "local-timestamp-micros", + field: `{"type":"long","logicalType":"local-timestamp-micros"}`, + expect: expect{ctype: schema.Timestamp, hasLogical: true, unit: schema.TimeUnitMicros, adjustToUTC: false}, + }, + { + name: "date", + field: `{"type":"int","logicalType":"date"}`, + expect: expect{ctype: schema.Date}, + }, + { + name: "time-millis", + field: `{"type":"int","logicalType":"time-millis"}`, + expect: expect{ctype: schema.TimeOfDay, hasLogical: true, unit: schema.TimeUnitMillis}, + }, + { + name: "time-micros", + field: `{"type":"long","logicalType":"time-micros"}`, + expect: expect{ctype: schema.TimeOfDay, hasLogical: true, unit: schema.TimeUnitMicros}, + }, + { + name: "uuid", + field: `{"type":"string","logicalType":"uuid"}`, + expect: expect{ctype: schema.UUID}, + }, + { + name: "unknown logicalType falls back to base primitive", + // Per Avro 1.10 spec readers must ignore unknown logicalType + // values. The base primitive (long) survives. + field: `{"type":"long","logicalType":"frobnicate-millis"}`, + expect: expect{ctype: schema.Int64}, + }, + { + name: "logicalType on mismatched primitive falls back", + // timestamp-millis declared on `string` is malformed; reader + // silently uses the base primitive instead of failing the schema. + field: `{"type":"string","logicalType":"timestamp-millis"}`, + expect: expect{ctype: schema.String}, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Row", + "fields": [ + {"name": "field", "type": ` + tc.field + `} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{preserveLogicalTypes: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + f := c.Children[0] + assert.Equal(t, tc.expect.ctype, f.Type, "type mapping") + if tc.expect.hasLogical { + require.NotNil(t, f.Logical, "Logical should be populated") + switch tc.expect.ctype { + case schema.Timestamp: + require.NotNil(t, f.Logical.Timestamp) + assert.Equal(t, tc.expect.unit, f.Logical.Timestamp.Unit) + assert.Equal(t, tc.expect.adjustToUTC, f.Logical.Timestamp.AdjustToUTC) + case schema.TimeOfDay: + require.NotNil(t, f.Logical.TimeOfDay) + assert.Equal(t, tc.expect.unit, f.Logical.TimeOfDay.Unit) + assert.Equal(t, tc.expect.adjustToUTC, f.Logical.TimeOfDay.AdjustToUTC) + } + } + }) + } +} + +// TestEcsAvroDecimalOnWrongPrimitiveFallsBack verifies that a `decimal` +// logical-type annotation on something other than `bytes` or `fixed` is +// silently dropped per the spec, rather than producing a malformed +// schema.Decimal whose precision/scale claims to govern an int64 value. +func TestEcsAvroDecimalOnWrongPrimitiveFallsBack(t *testing.T) { + spec := []byte(`{ + "type":"record","name":"Row", + "fields":[ + {"name":"amount","type":{"type":"long","logicalType":"decimal","precision":18,"scale":4}} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + amount := c.Children[0] + assert.Equal(t, schema.Int64, amount.Type, "decimal on long should fall back to base type") + assert.Nil(t, amount.Logical, "no Logical params should be set when annotation falls back") +} + +// TestEcsAvroLogicalTypeOptionalUnion verifies that the [null, {logical}] +// idiom under rawUnion=true unwraps to an Optional node with the logical +// type fully preserved. This is the shape the original issue #4399 reported. +func TestEcsAvroLogicalTypeOptionalUnion(t *testing.T) { + spec := []byte(`{ + "type": "record", + "name": "Row", + "fields": [ + {"name": "event_time", "type": ["null", {"type": "long", "logicalType": "timestamp-millis"}]} + ] + }`) + c, err := ecsAvroParseFromBytes(ecsAvroConfig{rawUnion: true, preserveLogicalTypes: true}, spec) + require.NoError(t, err) + require.Len(t, c.Children, 1) + f := c.Children[0] + assert.Equal(t, "event_time", f.Name) + assert.Equal(t, schema.Timestamp, f.Type) + assert.True(t, f.Optional, "[null, timestamp-millis] should produce Optional Timestamp") + require.NotNil(t, f.Logical) + require.NotNil(t, f.Logical.Timestamp) + assert.Equal(t, schema.TimeUnitMillis, f.Logical.Timestamp.Unit) + assert.True(t, f.Logical.Timestamp.AdjustToUTC) +} diff --git a/internal/impl/confluent/processor_schema_registry_decode_test.go b/internal/impl/confluent/processor_schema_registry_decode_test.go index c863cf17b6..19b2067bc4 100644 --- a/internal/impl/confluent/processor_schema_registry_decode_test.go +++ b/internal/impl/confluent/processor_schema_registry_decode_test.go @@ -290,10 +290,16 @@ func TestSchemaRegistryDecodeAvro(t *testing.T) { output: `{"Name":"foo","MaybeHobby":null,"Address": null}`, }, { - schemaID: 4, - name: "successful message with logical types", - input: "\x00\x00\x00\x00\x04\x02\x90\xaf\xce!\x02\x80\x80揪\x97\t\x02\x80\x80\xde\xf2\xdf\xff\xdf\xdc\x01\x02\x02!", - output: `{"int_time_millis":{"int.time-millis":35245000},"long_time_micros":{"long.time-micros":20192000000000},"long_timestamp_micros":{"long.timestamp-micros":62135596800000000},"pos_0_33333333":{"bytes.decimal":0.33}}`, + schemaID: 4, + name: "successful message with logical types", + input: "\x00\x00\x00\x00\x04\x02\x90\xaf\xce!\x02\x80\x80揪\x97\t\x02\x80\x80\xde\xf2\xdf\xff\xdf\xdc\x01\x02\x02!", + // pos_0_33333333 emits "!" — the spec-blessed codepoint-mapped + // string form for decimal under Avro JSON encoding (the single + // byte 0x21 = "!"). twmb/avro v1.7.3+ matches Java's + // JsonEncoder here; users who want a numeric form should set + // preserve_logical_types: true, which routes through our + // json.Number CustomType registration instead. + output: `{"int_time_millis":{"int.time-millis":35245000},"long_time_micros":{"long.time-micros":20192000000000},"long_timestamp_micros":{"long.timestamp-micros":62135596800000000},"pos_0_33333333":{"bytes.decimal":"!"}}`, preservedOutput: `{"int_time_millis":{"int.time-millis":"0001-01-01T09:47:25Z"},"long_time_micros":{"long.time-micros":"0001-08-22T16:53:20Z"},"long_timestamp_micros":{"long.timestamp-micros":"3939-01-01T00:00:00Z"},"pos_0_33333333":{"bytes.decimal":0.33}}`, }, { @@ -503,10 +509,12 @@ func TestSchemaRegistryDecodeAvroRawJson(t *testing.T) { output: `{"Name":"foo","MaybeHobby":null,"Address": null}`, }, { - schemaID: 4, - name: "successful message with logical types", - input: "\x00\x00\x00\x00\x04\x02\x90\xaf\xce!\x02\x80\x80揪\x97\t\x02\x80\x80\xde\xf2\xdf\xff\xdf\xdc\x01\x02\x02!", - output: `{"int_time_millis":35245000,"long_time_micros":20192000000000,"long_timestamp_micros":62135596800000000,"pos_0_33333333":0.33}`, + schemaID: 4, + name: "successful message with logical types", + input: "\x00\x00\x00\x00\x04\x02\x90\xaf\xce!\x02\x80\x80揪\x97\t\x02\x80\x80\xde\xf2\xdf\xff\xdf\xdc\x01\x02\x02!", + // pos_0_33333333 emits "!" in default mode — see the matching + // note on TestSchemaRegistryDecodeAvro for why. + output: `{"int_time_millis":35245000,"long_time_micros":20192000000000,"long_timestamp_micros":62135596800000000,"pos_0_33333333":"!"}`, preservedOutput: `{"int_time_millis":"0001-01-01T09:47:25Z","long_time_micros":"0001-08-22T16:53:20Z","long_timestamp_micros":"3939-01-01T00:00:00Z","pos_0_33333333":0.33}`, }, { diff --git a/internal/impl/confluent/serde_avro.go b/internal/impl/confluent/serde_avro.go index 4af6192c7b..ae869daec7 100644 --- a/internal/impl/confluent/serde_avro.go +++ b/internal/impl/confluent/serde_avro.go @@ -201,7 +201,8 @@ func (s *schemaRegistryDecoder) getAvroDecoder(ctx context.Context, aschema fran // Build parse options for preserve_logical_types: register custom // types that convert time.Duration→time.Time for time-of-day fields, - // avro.Duration→string, and optionally Kafka Connect types. + // avro.Duration→string, decimal bytes→json.Number, and optionally + // Kafka Connect (Debezium) types. var parseOpts []avro.SchemaOpt if s.cfg.avro.preserveLogicalTypes { parseOpts = append(parseOpts, preserveLogicalTypeOpts()...) @@ -222,7 +223,9 @@ func (s *schemaRegistryDecoder) getAvroDecoder(ctx context.Context, aschema fran ) if s.cfg.avro.storeSchemaMeta != "" { c, parseErr := ecsAvroParseFromBytes(ecsAvroConfig{ - rawUnion: s.cfg.avro.rawUnions, + rawUnion: s.cfg.avro.rawUnions, + preserveLogicalTypes: s.cfg.avro.preserveLogicalTypes, + translateKafkaConnectTypes: s.cfg.avro.translateKafkaConnectTypes, }, []byte(schemaSpec)) if parseErr != nil { s.logger.With("error", parseErr).Error("Failed to extract common schema for meta storage") diff --git a/internal/impl/iceberg/config.go b/internal/impl/iceberg/config.go index ae6a1c749d..1300150be7 100644 --- a/internal/impl/iceberg/config.go +++ b/internal/impl/iceberg/config.go @@ -69,12 +69,13 @@ const ( ioFieldAzureAccessKey = "storage_access_key" // Schema evolution fields - ioFieldSchemaEvolution = "schema_evolution" - ioFieldSchemaEvolutionEnabled = "enabled" - ioFieldSchemaEvolutionPartitionSpec = "partition_spec" - ioFieldSchemaEvolutionTableLoc = "table_location" - ioFieldSchemaEvolutionSchemaMetadata = "schema_metadata" - ioFieldSchemaEvolutionNewColumnTypeMapping = "new_column_type_mapping" + ioFieldSchemaEvolution = "schema_evolution" + ioFieldSchemaEvolutionEnabled = "enabled" + ioFieldSchemaEvolutionPartitionSpec = "partition_spec" + ioFieldSchemaEvolutionTableLoc = "table_location" + ioFieldSchemaEvolutionSchemaMetadata = "schema_metadata" + ioFieldSchemaEvolutionNewColumnTypeMapping = "new_column_type_mapping" + ioFieldSchemaEvolutionRequireSchemaMetadata = "require_schema_metadata" // Commit fields ioFieldCommit = "commit" @@ -332,6 +333,10 @@ array:list Description("An optional Bloblang mapping to customize column types during schema evolution. This mapping is executed for each new column and can override the inferred or schema-metadata-derived type. The mapping receives an object with fields `name` (column name), `path` (dot-separated path), `value` (sample value), `inferred_type` (the type that would be used without this mapping), `message` (the full message body), `namespace`, and `table`. It must return a string with a valid Iceberg type name: `boolean`, `int`, `long`, `float`, `double`, `string`, `binary`, `date`, `time`, `timestamp`, `timestamptz`, `uuid`, `decimal(p,s)`, or `fixed[n]`."). Optional(). Advanced(), + service.NewBoolField(ioFieldSchemaEvolutionRequireSchemaMetadata). + Description("When `true`, writing a numeric value into a `timestamp`, `timestamptz`, `date`, or `time` column without `schema_metadata` registered for that column is a hard error. The default `false` permits a fallback path that interprets bare numeric timestamps as Unix seconds and bare numeric times as already-microseconds — convenient, but silently wrong if upstream produced milliseconds. Enable this when you cannot guarantee the upstream attaches schema metadata and want to fail loudly rather than corrupt dates by ~50,000 years. No effect on time-typed columns receiving `time.Time`/`time.Duration` Go values, which carry their own unit unambiguously, and no effect on non-time columns. Requires `schema_metadata` to be set."). + Default(false). + Advanced(), ).Description("Schema evolution configuration."). Optional(). Advanced(), diff --git a/internal/impl/iceberg/integration/schema_metadata_timestamp_test.go b/internal/impl/iceberg/integration/schema_metadata_timestamp_test.go new file mode 100644 index 0000000000..dc3e56c50d --- /dev/null +++ b/internal/impl/iceberg/integration/schema_metadata_timestamp_test.go @@ -0,0 +1,297 @@ +// Copyright 2026 Redpanda Data, Inc. +// +// Licensed as a Redpanda Enterprise file under the Redpanda Community +// License (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md + +package iceberg + +import ( + "context" + "fmt" + "testing" + "time" + + "github.com/apache/iceberg-go" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/redpanda-data/benthos/v4/public/schema" + "github.com/redpanda-data/benthos/v4/public/service" + "github.com/redpanda-data/benthos/v4/public/service/integration" + + icebergimpl "github.com/redpanda-data/connect/v4/internal/impl/iceberg" +) + +// TestIntegrationSchemaMetadataDrivesTimestampColumn feeds the iceberg Router +// with the exact shape of input that schema_registry_decode produces after +// the parser fix in confluent/ecs_avro.go: a structured body plus a "schema" +// metadata entry holding the ToAny() form of a schema.Common with +// Type=Timestamp on the time-typed field. +// +// The test is deliberately isolated from confluent/avro so it exercises only +// the iceberg side of the boundary: given a correctly-shaped @schema, the +// iceberg output must auto-create a timestamp-typed column and scale the +// numeric or time.Time value into a correct parquet timestamp. +// +// Two value-side variants are covered, mirroring what twmb/avro emits for +// different idiomatic Avro shapes once schema_registry_decode finishes: +// +// - time.Time values — produced when the Avro schema nests logicalType +// inside the union object; twmb honours it natively. +// - int64 values — produced when the Avro schema places logicalType +// as a sibling of `type` (Java/JDBC idiom); twmb does not honour the +// annotation and emits a raw long. The iceberg shredder must consult +// the @schema metadata's declared Unit to scale it correctly. +// +// In both cases the resulting iceberg table must hold a timestamp column +// (not BIGINT) and the value must round-trip to the correct calendar date. +func TestIntegrationSchemaMetadataDrivesTimestampColumn(t *testing.T) { + integration.CheckSkip(t) + + ctx := context.Background() + infra := setupTestInfra(t, ctx) + + const namespace = "schema_meta_timestamp" + infra.CreateNamespace(t, namespace) + + // Construct the schema.Common that the schema_registry_decode parser + // must produce after our fix: record { id: string, ts: Optional }. + commonSchema := schema.Common{ + Type: schema.Object, Name: "Event", + Children: []schema.Common{ + {Name: "id", Type: schema.String}, + { + Name: "ts", Optional: true, Type: schema.Timestamp, + Logical: &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMillis, AdjustToUTC: true, + }, + }, + }, + }, + } + schemaMeta := commonSchema.ToAny() + + const tsMillis = int64(1700000000000) // 2023-11-14T22:13:20Z + expectedTimestampDate := "2023-11-14" + + cases := []struct { + name string + tableName string + tsValue any + }{ + { + name: "value-is-time.Time", + tableName: "events_time_value", + tsValue: time.UnixMilli(tsMillis).UTC(), + }, + { + name: "value-is-int64", + tableName: "events_int_value", + tsValue: tsMillis, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + // Build the message exactly as schema_registry_decode would: a + // structured map for the body plus the schema.Common (in ToAny() + // form) under the configured metadata key. + msg := service.NewMessage(nil) + msg.SetStructured(map[string]any{ + "id": "evt-1", + "ts": tc.tsValue, + }) + msg.MetaSetMut("schema", schemaMeta) + + router := infra.NewRouter(t, namespace, tc.tableName, + WithSchemaEvolution(icebergimpl.SchemaEvolutionConfig{ + Enabled: true, + SchemaMetadata: "schema", + })) + produceMessages(t, ctx, router, service.MessageBatch{msg}) + + cols := querySQL[ColumnInfo](t, ctx, infra, + fmt.Sprintf(`DESCRIBE iceberg_cat."%s"."%s";`, namespace, tc.tableName)) + require.Len(t, cols, 2, "expected exactly two columns") + typeOf := map[string]string{} + for _, c := range cols { + typeOf[c.ColumnName] = c.ColumnType + } + assert.Equal(t, "VARCHAR", typeOf["id"]) + assert.Contains(t, typeOf["ts"], "TIMESTAMP", + "ts column must be a timestamp type; BIGINT means @schema metadata was ignored or didn't carry Type=Timestamp") + + type row struct { + ID string `json:"id"` + TS string `json:"ts"` + } + rows := querySQL[row](t, ctx, infra, + fmt.Sprintf(`SELECT id, CAST(ts AS VARCHAR) AS ts FROM iceberg_cat."%s"."%s";`, namespace, tc.tableName)) + require.Len(t, rows, 1) + assert.Equal(t, "evt-1", rows[0].ID) + assert.Contains(t, rows[0].TS, expectedTimestampDate, + "timestamp must round-trip as %s; a year ~55,000 result means the int64 millis was treated as seconds via the fallback path", expectedTimestampDate) + }) + } +} + +// TestIntegrationCoerceTemporalIntoExistingBigintColumn pins down the +// rolling-upgrade behaviour for operators whose iceberg table was created +// with BIGINT columns under the pre-fix metadata bug. After the fix, +// schema_registry_decode emits time.Time values together with @schema +// metadata that declares Type=Timestamp(Millis). The iceberg shredder must +// coerce the time.Time into the wire-equivalent int64 millis on the way in +// — preserving the existing column shape — so the upgrade does not require +// dropping every affected table. +// +// The test pre-creates the table with the legacy column shape (id: STRING, +// ts: BIGINT), then sends a message whose body holds a time.Time and whose +// metadata says Timestamp(Millis). The post-write state must be: +// - The BIGINT column is unchanged (no auto-evolution to TIMESTAMP). +// - The stored value is the wire-equivalent millisecond integer. +// +// Operators who want the native TIMESTAMP shape can subsequently drop and +// recreate the table; the auto-create path then lands a TIMESTAMP column +// directly (covered by the test above). +func TestIntegrationCoerceTemporalIntoExistingBigintColumn(t *testing.T) { + integration.CheckSkip(t) + + ctx := context.Background() + infra := setupTestInfra(t, ctx) + + const namespace = "coerce_existing_bigint" + const tableName = "events_existing_bigint" + infra.CreateNamespace(t, namespace) + + // Pre-create the table with the legacy column shape an affected + // operator would have today: ts is BIGINT, not a timestamp type. + client := infra.NewCatalogClient(t, namespace) + _, err := client.CreateTable(ctx, tableName, iceberg.NewSchemaWithIdentifiers( + 1, nil, + iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.StringType{}, Required: false}, + iceberg.NestedField{ID: 2, Name: "ts", Type: iceberg.Int64Type{}, Required: false}, + )) + require.NoError(t, err) + + // Same metadata as the auto-create test: Timestamp(Millis,UTC). + commonSchema := schema.Common{ + Type: schema.Object, Name: "Event", + Children: []schema.Common{ + {Name: "id", Type: schema.String}, + { + Name: "ts", Optional: true, Type: schema.Timestamp, + Logical: &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMillis, AdjustToUTC: true, + }, + }, + }, + }, + } + schemaMeta := commonSchema.ToAny() + + const tsMillis = int64(1700000000000) + msg := service.NewMessage(nil) + msg.SetStructured(map[string]any{ + "id": "evt-1", + "ts": time.UnixMilli(tsMillis).UTC(), + }) + msg.MetaSetMut("schema", schemaMeta) + + router := infra.NewRouter(t, namespace, tableName, + WithSchemaEvolution(icebergimpl.SchemaEvolutionConfig{ + Enabled: true, + SchemaMetadata: "schema", + })) + produceMessages(t, ctx, router, service.MessageBatch{msg}) + + cols := querySQL[ColumnInfo](t, ctx, infra, + fmt.Sprintf(`DESCRIBE iceberg_cat."%s"."%s";`, namespace, tableName)) + require.Len(t, cols, 2) + typeOf := map[string]string{} + for _, c := range cols { + typeOf[c.ColumnName] = c.ColumnType + } + assert.Equal(t, "VARCHAR", typeOf["id"]) + assert.Equal(t, "BIGINT", typeOf["ts"], + "existing BIGINT column must remain BIGINT — schema evolution does not auto-promote BIGINT to TIMESTAMP") + + type row struct { + ID string `json:"id"` + TS int64 `json:"ts"` + } + rows := querySQL[row](t, ctx, infra, + fmt.Sprintf(`SELECT id, ts FROM iceberg_cat."%s"."%s";`, namespace, tableName)) + require.Len(t, rows, 1) + assert.Equal(t, "evt-1", rows[0].ID) + assert.Equal(t, tsMillis, rows[0].TS, + "time.Time value must be coerced to its wire-equivalent millisecond integer when the existing column is BIGINT") +} + +// TestIntegrationStrictModeRejectsCoerceOnExistingBigintColumn confirms the +// strict-mode escape hatch: when require_schema_metadata=true, the +// coerce-on-write bridge is disabled and a type disagreement between the +// existing column and the schema metadata becomes a hard write error +// instead of a silent conversion. Operators who want loud failure on +// schema drift (e.g. so they can rebuild affected tables) opt in via this +// flag. +func TestIntegrationStrictModeRejectsCoerceOnExistingBigintColumn(t *testing.T) { + integration.CheckSkip(t) + + ctx := context.Background() + infra := setupTestInfra(t, ctx) + + const namespace = "coerce_strict_reject" + const tableName = "events_strict_bigint" + infra.CreateNamespace(t, namespace) + + // Same pre-existing-BIGINT setup as the previous test. + client := infra.NewCatalogClient(t, namespace) + _, err := client.CreateTable(ctx, tableName, iceberg.NewSchemaWithIdentifiers( + 1, nil, + iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.StringType{}, Required: false}, + iceberg.NestedField{ID: 2, Name: "ts", Type: iceberg.Int64Type{}, Required: false}, + )) + require.NoError(t, err) + + commonSchema := schema.Common{ + Type: schema.Object, Name: "Event", + Children: []schema.Common{ + {Name: "id", Type: schema.String}, + { + Name: "ts", Optional: true, Type: schema.Timestamp, + Logical: &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{ + Unit: schema.TimeUnitMillis, AdjustToUTC: true, + }, + }, + }, + }, + } + schemaMeta := commonSchema.ToAny() + + const tsMillis = int64(1700000000000) + msg := service.NewMessage(nil) + msg.SetStructured(map[string]any{ + "id": "evt-1", + "ts": time.UnixMilli(tsMillis).UTC(), + }) + msg.MetaSetMut("schema", schemaMeta) + + // Strict mode on. + router := infra.NewRouter(t, namespace, tableName, + WithSchemaEvolution(icebergimpl.SchemaEvolutionConfig{ + Enabled: true, + SchemaMetadata: "schema", + RequireSchemaMetadata: true, + })) + + err = router.Route(ctx, service.MessageBatch{msg}) + require.Error(t, err, "strict mode must reject the coerce situation rather than silently converting") + assert.Contains(t, err.Error(), "require_schema_metadata=true", + "the error must name the strict-mode flag so the operator knows which knob is rejecting the write") +} diff --git a/internal/impl/iceberg/output_iceberg.go b/internal/impl/iceberg/output_iceberg.go index 10ad4706ff..991e947f9b 100644 --- a/internal/impl/iceberg/output_iceberg.go +++ b/internal/impl/iceberg/output_iceberg.go @@ -484,6 +484,17 @@ func parseSchemaEvolutionConfig(conf *service.ParsedConfig) (SchemaEvolutionConf } } + // Parse require_schema_metadata + if conf.Contains(ioFieldSchemaEvolution, ioFieldSchemaEvolutionRequireSchemaMetadata) { + cfg.RequireSchemaMetadata, err = conf.FieldBool(ioFieldSchemaEvolution, ioFieldSchemaEvolutionRequireSchemaMetadata) + if err != nil { + return cfg, err + } + if cfg.RequireSchemaMetadata && cfg.SchemaMetadata == "" { + return cfg, fmt.Errorf("%s.%s requires %s.%s to be set", ioFieldSchemaEvolution, ioFieldSchemaEvolutionRequireSchemaMetadata, ioFieldSchemaEvolution, ioFieldSchemaEvolutionSchemaMetadata) + } + } + return cfg, nil } diff --git a/internal/impl/iceberg/router.go b/internal/impl/iceberg/router.go index d315cba2d4..c2420c0c69 100644 --- a/internal/impl/iceberg/router.go +++ b/internal/impl/iceberg/router.go @@ -52,6 +52,11 @@ type SchemaEvolutionConfig struct { // NewColumnTypeMapping is an optional Bloblang mapping that can override inferred // or schema-metadata-derived column types during schema evolution. NewColumnTypeMapping *bloblang.Executor + // RequireSchemaMetadata enables strict mode: when true, writing a numeric + // value into a time-typed column without registered schema metadata is a + // hard error rather than a silent fallback to bloblang's seconds-default. + // Only meaningful when SchemaMetadata is also set. + RequireSchemaMetadata bool } const maxSchemaEvolutionRetries = 10 @@ -663,8 +668,10 @@ func (r *Router) createWriter(ctx context.Context, key tableKey) (*writer, error return nil, fmt.Errorf("creating committer: %w", err) } - // Create writer with its own table reference and the committer - w := NewWriter(writerTbl, comm, r.caseSensitive, r.writerOpts, r.logger) + // Create writer with its own table reference and the committer. + // The resolver is passed so the writer can use schema metadata to + // interpret numeric inputs into time-typed columns at shredding time. + w := NewWriter(writerTbl, comm, r.caseSensitive, r.writerOpts, r.resolver, r.schemaEvoCfg.RequireSchemaMetadata, r.logger) r.logger.Debugf("Created writer for table %s.%s", key.namespace, key.table) return w, nil diff --git a/internal/impl/iceberg/shredder/shredder.go b/internal/impl/iceberg/shredder/shredder.go index 90f79216d8..d91e7e6f5a 100644 --- a/internal/impl/iceberg/shredder/shredder.go +++ b/internal/impl/iceberg/shredder/shredder.go @@ -80,6 +80,20 @@ type RecordShredder struct { // iceberg's recommended convention and with engines like Spark and Trino // in their default configurations. caseSensitive bool + // fieldCommons optionally maps an iceberg field ID to the upstream + // schema.Common describing the same field. When present, the leaf + // value-conversion step uses Logical params (timestamp unit, + // AdjustToUTC, time-of-day unit) to interpret numeric inputs into + // typed columns instead of guessing. nil entries fall back to the + // pre-schema-metadata behavior — see [convertLeafValue]. + fieldCommons map[int]*schema.Common + // strictTemporal causes [convertLeafValue] to refuse numeric inputs + // into time-typed columns when no schema metadata has been + // registered for that column. When false (the default), the value + // converter falls back to [bloblang.ValueAsTimestamp]'s seconds + // default — convenient but silently wrong if the upstream produced + // a different unit. When true, the writer fails the batch loudly. + strictTemporal bool } // NewRecordShredder creates a new shredder for the given schema. @@ -93,6 +107,37 @@ func NewRecordShredder(schema *iceberg.Schema, caseSensitive bool) *RecordShredd } } +// SetStrictTemporalMode toggles whether numeric inputs into time-typed +// columns require registered schema metadata. With strict mode on, a bare +// int64 / float64 value reaching a TIMESTAMP / TIMESTAMPTZ / DATE / TIME +// column with no [schema.Common] in the field map is rejected with a +// per-field error rather than guessed-as-Unix-seconds. +// +// Strict mode has no effect on time.Time / time.Duration values, which +// carry their own unit unambiguously, and no effect on non-time columns. +// +// Defaults to off (back-compat). Operators that cannot guarantee schema +// metadata flows end-to-end can flip this on to fail loudly instead of +// silently corrupting dates by ~50,000 years. +func (rs *RecordShredder) SetStrictTemporalMode(on bool) { + rs.strictTemporal = on +} + +// SetFieldSchemaMetadata supplies a field-ID → schema.Common map that the +// leaf value converter consults when it sees a numeric input destined for a +// time-typed Iceberg column (TIMESTAMP, TIMESTAMPTZ, TIMESTAMP_NS, +// TIMESTAMP_TZ_NS, DATE, TIME). Without this map, the converter accepts +// time.Time / time.Duration / int64 directly but cannot disambiguate the +// unit of a bare numeric millis vs. micros vs. seconds — and historically +// defaulted to interpreting numbers as Unix seconds. Supplying this map +// lets the connector tell the shredder "this column was sourced from an +// Avro timestamp-millis", which the converter then honours. +// +// Passing nil clears any previously-set metadata. +func (rs *RecordShredder) SetFieldSchemaMetadata(byID map[int]*schema.Common) { + rs.fieldCommons = byID +} + // Shred converts a nested record into a sequence of shredded values. // The record should be a map[string]any matching the schema structure. // The sink receives each leaf value and notifications of unknown fields. @@ -219,7 +264,7 @@ func (rs *RecordShredder) shredValue( default: // Leaf/primitive type. - pqVal, err := convertLeafValue(value, typ) + pqVal, err := convertLeafValue(value, typ, rs.commonForField(fieldID), rs.strictTemporal) if err != nil { return err } @@ -331,8 +376,9 @@ func (rs *RecordShredder) shredMap( } first = false - // Shred the key. - keyVal, err := convertLeafValue(k, mapType.KeyType) + // Shred the key. Map keys are always leaf primitives, never time + // types, so the schema-metadata lookup never fires for them. + keyVal, err := convertLeafValue(k, mapType.KeyType, nil, false) if err != nil { return fmt.Errorf("map key: %w", err) } @@ -399,9 +445,15 @@ func (rs *RecordShredder) shredNull( } } -// convertLeafValue converts a Go value to a parquet.Value based on the Iceberg type. -// This is a stub - full implementation would handle all type conversions. -func convertLeafValue(value any, typ iceberg.Type) (parquet.Value, error) { +// convertLeafValue converts a Go value to a parquet.Value based on the +// Iceberg type. common is the upstream schema.Common describing the same +// column (or nil); when present it carries logical params consulted by the +// time/date arms to interpret numeric inputs using the declared unit +// instead of guessing. Without metadata, the converter accepts time.Time / +// time.Duration directly and treats bare numerics as already-in-the-target +// unit (microseconds for time, seconds for timestamp via bloblang +// ValueAsTimestamp's default — preserves the pre-PR behavior). +func convertLeafValue(value any, typ iceberg.Type, common *schema.Common, strictTemporal bool) (parquet.Value, error) { if value == nil { return parquet.NullValue(), nil } @@ -416,10 +468,34 @@ func convertLeafValue(value any, typ iceberg.Type) (parquet.Value, error) { } case iceberg.Int32Type: + // Coerce-to-existing-column-type: when the iceberg column is INT + // but the upstream schema declares a temporal logical type (Date, + // TimeOfDay) the upstream value will be time.Time / time.Duration. + // Honour the schema's unit and emit the wire-equivalent integer. + // Strict mode (require_schema_metadata=true) disables this bridge + // and refuses the type disagreement loudly instead. + if n, ok := coerceTemporalToNumeric(value, common); ok { + if strictTemporal { + return parquet.NullValue(), fmt.Errorf("int column received %T while schema metadata declares type %v; require_schema_metadata=true demands the existing column type match the schema metadata — recreate the table to migrate", value, common.Type) + } + return parquet.Int32Value(int32(n)), nil + } i, err := bloblang.ValueAsInt64(value) return parquet.Int32Value(int32(i)), err case iceberg.Int64Type: + // Coerce-to-existing-column-type: see the Int32Type arm above. For + // BIGINT columns this is the load-bearing case — it is what lets a + // rolling upgrade of schema_registry_decode (which now emits + // time.Time for timestamp-millis with preserve_logical_types=true) + // keep writing to a pre-existing BIGINT column that pre-dates the + // metadata fix. + if n, ok := coerceTemporalToNumeric(value, common); ok { + if strictTemporal { + return parquet.NullValue(), fmt.Errorf("bigint column received %T while schema metadata declares type %v; require_schema_metadata=true demands the existing column type match the schema metadata — recreate the table to migrate", value, common.Type) + } + return parquet.Int64Value(n), nil + } if v, ok := value.(uint64); ok && v > math.MaxInt64 { return parquet.NullValue(), fmt.Errorf("uint64 value %d exceeds int64 range", v) } @@ -443,36 +519,25 @@ func convertLeafValue(value any, typ iceberg.Type) (parquet.Value, error) { return parquet.ByteArrayValue(v), err case iceberg.DateType: - // Date is days since epoch as int32. - // TODO: Handle time.Time conversion. - switch v := value.(type) { - case int32: - return parquet.Int32Value(v), nil - case int: - return parquet.Int32Value(int32(v)), nil - case float64: - return parquet.Int32Value(int32(v)), nil - default: - return parquet.NullValue(), fmt.Errorf("cannot convert %T to date", value) - } + return convertDate(value, common, strictTemporal) case iceberg.TimeType: - // Time is microseconds since midnight as int64. - switch v := value.(type) { - case int64: - return parquet.Int64Value(v), nil - case int: - return parquet.Int64Value(int64(v)), nil - case float64: - return parquet.Int64Value(int64(v)), nil - default: - return parquet.NullValue(), fmt.Errorf("cannot convert %T to time", value) - } + // Iceberg TIME is microseconds since midnight. Accept time.Duration + // directly (the twmb/avro decode of time-millis/time-micros), and + // fall back to numeric input interpreted via the schema-declared + // unit when available. + return convertTime(value, common, strictTemporal) case iceberg.TimestampType, iceberg.TimestampTzType: - // Timestamp is microseconds since epoch as int64. - v, err := bloblang.ValueAsTimestamp(value) - return parquet.Int64Value(v.UnixMicro()), err + // Iceberg TIMESTAMP / TIMESTAMPTZ are microseconds since epoch. + // time.Time inputs are unambiguous and used directly. For numeric + // inputs, prefer the schema's declared unit so that millis stays + // millis (instead of being interpreted as seconds and landing in + // year 56755). + return convertTimestamp(value, common, false, strictTemporal) + + case iceberg.TimestampNsType, iceberg.TimestampTzNsType: + return convertTimestamp(value, common, true, strictTemporal) case iceberg.UUIDType: switch v := value.(type) { diff --git a/internal/impl/iceberg/shredder/shredder_test.go b/internal/impl/iceberg/shredder/shredder_test.go index 10f0a11958..2bda478dbb 100644 --- a/internal/impl/iceberg/shredder/shredder_test.go +++ b/internal/impl/iceberg/shredder/shredder_test.go @@ -1055,7 +1055,7 @@ func TestConvertLeafValueDecimal(t *testing.T) { for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - result, err := convertLeafValue(tt.value, dt) + result, err := convertLeafValue(tt.value, dt, nil, false) if tt.wantErr { require.Error(t, err) return @@ -1069,7 +1069,7 @@ func TestConvertLeafValueDecimal(t *testing.T) { func TestConvertLeafValueDecimalPrecision(t *testing.T) { dt := iceberg.DecimalTypeOf(10, 2) - result, err := convertLeafValue(float64(123.45), dt) + result, err := convertLeafValue(float64(123.45), dt, nil, false) require.NoError(t, err) b := result.ByteArray() @@ -1085,7 +1085,7 @@ func TestConvertLeafValueDecimalPrecision(t *testing.T) { func TestConvertLeafValueDecimalNegative(t *testing.T) { dt := iceberg.DecimalTypeOf(10, 2) - result, err := convertLeafValue(float64(-123.45), dt) + result, err := convertLeafValue(float64(-123.45), dt, nil, false) require.NoError(t, err) b := result.ByteArray() @@ -1131,7 +1131,7 @@ func TestConvertLeafValueDecimalExactValues(t *testing.T) { for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - result, err := convertLeafValue(tt.value, dt) + result, err := convertLeafValue(tt.value, dt, nil, false) require.NoError(t, err) assert.Equal(t, tt.wantUnscaled, decodeUnscaled(t, result)) }) @@ -1143,16 +1143,16 @@ func TestConvertLeafValueDecimalOverflow(t *testing.T) { dt := iceberg.DecimalTypeOf(5, 2) // 999.99 should succeed - _, err := convertLeafValue(float64(999.99), dt) + _, err := convertLeafValue(float64(999.99), dt, nil, false) require.NoError(t, err) // 1000.00 exceeds precision — unscaled 100000 >= 10^5 - _, err = convertLeafValue(float64(1000.00), dt) + _, err = convertLeafValue(float64(1000.00), dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "exceeds decimal(5, 2) precision") // Large negative should also fail - _, err = convertLeafValue(float64(-1000.00), dt) + _, err = convertLeafValue(float64(-1000.00), dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "exceeds decimal(5, 2) precision") } @@ -1160,7 +1160,7 @@ func TestConvertLeafValueDecimalOverflow(t *testing.T) { func TestConvertLeafValueDecimalStringError(t *testing.T) { dt := iceberg.DecimalTypeOf(10, 2) - _, err := convertLeafValue("not_a_number", dt) + _, err := convertLeafValue("not_a_number", dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "cannot parse") } @@ -1168,32 +1168,32 @@ func TestConvertLeafValueDecimalStringError(t *testing.T) { func TestConvertLeafValueDecimalNaNInf(t *testing.T) { dt := iceberg.DecimalTypeOf(10, 2) - _, err := convertLeafValue(math.NaN(), dt) + _, err := convertLeafValue(math.NaN(), dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "cannot convert") - _, err = convertLeafValue(math.Inf(1), dt) + _, err = convertLeafValue(math.Inf(1), dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "cannot convert") - _, err = convertLeafValue(math.Inf(-1), dt) + _, err = convertLeafValue(math.Inf(-1), dt, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "cannot convert") - _, err = convertLeafValue(float32(math.NaN()), dt) + _, err = convertLeafValue(float32(math.NaN()), dt, nil, false) require.Error(t, err) - _, err = convertLeafValue(float32(math.Inf(1)), dt) + _, err = convertLeafValue(float32(math.Inf(1)), dt, nil, false) require.Error(t, err) } func TestConvertLeafValueUint64Overflow(t *testing.T) { - _, err := convertLeafValue(uint64(math.MaxInt64+1), iceberg.Int64Type{}) + _, err := convertLeafValue(uint64(math.MaxInt64+1), iceberg.Int64Type{}, nil, false) require.Error(t, err) assert.Contains(t, err.Error(), "exceeds int64 range") // Value within range should succeed - _, err = convertLeafValue(uint64(math.MaxInt64), iceberg.Int64Type{}) + _, err = convertLeafValue(uint64(math.MaxInt64), iceberg.Int64Type{}, nil, false) require.NoError(t, err) } diff --git a/internal/impl/iceberg/shredder/temporal.go b/internal/impl/iceberg/shredder/temporal.go new file mode 100644 index 0000000000..76dc373f9f --- /dev/null +++ b/internal/impl/iceberg/shredder/temporal.go @@ -0,0 +1,382 @@ +// Copyright 2026 Redpanda Data, Inc. +// +// Licensed as a Redpanda Enterprise file under the Redpanda Community +// License (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md + +package shredder + +import ( + "errors" + "fmt" + "math" + "time" + + "github.com/parquet-go/parquet-go" + + "github.com/redpanda-data/benthos/v4/public/bloblang" + "github.com/redpanda-data/benthos/v4/public/schema" +) + +// commonForField returns the upstream schema.Common registered for the given +// iceberg field ID, or nil when no metadata has been registered for that +// field. Returns nil when the shredder itself has no metadata at all. +func (rs *RecordShredder) commonForField(fieldID int) *schema.Common { + if rs.fieldCommons == nil { + return nil + } + return rs.fieldCommons[fieldID] +} + +// convertDate converts a Go value into the Parquet representation of an +// Iceberg DATE column (int32 days since the unix epoch). Accepts time.Time +// (UTC midnight idiom from twmb/avro `date` decode), or a numeric value +// already expressing days since epoch. +// +// strictTemporal causes numeric inputs without registered schema metadata +// to be rejected — the column is DATE (days since epoch) but a bare +// numeric could be days, milliseconds, or seconds depending on upstream; +// strict mode forces the operator to attach metadata or convert upstream. +func convertDate(value any, common *schema.Common, strictTemporal bool) (parquet.Value, error) { + switch v := value.(type) { + case time.Time: + // Iceberg / Avro DATE is days since 1970-01-01, including negative + // values for pre-epoch dates. Go integer division truncates toward + // zero, so we floor explicitly: 1969-12-31 23:59:59 UTC has + // Unix() == -1, and floor(-1/86400) == -1, not 0. + secs := v.UTC().Unix() + days := secs / 86400 + if secs < 0 && secs%86400 != 0 { + days-- + } + return parquet.Int32Value(int32(days)), nil + case int32, int, int64, float64: + // In strict mode, a numeric value into a DATE column must come + // with metadata whose Type is also Date — anything else either + // has no shape information at all or has the wrong shape, both + // of which the operator asked us to fail on. + if strictTemporal && (common == nil || common.Type != schema.Date) { + return parquet.NullValue(), errors.New("date column received numeric value without matching schema.Common (Type=Date); require_schema_metadata=true demands a Date metadata entry to disambiguate the unit") + } + switch v := v.(type) { + case int32: + return parquet.Int32Value(v), nil + case int: + return parquet.Int32Value(int32(v)), nil + case int64: + return parquet.Int32Value(int32(v)), nil + case float64: + // NaN and ±Inf cast to int32 as implementation-defined garbage; + // silent corruption (year 1970) is worse than a hard error. + if math.IsNaN(v) || math.IsInf(v, 0) { + return parquet.NullValue(), fmt.Errorf("cannot convert %v to date", v) + } + return parquet.Int32Value(int32(v)), nil + } + return parquet.NullValue(), fmt.Errorf("cannot convert %T to date", value) + default: + return parquet.NullValue(), fmt.Errorf("cannot convert %T to date", value) + } +} + +// convertTime converts a Go value into the Parquet representation of an +// Iceberg TIME column (int64 microseconds since midnight). Accepts +// time.Duration directly (the twmb/avro decode of time-millis/time-micros), +// or a numeric value whose unit is consulted via common.Logical.TimeOfDay +// when present. Without schema metadata, numeric inputs are treated as +// already in microseconds — matching the historical behavior for numeric +// values landing in this column. +// +// time.Time inputs are extracted in the value's own [time.Location], not +// converted to UTC first. This treats time-of-day as zoneless wall-clock +// (matching Java's LocalTime, Postgres TIME, and Iceberg TIME): a +// time.Time of 14:30 EST yields 14:30 in the column, not 19:30 UTC. +// If the upstream wants UTC-shifted, it should v.UTC() before passing in. +func convertTime(value any, common *schema.Common, strictTemporal bool) (parquet.Value, error) { + switch v := value.(type) { + case time.Duration: + return parquet.Int64Value(v.Microseconds()), nil + case time.Time: + dur := time.Duration(v.Hour())*time.Hour + + time.Duration(v.Minute())*time.Minute + + time.Duration(v.Second())*time.Second + + time.Duration(v.Nanosecond())*time.Nanosecond + return parquet.Int64Value(dur.Microseconds()), nil + case int64, int32, int, float64: + // In strict mode, a numeric value into a TIME column must come + // with TimeOfDay-typed metadata that names the unit. Anything + // else (no metadata, wrong-typed metadata, or a TimeOfDay common + // missing its Logical params) means the unit is undeclared and + // the operator asked us to fail rather than guess. + if strictTemporal && (common == nil || common.Type != schema.TimeOfDay || common.Logical == nil || common.Logical.TimeOfDay == nil) { + return parquet.NullValue(), errors.New("time column received numeric value without matching schema.Common (Type=TimeOfDay with Logical.TimeOfDay); require_schema_metadata=true demands a TimeOfDay metadata entry with declared unit") + } + switch v := v.(type) { + case int64: + return parquet.Int64Value(numericToTimeMicros(v, common)), nil + case int32: + // Avro time-millis is int32; accept it directly as well. + return parquet.Int64Value(numericToTimeMicros(int64(v), common)), nil + case int: + return parquet.Int64Value(numericToTimeMicros(int64(v), common)), nil + case float64: + // NaN/±Inf cast is implementation-defined; reject explicitly. + if math.IsNaN(v) || math.IsInf(v, 0) { + return parquet.NullValue(), fmt.Errorf("cannot convert %v to time", v) + } + return parquet.Int64Value(numericToTimeMicros(int64(v), common)), nil + } + return parquet.NullValue(), fmt.Errorf("cannot convert %T to time", value) + default: + return parquet.NullValue(), fmt.Errorf("cannot convert %T to time", value) + } +} + +// numericToTimeMicros scales a numeric time-of-day value to microseconds +// using the schema's declared unit. Without schema metadata, the input is +// assumed to already be microseconds (Iceberg's internal representation), +// which preserves the pre-PR behavior for numeric inputs. +// +// An unrecognised TimeUnit reaching this function indicates a benthos +// schema-package contract violation — Validate() guards against it +// upstream — and we panic rather than silently misinterpret. Returning n +// unchanged would conflate a malformed schema with the no-metadata case. +func numericToTimeMicros(n int64, common *schema.Common) int64 { + if common == nil || common.Logical == nil || common.Logical.TimeOfDay == nil { + return n + } + switch common.Logical.TimeOfDay.Unit { + case schema.TimeUnitSeconds: + return n * 1_000_000 + case schema.TimeUnitMillis: + return n * 1_000 + case schema.TimeUnitMicros: + return n + case schema.TimeUnitNanos: + return n / 1_000 + default: + panic(fmt.Sprintf("unreachable: invalid TimeUnit %v escaped Validate()", common.Logical.TimeOfDay.Unit)) + } +} + +// convertTimestamp converts a Go value into the Parquet representation of an +// Iceberg TIMESTAMP / TIMESTAMPTZ column (int64 microseconds since epoch by +// default; nanoseconds when nanos is true, for the V3 *NsType variants). +// Accepts time.Time directly. For numeric inputs, prefers +// common.Logical.Timestamp.Unit to interpret the unit. When common is a +// Timestamp schema with no Logical (legacy / hand-constructed), falls back +// to EffectiveTimestamp's millis/UTC default rather than to +// bloblang.ValueAsTimestamp's seconds default — so a numeric millis value +// declared by the schema as Timestamp lands at the correct microsecond +// offset, not 1000× too far in the future. +// +// Only when no schema metadata is available at all do we fall through to +// bloblang.ValueAsTimestamp — which preserves the pre-PR behavior for +// callers that genuinely store unix-seconds in a numeric column without +// any schema annotation. +func convertTimestamp(value any, common *schema.Common, nanos, strictTemporal bool) (parquet.Value, error) { + if v, ok := value.(time.Time); ok { + if nanos { + return parquet.Int64Value(v.UnixNano()), nil + } + return parquet.Int64Value(v.UnixMicro()), nil + } + + // Reject NaN/±Inf up front. int64(NaN) is implementation-defined + // garbage; left to fall through, this would silently produce year 1970 + // (or worse) timestamps via either the metadata path or the bloblang + // fallback. Mirrors the guard the decimal converter already enforces. + switch v := value.(type) { + case float64: + if math.IsNaN(v) || math.IsInf(v, 0) { + return parquet.NullValue(), fmt.Errorf("cannot convert %v to timestamp", v) + } + case float32: + if math.IsNaN(float64(v)) || math.IsInf(float64(v), 0) { + return parquet.NullValue(), fmt.Errorf("cannot convert %v to timestamp", v) + } + } + + // Numeric path. Honour the schema's declared unit when we have one. + // A Timestamp schema with nil Logical falls through EffectiveTimestamp + // to the legacy millis/UTC default, which is also the right + // interpretation for the population that produced this PR: an + // upstream Avro timestamp-millis value flowing through the schema + // metadata path. + if n, ok := numericInt64(value); ok && common != nil && common.Type == schema.Timestamp { + p := common.EffectiveTimestamp() + out := scaleTimestampNumeric(n, p.Unit, nanos) + return parquet.Int64Value(out), nil + } + + // Strict mode: refuse the bloblang fallback for numeric values when + // the operator has asked us to fail loudly on missing metadata. + if strictTemporal { + if _, ok := numericInt64(value); ok { + return parquet.NullValue(), errors.New("timestamp column received numeric value with no Timestamp schema metadata; require_schema_metadata=true demands a schema.Common with Type=Timestamp to disambiguate the unit") + } + } + + // No applicable schema metadata: keep the historical bloblang path. + // This preserves behavior for callers that pass unix-seconds directly + // into a numeric column without schema annotation. + t, err := bloblang.ValueAsTimestamp(value) + if err != nil { + return parquet.NullValue(), err + } + if nanos { + return parquet.Int64Value(t.UnixNano()), nil + } + return parquet.Int64Value(t.UnixMicro()), nil +} + +// scaleTimestampNumeric scales a numeric timestamp from its declared unit +// into either microseconds (the standard Iceberg TIMESTAMP/TIMESTAMPTZ +// internal representation) or nanoseconds (the V3 *NsType internal +// representation). +// +// An unrecognised TimeUnit reaching this function indicates a benthos +// schema-package contract violation — Validate() guards against it +// upstream — and we panic rather than silently misinterpret. Returning n +// unchanged would conflate "unit=micros (no scaling needed)" with +// "malformed schema (no scaling possible)". +func scaleTimestampNumeric(n int64, unit schema.TimeUnit, nanos bool) int64 { + if nanos { + switch unit { + case schema.TimeUnitSeconds: + return n * 1_000_000_000 + case schema.TimeUnitMillis: + return n * 1_000_000 + case schema.TimeUnitMicros: + return n * 1_000 + case schema.TimeUnitNanos: + return n + default: + panic(fmt.Sprintf("unreachable: invalid TimeUnit %v escaped Validate()", unit)) + } + } + switch unit { + case schema.TimeUnitSeconds: + return n * 1_000_000 + case schema.TimeUnitMillis: + return n * 1_000 + case schema.TimeUnitMicros: + return n + case schema.TimeUnitNanos: + return n / 1_000 + default: + panic(fmt.Sprintf("unreachable: invalid TimeUnit %v escaped Validate()", unit)) + } +} + +// coerceTemporalToNumeric handles the rolling-upgrade case where the iceberg +// table holds a numeric column (BIGINT / INT) that pre-dates the metadata +// fix for issue #4399, but the upgraded upstream now emits a temporal Go +// value (`time.Time` for Timestamp / Date, `time.Duration` for TimeOfDay) +// together with correct schema.Common metadata. +// +// Without this conversion the shredder's Int64Type / Int32Type arms would +// hand the temporal value to bloblang.ValueAsInt64, which fails with +// "expected number value, got timestamp" — the exact shape of the customer +// regression that surfaced this work. +// +// Returns (0, false) when no coercion applies (value is not temporal, common +// is absent, or the common's declared type doesn't match the value's Go +// type). Callers fall back to their existing numeric coercion path in that +// case. +// +// The conversion is unit-aware: a `time.Time` against a Timestamp(Millis) +// common produces UnixMilli, Micros produces UnixMicro, etc. — preserving +// the wire-equivalent representation the operator's pre-fix pipeline had +// been storing in the same column. +func coerceTemporalToNumeric(value any, common *schema.Common) (int64, bool) { + if common == nil { + return 0, false + } + switch v := value.(type) { + case time.Time: + switch common.Type { + case schema.Timestamp: + p := common.EffectiveTimestamp() + switch p.Unit { + case schema.TimeUnitSeconds: + return v.Unix(), true + case schema.TimeUnitMillis: + return v.UnixMilli(), true + case schema.TimeUnitMicros: + return v.UnixMicro(), true + case schema.TimeUnitNanos: + return v.UnixNano(), true + } + case schema.Date: + // Days since epoch, floored toward negative infinity to match + // convertDate's pre-epoch handling. + secs := v.UTC().Unix() + days := secs / 86400 + if secs < 0 && secs%86400 != 0 { + days-- + } + return days, true + } + case time.Duration: + if common.Type == schema.TimeOfDay && common.Logical != nil && common.Logical.TimeOfDay != nil { + switch common.Logical.TimeOfDay.Unit { + case schema.TimeUnitSeconds: + return int64(v / time.Second), true + case schema.TimeUnitMillis: + return v.Milliseconds(), true + case schema.TimeUnitMicros: + return v.Microseconds(), true + case schema.TimeUnitNanos: + return v.Nanoseconds(), true + } + } + } + return 0, false +} + +// numericInt64 extracts an int64 from an arbitrary numeric Go value. Returns +// (0, false) for non-numeric inputs and for NaN/±Inf floats — so callers +// can either error explicitly on those cases or fall through to a more +// permissive parsing path. +// +// Float inputs are truncated toward zero — sub-unit precision is lost +// silently. Floats outside the int64 representable range overflow to +// undefined int64 values per Go conversion semantics; in practice this +// is unreachable for any sane timestamp/date/time-of-day value, but +// callers passing arbitrary numeric values should prefer integer types +// to avoid the precision loss. NaN and ±Inf are filtered to (0, false) +// because their int64 conversion is implementation-defined garbage; the +// callers add explicit error paths so these can't silently corrupt. +// +// uint64 values above MaxInt64 overflow to negative int64 values; this +// matches what the existing iceberg.Int64Type arm rejects via an +// explicit range check, but here we accept the wrap because timestamp +// values that high are nonsensical regardless. +func numericInt64(value any) (int64, bool) { + switch v := value.(type) { + case int64: + return v, true + case int: + return int64(v), true + case int32: + return int64(v), true + case uint64: + return int64(v), true + case uint32: + return int64(v), true + case float64: + if math.IsNaN(v) || math.IsInf(v, 0) { + return 0, false + } + return int64(v), true + case float32: + if math.IsNaN(float64(v)) || math.IsInf(float64(v), 0) { + return 0, false + } + return int64(v), true + } + return 0, false +} diff --git a/internal/impl/iceberg/shredder/temporal_test.go b/internal/impl/iceberg/shredder/temporal_test.go new file mode 100644 index 0000000000..f5ddc1117c --- /dev/null +++ b/internal/impl/iceberg/shredder/temporal_test.go @@ -0,0 +1,557 @@ +// Copyright 2026 Redpanda Data, Inc. +// +// Licensed as a Redpanda Enterprise file under the Redpanda Community +// License (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// https://github.com/redpanda-data/redpanda/blob/master/licenses/rcl.md + +package shredder + +import ( + "math" + "testing" + "time" + + "github.com/apache/iceberg-go" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/redpanda-data/benthos/v4/public/schema" +) + +// TestConvertTimestampSchemaAware exercises the unit-aware numeric path that +// closes the silent-corruption case from issue #4399. Without schema +// metadata, bloblang.ValueAsTimestamp interprets a numeric input as Unix +// seconds — a millisecond input therefore lands ~50,000 years in the +// future. With schema metadata declaring `timestamp-millis`, the same +// numeric input lands at the correct microsecond offset. +func TestConvertTimestampSchemaAware(t *testing.T) { + const epochMillis = int64(1_730_000_000_000) // 2024-10-27 03:33:20 UTC + const wantMicros = epochMillis * 1000 + + cases := []struct { + name string + value any + common *schema.Common + typ iceberg.Type + wantOK bool + wantVal int64 + exactly bool // assert exact equality on wantVal + }{ + { + name: "millis with schema metadata → correct micros", + value: epochMillis, + common: tsCommon(schema.TimeUnitMillis, true), + typ: iceberg.TimestampTzType{}, + wantOK: true, + wantVal: wantMicros, + exactly: true, + }, + { + name: "micros with schema metadata → identity micros", + value: wantMicros, + common: tsCommon(schema.TimeUnitMicros, true), + typ: iceberg.TimestampTzType{}, + wantOK: true, + wantVal: wantMicros, + exactly: true, + }, + { + name: "nanos with schema metadata → micros (truncated)", + value: wantMicros * 1000, + common: tsCommon(schema.TimeUnitNanos, true), + typ: iceberg.TimestampTzType{}, + wantOK: true, + wantVal: wantMicros, + exactly: true, + }, + { + name: "time.Time always wins over numeric path", + value: time.UnixMilli(epochMillis).UTC(), + common: nil, // even without schema metadata, time.Time is unambiguous + typ: iceberg.TimestampTzType{}, + wantOK: true, + wantVal: wantMicros, + exactly: true, + }, + { + name: "numeric without schema falls back to bloblang seconds default", + value: epochMillis, // user intended millis + common: nil, // but no schema to disambiguate + typ: iceberg.TimestampTzType{}, // → bloblang treats as seconds + wantOK: true, + wantVal: epochMillis * 1_000_000, // year 56755, but loud at least via comparison + exactly: false, // we don't assert exact correctness here, just existence + }, + { + name: "TIMESTAMP_NS with millis metadata scales to nanos", + value: epochMillis, + common: tsCommon(schema.TimeUnitMillis, true), + typ: iceberg.TimestampTzNsType{}, + wantOK: true, + wantVal: epochMillis * 1_000_000, + exactly: true, + }, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + pq, err := convertLeafValue(tc.value, tc.typ, tc.common, false) + if !tc.wantOK { + require.Error(t, err) + return + } + require.NoError(t, err) + if tc.exactly { + assert.Equal(t, tc.wantVal, pq.Int64()) + } + }) + } +} + +func TestConvertDateSchemaAware(t *testing.T) { + cases := []struct { + name string + value any + want int32 + wantErr bool + }{ + { + name: "time.Time at UTC midnight", + value: time.Date(2024, 1, 15, 0, 0, 0, 0, time.UTC), + want: 19737, // days since epoch + }, + { + name: "int32 days passes through", + value: int32(19737), + want: 19737, + }, + { + name: "int days passes through", + value: 19737, + want: 19737, + }, + { + name: "string rejected", + value: "2024-01-15", + wantErr: true, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + pq, err := convertLeafValue(tc.value, iceberg.DateType{}, nil, false) + if tc.wantErr { + require.Error(t, err) + return + } + require.NoError(t, err) + assert.Equal(t, tc.want, pq.Int32()) + }) + } +} + +func TestConvertTimeOfDaySchemaAware(t *testing.T) { + const eightThirtyMicros = int64(8*3600+30*60) * 1_000_000 + cases := []struct { + name string + value any + common *schema.Common + want int64 + }{ + { + name: "time.Duration passes through directly", + value: 8*time.Hour + 30*time.Minute, + common: nil, + want: eightThirtyMicros, + }, + { + name: "millis numeric with schema metadata scales correctly", + value: int64(8*3600+30*60) * 1_000, + common: todCommon(schema.TimeUnitMillis), + want: eightThirtyMicros, + }, + { + name: "micros numeric with schema metadata is identity", + value: eightThirtyMicros, + common: todCommon(schema.TimeUnitMicros), + want: eightThirtyMicros, + }, + { + name: "nanos numeric with schema metadata divides correctly", + value: eightThirtyMicros * 1_000, + common: todCommon(schema.TimeUnitNanos), + want: eightThirtyMicros, + }, + { + name: "numeric without schema treated as already-micros (legacy)", + value: eightThirtyMicros, + common: nil, + want: eightThirtyMicros, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + pq, err := convertLeafValue(tc.value, iceberg.TimeType{}, tc.common, false) + require.NoError(t, err) + assert.Equal(t, tc.want, pq.Int64()) + }) + } +} + +// TestConvertDatePreEpoch covers the floor-vs-truncate bug for negative +// Unix timestamps. 1969-12-31 23:59:59 UTC has Unix() == -1; truncating +// integer division gives 0 (Jan 1 1970) instead of the correct -1 +// (Dec 31 1969). Iceberg DATE supports negative day indices. +func TestConvertDatePreEpoch(t *testing.T) { + cases := []struct { + name string + t time.Time + want int32 + }{ + {"1969-12-31 23:59:59 UTC", time.Date(1969, 12, 31, 23, 59, 59, 0, time.UTC), -1}, + {"1969-12-31 00:00:00 UTC", time.Date(1969, 12, 31, 0, 0, 0, 0, time.UTC), -1}, + {"1969-01-01 UTC", time.Date(1969, 1, 1, 0, 0, 0, 0, time.UTC), -365}, + {"1900-01-01 UTC", time.Date(1900, 1, 1, 0, 0, 0, 0, time.UTC), -25567}, + {"1970-01-01 00:00:00 UTC (boundary)", time.Date(1970, 1, 1, 0, 0, 0, 0, time.UTC), 0}, + {"1970-01-01 23:59:59 UTC", time.Date(1970, 1, 1, 23, 59, 59, 0, time.UTC), 0}, + {"1970-01-02 UTC", time.Date(1970, 1, 2, 0, 0, 0, 0, time.UTC), 1}, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + pq, err := convertLeafValue(tc.t, iceberg.DateType{}, nil, false) + require.NoError(t, err) + assert.Equal(t, tc.want, pq.Int32()) + }) + } +} + +// TestConvertTimeUsesValueLocation verifies that time.Time inputs are +// extracted in the value's own location (zoneless wall-clock semantics) +// rather than UTC-shifted. A 14:30 EST time.Time should produce the +// 14:30 microsecond offset, not 19:30 UTC. +func TestConvertTimeUsesValueLocation(t *testing.T) { + est, err := time.LoadLocation("America/New_York") + require.NoError(t, err) + v := time.Date(2024, 1, 15, 14, 30, 0, 0, est) + pq, err := convertLeafValue(v, iceberg.TimeType{}, nil, false) + require.NoError(t, err) + want := int64(14*3600+30*60) * 1_000_000 + assert.Equal(t, want, pq.Int64()) +} + +// TestConvertTimestampLegacyNilLogical verifies that a Timestamp schema +// with no Logical params (legacy / hand-constructed) treats numeric inputs +// as millis (the EffectiveTimestamp default) rather than seconds (the +// bloblang fallback). Without this guard, a numeric millis value declared +// by the schema as Timestamp would land 1000× too far in the future. +func TestConvertTimestampLegacyNilLogical(t *testing.T) { + const epochMillis = int64(1_730_000_000_000) // 2024-10-27 03:33:20 UTC + const wantMicros = epochMillis * 1000 + + common := &schema.Common{Type: schema.Timestamp} // nil Logical + pq, err := convertLeafValue(epochMillis, iceberg.TimestampTzType{}, common, false) + require.NoError(t, err) + assert.Equal(t, wantMicros, pq.Int64()) +} + +// TestScaleTimestampNumericPanicsOnInvalid asserts the defense-in-depth +// behavior: an invalid TimeUnit slipping past Validate() must not silently +// degrade to a wrong unit interpretation. +func TestScaleTimestampNumericPanicsOnInvalid(t *testing.T) { + defer func() { + if r := recover(); r == nil { + t.Errorf("expected panic on invalid TimeUnit") + } + }() + scaleTimestampNumeric(123, schema.TimeUnit(99), false) +} + +// TestTemporalRejectsNaNInf locks in a defense-in-depth guard against +// silent corruption: a NaN or ±Inf float reaching a time-typed column +// must fail loudly rather than be cast to implementation-defined +// int32/int64 garbage (typically 0, i.e. year 1970). +func TestTemporalRejectsNaNInf(t *testing.T) { + cases := []struct { + name string + typ iceberg.Type + val any + }{ + {"date NaN", iceberg.DateType{}, math.NaN()}, + {"date +Inf", iceberg.DateType{}, math.Inf(1)}, + {"date -Inf", iceberg.DateType{}, math.Inf(-1)}, + {"time NaN", iceberg.TimeType{}, math.NaN()}, + {"time +Inf", iceberg.TimeType{}, math.Inf(1)}, + {"timestamp NaN", iceberg.TimestampTzType{}, math.NaN()}, + {"timestamp +Inf", iceberg.TimestampTzType{}, math.Inf(1)}, + {"timestamp -Inf", iceberg.TimestampTzType{}, math.Inf(-1)}, + {"timestamp_ns NaN", iceberg.TimestampNsType{}, math.NaN()}, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + _, err := convertLeafValue(tc.val, tc.typ, nil, false) + require.Error(t, err) + assert.Contains(t, err.Error(), "cannot convert") + }) + } +} + +// TestNumericInt64FiltersNaNInf verifies the helper returns (0, false) +// for non-finite floats so callers can route them to explicit error paths +// instead of letting the int64 cast produce garbage. +func TestNumericInt64FiltersNaNInf(t *testing.T) { + for _, v := range []any{math.NaN(), math.Inf(1), math.Inf(-1), float32(math.NaN()), float32(math.Inf(1))} { + _, ok := numericInt64(v) + assert.False(t, ok, "%v should not extract as int64", v) + } +} + +// TestStrictTemporalModeRejectsNumericWithoutMetadata exercises the +// require_schema_metadata=true path: numeric inputs into time-typed +// columns must fail loudly when no schema.Common has been registered for +// the field. +func TestStrictTemporalModeRejectsNumericWithoutMetadata(t *testing.T) { + cases := []struct { + name string + typ iceberg.Type + val any + }{ + {"date numeric strict", iceberg.DateType{}, int64(19737)}, + {"time numeric strict", iceberg.TimeType{}, int64(123456)}, + {"timestamp numeric strict", iceberg.TimestampTzType{}, int64(1_730_000_000_000)}, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + _, err := convertLeafValue(tc.val, tc.typ, nil, true) + require.Error(t, err) + assert.Contains(t, err.Error(), "require_schema_metadata=true") + }) + } +} + +// TestStrictTemporalModeRejectsWrongTypeMetadata verifies that strict +// mode is not satisfied by *any* schema.Common — the registered Type +// must match the iceberg-side column type. Otherwise a hand-crafted +// metadata mismatch (e.g. claiming a column is `string` while iceberg +// actually has it as DATE) would silently pass the strict check and +// the value-converter would interpret the numeric using the wrong +// shape. +func TestStrictTemporalModeRejectsWrongTypeMetadata(t *testing.T) { + cases := []struct { + name string + typ iceberg.Type + val any + common *schema.Common + }{ + { + name: "DATE column with String-typed metadata", + typ: iceberg.DateType{}, + val: int64(19737), + common: &schema.Common{Type: schema.String}, + }, + { + name: "DATE column with Int64-typed metadata", + typ: iceberg.DateType{}, + val: int64(19737), + common: &schema.Common{Type: schema.Int64}, + }, + { + name: "TIME column with Timestamp-typed metadata", + typ: iceberg.TimeType{}, + val: int64(8 * 3600 * 1_000_000), + common: &schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMillis, AdjustToUTC: true}}, + }, + }, + { + name: "TIMESTAMPTZ column with Date-typed metadata", + typ: iceberg.TimestampTzType{}, + val: int64(1_730_000_000_000), + common: &schema.Common{Type: schema.Date}, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + _, err := convertLeafValue(tc.val, tc.typ, tc.common, true) + require.Error(t, err) + assert.Contains(t, err.Error(), "require_schema_metadata=true") + }) + } +} + +// TestStrictTemporalModeAcceptsTimeTypeNatives confirms that strict mode +// has no effect when the input is already an unambiguous Go time type. +func TestStrictTemporalModeAcceptsTimeTypeNatives(t *testing.T) { + const epochMillis = int64(1_730_000_000_000) + t.Run("time.Time into TIMESTAMPTZ without metadata", func(t *testing.T) { + v := time.UnixMilli(epochMillis).UTC() + pq, err := convertLeafValue(v, iceberg.TimestampTzType{}, nil, true) + require.NoError(t, err) + assert.Equal(t, epochMillis*1000, pq.Int64()) + }) + t.Run("time.Duration into TIME without metadata", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + pq, err := convertLeafValue(d, iceberg.TimeType{}, nil, true) + require.NoError(t, err) + assert.Equal(t, int64(8*3600+30*60)*1_000_000, pq.Int64()) + }) + t.Run("time.Time into DATE without metadata", func(t *testing.T) { + v := time.Date(2024, 1, 15, 0, 0, 0, 0, time.UTC) + pq, err := convertLeafValue(v, iceberg.DateType{}, nil, true) + require.NoError(t, err) + assert.Equal(t, int32(19737), pq.Int32()) + }) +} + +// TestStrictTemporalModeAcceptsNumericWithMetadata confirms that strict +// mode is inert once schema metadata is registered for the field. +func TestStrictTemporalModeAcceptsNumericWithMetadata(t *testing.T) { + const epochMillis = int64(1_730_000_000_000) + common := tsCommon(schema.TimeUnitMillis, true) + pq, err := convertLeafValue(epochMillis, iceberg.TimestampTzType{}, common, true) + require.NoError(t, err) + assert.Equal(t, epochMillis*1000, pq.Int64()) +} + +// TestCoerceTemporalIntoNumericColumn covers the rolling-upgrade case where +// an existing iceberg table has BIGINT / INT columns (from the pre-fix +// metadata bug for #4399) but the upgraded upstream now emits time.Time / +// time.Duration values together with correct schema.Common metadata. +// +// Without this path the shredder hands the temporal value to +// bloblang.ValueAsInt64 and fails with "expected number value, got +// timestamp" — the exact symptom that surfaced this work. The conversion +// must honour the schema's declared unit so the integer that lands in the +// column matches what the operator's pre-fix pipeline had been storing. +func TestCoerceTemporalIntoNumericColumn(t *testing.T) { + const tsMillis = int64(1_700_000_000_000) // 2023-11-14 22:13:20 UTC + + t.Run("time.Time into Int64 with Timestamp(Millis)", func(t *testing.T) { + v := time.UnixMilli(tsMillis).UTC() + pq, err := convertLeafValue(v, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), false) + require.NoError(t, err) + assert.Equal(t, tsMillis, pq.Int64()) + }) + + t.Run("time.Time into Int64 with Timestamp(Micros)", func(t *testing.T) { + v := time.UnixMilli(tsMillis).UTC() + pq, err := convertLeafValue(v, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMicros, true), false) + require.NoError(t, err) + assert.Equal(t, tsMillis*1000, pq.Int64()) + }) + + t.Run("time.Time into Int64 with Timestamp(Nanos)", func(t *testing.T) { + v := time.UnixMilli(tsMillis).UTC() + pq, err := convertLeafValue(v, iceberg.Int64Type{}, tsCommon(schema.TimeUnitNanos, true), false) + require.NoError(t, err) + assert.Equal(t, tsMillis*1_000_000, pq.Int64()) + }) + + t.Run("time.Time into Int32 with Date", func(t *testing.T) { + v := time.Date(2024, 1, 15, 0, 0, 0, 0, time.UTC) + pq, err := convertLeafValue(v, iceberg.Int32Type{}, &schema.Common{Type: schema.Date}, false) + require.NoError(t, err) + assert.Equal(t, int32(19737), pq.Int32()) + }) + + t.Run("time.Duration into Int64 with TimeOfDay(Millis)", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + pq, err := convertLeafValue(d, iceberg.Int64Type{}, todCommon(schema.TimeUnitMillis), false) + require.NoError(t, err) + assert.Equal(t, int64(8*3600+30*60)*1000, pq.Int64()) + }) + + t.Run("time.Duration into Int32 with TimeOfDay(Millis)", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + pq, err := convertLeafValue(d, iceberg.Int32Type{}, todCommon(schema.TimeUnitMillis), false) + require.NoError(t, err) + assert.Equal(t, int32(8*3600+30*60)*1000, pq.Int32()) + }) + + // Non-temporal values into numeric columns continue to flow through the + // existing bloblang.ValueAsInt64 path unchanged — coerce is opt-in via + // the temporal-value-type predicate. + t.Run("plain int64 into Int64 with Timestamp metadata is unchanged", func(t *testing.T) { + pq, err := convertLeafValue(tsMillis, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), false) + require.NoError(t, err) + assert.Equal(t, tsMillis, pq.Int64()) + }) + + t.Run("string into Int64 with Timestamp metadata still errors", func(t *testing.T) { + _, err := convertLeafValue("not a number", iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), false) + require.Error(t, err) + }) + + // Metadata-mismatched cases fall through to bloblang, which fails loudly + // rather than silently misinterpreting the value. + t.Run("time.Time into Int64 without metadata falls through to bloblang and errors", func(t *testing.T) { + v := time.UnixMilli(tsMillis).UTC() + _, err := convertLeafValue(v, iceberg.Int64Type{}, nil, false) + require.Error(t, err) + }) + + t.Run("time.Duration into Int64 with Timestamp metadata (mismatch) falls through and errors", func(t *testing.T) { + d := 8 * time.Hour + _, err := convertLeafValue(d, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), false) + require.Error(t, err) + }) +} + +// TestCoerceTemporalRejectedInStrictMode confirms that the temporal->numeric +// coerce path is disabled when require_schema_metadata=true. In strict mode +// a type disagreement between the existing column and the schema metadata +// is a hard error rather than a silent conversion — the operator has +// explicitly asked us not to bridge across stale tables. +func TestCoerceTemporalRejectedInStrictMode(t *testing.T) { + const tsMillis = int64(1_700_000_000_000) + + t.Run("time.Time into Int64 with Timestamp(Millis) errors", func(t *testing.T) { + v := time.UnixMilli(tsMillis).UTC() + _, err := convertLeafValue(v, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), true) + require.Error(t, err) + assert.Contains(t, err.Error(), "require_schema_metadata=true") + }) + + t.Run("time.Time into Int32 with Date errors", func(t *testing.T) { + v := time.Date(2024, 1, 15, 0, 0, 0, 0, time.UTC) + _, err := convertLeafValue(v, iceberg.Int32Type{}, &schema.Common{Type: schema.Date}, true) + require.Error(t, err) + assert.Contains(t, err.Error(), "require_schema_metadata=true") + }) + + t.Run("time.Duration into Int64 with TimeOfDay errors", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + _, err := convertLeafValue(d, iceberg.Int64Type{}, todCommon(schema.TimeUnitMillis), true) + require.Error(t, err) + assert.Contains(t, err.Error(), "require_schema_metadata=true") + }) + + // Strict mode is inert when the value/column types already agree — it + // only rejects coerce situations, not the happy path. + t.Run("plain int64 into Int64 with Timestamp metadata still succeeds in strict", func(t *testing.T) { + pq, err := convertLeafValue(tsMillis, iceberg.Int64Type{}, tsCommon(schema.TimeUnitMillis, true), true) + require.NoError(t, err) + assert.Equal(t, tsMillis, pq.Int64()) + }) + + t.Run("plain int64 into Int64 without metadata still succeeds in strict", func(t *testing.T) { + pq, err := convertLeafValue(tsMillis, iceberg.Int64Type{}, nil, true) + require.NoError(t, err) + assert.Equal(t, tsMillis, pq.Int64()) + }) +} + +func tsCommon(u schema.TimeUnit, utc bool) *schema.Common { + return &schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: u, AdjustToUTC: utc}}, + } +} + +func todCommon(u schema.TimeUnit) *schema.Common { + return &schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: u}}, + } +} diff --git a/internal/impl/iceberg/type_inference.go b/internal/impl/iceberg/type_inference.go index f14420a067..dd5ceadd8d 100644 --- a/internal/impl/iceberg/type_inference.go +++ b/internal/impl/iceberg/type_inference.go @@ -120,6 +120,12 @@ func (ti *typeInferrer) inferType(value any) (iceberg.Type, error) { case time.Time: return iceberg.TimestampTzType{}, nil + case time.Duration: + // time.Duration is the natural Go representation for time-of-day + // values produced by twmb/avro decoding `time-millis`/`time-micros`. + // Iceberg TIME stores microseconds since midnight, no timezone. + return iceberg.TimeType{}, nil + case []byte: return iceberg.BinaryType{}, nil diff --git a/internal/impl/iceberg/type_resolver.go b/internal/impl/iceberg/type_resolver.go index 365cb935ca..5e26612368 100644 --- a/internal/impl/iceberg/type_resolver.go +++ b/internal/impl/iceberg/type_resolver.go @@ -238,7 +238,45 @@ func commonTypeToIcebergTypeRec(c *schema.Common, ti *typeInferrer) (iceberg.Typ case schema.ByteArray: return iceberg.BinaryType{}, nil case schema.Timestamp: - return iceberg.TimestampTzType{}, nil + // Pick an Iceberg variant per the schema's declared unit/UTC-adjust. + // Legacy Timestamps (nil Logical) fall through to the millis/UTC + // default via EffectiveTimestamp(), preserving today's behavior of + // "always TimestampTzType". Schemas that explicitly say nanos use + // the V3 *NsType variants (catalog must support spec V3 to read + // these — that's surfaced as a write-time error from iceberg-go, + // not silently downcast here). + p := c.EffectiveTimestamp() + switch { + case p.Unit == schema.TimeUnitNanos && p.AdjustToUTC: + return iceberg.TimestampTzNsType{}, nil + case p.Unit == schema.TimeUnitNanos: + return iceberg.TimestampNsType{}, nil + case p.AdjustToUTC: + return iceberg.TimestampTzType{}, nil + default: + return iceberg.TimestampType{}, nil + } + case schema.Date: + return iceberg.DateType{}, nil + case schema.TimeOfDay: + // Iceberg's TIME is microseconds since midnight, no timezone. Reject + // shapes Iceberg can't faithfully represent rather than silently + // downcast — see sink-design-guidance.md. + if c.Logical == nil || c.Logical.TimeOfDay == nil { + return nil, fmt.Errorf("time-of-day field %q missing Logical.TimeOfDay parameters", c.Name) + } + p := c.Logical.TimeOfDay + if p.AdjustToUTC { + return nil, fmt.Errorf("time-of-day field %q has AdjustToUTC=true; Iceberg has no time-with-tz column type, downcast or restructure upstream", c.Name) + } + switch p.Unit { + case schema.TimeUnitMillis, schema.TimeUnitMicros: + return iceberg.TimeType{}, nil + default: + return nil, fmt.Errorf("time-of-day field %q has unit %v; Iceberg supports only MILLIS and MICROS for TIME columns", c.Name, p.Unit) + } + case schema.UUID: + return iceberg.UUIDType{}, nil case schema.Decimal: if c.Logical == nil || c.Logical.Decimal == nil { return nil, fmt.Errorf("decimal field %q is missing precision/scale", c.Name) @@ -250,6 +288,13 @@ func commonTypeToIcebergTypeRec(c *schema.Common, ti *typeInferrer) (iceberg.Typ return commonObjectToIcebergStruct(c, ti) case schema.Array: return commonArrayToIcebergList(c, ti) + case schema.Map: + return commonMapToIcebergMap(c, ti) + case schema.Union: + // Iceberg has no native union type. Surface the constraint + // loudly with a remediation pointer rather than the generic + // "unsupported" error — same pattern as parquet_encode. + return nil, fmt.Errorf("field %q is Union which iceberg cannot express; flatten to a single branch upstream (typically by projecting to the non-null variant) before iceberg", c.Name) case schema.Any, schema.Null: return iceberg.StringType{}, nil default: @@ -304,6 +349,26 @@ func commonArrayToIcebergList(c *schema.Common, ti *typeInferrer) (*iceberg.List }, nil } +// commonMapToIcebergMap converts a schema.Common Map (which always keys on +// string by convention, with the single child describing the value type) +// to an iceberg.MapType. Mirrors the Avro / parquet conventions. +func commonMapToIcebergMap(c *schema.Common, ti *typeInferrer) (*iceberg.MapType, error) { + if len(c.Children) != 1 { + return nil, fmt.Errorf("map type must have exactly one value-type child, got %d", len(c.Children)) + } + valType, err := commonTypeToIcebergTypeRec(&c.Children[0], ti) + if err != nil { + return nil, fmt.Errorf("map value: %w", err) + } + return &iceberg.MapType{ + KeyID: ti.allocateFieldID(), + KeyType: iceberg.StringType{}, + ValueID: ti.allocateFieldID(), + ValueType: valType, + ValueRequired: false, + }, nil +} + // applyTypeMapping runs the Bloblang new_column_type_mapping. func (r *typeResolver) applyTypeMapping( field *UnknownFieldError, diff --git a/internal/impl/iceberg/type_resolver_test.go b/internal/impl/iceberg/type_resolver_test.go index 4870e37218..7dc59f860e 100644 --- a/internal/impl/iceberg/type_resolver_test.go +++ b/internal/impl/iceberg/type_resolver_test.go @@ -116,6 +116,44 @@ func TestCommonTypeToIcebergType(t *testing.T) { Name: "amount", Type: schema.BigDecimal, }, "", true}, + // Logical-type cases the original issue #4399 stopped at INT/BIGINT — + // these now produce the right Iceberg column types. + {"Timestamp millis UTC default (legacy nil Logical)", schema.Common{Type: schema.Timestamp}, "timestamptz", false}, + {"Timestamp micros UTC", schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + }, "timestamptz", false}, + {"Timestamp local micros (no tz)", schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: false}}, + }, "timestamp", false}, + {"Timestamp nanos UTC (V3)", schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitNanos, AdjustToUTC: true}}, + }, "timestamptz_ns", false}, + {"Timestamp local nanos (V3)", schema.Common{ + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitNanos, AdjustToUTC: false}}, + }, "timestamp_ns", false}, + {"Date", schema.Common{Type: schema.Date}, "date", false}, + {"TimeOfDay millis (no tz)", schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMillis}}, + }, "time", false}, + {"TimeOfDay micros (no tz)", schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros}}, + }, "time", false}, + {"TimeOfDay AdjustToUTC rejected (Iceberg has no time-with-tz)", schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + }, "", true}, + {"TimeOfDay missing Logical rejected", schema.Common{Type: schema.TimeOfDay}, "", true}, + {"TimeOfDay nanos rejected (Iceberg TIME is micros)", schema.Common{ + Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitNanos}}, + }, "", true}, + {"UUID", schema.Common{Type: schema.UUID}, "uuid", false}, } for _, tt := range tests { @@ -131,6 +169,57 @@ func TestCommonTypeToIcebergType(t *testing.T) { } } +// TestCommonTypeToIcebergType_MapAndUnion closes the coverage gap the +// common-schema-audit skill flagged: Map and Union previously fell into +// the type switch's `default` arm with a generic "unsupported" error. +// Map now maps to MapType; Union still errors but with a +// remediation-pointer message rather than the generic one. +func TestCommonTypeToIcebergType_MapAndUnion(t *testing.T) { + t.Run("Map of String->Int64", func(t *testing.T) { + root := schema.Common{ + Type: schema.Map, + Children: []schema.Common{{Type: schema.Int64}}, + } + got, err := commonTypeToIcebergType(&root, newTypeInferrer(true)) + require.NoError(t, err) + mt, ok := got.(*iceberg.MapType) + require.True(t, ok, "want *iceberg.MapType, got %T", got) + assert.Equal(t, "string", mt.KeyType.Type()) + assert.Equal(t, "long", mt.ValueType.Type()) + }) + + t.Run("Map of String->Timestamp", func(t *testing.T) { + root := schema.Common{ + Type: schema.Map, + Children: []schema.Common{ + {Type: schema.Timestamp, Logical: &schema.LogicalParams{ + Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMillis, AdjustToUTC: true}, + }}, + }, + } + got, err := commonTypeToIcebergType(&root, newTypeInferrer(true)) + require.NoError(t, err) + mt, ok := got.(*iceberg.MapType) + require.True(t, ok) + assert.Equal(t, "timestamptz", mt.ValueType.Type()) + }) + + t.Run("Map with wrong child arity errors clearly", func(t *testing.T) { + root := schema.Common{Type: schema.Map} + _, err := commonTypeToIcebergType(&root, newTypeInferrer(true)) + require.Error(t, err) + assert.Contains(t, err.Error(), "exactly one value-type child") + }) + + t.Run("Union refused loudly", func(t *testing.T) { + root := schema.Common{Name: "u", Type: schema.Union} + _, err := commonTypeToIcebergType(&root, newTypeInferrer(true)) + require.Error(t, err) + assert.Contains(t, err.Error(), "Union") + assert.Contains(t, err.Error(), "flatten", "error must point at the upstream remediation") + }) +} + func TestFindCommonField(t *testing.T) { root := schema.Common{ Type: schema.Object, diff --git a/internal/impl/iceberg/writer.go b/internal/impl/iceberg/writer.go index 43a62cfb4d..4ba895f8aa 100644 --- a/internal/impl/iceberg/writer.go +++ b/internal/impl/iceberg/writer.go @@ -1,4 +1,4 @@ -// Copyright 2025 Redpanda Data, Inc. +// Copyright 2026 Redpanda Data, Inc. // // Licensed as a Redpanda Enterprise file under the Redpanda Community // License (the "License"); you may not use this file except in compliance with @@ -13,6 +13,7 @@ import ( "context" "fmt" "path" + "reflect" "slices" "strings" @@ -23,6 +24,7 @@ import ( "github.com/parquet-go/parquet-go" "github.com/parquet-go/parquet-go/format" + "github.com/redpanda-data/benthos/v4/public/schema" "github.com/redpanda-data/benthos/v4/public/service" "github.com/redpanda-data/connect/v4/internal/impl/iceberg/icebergx" "github.com/redpanda-data/connect/v4/internal/impl/iceberg/shredder" @@ -30,24 +32,40 @@ import ( // writer handles writing batches of messages to a single Iceberg table. type writer struct { - table *table.Table - committer *committer - caseSensitive bool - writerOpts []parquet.WriterOption - logger *service.Logger + table *table.Table + committer *committer + caseSensitive bool + writerOpts []parquet.WriterOption + resolver *typeResolver + requireSchemaMetadata bool + logger *service.Logger + + // coerceLoggedFieldIDs tracks the iceberg field IDs we have already + // logged a coerce-on-write notice for, so that a long-running writer + // emits the divergence between schema metadata and existing column + // type once per column rather than per batch. Single-goroutine access + // per writer instance so no synchronisation is needed. + coerceLoggedFieldIDs map[int]struct{} } // NewWriter creates a new writer for a specific table. // The table and committer should use separate table references since they // operate in different goroutines and the table object is mutable. // caseSensitive controls how message keys are matched against the schema. -func NewWriter(tbl *table.Table, comm *committer, caseSensitive bool, writerOpts []parquet.WriterOption, logger *service.Logger) *writer { +// resolver supplies optional per-message schema metadata used by the shredder +// to interpret numeric inputs into time-typed columns; pass nil to disable. +// requireSchemaMetadata enables shredder strict mode — see +// [shredder.RecordShredder.SetStrictTemporalMode]. +func NewWriter(tbl *table.Table, comm *committer, caseSensitive bool, writerOpts []parquet.WriterOption, resolver *typeResolver, requireSchemaMetadata bool, logger *service.Logger) *writer { return &writer{ - table: tbl, - committer: comm, - caseSensitive: caseSensitive, - writerOpts: writerOpts, - logger: logger, + table: tbl, + committer: comm, + caseSensitive: caseSensitive, + writerOpts: writerOpts, + resolver: resolver, + requireSchemaMetadata: requireSchemaMetadata, + logger: logger, + coerceLoggedFieldIDs: map[int]struct{}{}, } } @@ -186,8 +204,36 @@ func (w *writer) messagesToParquet(batch service.MessageBatch) ([]partitionFile, } numPartitionFields := spec.NumFields() - // Create shredder for the schema + // Create shredder for the schema. When schema metadata is configured + // and the first message in the batch carries it, use it to inform the + // shredder's numeric-to-temporal conversion. Schema metadata is the + // authoritative source for "this BIGINT-shaped value is actually + // timestamp-millis" and prevents the year-50000 silent corruption + // that bloblang.ValueAsTimestamp's seconds-default would otherwise + // produce. + // + // We sample the schema metadata from batch[0] only and apply it to + // every message in the batch. Connect's iceberg router groups by + // (namespace, table) before reaching this method, so messages in a + // batch share a destination table and — in every supported upstream + // (schema-registry decode, parquet decode, single-source streams) — + // a single schema. If a future upstream genuinely interleaves + // different schema metadata into one batch, this assumption breaks + // silently for messages 1..N; in that case the writer must be + // extended to per-message metadata lookup with a small cache. rs := shredder.NewRecordShredder(schema, w.caseSensitive) + if w.requireSchemaMetadata { + rs.SetStrictTemporalMode(true) + } + if w.resolver != nil && len(batch) > 0 { + if common, err := w.resolver.parseSchemaMetadata(batch[0]); err != nil { + w.logger.Warnf("parsing schema metadata for shredder: %v (falling back to schema-agnostic conversion)", err) + } else if common != nil { + fieldCommons := buildShredderFieldCommons(schema, common, w.caseSensitive) + w.logCoerceDecisions(schema, fieldCommons) + rs.SetFieldSchemaMetadata(fieldCommons) + } + } // For unpartitioned tables, use a single writer if spec.IsUnpartitioned() { @@ -305,6 +351,185 @@ func (w *writer) Close() { w.committer.Close() } +// logCoerceDecisions inspects the resolved fieldID → schema.Common map +// against the live iceberg schema and emits one INFO-level log per field +// whose declared upstream type would have produced a different iceberg +// column type than the one the table currently holds. +// +// The intended audience is operators rolling out the #4399 metadata fix +// over existing tables whose columns were created under the pre-fix +// (degraded) metadata shape: the log surface tells them which columns the +// shredder is silently coerce-converting on write, so they can choose to +// rebuild the affected tables when they want native temporal columns. +// +// Per-field dedup is via w.coerceLoggedFieldIDs so a long-running writer +// emits each notice once, not per-batch. +func (w *writer) logCoerceDecisions(s *iceberg.Schema, fieldCommons map[int]*schema.Common) { + if w.logger == nil || len(fieldCommons) == 0 { + return + } + icebergTypeByID := map[int]iceberg.Type{} + collectLeafIcebergTypes(s.AsStruct().FieldList, icebergTypeByID) + + for fieldID, common := range fieldCommons { + if _, already := w.coerceLoggedFieldIDs[fieldID]; already { + continue + } + existingType, ok := icebergTypeByID[fieldID] + if !ok { + continue + } + // The type inferrer is used by commonTypeToIcebergType only for + // nested struct/list ID allocation, which we don't care about + // when comparing leaf primitives — a throwaway inferrer is fine. + impliedType, err := commonTypeToIcebergType(common, newTypeInferrer(w.caseSensitive)) + if err != nil || impliedType == nil { + continue + } + if reflect.DeepEqual(existingType, impliedType) { + continue + } + fieldName := lookupFieldName(s, fieldID) + if w.requireSchemaMetadata { + w.logger.Infof( + "iceberg: field %q has existing column type %s but schema metadata declares %s; require_schema_metadata=true will reject writes for this column. Recreate the table to migrate to %s.", + fieldName, existingType.String(), impliedType.String(), impliedType.String(), + ) + } else { + w.logger.Infof( + "iceberg: coercing field %q on write: existing column type %s does not match the type implied by schema metadata (%s). "+ + "Values will be written using the existing column type. Recreate the table to migrate to %s.", + fieldName, existingType.String(), impliedType.String(), impliedType.String(), + ) + } + w.coerceLoggedFieldIDs[fieldID] = struct{}{} + } +} + +// collectLeafIcebergTypes recursively walks the iceberg schema, populating +// out with fieldID → leaf type for every primitive-typed field. +func collectLeafIcebergTypes(fields []iceberg.NestedField, out map[int]iceberg.Type) { + for _, f := range fields { + switch t := f.Type.(type) { + case *iceberg.StructType: + collectLeafIcebergTypes(t.FieldList, out) + case *iceberg.ListType: + // Lists wrap a single element; treat the element as a leaf + // candidate for completeness, even though primitive-element + // lists are the common case. + out[t.ElementID] = t.Element + if st, ok := t.Element.(*iceberg.StructType); ok { + collectLeafIcebergTypes(st.FieldList, out) + } + case *iceberg.MapType: + out[t.ValueID] = t.ValueType + if st, ok := t.ValueType.(*iceberg.StructType); ok { + collectLeafIcebergTypes(st.FieldList, out) + } + default: + out[f.ID] = f.Type + } + } +} + +// lookupFieldName returns the human-readable name for an iceberg fieldID, +// or a synthesized "field_" if the schema doesn't expose it directly. +func lookupFieldName(s *iceberg.Schema, fieldID int) string { + if name, ok := s.FindColumnName(fieldID); ok { + return name + } + return fmt.Sprintf("field_%d", fieldID) +} + +// buildShredderFieldCommons walks an iceberg schema and the parallel +// schema.Common metadata that produced it, returning a fieldID → *schema.Common +// map keyed by the iceberg field IDs of every leaf column. The shredder +// consults this to interpret numeric inputs into time-typed columns using +// the unit/AdjustToUTC semantics declared by the upstream schema. +// +// Only leaf-level entries are emitted; struct/list/map containers are +// recursed into. Field matching uses the same case-sensitivity rule as +// shredding so the lookup behaves consistently with how values are routed. +// Fields present in the iceberg schema but absent from the metadata are +// skipped — those columns either pre-date the metadata or were added by +// schema evolution from a different source. +func buildShredderFieldCommons(s *iceberg.Schema, root *schema.Common, caseSensitive bool) map[int]*schema.Common { + if root == nil { + return nil + } + out := make(map[int]*schema.Common) + visitIcebergSchemaFields(s.AsStruct().FieldList, root, caseSensitive, out) + if len(out) == 0 { + return nil + } + return out +} + +func visitIcebergSchemaFields(fields []iceberg.NestedField, parent *schema.Common, caseSensitive bool, out map[int]*schema.Common) { + if parent == nil || parent.Type != schema.Object { + return + } + for _, f := range fields { + child := lookupCommonChild(parent, f.Name, caseSensitive) + if child == nil { + continue + } + recordOrRecurseIcebergField(f.ID, f.Type, child, caseSensitive, out) + } +} + +// recordOrRecurseIcebergField is the leaf-vs-recurse decision for a single +// iceberg type/common pair. Leaves are registered in out; container types +// (struct, list, map) are recursed into so their leaf descendants pick up +// metadata too. +// +// When the iceberg-side container shape and the common-side type don't +// agree (e.g. iceberg has ListType but the common says Object), we skip +// rather than blindly consume children of the wrong shape. The shredder +// then falls back to the historical schema-agnostic conversion for those +// fields, which is the safe loss-of-precision rather than misinterpreting. +func recordOrRecurseIcebergField(fieldID int, typ iceberg.Type, common *schema.Common, caseSensitive bool, out map[int]*schema.Common) { + switch t := typ.(type) { + case *iceberg.StructType: + if common.Type != schema.Object { + return + } + visitIcebergSchemaFields(t.FieldList, common, caseSensitive, out) + case *iceberg.ListType: + // A schema.Common array carries the element schema as its single + // child; skip if the shape doesn't match. + if common.Type != schema.Array || len(common.Children) != 1 { + return + } + recordOrRecurseIcebergField(t.ElementID, t.Element, &common.Children[0], caseSensitive, out) + case *iceberg.MapType: + // schema.Common maps are encoded as type Map with a single child + // representing the value schema. Keys are always primitives in + // our model, so they don't need metadata. Recurse into the value. + if common.Type != schema.Map || len(common.Children) != 1 { + return + } + recordOrRecurseIcebergField(t.ValueID, t.ValueType, &common.Children[0], caseSensitive, out) + default: + // Leaf — register the metadata so the shredder can consult it. + out[fieldID] = common + } +} + +func lookupCommonChild(parent *schema.Common, name string, caseSensitive bool) *schema.Common { + for i := range parent.Children { + ch := &parent.Children[i] + if caseSensitive { + if ch.Name == name { + return ch + } + } else if strings.EqualFold(ch.Name, name) { + return ch + } + } + return nil +} + // parquetColumn holds state for writing to a single parquet column. type parquetColumn struct { writer *parquet.ColumnWriter diff --git a/internal/impl/iceberg/writer_test.go b/internal/impl/iceberg/writer_test.go index af7ade85c0..cd6819b7c1 100644 --- a/internal/impl/iceberg/writer_test.go +++ b/internal/impl/iceberg/writer_test.go @@ -11,9 +11,11 @@ package iceberg import ( "testing" + "github.com/apache/iceberg-go" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" + "github.com/redpanda-data/benthos/v4/public/schema" "github.com/redpanda-data/connect/v4/internal/impl/iceberg/icebergx" ) @@ -110,3 +112,137 @@ func TestParquetSinkNewFieldDedup(t *testing.T) { assert.Len(t, sink.newFieldErrors(), 2) }) } + +// TestBuildShredderFieldCommons verifies that the field-id → schema.Common +// map produced for the shredder correctly traverses iceberg containers +// (struct, list, map) so descendant leaf columns pick up unit metadata +// instead of being treated as the container's leaf. +func TestBuildShredderFieldCommons(t *testing.T) { + tsCommon := schema.Common{ + Name: "ts", + Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMillis, AdjustToUTC: true}}, + } + + t.Run("nil root returns nil", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "x", Type: iceberg.PrimitiveTypes.Int64}, + ) + assert.Nil(t, buildShredderFieldCommons(s, nil, false)) + }) + + t.Run("flat schema registers leaf metadata", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "id", Type: iceberg.PrimitiveTypes.Int64}, + iceberg.NestedField{ID: 2, Name: "ts", Type: iceberg.TimestampTzType{}}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + {Name: "id", Type: schema.Int64}, + tsCommon, + }, + } + got := buildShredderFieldCommons(s, &root, false) + require.NotNil(t, got) + require.Contains(t, got, 2) + assert.Equal(t, schema.Timestamp, got[2].Type) + require.NotNil(t, got[2].Logical) + require.NotNil(t, got[2].Logical.Timestamp) + assert.Equal(t, schema.TimeUnitMillis, got[2].Logical.Timestamp.Unit) + }) + + t.Run("nested struct registers descendant leaf", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "outer", Type: &iceberg.StructType{ + FieldList: []iceberg.NestedField{ + {ID: 2, Name: "inner_ts", Type: iceberg.TimestampTzType{}}, + }, + }}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + {Name: "outer", Type: schema.Object, Children: []schema.Common{ + { + Name: "inner_ts", Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + }, + }}, + }, + } + got := buildShredderFieldCommons(s, &root, false) + require.Contains(t, got, 2, "inner_ts should be registered by element ID") + assert.Equal(t, schema.TimeUnitMicros, got[2].Logical.Timestamp.Unit) + }) + + t.Run("list of timestamps registers element metadata", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "events", Type: &iceberg.ListType{ + ElementID: 100, Element: iceberg.TimestampTzType{}, ElementRequired: false, + }}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + {Name: "events", Type: schema.Array, Children: []schema.Common{tsCommon}}, + }, + } + got := buildShredderFieldCommons(s, &root, false) + require.Contains(t, got, 100) + assert.Equal(t, schema.TimeUnitMillis, got[100].Logical.Timestamp.Unit) + }) + + t.Run("map of string to timestamp registers value metadata", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "by_event", Type: &iceberg.MapType{ + KeyID: 100, KeyType: iceberg.PrimitiveTypes.String, + ValueID: 101, ValueType: iceberg.TimestampTzType{}, ValueRequired: false, + }}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + {Name: "by_event", Type: schema.Map, Children: []schema.Common{tsCommon}}, + }, + } + got := buildShredderFieldCommons(s, &root, false) + require.Contains(t, got, 101, "map value should be registered by ValueID") + assert.NotContains(t, got, 100, "map key should not be registered") + assert.Equal(t, schema.TimeUnitMillis, got[101].Logical.Timestamp.Unit) + }) + + t.Run("shape mismatch skips silently", func(t *testing.T) { + // iceberg has a list, but metadata has scalar Int64 — skip the field. + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "events", Type: &iceberg.ListType{ + ElementID: 100, Element: iceberg.TimestampTzType{}, + }}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + {Name: "events", Type: schema.Int64}, + }, + } + got := buildShredderFieldCommons(s, &root, false) + assert.Empty(t, got, "wrong-shape metadata should not register anything") + }) + + t.Run("case-insensitive matching", func(t *testing.T) { + s := iceberg.NewSchema(1, + iceberg.NestedField{ID: 1, Name: "ts", Type: iceberg.TimestampTzType{}}, + ) + root := schema.Common{ + Type: schema.Object, + Children: []schema.Common{ + { + Name: "TS", Type: schema.Timestamp, + Logical: &schema.LogicalParams{Timestamp: &schema.TimestampParams{Unit: schema.TimeUnitMicros, AdjustToUTC: true}}, + }, + }, + } + got := buildShredderFieldCommons(s, &root, false) + require.Contains(t, got, 1, "case-insensitive match should still register the metadata") + }) +} diff --git a/internal/impl/parquet/processor_encode.go b/internal/impl/parquet/processor_encode.go index 30b39a795c..81ad0dc616 100644 --- a/internal/impl/parquet/processor_encode.go +++ b/internal/impl/parquet/processor_encode.go @@ -414,6 +414,55 @@ func parquetNodeFromCommonField(field schema.Common, tsUnit parquet.TimeUnit) (p return nil, fmt.Errorf("field %q is BigDecimal which has no fixed precision/scale; cast or coerce upstream before parquet_encode", field.Name) case schema.ByteArray: n = parquet.Leaf(parquet.ByteArrayType) + case schema.Date: + n = parquet.Date() + case schema.TimeOfDay: + // Iceberg, Avro, and the Common schema model all treat + // time-of-day as a unit-tagged integer. parquet has a parallel + // representation via parquet.Time(unit). When the Common schema + // omits the Logical params we default to millis to match the + // Avro spec's preferred wire form for time-of-day. + unit := parquet.Millisecond + if field.Logical != nil && field.Logical.TimeOfDay != nil { + switch field.Logical.TimeOfDay.Unit { + case schema.TimeUnitMicros: + unit = parquet.Microsecond + case schema.TimeUnitNanos: + unit = parquet.Nanosecond + case schema.TimeUnitSeconds: + return nil, fmt.Errorf("time-of-day field %q has unit SECONDS which parquet cannot express; coerce upstream to MILLIS, MICROS, or NANOS", field.Name) + } + } + n = parquet.Time(unit) + case schema.UUID: + // parquet.UUID() is a 16-byte fixed-length byte array with the + // UUID logical-type annotation. The encoding coercion visitor + // handles string→[]byte conversion at encode time. + n = parquet.UUID() + case schema.Map: + // Common-schema maps key on string by convention; the single + // child describes the value type. + if len(field.Children) != 1 { + return nil, fmt.Errorf("source schema contains map '%v' that does not define a single value type child", field.Name) + } + valueNode, err := parquetNodeFromCommonField(field.Children[0], tsUnit) + if err != nil { + return nil, err + } + n = parquet.Map(parquet.String(), valueNode) + case schema.Union: + // Parquet has no native union; we'd need to flatten to a single + // branch and lose the type information for the others. Refuse + // loudly rather than silently picking the wrong branch — the + // operator can collapse the union upstream (typically by + // projecting to the single non-null variant) before encoding. + return nil, fmt.Errorf("source schema contains union '%v' that parquet cannot express; flatten to a single branch upstream before parquet_encode", field.Name) + case schema.Null: + // A schema that declares a column as always-null is degenerate + // for parquet, which requires every column to have a physical + // type. Match the Avro spec convention that null can be the + // type of a union branch but not the type of a record field. + return nil, fmt.Errorf("source schema contains field %q with type NULL; parquet requires a concrete physical type, drop the field or wrap it in a union upstream", field.Name) case schema.Array: if len(field.Children) != 1 { return nil, fmt.Errorf("source schema contains array '%v' that does not define a child type", field.Name) diff --git a/internal/impl/parquet/schema_coercion.go b/internal/impl/parquet/schema_coercion.go index ac03c158ab..0b511088aa 100644 --- a/internal/impl/parquet/schema_coercion.go +++ b/internal/impl/parquet/schema_coercion.go @@ -1,4 +1,4 @@ -// Copyright 2024 Redpanda Data, Inc. +// Copyright 2026 Redpanda Data, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. @@ -18,6 +18,7 @@ import ( "encoding/json" "errors" "fmt" + "math" "time" "github.com/gofrs/uuid/v5" @@ -81,26 +82,11 @@ func (encodingCoercionVisitor) visitLeaf(value any, schemaNode parquet.Node) (an return value, nil } if logicalType.Timestamp != nil { - switch v := value.(type) { - case string: - ts, err := time.Parse(time.RFC3339, v) - if err != nil { - return nil, fmt.Errorf("parsing string RFC3339 timestamp: %w", err) - } - unit := logicalType.Timestamp.Unit - switch { - case unit.Millis != nil: - return ts.UnixMilli(), nil - case unit.Micros != nil: - return ts.UnixMicro(), nil - case unit.Nanos != nil: - return ts.UnixNano(), nil - default: - return nil, errors.New("unreachable branch while processing parquet timestamp") - } - default: - return nil, errors.New("TIMESTAMP values must be RFC3339-formatted strings") - } + return coerceTimestampForEncode(value, logicalType.Timestamp) + } else if logicalType.Date != nil { + return coerceDateForEncode(value) + } else if logicalType.Time != nil { + return coerceTimeForEncode(value, logicalType.Time) } else if logicalType.Json != nil { jsonBytes, err := json.Marshal(value) if err != nil { @@ -125,6 +111,166 @@ func (encodingCoercionVisitor) visitLeaf(value any, schemaNode parquet.Node) (an return value, nil } +// coerceTimestampForEncode converts a value into the parquet physical +// representation for a TIMESTAMP-typed column (int64 in the column's +// declared unit). Accepts: +// +// - time.Time — produced by the value-side decoder when +// preserve_logical_types is enabled. Scaled to the column's unit +// via UnixMilli/UnixMicro/UnixNano. +// - numeric (int64 / int32 / int / float64) — assumed to be already +// in the column's declared unit. This matches the column-honouring +// path the iceberg shredder uses for numeric inputs into time-typed +// columns; the schema's Unit is the authoritative source of truth. +// - RFC3339 string — historical path for callers that pre-format +// timestamps as ISO 8601 strings. Parsed via time.Parse then scaled. +func coerceTimestampForEncode(value any, ts *format.TimestampType) (any, error) { + scale := func(t time.Time) (int64, error) { + switch unit := ts.Unit; { + case unit.Millis != nil: + return t.UnixMilli(), nil + case unit.Micros != nil: + return t.UnixMicro(), nil + case unit.Nanos != nil: + return t.UnixNano(), nil + default: + return 0, errors.New("unreachable branch while processing parquet timestamp") + } + } + switch v := value.(type) { + case time.Time: + return scale(v) + case string: + t, err := time.Parse(time.RFC3339, v) + if err != nil { + return nil, fmt.Errorf("parsing string RFC3339 timestamp: %w", err) + } + return scale(t) + case int64: + return v, nil + case int32: + return int64(v), nil + case int: + return int64(v), nil + case float64: + // int64(NaN) / int64(±Inf) is implementation-defined garbage — + // reject explicitly so silent corruption (year 1970 or worse) + // cannot reach the column. Mirrors the iceberg shredder's guard. + if math.IsNaN(v) || math.IsInf(v, 0) { + return nil, fmt.Errorf("cannot convert %v to TIMESTAMP", v) + } + return int64(v), nil + default: + return nil, fmt.Errorf("TIMESTAMP values must be time.Time, RFC3339 string, or numeric; got %T", value) + } +} + +// coerceDateForEncode converts a value into the parquet physical +// representation for a DATE column (int32 days-since-epoch). Accepts +// time.Time (UTC-floored to midnight), numeric (assumed already +// days-since-epoch), and ISO 8601 date strings. +func coerceDateForEncode(value any) (any, error) { + switch v := value.(type) { + case time.Time: + return int32(unixDaysFloor(v)), nil + case string: + t, errRFC := time.Parse(time.RFC3339, v) + if errRFC != nil { + t2, errDate := time.Parse("2006-01-02", v) + if errDate != nil { + // Surface both attempts — a malformed bare date like + // "2024-13-99" would otherwise yield only the RFC3339 + // error, which misleadingly suggests the user must add + // a time component. + return nil, fmt.Errorf("parsing DATE string %q: tried RFC3339 (%v) and YYYY-MM-DD (%v)", v, errRFC, errDate) + } + t = t2 + } + return int32(unixDaysFloor(t)), nil + case int32: + return v, nil + case int64: + return int32(v), nil + case int: + return int32(v), nil + case float64: + // int32(NaN) / int32(±Inf) is implementation-defined garbage — + // reject explicitly so silent corruption cannot reach the column. + if math.IsNaN(v) || math.IsInf(v, 0) { + return nil, fmt.Errorf("cannot convert %v to DATE", v) + } + return int32(v), nil + default: + return nil, fmt.Errorf("DATE values must be time.Time, date string, or numeric days-since-epoch; got %T", value) + } +} + +// unixDaysFloor returns days-since-epoch for t with floor-toward-negative- +// infinity rounding. Go's integer division truncates toward zero, which +// would map a pre-epoch wall-clock like 1969-12-31T23:59:59Z (Unix = -1) to +// day 0 (1970-01-01) instead of the correct day -1 (1969-12-31). Times at +// or after the epoch are unaffected. +func unixDaysFloor(t time.Time) int64 { + secs := t.UTC().Unix() + days := secs / 86400 + if secs < 0 && secs%86400 != 0 { + days-- + } + return days +} + +// coerceTimeForEncode converts a value into the parquet physical +// representation for a TIME column (int32 millis for millis unit, int64 +// otherwise). Accepts time.Duration, time.Time (wall-clock portion), +// and numeric inputs assumed to already be in the column's declared +// unit. +func coerceTimeForEncode(value any, tt *format.TimeType) (any, error) { + isMillis := tt.Unit.Millis != nil + durationToUnit := func(d time.Duration) int64 { + switch unit := tt.Unit; { + case unit.Millis != nil: + return d.Milliseconds() + case unit.Micros != nil: + return d.Microseconds() + case unit.Nanos != nil: + return d.Nanoseconds() + default: + return int64(d) + } + } + wrap := func(n int64) any { + if isMillis { + return int32(n) + } + return n + } + switch v := value.(type) { + case time.Duration: + return wrap(durationToUnit(v)), nil + case time.Time: + d := time.Duration(v.Hour())*time.Hour + + time.Duration(v.Minute())*time.Minute + + time.Duration(v.Second())*time.Second + + time.Duration(v.Nanosecond())*time.Nanosecond + return wrap(durationToUnit(d)), nil + case int64: + return wrap(v), nil + case int32: + return wrap(int64(v)), nil + case int: + return wrap(int64(v)), nil + case float64: + // int64(NaN) / int64(±Inf) is implementation-defined garbage — + // reject explicitly so silent corruption cannot reach the column. + if math.IsNaN(v) || math.IsInf(v, 0) { + return nil, fmt.Errorf("cannot convert %v to TIME", v) + } + return wrap(int64(v)), nil + default: + return nil, fmt.Errorf("TIME values must be time.Duration, time.Time, or numeric; got %T", value) + } +} + // coerceDecimalForEncode converts a canonical decimal string (the value // contract for schema.Decimal fields) into the parquet physical form // matching the column's declared precision: int32 for p<=9, int64 for p<=18, diff --git a/internal/impl/parquet/schema_coercion_coverage_test.go b/internal/impl/parquet/schema_coercion_coverage_test.go new file mode 100644 index 0000000000..de2b8f836b --- /dev/null +++ b/internal/impl/parquet/schema_coercion_coverage_test.go @@ -0,0 +1,239 @@ +// Copyright 2026 Redpanda Data, Inc. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package parquet + +import ( + "math" + "strings" + "testing" + "time" + + "github.com/parquet-go/parquet-go" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + + "github.com/redpanda-data/benthos/v4/public/schema" +) + +// TestParquetNodeFromCommonField_NewTypes pins coverage for the +// schema.CommonType variants the encoder previously rejected with +// "unsupported by this processor". These are the analogues of the +// type-coverage holes the iceberg sink closed earlier in this branch; +// pipelines flowing CDC schemas → parquet need them just as much. +func TestParquetNodeFromCommonField_NewTypes(t *testing.T) { + t.Run("Date", func(t *testing.T) { + n, err := parquetNodeFromCommonField(schema.Common{Name: "d", Type: schema.Date}, parquet.Millisecond) + require.NoError(t, err) + require.NotNil(t, n.Type().LogicalType()) + assert.NotNil(t, n.Type().LogicalType().Date, "DATE column must carry the Date logical-type annotation") + }) + + t.Run("TimeOfDay millis", func(t *testing.T) { + n, err := parquetNodeFromCommonField(schema.Common{ + Name: "t", Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMillis}}, + }, parquet.Millisecond) + require.NoError(t, err) + require.NotNil(t, n.Type().LogicalType()) + require.NotNil(t, n.Type().LogicalType().Time) + assert.NotNil(t, n.Type().LogicalType().Time.Unit.Millis) + }) + + t.Run("TimeOfDay micros", func(t *testing.T) { + n, err := parquetNodeFromCommonField(schema.Common{ + Name: "t", Type: schema.TimeOfDay, + Logical: &schema.LogicalParams{TimeOfDay: &schema.TimeOfDayParams{Unit: schema.TimeUnitMicros}}, + }, parquet.Millisecond) + require.NoError(t, err) + require.NotNil(t, n.Type().LogicalType().Time) + assert.NotNil(t, n.Type().LogicalType().Time.Unit.Micros) + }) + + t.Run("UUID", func(t *testing.T) { + n, err := parquetNodeFromCommonField(schema.Common{Name: "u", Type: schema.UUID}, parquet.Millisecond) + require.NoError(t, err) + require.NotNil(t, n.Type().LogicalType()) + assert.NotNil(t, n.Type().LogicalType().UUID, "UUID column must carry the UUID logical-type annotation") + }) + + t.Run("Map of String->Int64", func(t *testing.T) { + n, err := parquetNodeFromCommonField(schema.Common{ + Name: "m", Type: schema.Map, + Children: []schema.Common{{Type: schema.Int64}}, + }, parquet.Millisecond) + require.NoError(t, err) + require.NotNil(t, n.Type().LogicalType()) + assert.NotNil(t, n.Type().LogicalType().Map, "Map column must carry the Map logical-type annotation") + }) + + t.Run("Union refused loudly", func(t *testing.T) { + _, err := parquetNodeFromCommonField(schema.Common{Name: "u", Type: schema.Union}, parquet.Millisecond) + require.Error(t, err) + assert.Contains(t, err.Error(), "union", "error message should explain the constraint") + }) + + t.Run("Null refused loudly", func(t *testing.T) { + _, err := parquetNodeFromCommonField(schema.Common{Name: "n", Type: schema.Null}, parquet.Millisecond) + require.Error(t, err) + assert.Contains(t, err.Error(), "NULL") + }) +} + +// TestEncodingCoercionVisitor_Temporal pins the cross-type bridges the +// encoder now applies. The bug class is the same one the iceberg +// shredder closes via coerceTemporalToNumeric: a pipeline using +// preserve_logical_types: true emits time.Time / time.Duration values, +// but the parquet encoder was previously rejecting anything but RFC3339 +// strings for TIMESTAMP columns. +func TestEncodingCoercionVisitor_Temporal(t *testing.T) { + const tsMillis = int64(1_700_000_000_000) + visitor := encodingCoercionVisitor{} + + t.Run("time.Time into TIMESTAMP(millis)", func(t *testing.T) { + node := parquet.Timestamp(parquet.Millisecond) + out, err := visitor.visitLeaf(time.UnixMilli(tsMillis).UTC(), node) + require.NoError(t, err) + assert.Equal(t, tsMillis, out) + }) + + t.Run("time.Time into TIMESTAMP(micros)", func(t *testing.T) { + node := parquet.Timestamp(parquet.Microsecond) + out, err := visitor.visitLeaf(time.UnixMilli(tsMillis).UTC(), node) + require.NoError(t, err) + assert.Equal(t, tsMillis*1000, out) + }) + + t.Run("int64 millis into TIMESTAMP(millis) — pre-scaled passthrough", func(t *testing.T) { + node := parquet.Timestamp(parquet.Millisecond) + out, err := visitor.visitLeaf(tsMillis, node) + require.NoError(t, err) + assert.Equal(t, tsMillis, out) + }) + + t.Run("RFC3339 string into TIMESTAMP(millis) — legacy path", func(t *testing.T) { + node := parquet.Timestamp(parquet.Millisecond) + out, err := visitor.visitLeaf("2023-11-14T22:13:20Z", node) + require.NoError(t, err) + assert.Equal(t, tsMillis, out) + }) + + t.Run("time.Time into DATE", func(t *testing.T) { + out, err := visitor.visitLeaf(time.Date(2024, 1, 15, 0, 0, 0, 0, time.UTC), parquet.Date()) + require.NoError(t, err) + assert.Equal(t, int32(19737), out) + }) + + t.Run("date string into DATE", func(t *testing.T) { + out, err := visitor.visitLeaf("2024-01-15", parquet.Date()) + require.NoError(t, err) + assert.Equal(t, int32(19737), out) + }) + + t.Run("time.Duration into TIME(millis)", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + out, err := visitor.visitLeaf(d, parquet.Time(parquet.Millisecond)) + require.NoError(t, err) + assert.Equal(t, int32(8*3600+30*60)*1000, out) + }) + + t.Run("time.Duration into TIME(micros)", func(t *testing.T) { + d := 8*time.Hour + 30*time.Minute + out, err := visitor.visitLeaf(d, parquet.Time(parquet.Microsecond)) + require.NoError(t, err) + assert.Equal(t, int64(8*3600+30*60)*1_000_000, out) + }) + + t.Run("unsupported type into TIMESTAMP errors clearly", func(t *testing.T) { + _, err := visitor.visitLeaf(struct{}{}, parquet.Timestamp(parquet.Millisecond)) + require.Error(t, err) + assert.True(t, + strings.Contains(err.Error(), "time.Time") || strings.Contains(err.Error(), "RFC3339") || strings.Contains(err.Error(), "numeric"), + "error message should name an accepted type, got %q", err.Error()) + }) +} + +// TestEncodingCoercionVisitor_RejectsNaNInf locks in defense-in-depth +// against silent corruption: a NaN or ±Inf float reaching a time-typed +// column must error, not int64-cast to implementation-defined garbage. +// Mirrors the iceberg shredder's TestTemporalRejectsNaNInf so the guard +// is symmetric across the two sinks. +func TestEncodingCoercionVisitor_RejectsNaNInf(t *testing.T) { + visitor := encodingCoercionVisitor{} + for _, tc := range []struct { + name string + node parquet.Node + }{ + {"TIMESTAMP(millis)", parquet.Timestamp(parquet.Millisecond)}, + {"TIMESTAMP(micros)", parquet.Timestamp(parquet.Microsecond)}, + {"TIMESTAMP(nanos)", parquet.Timestamp(parquet.Nanosecond)}, + {"DATE", parquet.Date()}, + {"TIME(millis)", parquet.Time(parquet.Millisecond)}, + {"TIME(micros)", parquet.Time(parquet.Microsecond)}, + } { + for _, v := range []float64{math.NaN(), math.Inf(1), math.Inf(-1)} { + t.Run(tc.name, func(t *testing.T) { + _, err := visitor.visitLeaf(v, tc.node) + require.Errorf(t, err, "must reject %v into %s, not silently cast", v, tc.name) + }) + } + } +} + +// TestCoerceDateForEncode_StringErrorSurfacesBothAttempts verifies that +// when a string fails both the RFC3339 and YYYY-MM-DD parses, the error +// surfaces both attempts rather than only the RFC3339 one — which would +// mislead a user who passed a clearly date-shaped (but invalid) string. +func TestCoerceDateForEncode_StringErrorSurfacesBothAttempts(t *testing.T) { + _, err := coerceDateForEncode("2024-13-99") + require.Error(t, err) + msg := err.Error() + assert.Contains(t, msg, "RFC3339", "error should mention the RFC3339 attempt") + assert.Contains(t, msg, "YYYY-MM-DD", "error should mention the bare-date attempt") +} + +// TestCoerceDateForEncode_FloorsTowardNegativeInfinity pins down the +// pre-epoch rounding contract: a time.Time and an RFC3339 string at the +// same instant must produce the same days-since-epoch, and a pre-epoch +// wall clock with a non-midnight component must round down (1969-12-31) +// rather than truncate toward zero (1970-01-01). +// +// Go's integer division truncates, which silently mapped pre-epoch +// RFC3339 strings to the wrong day before the floor adjustment landed. +// The bare-date YYYY-MM-DD form happens to land on midnight UTC, so its +// Unix() is a multiple of 86400 and truncate/floor agree — but it is +// included here to confirm the shared helper hasn't perturbed it. +func TestCoerceDateForEncode_FloorsTowardNegativeInfinity(t *testing.T) { + tests := []struct { + name string + value any + wantDays int32 + }{ + {"time.Time pre-epoch non-midnight", time.Date(1969, 12, 31, 23, 59, 59, 0, time.UTC), -1}, + {"RFC3339 pre-epoch non-midnight", "1969-12-31T23:59:59Z", -1}, + {"RFC3339 pre-epoch midnight", "1969-12-31T00:00:00Z", -1}, + {"bare date pre-epoch", "1969-12-31", -1}, + {"time.Time epoch", time.Unix(0, 0).UTC(), 0}, + {"RFC3339 epoch", "1970-01-01T00:00:00Z", 0}, + {"RFC3339 post-epoch non-midnight", "1970-01-01T12:34:56Z", 0}, + {"bare date post-epoch", "2024-03-14", 19796}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := coerceDateForEncode(tt.value) + require.NoError(t, err) + assert.Equal(t, tt.wantDays, got, "days-since-epoch must be floor-rounded, not truncated") + }) + } +}