[SPARK-57570][SQL] Support TimeType in vectorized-reader column population

MaxGekk · MaxGekk · commit d635a1f07fbf · 2026-06-30T08:47:14.000+02:00
### What changes were proposed in this pull request? Make the vectorized reader handle `TimeType` when it populates partition, missing, and constant columns. The only production change is one branch in `ColumnVectorUtils.appendValue` (the `toBatch` path): a TIME value is now appended as its nanos-of-day `long` via `DateTimeUtils.localTimeToNanos`. Before, there was no `TimeType` case and it failed with "Datatype not supported". The `populate` path already worked, since `TimeType` is physically a `long`. This builds on SPARK-54203, which added the underlying `RowToColumnConverter` and column-vector allocation support. The unsupported-type branch still throws `_LEGACY_ERROR_TEMP_3192` for other types; renaming that legacy error is tracked separately in SPARK-57745. ### Why are the changes needed? This makes `TimeType` a first-class citizen in the vectorized (columnar) read path instead of a type that fails depending on where a column comes from. Concretely, it enables: - **TIME partition columns in vectorized reads.** A table partitioned by a TIME column previously errored when the partition value was materialized into the columnar batch. With this change such reads succeed with the vectorized reader engaged, so partition pruning and the columnar scan both work. - **Schema-evolution / "missing" TIME columns.** When a Parquet/ORC file predates a TIME column added to the table schema, the reader fills that column via the same population path; those reads now succeed instead of failing. - **Constant-folded TIME columns** injected into a scan populate correctly. - **`toBatch` round-trips with TIME**, e.g. row-to-columnar conversions that carry `java.time.LocalTime` values. Without this, queries touching TIME columns in these scenarios either fail with an unsupported-datatype error or fall back to the slower row-based reader. After the change, TIME behaves consistently with `DATE`, `TIMESTAMP`, and interval types in this layer, and downstream code built on `ColumnarBatch` can carry TIME columns through the population path without special-casing. It also clears a blocker in the SPARK-54203 umbrella and keeps the vectorized layer's hand-maintained type dispatch in step with the other datetime types. Note: physically reading TIME data columns stored inside Parquet/ORC files (as opposed to populated partition / missing / constant columns) is a separate concern and is out of scope here. ### Does this PR introduce _any_ user-facing change? No. TIME is an in-progress, not-yet-released data type; this only widens internal vectorized support so that previously-failing TIME column population now succeeds. ### How was this patch tested? New unit tests: - `ColumnVectorUtilsSuite` (the `populate` / constant-column path): TIME across precisions 0/6/7/9, boundary values (`00:00:00`, `23:59:59.999999999`), null (missing column), and TIME nested in struct / array / map. - `ColumnarBatchSuite` (the `toBatch` path): `TimeType` added to the random-schema test and `compareStruct` (top-level and array element), a per-precision `testVector` (0/6/7/9 + boundaries), a nested struct/array `toBatch` test, and a negative unsupported-type case. Ran `build/sbt 'sql/testOnly *ColumnVectorUtilsSuite *ColumnarBatchSuite'` (93 tests pass). Scalastyle and Java checkstyle pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor Closes #56858 from MaxGekk/time-vec-column-pop. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
@@ -22,6 +22,7 @@
 import java.sql.Date;
 import java.sql.Timestamp;
 import java.time.LocalDateTime;
+import java.time.LocalTime;
 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.List;
@@ -236,6 +237,8 @@ private static void appendValue(WritableColumnVector dst, DataType t, Object o)
         dst.appendLong(DateTimeUtils.fromJavaTimestamp((Timestamp) o));
       } else if (t instanceof TimestampNTZType) {
         dst.appendLong(DateTimeUtils.localDateTimeToMicros((LocalDateTime) o));
+      } else if (t instanceof TimeType) {
+        dst.appendLong(DateTimeUtils.localTimeToNanos((LocalTime) o));
       } else {
         throw new SparkUnsupportedOperationException(
           "UNSUPPORTED_DATATYPE", Map.of("typeName", QueryExecutionErrors.toSQLType(t)));
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorUtilsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorUtilsSuite.scala
@@ -17,9 +17,11 @@
 
 package org.apache.spark.sql.execution.vectorized
 
+import java.time.LocalTime
+
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData}
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.{CalendarInterval, TimestampNanosVal}
 import org.apache.spark.unsafe.types.UTF8String
@@ -257,4 +259,98 @@ class ColumnVectorUtilsSuite extends SparkFunSuite {
     ColumnVectorUtils.populate(vector, InternalRow(null), 0)
     assert(vector.hasNull)
   }
+
+  private def timeNanos(s: String): Long = DateTimeUtils.localTimeToNanos(LocalTime.parse(s))
+
+  // TimeType is physically a long (nanoseconds since midnight). Precision affects display only,
+  // not storage, so every TimeType(p) is filled through the same PhysicalLongType code path.
+  Seq(
+    0 -> "12:30:45",
+    6 -> "12:30:45.123456",
+    7 -> "12:30:45.1234567",
+    9 -> "12:30:45.123456789").foreach { case (p, s) =>
+    testConstantColumnVector(s"fill time p=$p", 10, TimeType(p)) { vector =>
+      val nanos = timeNanos(s)
+      ColumnVectorUtils.populate(vector, InternalRow(nanos), 0)
+      (0 until 10).foreach { i =>
+        assert(vector.getLong(i) == nanos)
+      }
+    }
+  }
+
+  testConstantColumnVector("fill time boundaries", 10, TimeType(9)) { vector =>
+    Seq(0L, 86399999999999L).foreach { nanos =>
+      ColumnVectorUtils.populate(vector, InternalRow(nanos), 0)
+      (0 until 10).foreach { i =>
+        assert(vector.getLong(i) == nanos)
+      }
+    }
+  }
+
+  testConstantColumnVector("fill time null", 10, TimeType(6)) { vector =>
+    ColumnVectorUtils.populate(vector, InternalRow(null), 0)
+    assert(vector.hasNull)
+    assert(vector.numNulls() == 10)
+    (0 until 10).foreach { i =>
+      assert(vector.isNullAt(i))
+    }
+  }
+
+  testConstantColumnVector("fill struct with time field", 10,
+    new StructType().add("t", TimeType(6)).add("flag", BooleanType)) { vector =>
+    val nanos = timeNanos("01:02:03.456789")
+    ColumnVectorUtils.populate(vector, InternalRow(InternalRow(nanos, true)), 0)
+    (0 until 10).foreach { i =>
+      assert(vector.getChild(0).getLong(i) == nanos)
+      assert(vector.getChild(1).getBoolean(i))
+    }
+  }
+
+  testConstantColumnVector("fill struct with null time field", 10,
+    new StructType().add("t", TimeType(6), nullable = true).add("flag", BooleanType)) { vector =>
+    ColumnVectorUtils.populate(vector, InternalRow(InternalRow(null, true)), 0)
+    (0 until 10).foreach { i =>
+      assert(vector.getChild(0).isNullAt(i))
+      assert(vector.getChild(1).getBoolean(i))
+    }
+  }
+
+  testConstantColumnVector("fill array of time", 10, ArrayType(TimeType(9))) { vector =>
+    val n0 = timeNanos("00:00:01")
+    val n1 = timeNanos("12:00:00.123456789")
+    val n2 = 86399999999999L
+    val arr = new GenericArrayData(Array[Any](n0, n1, n2))
+    ColumnVectorUtils.populate(vector, InternalRow(arr), 0)
+    (0 until 10).foreach { i =>
+      val a = vector.getArray(i)
+      assert(a.numElements() == 3)
+      assert(a.getLong(0) == n0)
+      assert(a.getLong(1) == n1)
+      assert(a.getLong(2) == n2)
+    }
+  }
+
+  testConstantColumnVector("fill null array of time", 10, ArrayType(TimeType(6))) { vector =>
+    ColumnVectorUtils.populate(vector, InternalRow(null), 0)
+    assert(vector.hasNull)
+  }
+
+  testConstantColumnVector("fill map of int -> time", 10,
+    MapType(IntegerType, TimeType(6))) { vector =>
+    val keys = new GenericArrayData(Array[Any](1, 2, 3))
+    val v0 = timeNanos("00:00:00")
+    val v1 = timeNanos("06:30:15.123456")
+    val v2 = 86399999999999L
+    val values = new GenericArrayData(Array[Any](v0, v1, v2))
+    val map = new ArrayBasedMapData(keys, values)
+    ColumnVectorUtils.populate(vector, InternalRow(map), 0)
+    (0 until 10).foreach { i =>
+      val m = vector.getMap(i)
+      assert(m.numElements() == 3)
+      assert(m.keyArray().toIntArray === Array(1, 2, 3))
+      assert(m.valueArray().getLong(0) == v0)
+      assert(m.valueArray().getLong(1) == v1)
+      assert(m.valueArray().getLong(2) == v2)
+    }
+  }
 }
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
@@ -21,7 +21,7 @@ import java.nio.ByteBuffer
 import java.nio.ByteOrder
 import java.nio.charset.StandardCharsets
 import java.sql.{Date, Timestamp}
-import java.time.LocalDateTime
+import java.time.{LocalDateTime, LocalTime}
 import java.util
 
 import scala.collection.mutable
@@ -32,7 +32,7 @@ import scala.util.Random
 import org.apache.arrow.vector.IntVector
 import org.apache.parquet.bytes.ByteBufferInputStream
 
-import org.apache.spark.SparkFunSuite
+import org.apache.spark.{SparkFunSuite, SparkUnsupportedOperationException}
 import org.apache.spark.memory.MemoryMode
 import org.apache.spark.sql.{RandomDataGenerator, Row}
 import org.apache.spark.sql.catalyst.InternalRow
@@ -1432,6 +1432,10 @@ class ColumnarBatchSuite extends SparkFunSuite {
             assert(r1.getLong(ordinal) ==
               DateTimeUtils.localDateTimeToMicros(r2.getAs[LocalDateTime](ordinal)),
               "Seed = " + seed)
+          case _: TimeType =>
+            assert(r1.getLong(ordinal) ==
+              DateTimeUtils.localTimeToNanos(r2.getAs[LocalTime](ordinal)),
+              "Seed = " + seed)
           case t: DecimalType =>
             val d1 = r1.getDecimal(ordinal, t.precision, t.scale).toBigDecimal
             val d2 = r2.getDecimal(ordinal)
@@ -1506,6 +1510,17 @@ class ColumnarBatchSuite extends SparkFunSuite {
                   }
                   i += 1
                 }
+              case _: TimeType =>
+                var i = 0
+                while (i < a1.length) {
+                  assert((a1(i) == null) == (a2(i) == null), "Seed = " + seed)
+                  if (a1(i) != null) {
+                    val i1 = a1(i).asInstanceOf[Long]
+                    val i2 = DateTimeUtils.localTimeToNanos(a2(i).asInstanceOf[LocalTime])
+                    assert(i1 === i2, "Seed = " + seed)
+                  }
+                  i += 1
+                }
               case t: DecimalType =>
                 var i = 0
                 while (i < a1.length) {
@@ -1562,7 +1577,8 @@ class ColumnarBatchSuite extends SparkFunSuite {
       DecimalType.ShortDecimal, DecimalType.IntDecimal, DecimalType.ByteDecimal,
       DecimalType.FloatDecimal, DecimalType.LongDecimal, new DecimalType(5, 2),
       new DecimalType(12, 2), new DecimalType(30, 10), CalendarIntervalType,
-      DateType, StringType, BinaryType, TimestampType, TimestampNTZType)
+      DateType, StringType, BinaryType, TimestampType, TimestampNTZType,
+      TimeType(0), TimeType(3), TimeType(), TimeType(TimeType.MAX_PRECISION))
     val seed = System.nanoTime()
     val NUM_ROWS = 200
     val NUM_ITERS = 1000
@@ -2126,6 +2142,71 @@ class ColumnarBatchSuite extends SparkFunSuite {
     }
   }
 
+  // TimeType is physically a long (nanoseconds since midnight); precision affects display only.
+  // The generic `get(int, DataType)` accessor is intentionally not extended for TimeType in this
+  // change (tracked separately), so values are read back via the typed `getLong` accessor.
+  Seq(0, 6, 7, 9).foreach { p =>
+    val dt = TimeType(p)
+    testVector(s"TIME(precision=$p)", 10, dt) {
+      column =>
+        val values = Array(0L, 86399999999999L) ++ (2 until 10).map(_.toLong * 1000000000L)
+        (0 until 10).foreach { i =>
+          column.putLong(i, values(i))
+        }
+        val batchRow = new ColumnarBatchRow(Array(column))
+        (0 until 10).foreach { i =>
+          batchRow.rowId = i
+          assert(batchRow.getLong(0) == values(i))
+          val batchRowCopy = batchRow.copy()
+          assert(batchRowCopy.getLong(0) == values(i))
+        }
+    }
+  }
+
+  test("SPARK-57570: toBatch with TIME nested in struct and array") {
+    val schema = new StructType()
+      .add("s", new StructType().add("t", TimeType(6)).add("flag", BooleanType))
+      .add("a", ArrayType(TimeType(9)))
+    val t1 = LocalTime.parse("01:02:03.123456")
+    val t2 = LocalTime.parse("23:59:59.999999999")
+    val t3 = LocalTime.parse("00:00:00")
+    val n1 = DateTimeUtils.localTimeToNanos(t1)
+    val n2 = DateTimeUtils.localTimeToNanos(t2)
+    val n3 = DateTimeUtils.localTimeToNanos(t3)
+    val rows = Seq(
+      Row(Row(t1, true), Seq(t1, t2)),
+      Row(Row(t3, false), Seq(t3)))
+    Seq(MemoryMode.ON_HEAP, MemoryMode.OFF_HEAP).foreach { memMode =>
+      val batch = ColumnVectorUtils.toBatch(schema, memMode, rows.iterator.asJava)
+      try {
+        assert(batch.numRows() == 2)
+        val structCol = batch.column(0)
+        assert(structCol.getChild(0).getLong(0) == n1)
+        assert(structCol.getChild(1).getBoolean(0))
+        assert(structCol.getChild(0).getLong(1) == n3)
+        assert(!structCol.getChild(1).getBoolean(1))
+        val arrCol = batch.column(1)
+        val a0 = arrCol.getArray(0)
+        assert(a0.numElements() == 2)
+        assert(a0.getLong(0) == n1)
+        assert(a0.getLong(1) == n2)
+        val a1 = arrCol.getArray(1)
+        assert(a1.numElements() == 1)
+        assert(a1.getLong(0) == n3)
+      } finally {
+        batch.close()
+      }
+    }
+  }
+
+  test("SPARK-57570: toBatch throws on unsupported data type") {
+    val schema = new StructType().add("m", MapType(IntegerType, IntegerType))
+    intercept[SparkUnsupportedOperationException] {
+      ColumnVectorUtils.toBatch(
+        schema, MemoryMode.ON_HEAP, Seq(Row(Map(1 -> 2))).iterator.asJava)
+    }
+  }
+
   testVector("[SPARK-55552] Variant", 3, VariantType) {
     column =>
       val valueChild = column.getChild(0)