@@ -120,21 +120,21 @@ Before this serde, any `ScalaUDF` in a plan forced Comet to fall back to Spark i
120120
121121### What's covered
122122
123- | What users write | Spark expression class | Route through codegen |
124- | ---| ---| ---|
125- | ` udf((x: T) => ...) ` or ` spark.udf.register ` (Scala) | ` ScalaUDF ` | yes |
126- | ` spark.udf.register("f", new UDF1[...]{...}) ` (Java) | ` ScalaUDF ` (Spark wraps the Java functional interface) | yes, transparently |
127- | ` CREATE FUNCTION foo AS 'com.example.MyUDF' ` (SQL registration) | ` ScalaUDF ` | yes, if the user class is reachable on the executor classpath |
123+ | What users write | Spark expression class | Route through codegen |
124+ | --------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------- |
125+ | ` udf((x: T) => ...) ` or ` spark.udf.register ` (Scala) | ` ScalaUDF ` | yes |
126+ | ` spark.udf.register("f", new UDF1[...]{...}) ` (Java) | ` ScalaUDF ` (Spark wraps the Java functional interface) | yes, transparently |
127+ | ` CREATE FUNCTION foo AS 'com.example.MyUDF' ` (SQL registration) | ` ScalaUDF ` | yes, if the user class is reachable on the executor classpath |
128128
129129### What's not covered
130130
131- | What users write | Spark expression class | Why not |
132- | ---| ---| ---|
133- | Aggregate UDF | ` ScalaAggregator ` , ` TypedImperativeAggregate ` , old ` UserDefinedAggregateFunction ` | accumulator-based; needs a different bridge contract (accumulate + merge + finalize) |
134- | Table UDF / generator | ` UserDefinedTableFunction ` | 1 row → N rows; ` canHandle ` rejects ` Generator ` |
135- | Python ` @udf ` | ` PythonUDF ` | subprocess runtime, not JVM |
136- | Pandas ` @pandas_udf ` | ` PandasUDF ` | Arrow-via-subprocess runtime |
137- | Hive ` GenericUDF ` / ` SimpleUDF ` | ` HiveGenericUDF ` / ` HiveSimpleUDF ` | separate expression classes; would need their own serde |
131+ | What users write | Spark expression class | Why not |
132+ | ------------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
133+ | Aggregate UDF | ` ScalaAggregator ` , ` TypedImperativeAggregate ` , old ` UserDefinedAggregateFunction ` | accumulator-based; needs a different bridge contract (accumulate + merge + finalize) |
134+ | Table UDF / generator | ` UserDefinedTableFunction ` | 1 row → N rows; ` canHandle ` rejects ` Generator ` |
135+ | Python ` @udf ` | ` PythonUDF ` | subprocess runtime, not JVM |
136+ | Pandas ` @pandas_udf ` | ` PandasUDF ` | Arrow-via-subprocess runtime |
137+ | Hive ` GenericUDF ` / ` SimpleUDF ` | ` HiveGenericUDF ` / ` HiveSimpleUDF ` | separate expression classes; would need their own serde |
138138
139139### Constraints within the ScalaUDF path
140140
@@ -158,18 +158,18 @@ There is no native or hand-coded fallback for arbitrary user functions; codegen
158158
159159All scalar Spark types that map to a single Arrow vector:
160160
161- | Spark type | Arrow vector class | ` InternalRow ` getter |
162- | ---| ---| ---|
163- | BooleanType | BitVector | ` getBoolean ` |
164- | ByteType | TinyIntVector | ` getByte ` |
165- | ShortType | SmallIntVector | ` getShort ` |
166- | IntegerType, DateType | IntVector, DateDayVector | ` getInt ` |
167- | LongType, TimestampType, TimestampNTZType | BigIntVector, TimeStampMicroVector, TimeStampMicroTZVector | ` getLong ` |
168- | FloatType | Float4Vector | ` getFloat ` |
169- | DoubleType | Float8Vector | ` getDouble ` |
170- | DecimalType | DecimalVector | ` getDecimal(ord, precision, scale) ` |
171- | StringType | VarCharVector, ViewVarCharVector | ` getUTF8String ` (zero-copy via ` UTF8String.fromAddress ` ) |
172- | BinaryType | VarBinaryVector, ViewVarBinaryVector | ` getBinary ` (allocates ` byte[] ` ) |
161+ | Spark type | Arrow vector class | ` InternalRow ` getter |
162+ | ----------------------------------------- | ---------------------------------------------------------- | -------------------------------------------------------- |
163+ | BooleanType | BitVector | ` getBoolean ` |
164+ | ByteType | TinyIntVector | ` getByte ` |
165+ | ShortType | SmallIntVector | ` getShort ` |
166+ | IntegerType, DateType | IntVector, DateDayVector | ` getInt ` |
167+ | LongType, TimestampType, TimestampNTZType | BigIntVector, TimeStampMicroVector, TimeStampMicroTZVector | ` getLong ` |
168+ | FloatType | Float4Vector | ` getFloat ` |
169+ | DoubleType | Float8Vector | ` getDouble ` |
170+ | DecimalType | DecimalVector | ` getDecimal(ord, precision, scale) ` |
171+ | StringType | VarCharVector, ViewVarCharVector | ` getUTF8String ` (zero-copy via ` UTF8String.fromAddress ` ) |
172+ | BinaryType | VarBinaryVector, ViewVarBinaryVector | ` getBinary ` (allocates ` byte[] ` ) |
173173
174174Widening: add cases to ` CometBatchKernelCodegen.typedInputAccessors ` and accept the new vector classes in ` CometCodegenDispatchUDF.evaluate ` 's input pattern match.
175175
@@ -185,14 +185,14 @@ All scalar Spark types that map to a single Arrow vector: `Boolean`, `Byte`, `Sh
185185
186186## Choosing between approaches
187187
188- | Criterion | Hand-coded | Codegen dispatch |
189- | ---| ---| ---|
190- | Classes per expression | one | zero |
191- | Per-row loop | hand-written Scala | compiled Java |
192- | Arrow read / write | hand-written | compiled Java |
193- | Expression evaluation | hand-written | compiled via Spark ` doGenCode ` , inlined into the fused loop |
194- | Composed expression trees | no (without native support for children) | yes |
195- | Adding a new expression | new UDF class + serde branch | free within the supported type surface |
188+ | Criterion | Hand-coded | Codegen dispatch |
189+ | ------------------------- | ---------------------------------------- | ----------------------------------------------------------- |
190+ | Classes per expression | one | zero |
191+ | Per-row loop | hand-written Scala | compiled Java |
192+ | Arrow read / write | hand-written | compiled Java |
193+ | Expression evaluation | hand-written | compiled via Spark ` doGenCode ` , inlined into the fused loop |
194+ | Composed expression trees | no (without native support for children) | yes |
195+ | Adding a new expression | new UDF class + serde branch | free within the supported type surface |
196196
197197Rule of thumb: pick hand-coded when the expression is hot enough to justify per-expression maintenance or has specialization the generic path cannot match; pick codegen dispatch when you would otherwise fall back to Spark, or when the expression composes naturally with others and you want the free composition.
198198
0 commit comments