docs: clean up custom JVM UDF guide for new registration API

andygrove · andygrove · commit 22439e38c0b4 · 2026-05-06T06:45:56.000-06:00
diff --git a/docs/source/user-guide/latest/custom-jvm-udfs.md b/docs/source/user-guide/latest/custom-jvm-udfs.md
@@ -31,10 +31,12 @@ row-at-a-time execution while keeping the implementation in Java/Scala.
 
 The framework consists of:
 
-- **`CometUDF`** — a trait your UDF class must implement, declaring its name, return type, optional input types,
-  and the vectorized `evaluate` method
-- **`CometUdfRegistry`** — a registry that introspects your `CometUDF` class to record metadata for the serde layer
-- **`CometUdfBridge`** — the JNI bridge that native execution uses to invoke your UDF (no user interaction needed)
+- **`CometUDF`**: a trait your UDF class must implement, declaring its name, return type, optional input
+  types, and the vectorized `evaluate` method.
+- **`CometUdfRegistry`**: a registry that introspects your `CometUDF` class to record metadata for the serde
+  layer.
+- **`CometUdfBridge`**: the JNI bridge that native execution uses to invoke your UDF (no user interaction
+  needed).
 
 ## Writing a CometUDF
 
@@ -43,15 +45,15 @@ Implement the `org.apache.comet.udf.CometUDF` trait. Comet relocates Apache Arro
 so your implementation must import Arrow types from the shaded package. This is the
 same package that the published `comet-spark` JAR exposes on your classpath.
 
+### Java
+
 ```java
 import org.apache.comet.shaded.arrow.vector.IntVector;
 import org.apache.comet.shaded.arrow.vector.BitVector;
 import org.apache.comet.shaded.arrow.vector.ValueVector;
 import org.apache.comet.udf.CometUDF;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.DataTypes;
-import scala.collection.JavaConverters;
-import java.util.Arrays;
 
 public class IsPositiveUdf implements CometUDF {
 
@@ -64,12 +66,6 @@ public class IsPositiveUdf implements CometUDF {
     @Override
     public boolean nullable() { return true; }
 
-    @Override
-    public scala.collection.Seq<DataType> inputTypes() {
-        return JavaConverters.asScalaBuffer(
-            Arrays.<DataType>asList(DataTypes.IntegerType)).toSeq();
-    }
-
     @Override
     public ValueVector evaluate(ValueVector[] inputs) {
         IntVector input = (IntVector) inputs[0];
@@ -92,18 +88,54 @@ public class IsPositiveUdf implements CometUDF {
 }
 ```
 
+### Scala
+
+```scala
+import org.apache.comet.shaded.arrow.vector.{BitVector, IntVector, ValueVector}
+import org.apache.comet.CometArrowAllocator
+import org.apache.comet.udf.CometUDF
+import org.apache.spark.sql.types.{BooleanType, DataType, IntegerType}
+
+class IsPositiveUdf extends CometUDF {
+  override def name: String = "is_positive"
+  override def returnType: DataType = BooleanType
+  override def nullable: Boolean = true
+
+  // Optional: declare only if you plan to use registerColumnarOnly.
+  override def inputTypes: Seq[DataType] = Seq(IntegerType)
+
+  override def evaluate(inputs: Array[ValueVector]): ValueVector = {
+    val input = inputs(0).asInstanceOf[IntVector]
+    val rowCount = input.getValueCount
+    val result = new BitVector("result", CometArrowAllocator)
+    result.allocateNew(rowCount)
+    var i = 0
+    while (i < rowCount) {
+      if (input.isNull(i)) result.setNull(i)
+      else result.set(i, if (input.get(i) > 0) 1 else 0)
+      i += 1
+    }
+    result.setValueCount(rowCount)
+    result
+  }
+}
+```
+
 Key requirements:
 
-- The class must have a **public no-arg constructor**
-- Arrow types must be imported from `org.apache.comet.shaded.arrow.*` (the relocated package)
-- Input vectors arrive at the row count of the current batch
-- Scalar (literal) arguments arrive as length-1 vectors — read at index 0
-- The returned vector's length **must match** the longest input vector
-- Instances are cached per executor thread, so implementations should be **stateless**
-- `inputTypes` is required only for columnar-only registration (see below)
+- The class must have a **public no-arg constructor**.
+- Arrow types must be imported from `org.apache.comet.shaded.arrow.*` (the relocated package).
+- Input vectors arrive at the row count of the current batch.
+- Scalar (literal) arguments arrive as length-1 vectors: read at index 0.
+- The returned vector's length **must match** the longest input vector.
+- Instances are cached per executor thread, so implementations should be **stateless**.
+- `inputTypes` is only required for columnar-only registration (see Option 3 below).
 
 ## Registering a CometUDF
 
+There are three ways to register a `CometUDF` with Comet, depending on whether you also want a
+row-based Spark fallback.
+
 ### Option 1: Comet UDF only (existing Spark UDF)
 
 If you already have a Spark UDF registered, just tell Comet about the accelerated implementation:
@@ -126,10 +158,13 @@ import org.apache.comet.udf.CometUdfRegistry
 CometUdfRegistry.register(spark, classOf[IsPositiveUdf], (x: Int) => x > 0)
 ```
 
+Convenience overloads exist for arities 1, 2, and 3. For higher arities, use Option 1 and call
+`spark.udf.register` separately.
+
 ### Option 3: Columnar-only (no row-based equivalent)
 
 If you do not want to write a row-based fallback, Comet can synthesize a stub Spark UDF that
-throws `UnsupportedOperationException` if invoked row-at-a-time. The CometUDF must declare
+throws `UnsupportedOperationException` if invoked row-at-a-time. The `CometUDF` must declare
 `inputTypes` so the stub has the correct arity.
 
 ```scala
@@ -140,7 +175,7 @@ CometUdfRegistry.registerColumnarOnly(spark, classOf[IsPositiveUdf])
 
 When Comet is enabled and the query is supported, the vectorized implementation runs natively.
 If Comet falls back (e.g. an unsupported expression elsewhere in the plan), the stub is invoked
-and the query fails with a clear error rather than silently slow row-at-a-time execution.
+and the query fails with a clear error rather than silently degrading to row-at-a-time execution.
 
 ## How It Works
 
@@ -158,16 +193,16 @@ and the query fails with a clear error rather than silently slow row-at-a-time e
 
 ## Packaging and Deployment
 
-1. Package your `CometUDF` implementation in a JAR
-2. Include it on the Spark classpath via `--jars` or `spark.jars`
-3. Register the UDF as shown above (in your application code or via a Spark session extension)
+1. Package your `CometUDF` implementation in a JAR.
+2. Include it on the Spark classpath via `--jars` or `spark.jars`.
+3. Register the UDF as shown above (in your application code or via a Spark session extension).
 
-The CometUDF class is resolved using the executor's context classloader, so user-supplied JARs added via
+The `CometUDF` class is resolved using the executor's context classloader, so user-supplied JARs added via
 `spark.jars` or `--jars` are automatically visible.
 
 ## Limitations
 
-- Only scalar UDFs are supported (not aggregate or table UDFs)
-- The UDF must be registered by name — anonymous lambdas without a name cannot be intercepted
-- All input and output types must be representable as Arrow vectors
-- Columnar-only registration currently supports arities 1 through 5
+- Only scalar UDFs are supported (not aggregate or table UDFs).
+- The UDF must be registered by name: anonymous lambdas without a name cannot be intercepted.
+- All input and output types must be representable as Arrow vectors.
+- `registerColumnarOnly` currently supports arities 1 through 5.