Skip to content

Commit 08d6b78

Browse files
committed
prettier, add new suites to CI checks.
1 parent 1746bcc commit 08d6b78

4 files changed

Lines changed: 45 additions & 37 deletions

File tree

.github/workflows/pr_build_linux.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,7 @@ jobs:
318318
org.apache.comet.CometFuzzAggregateSuite
319319
org.apache.comet.CometFuzzIcebergSuite
320320
org.apache.comet.CometFuzzMathSuite
321+
org.apache.comet.CometCodegenDispatchFuzzSuite
321322
org.apache.comet.DataGeneratorSuite
322323
- name: "shuffle"
323324
value: |
@@ -394,6 +395,9 @@ jobs:
394395
org.apache.comet.expressions.conditional.CometIfSuite
395396
org.apache.comet.expressions.conditional.CometCoalesceSuite
396397
org.apache.comet.expressions.conditional.CometCaseWhenSuite
398+
org.apache.comet.CometCodegenDispatchSmokeSuite
399+
org.apache.comet.CometCodegenSourceSuite
400+
org.apache.comet.CometRegExpJvmSuite
397401
- name: "sql"
398402
value: |
399403
org.apache.spark.sql.CometToPrettyStringSuite

.github/workflows/pr_build_macos.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@ jobs:
157157
org.apache.comet.CometFuzzAggregateSuite
158158
org.apache.comet.CometFuzzIcebergSuite
159159
org.apache.comet.CometFuzzMathSuite
160+
org.apache.comet.CometCodegenDispatchFuzzSuite
160161
org.apache.comet.DataGeneratorSuite
161162
- name: "shuffle"
162163
value: |
@@ -232,6 +233,9 @@ jobs:
232233
org.apache.comet.expressions.conditional.CometIfSuite
233234
org.apache.comet.expressions.conditional.CometCoalesceSuite
234235
org.apache.comet.expressions.conditional.CometCaseWhenSuite
236+
org.apache.comet.CometCodegenDispatchSmokeSuite
237+
org.apache.comet.CometCodegenSourceSuite
238+
org.apache.comet.CometRegExpJvmSuite
235239
- name: "sql"
236240
value: |
237241
org.apache.spark.sql.CometToPrettyStringSuite

docs/source/contributor-guide/jvm_udf_dispatch.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -120,21 +120,21 @@ Before this serde, any `ScalaUDF` in a plan forced Comet to fall back to Spark i
120120

121121
### What's covered
122122

123-
| What users write | Spark expression class | Route through codegen |
124-
|---|---|---|
125-
| `udf((x: T) => ...)` or `spark.udf.register` (Scala) | `ScalaUDF` | yes |
126-
| `spark.udf.register("f", new UDF1[...]{...})` (Java) | `ScalaUDF` (Spark wraps the Java functional interface) | yes, transparently |
127-
| `CREATE FUNCTION foo AS 'com.example.MyUDF'` (SQL registration) | `ScalaUDF` | yes, if the user class is reachable on the executor classpath |
123+
| What users write | Spark expression class | Route through codegen |
124+
| --------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------------- |
125+
| `udf((x: T) => ...)` or `spark.udf.register` (Scala) | `ScalaUDF` | yes |
126+
| `spark.udf.register("f", new UDF1[...]{...})` (Java) | `ScalaUDF` (Spark wraps the Java functional interface) | yes, transparently |
127+
| `CREATE FUNCTION foo AS 'com.example.MyUDF'` (SQL registration) | `ScalaUDF` | yes, if the user class is reachable on the executor classpath |
128128

129129
### What's not covered
130130

131-
| What users write | Spark expression class | Why not |
132-
|---|---|---|
133-
| Aggregate UDF | `ScalaAggregator`, `TypedImperativeAggregate`, old `UserDefinedAggregateFunction` | accumulator-based; needs a different bridge contract (accumulate + merge + finalize) |
134-
| Table UDF / generator | `UserDefinedTableFunction` | 1 row → N rows; `canHandle` rejects `Generator` |
135-
| Python `@udf` | `PythonUDF` | subprocess runtime, not JVM |
136-
| Pandas `@pandas_udf` | `PandasUDF` | Arrow-via-subprocess runtime |
137-
| Hive `GenericUDF` / `SimpleUDF` | `HiveGenericUDF` / `HiveSimpleUDF` | separate expression classes; would need their own serde |
131+
| What users write | Spark expression class | Why not |
132+
| ------------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
133+
| Aggregate UDF | `ScalaAggregator`, `TypedImperativeAggregate`, old `UserDefinedAggregateFunction` | accumulator-based; needs a different bridge contract (accumulate + merge + finalize) |
134+
| Table UDF / generator | `UserDefinedTableFunction` | 1 row → N rows; `canHandle` rejects `Generator` |
135+
| Python `@udf` | `PythonUDF` | subprocess runtime, not JVM |
136+
| Pandas `@pandas_udf` | `PandasUDF` | Arrow-via-subprocess runtime |
137+
| Hive `GenericUDF` / `SimpleUDF` | `HiveGenericUDF` / `HiveSimpleUDF` | separate expression classes; would need their own serde |
138138

139139
### Constraints within the ScalaUDF path
140140

@@ -158,18 +158,18 @@ There is no native or hand-coded fallback for arbitrary user functions; codegen
158158

159159
All scalar Spark types that map to a single Arrow vector:
160160

161-
| Spark type | Arrow vector class | `InternalRow` getter |
162-
|---|---|---|
163-
| BooleanType | BitVector | `getBoolean` |
164-
| ByteType | TinyIntVector | `getByte` |
165-
| ShortType | SmallIntVector | `getShort` |
166-
| IntegerType, DateType | IntVector, DateDayVector | `getInt` |
167-
| LongType, TimestampType, TimestampNTZType | BigIntVector, TimeStampMicroVector, TimeStampMicroTZVector | `getLong` |
168-
| FloatType | Float4Vector | `getFloat` |
169-
| DoubleType | Float8Vector | `getDouble` |
170-
| DecimalType | DecimalVector | `getDecimal(ord, precision, scale)` |
171-
| StringType | VarCharVector, ViewVarCharVector | `getUTF8String` (zero-copy via `UTF8String.fromAddress`) |
172-
| BinaryType | VarBinaryVector, ViewVarBinaryVector | `getBinary` (allocates `byte[]`) |
161+
| Spark type | Arrow vector class | `InternalRow` getter |
162+
| ----------------------------------------- | ---------------------------------------------------------- | -------------------------------------------------------- |
163+
| BooleanType | BitVector | `getBoolean` |
164+
| ByteType | TinyIntVector | `getByte` |
165+
| ShortType | SmallIntVector | `getShort` |
166+
| IntegerType, DateType | IntVector, DateDayVector | `getInt` |
167+
| LongType, TimestampType, TimestampNTZType | BigIntVector, TimeStampMicroVector, TimeStampMicroTZVector | `getLong` |
168+
| FloatType | Float4Vector | `getFloat` |
169+
| DoubleType | Float8Vector | `getDouble` |
170+
| DecimalType | DecimalVector | `getDecimal(ord, precision, scale)` |
171+
| StringType | VarCharVector, ViewVarCharVector | `getUTF8String` (zero-copy via `UTF8String.fromAddress`) |
172+
| BinaryType | VarBinaryVector, ViewVarBinaryVector | `getBinary` (allocates `byte[]`) |
173173

174174
Widening: add cases to `CometBatchKernelCodegen.typedInputAccessors` and accept the new vector classes in `CometCodegenDispatchUDF.evaluate`'s input pattern match.
175175

@@ -185,14 +185,14 @@ All scalar Spark types that map to a single Arrow vector: `Boolean`, `Byte`, `Sh
185185

186186
## Choosing between approaches
187187

188-
| Criterion | Hand-coded | Codegen dispatch |
189-
|---|---|---|
190-
| Classes per expression | one | zero |
191-
| Per-row loop | hand-written Scala | compiled Java |
192-
| Arrow read / write | hand-written | compiled Java |
193-
| Expression evaluation | hand-written | compiled via Spark `doGenCode`, inlined into the fused loop |
194-
| Composed expression trees | no (without native support for children) | yes |
195-
| Adding a new expression | new UDF class + serde branch | free within the supported type surface |
188+
| Criterion | Hand-coded | Codegen dispatch |
189+
| ------------------------- | ---------------------------------------- | ----------------------------------------------------------- |
190+
| Classes per expression | one | zero |
191+
| Per-row loop | hand-written Scala | compiled Java |
192+
| Arrow read / write | hand-written | compiled Java |
193+
| Expression evaluation | hand-written | compiled via Spark `doGenCode`, inlined into the fused loop |
194+
| Composed expression trees | no (without native support for children) | yes |
195+
| Adding a new expression | new UDF class + serde branch | free within the supported type surface |
196196

197197
Rule of thumb: pick hand-coded when the expression is hot enough to justify per-expression maintenance or has specialization the generic path cannot match; pick codegen dispatch when you would otherwise fall back to Spark, or when the expression composes naturally with others and you want the free composition.
198198

docs/source/user-guide/latest/compatibility/regex.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,12 @@ spark.comet.exec.regexp.engine=rust
2929

3030
## Choosing an engine
3131

32-
| | Java engine | Rust engine |
33-
|---|---|---|
34-
| **Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
32+
| | Java engine | Rust engine |
33+
| -------------------- | ------------------------------------------------------------------------------------------------------------------- | --------------------------------------- |
34+
| **Compatibility** | 100% compatible with Spark | Pattern-dependent differences |
3535
| **Feature coverage** | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) | `rlike`, `regexp_replace`, `split` only |
36-
| **Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37-
| **Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
36+
| **Performance** | One JNI round-trip per batch (Arrow vectors stay columnar) | Fully native, no JNI overhead |
37+
| **Pattern support** | All Java regex features (backreferences, lookaround, etc.) | Linear-time subset only |
3838

3939
The **Java engine** (default) is recommended for correctness-sensitive workloads. It evaluates expressions by
4040
passing Arrow vectors to a JVM-side UDF that uses `java.util.regex`, producing identical results to Spark for

0 commit comments

Comments
 (0)