Skip to content

Commit 6fcd81c

Browse files
committed
Merge remote-tracking branch 'apache/main' into codegen_scala_udf
# Conflicts: # common/src/main/java/org/apache/comet/udf/CometUdfBridge.java # common/src/main/scala/org/apache/comet/udf/CometUDF.scala # docs/source/contributor-guide/index.md # native/core/src/execution/jni_api.rs # native/core/src/execution/planner.rs # native/spark-expr/src/jvm_udf/mod.rs # spark/src/main/scala/org/apache/comet/CometExecIterator.scala # spark/src/main/scala/org/apache/comet/serde/strings.scala
2 parents 4be8144 + c12b8e3 commit 6fcd81c

63 files changed

Lines changed: 2495 additions & 304 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/pr_build_linux.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,7 @@ jobs:
329329
org.apache.comet.parquet.ParquetReadV1Suite
330330
org.apache.comet.parquet.ParquetReadV2Suite
331331
org.apache.comet.parquet.ParquetReadFromFakeHadoopFsSuite
332+
org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
332333
org.apache.spark.sql.comet.ParquetDatetimeRebaseV1Suite
333334
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
334335
org.apache.spark.sql.comet.ParquetEncryptionITCase

.github/workflows/pr_build_macos.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@ jobs:
177177
org.apache.comet.parquet.ParquetReadV1Suite
178178
org.apache.comet.parquet.ParquetReadV2Suite
179179
org.apache.comet.parquet.ParquetReadFromFakeHadoopFsSuite
180+
org.apache.comet.parquet.ParquetTimestampLtzAsNtzSuite
180181
org.apache.spark.sql.comet.ParquetDatetimeRebaseV1Suite
181182
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
182183
org.apache.spark.sql.comet.ParquetEncryptionITCase

AGENTS.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Agent Guidelines for Apache DataFusion Comet
21+
22+
Read the [Contributor Guide](docs/source/contributor-guide/index.md) before making changes.
23+
Relevant entry points:
24+
25+
- [Development Guide](docs/source/contributor-guide/development.md): build, test, and common
26+
pitfalls (including why `-pl` must not be used and the JVM/native build order).
27+
- [Spark SQL Tests](docs/source/contributor-guide/spark-sql-tests.md): the only supported way
28+
to modify files under `dev/diffs/`. **Never hand-edit a diff file.** Clone Spark, apply the
29+
existing diff, modify the Spark source, then regenerate the diff as documented there.
30+
- [Adding a New Expression](docs/source/contributor-guide/adding_a_new_expression.md) /
31+
[Adding a New Operator](docs/source/contributor-guide/adding_a_new_operator.md).
32+
- [Debugging Guide](docs/source/contributor-guide/debugging.md).
33+
34+
When opening a pull request, use the [PR template](.github/pull_request_template.md) and fill
35+
in every section.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,17 +40,17 @@ Apache DataFusion Comet is a high-performance accelerator for Apache Spark, buil
4040
performance of Apache Spark workloads while leveraging commodity hardware and seamlessly integrating with the
4141
Spark ecosystem without requiring any code changes.
4242

43-
**Comet provides a 2x speedup for TPC-H @ 1TB, resulting in 50% cost savings.**
43+
**Comet provides a ~2x speedup for TPC-DS @ SF 1000 (1TB), resulting in ~50% cost savings.**
4444

4545
That 2x speedup gives you a choice: finish the same Spark workload in half the time on the cluster you already have,
4646
or match your current Spark performance on roughly half the resources. Either way, the gain translates directly into
4747
lower cloud bills, reduced on-prem capacity, and lower energy usage, with no changes to your existing Spark SQL,
4848
DataFrame, or PySpark code. Comet runs on commodity hardware: no GPUs, FPGAs, or other specialized accelerators are
4949
required, so the savings come from better utilization of the infrastructure you already run on.
5050

51-
![](docs/source/_static/images/benchmark-results/0.15.0/tpch_allqueries.png)
51+
![](docs/source/_static/images/benchmark-results/0.16.0/tpcds_allqueries.png)
5252

53-
![](docs/source/_static/images/benchmark-results/0.15.0/tpch_queries_compare.png)
53+
![](docs/source/_static/images/benchmark-results/0.16.0/tpcds_queries_speedup_rel.png)
5454

5555
See the [Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html) for more details.
5656

@@ -81,7 +81,7 @@ benefits of Comet's acceleration capabilities without disrupting your Spark appl
8181

8282
## Getting Started
8383

84-
Comet supports Apache Spark 3.4, 3.5, and 4.0, and provides experimental support for Spark 4.1 and 4.2. See the
84+
Comet supports Apache Spark 3.4, 3.5, 4.0, and 4.1, and provides experimental support for Spark 4.2. See the
8585
[installation guide](https://datafusion.apache.org/comet/user-guide/installation.html) for the detailed
8686
version, Java, and Scala compatibility matrix.
8787

common/src/main/java/org/apache/comet/udf/CometUdfBridge.java

Lines changed: 42 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,7 @@
1919

2020
package org.apache.comet.udf;
2121

22-
import java.util.LinkedHashMap;
23-
import java.util.Map;
22+
import java.util.concurrent.ConcurrentHashMap;
2423

2524
import org.apache.arrow.c.ArrowArray;
2625
import org.apache.arrow.c.ArrowSchema;
@@ -38,23 +37,10 @@
3837
*/
3938
public class CometUdfBridge {
4039

41-
// Per-thread, bounded LRU of UDF instances keyed by class name. Comet
42-
// native execution threads (Tokio/DataFusion worker pool) are reused
43-
// across tasks within an executor, so the effective lifetime of cached
44-
// entries is the worker thread (i.e. the executor JVM). Fine for
45-
// stateless UDFs; future stateful UDFs would need explicit per-task
46-
// isolation.
47-
private static final int CACHE_CAPACITY = 64;
48-
49-
private static final ThreadLocal<LinkedHashMap<String, CometUDF>> INSTANCES =
50-
ThreadLocal.withInitial(
51-
() ->
52-
new LinkedHashMap<String, CometUDF>(CACHE_CAPACITY, 0.75f, true) {
53-
@Override
54-
protected boolean removeEldestEntry(Map.Entry<String, CometUDF> eldest) {
55-
return size() > CACHE_CAPACITY;
56-
}
57-
});
40+
// Process-wide cache of UDF instances keyed by class name. CometUDF
41+
// implementations are required to be stateless (see CometUDF), so a
42+
// single shared instance per class is safe across native worker threads.
43+
private static final ConcurrentHashMap<String, CometUDF> INSTANCES = new ConcurrentHashMap<>();
5844

5945
/**
6046
* Called from native via JNI.
@@ -64,19 +50,15 @@ protected boolean removeEldestEntry(Map.Entry<String, CometUDF> eldest) {
6450
* @param inputSchemaPtrs addresses of pre-allocated FFI_ArrowSchema structs (one per input)
6551
* @param outArrayPtr address of pre-allocated FFI_ArrowArray for the result
6652
* @param outSchemaPtr address of pre-allocated FFI_ArrowSchema for the result
67-
* @param numRows number of rows in the current batch. Mirrors DataFusion's {@code
68-
* ScalarFunctionArgs.number_rows} and gives UDFs an explicit batch-size signal for cases
69-
* where no input arg is a batch-length array (e.g. a zero-arg non-deterministic ScalaUDF).
70-
* UDFs that already read size from their input vectors can ignore it.
71-
* @param taskContext Spark {@link TaskContext} captured on the driving Spark task thread and
72-
* passed through from native. May be {@code null} when the bridge is invoked outside a Spark
73-
* task (unit tests, direct native driver runs). When non-null and the current thread has no
74-
* {@code TaskContext} of its own, the bridge installs it as the thread-local for the duration
75-
* of the UDF call so the UDF body (including partition-sensitive built-ins like {@code Rand}
76-
* / {@code Uuid} / {@code MonotonicallyIncreasingID} that read the partition index via {@code
77-
* TaskContext.get().partitionId()}) sees the real context rather than null. The thread-local
78-
* is cleared in a {@code finally} so Tokio workers don't leak a stale TaskContext across
79-
* invocations.
53+
* @param numRows row count of the current batch. Mirrors DataFusion's {@code
54+
* ScalarFunctionArgs.number_rows}; the only batch-size signal a zero-input UDF (e.g. a
55+
* zero-arg non-deterministic ScalaUDF) ever sees.
56+
* @param taskContext propagated Spark {@link TaskContext} from the driving Spark task thread, or
57+
* {@code null} outside a Spark task. Treated as ground truth for the call: installed as the
58+
* thread-local on entry, with the prior value (if any) saved and restored in {@code finally}.
59+
* Lets partition-sensitive built-ins ({@code Rand}, {@code Uuid}, {@code
60+
* MonotonicallyIncreasingID}) work from Tokio workers and avoids reusing a stale TaskContext
61+
* left on a worker by a previous task.
8062
*/
8163
public static void evaluate(
8264
String udfClassName,
@@ -86,17 +68,23 @@ public static void evaluate(
8668
long outSchemaPtr,
8769
int numRows,
8870
TaskContext taskContext) {
89-
boolean installedTaskContext = false;
90-
if (taskContext != null && TaskContext.get() == null) {
71+
// Save-and-restore rather than only-install-if-null: the propagated context is the ground
72+
// truth for this call. Any value already on the thread is either (a) the same object on a
73+
// Spark task thread, or (b) stale from a prior task on a reused Tokio worker.
74+
TaskContext prior = TaskContext.get();
75+
if (taskContext != null) {
9176
CometTaskContextShim.set(taskContext);
92-
installedTaskContext = true;
9377
}
9478
try {
9579
evaluateInternal(
9680
udfClassName, inputArrayPtrs, inputSchemaPtrs, outArrayPtr, outSchemaPtr, numRows);
9781
} finally {
98-
if (installedTaskContext) {
99-
CometTaskContextShim.unset();
82+
if (taskContext != null) {
83+
if (prior != null) {
84+
CometTaskContextShim.set(prior);
85+
} else {
86+
CometTaskContextShim.unset();
87+
}
10088
}
10189
}
10290
}
@@ -108,23 +96,23 @@ private static void evaluateInternal(
10896
long outArrayPtr,
10997
long outSchemaPtr,
11098
int numRows) {
111-
LinkedHashMap<String, CometUDF> cache = INSTANCES.get();
112-
CometUDF udf = cache.get(udfClassName);
113-
if (udf == null) {
114-
try {
115-
// Resolve via the executor's context classloader so user-supplied UDF jars
116-
// (added via spark.jars / --jars) are visible.
117-
ClassLoader cl = Thread.currentThread().getContextClassLoader();
118-
if (cl == null) {
119-
cl = CometUdfBridge.class.getClassLoader();
120-
}
121-
udf =
122-
(CometUDF) Class.forName(udfClassName, true, cl).getDeclaredConstructor().newInstance();
123-
} catch (ReflectiveOperationException e) {
124-
throw new RuntimeException("Failed to instantiate CometUDF: " + udfClassName, e);
125-
}
126-
cache.put(udfClassName, udf);
127-
}
99+
CometUDF udf =
100+
INSTANCES.computeIfAbsent(
101+
udfClassName,
102+
name -> {
103+
try {
104+
// Resolve via the executor's context classloader so user-supplied UDF jars
105+
// (added via spark.jars / --jars) are visible.
106+
ClassLoader cl = Thread.currentThread().getContextClassLoader();
107+
if (cl == null) {
108+
cl = CometUdfBridge.class.getClassLoader();
109+
}
110+
return (CometUDF)
111+
Class.forName(name, true, cl).getDeclaredConstructor().newInstance();
112+
} catch (ReflectiveOperationException e) {
113+
throw new RuntimeException("Failed to instantiate CometUDF: " + name, e);
114+
}
115+
});
128116

129117
BufferAllocator allocator = org.apache.comet.package$.MODULE$.CometArrowAllocator();
130118

dev/diffs/3.4.3.diff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2170,7 +2170,7 @@ index 29cb224c878..ee5a87fa200 100644
21702170

21712171
- test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ") {
21722172
+ test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ",
2173-
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
2173+
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/4219")) {
21742174
val data = (1 to 1000).map { i =>
21752175
val ts = new java.sql.Timestamp(i)
21762176
Row(ts)

dev/diffs/3.5.8.diff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2137,7 +2137,7 @@ index f6472ba3d9d..5ea2d938664 100644
21372137

21382138
- test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ") {
21392139
+ test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ",
2140-
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
2140+
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/4219")) {
21412141
val data = (1 to 1000).map { i =>
21422142
val ts = new java.sql.Timestamp(i)
21432143
Row(ts)

dev/diffs/4.0.2.diff

Lines changed: 6 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2729,7 +2729,7 @@ index 4474ec1fd42..05fa0257c82 100644
27292729
checkAnswer(
27302730
// "fruit" column in this file is encoded using DELTA_LENGTH_BYTE_ARRAY.
27312731
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
2732-
index bba71f1c48d..35247c13ad9 100644
2732+
index bba71f1c48d..5a111a937a9 100644
27332733
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
27342734
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
27352735
@@ -27,6 +27,7 @@ import org.apache.parquet.hadoop.ParquetOutputFormat
@@ -2740,17 +2740,7 @@ index bba71f1c48d..35247c13ad9 100644
27402740
import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
27412741
import org.apache.spark.sql.catalyst.expressions.SpecificInternalRow
27422742
import org.apache.spark.sql.catalyst.util.ArrayData
2743-
@@ -185,7 +186,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2744-
}
2745-
}
2746-
2747-
- test("SPARK-47447: read TimestampLTZ as TimestampNTZ") {
2748-
+ test("SPARK-47447: read TimestampLTZ as TimestampNTZ",
2749-
+ IgnoreCometNativeDataFusion("https://github.com/apache/datafusion-comet/issues/3720")) {
2750-
val providedSchema = StructType(Seq(StructField("time", TimestampNTZType, false)))
2751-
2752-
Seq("INT96", "TIMESTAMP_MICROS", "TIMESTAMP_MILLIS").foreach { tsType =>
2753-
@@ -996,7 +998,11 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2743+
@@ -996,7 +997,11 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
27542744
Seq(Some("A"), Some("A"), None).toDF().repartition(1)
27552745
.write.parquet(path.getAbsolutePath)
27562746
val df = spark.read.parquet(path.getAbsolutePath)
@@ -2763,7 +2753,7 @@ index bba71f1c48d..35247c13ad9 100644
27632753
}
27642754
}
27652755
}
2766-
@@ -1042,7 +1048,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2756+
@@ -1042,7 +1047,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
27672757
testMigration(fromTsType = "TIMESTAMP_MICROS", toTsType = "INT96")
27682758
}
27692759

@@ -2773,7 +2763,7 @@ index bba71f1c48d..35247c13ad9 100644
27732763
def readParquet(schema: String, path: File): DataFrame = {
27742764
spark.read.schema(schema).parquet(path.toString)
27752765
}
2776-
@@ -1060,7 +1067,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2766+
@@ -1060,7 +1066,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
27772767
checkAnswer(readParquet(schema2, path), df)
27782768
}
27792769

@@ -2783,7 +2773,7 @@ index bba71f1c48d..35247c13ad9 100644
27832773
val schema1 = "a DECIMAL(3, 2), b DECIMAL(18, 3), c DECIMAL(37, 3)"
27842774
checkAnswer(readParquet(schema1, path), df)
27852775
val schema2 = "a DECIMAL(3, 0), b DECIMAL(18, 1), c DECIMAL(37, 1)"
2786-
@@ -1084,7 +1092,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2776+
@@ -1084,7 +1091,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
27872777
val df = sql(s"SELECT 1 a, 123456 b, ${Int.MaxValue.toLong * 10} c, CAST('1.2' AS BINARY) d")
27882778
df.write.parquet(path.toString)
27892779

@@ -2793,7 +2783,7 @@ index bba71f1c48d..35247c13ad9 100644
27932783
checkAnswer(readParquet("a DECIMAL(3, 2)", path), sql("SELECT 1.00"))
27942784
checkAnswer(readParquet("a DECIMAL(11, 2)", path), sql("SELECT 1.00"))
27952785
checkAnswer(readParquet("b DECIMAL(3, 2)", path), Row(null))
2796-
@@ -1131,7 +1140,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
2786+
@@ -1131,7 +1139,8 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS
27972787
}
27982788
}
27992789

0 commit comments

Comments
 (0)