Skip to content

Commit d9652a5

Browse files
committed
Merge remote-tracking branch 'apache/main' into direct-native-shuffle-execution
2 parents f8fd7d9 + d3b2007 commit d9652a5

889 files changed

Lines changed: 16204 additions & 8544 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
name: review-comet-pr
3+
description: Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly.
4+
argument-hint: <pr-number>
5+
---
6+
7+
Review Comet PR #$ARGUMENTS
8+
9+
## Before You Start
10+
11+
### Gather PR Metadata
12+
13+
Fetch the PR details to understand the scope:
14+
15+
```bash
16+
gh pr view $ARGUMENTS --repo apache/datafusion-comet --json title,body,author,isDraft,state,files
17+
```
18+
19+
### Review Existing Comments First
20+
21+
Before forming your review:
22+
23+
1. **Read all existing review comments** on the PR
24+
2. **Check the conversation tab** for any discussion
25+
3. **Avoid duplicating feedback** that others have already provided
26+
4. **Build on existing discussions** rather than starting new threads on the same topic
27+
5. **If you have no additional concerns beyond what's already discussed, say so**
28+
6. **Ignore Copilot reviews** - do not reference or build upon comments from GitHub Copilot
29+
30+
```bash
31+
# View existing comments on a PR
32+
gh pr view $ARGUMENTS --repo apache/datafusion-comet --comments
33+
```
34+
35+
---
36+
37+
## Review Workflow
38+
39+
### 1. Gather Context
40+
41+
Read the changed files and understand the area of the codebase being modified:
42+
43+
```bash
44+
# View the diff
45+
gh pr diff $ARGUMENTS --repo apache/datafusion-comet
46+
```
47+
48+
For expression PRs, check how similar expressions are implemented in the codebase. Look at the serde files in `spark/src/main/scala/org/apache/comet/serde/` and Rust implementations in `native/spark-expr/src/`.
49+
50+
### 2. Read Spark Source (Expression PRs)
51+
52+
**For any PR that adds or modifies an expression, you must read the Spark source code to understand the canonical behavior.** This is the authoritative reference for what Comet must match.
53+
54+
1. **Clone or update the Spark repo:**
55+
56+
```bash
57+
# Clone if not already present (use /tmp to avoid polluting the workspace)
58+
if [ ! -d /tmp/spark ]; then
59+
git clone --depth 1 https://github.com/apache/spark.git /tmp/spark
60+
fi
61+
```
62+
63+
2. **Find the expression implementation in Spark:**
64+
65+
```bash
66+
# Search for the expression class (e.g., for "Conv", "Hex", "Substring")
67+
find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" | xargs grep -l "case class <ExpressionName>"
68+
```
69+
70+
3. **Read the Spark implementation carefully.** Pay attention to:
71+
- The `eval` and `doGenEval`/`nullSafeEval` methods. These define the exact behavior.
72+
- The `inputTypes` and `dataType` fields. These define which types Spark accepts and what it returns.
73+
- Null handling. Does it use `nullable = true`? Does `nullSafeEval` handle nulls implicitly?
74+
- Special cases, guards, and `require` assertions.
75+
- ANSI mode branches (look for `SQLConf.get.ansiEnabled` or `failOnError`).
76+
77+
4. **Read the Spark tests for the expression:**
78+
79+
```bash
80+
# Find test files
81+
find /tmp/spark/sql -name "*.scala" -path "*/test/*" | xargs grep -l "<ExpressionName>"
82+
```
83+
84+
5. **Compare the Spark behavior against the Comet implementation in the PR.** Identify:
85+
- Edge cases tested in Spark but not in the PR
86+
- Data types supported in Spark but not handled in the PR
87+
- Behavioral differences that should be marked `Incompatible`
88+
89+
6. **Suggest additional tests** for any edge cases or type combinations covered in Spark's tests that are missing from the PR's tests.
90+
91+
### 3. Spark Compatibility Check
92+
93+
**This is the most critical aspect of Comet reviews.** Comet must produce identical results to Spark.
94+
95+
For expression PRs, verify against the Spark source you read in step 2:
96+
97+
1. **Check edge cases**
98+
- Null handling
99+
- Overflow behavior
100+
- Empty input behavior
101+
- Type-specific behavior
102+
103+
2. **Verify all data types are handled**
104+
- Does Spark support this type? (Check `inputTypes` in Spark source)
105+
- Does the PR handle all Spark-supported types?
106+
107+
3. **Check for ANSI mode differences**
108+
- Spark behavior may differ between legacy and ANSI modes
109+
- PR should handle both or mark as `Incompatible`
110+
111+
### 4. Check Against Implementation Guidelines
112+
113+
**Always verify PRs follow the implementation guidelines.**
114+
115+
#### Scala Serde (`spark/src/main/scala/org/apache/comet/serde/`)
116+
117+
- [ ] Expression class correctly identified
118+
- [ ] All child expressions converted via `exprToProtoInternal`
119+
- [ ] Return type correctly serialized
120+
- [ ] `getSupportLevel` reflects true compatibility:
121+
- `Compatible()` - matches Spark exactly
122+
- `Incompatible(Some("reason"))` - differs in documented ways
123+
- `Unsupported(Some("reason"))` - cannot be implemented
124+
- [ ] Serde in appropriate file (`datetime.scala`, `strings.scala`, `arithmetic.scala`, etc.)
125+
126+
#### Registration (`QueryPlanSerde.scala`)
127+
128+
- [ ] Added to correct map (temporal, string, arithmetic, etc.)
129+
- [ ] No duplicate registrations
130+
- [ ] Import statement added
131+
132+
#### Rust Implementation (if applicable)
133+
134+
Location: `native/spark-expr/src/`
135+
136+
- [ ] Matches DataFusion and Arrow conventions
137+
- [ ] Null handling is correct
138+
- [ ] No panics. Use `Result` types.
139+
- [ ] Efficient array operations (avoid row-by-row)
140+
141+
#### Tests - Prefer SQL File-Based Framework
142+
143+
**Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) where possible.** This framework automatically runs each query through both Spark and Comet and compares results. No Scala code is needed. Only fall back to Scala tests in `CometExpressionSuite` when the SQL framework cannot express the test. Examples include complex `DataFrame` setup, programmatic data generation, or non-expression tests.
144+
145+
**SQL file test location:** `spark/src/test/resources/sql-tests/expressions/<category>/`
146+
147+
Categories include: `aggregate/`, `array/`, `string/`, `math/`, `struct/`, `map/`, `datetime/`, `hash/`, etc.
148+
149+
**SQL file structure:**
150+
151+
```sql
152+
-- ConfigMatrix: parquet.enable.dictionary=false,true
153+
154+
-- Create test data
155+
statement
156+
CREATE TABLE test_crc32(col string, a int, b float) USING parquet
157+
158+
statement
159+
INSERT INTO test_crc32 VALUES ('Spark', 10, 1.5), (NULL, NULL, NULL), ('', 0, 0.0)
160+
161+
-- Default mode: verifies native Comet execution + result matches Spark
162+
query
163+
SELECT crc32(col) FROM test_crc32
164+
165+
-- spark_answer_only: compares results without requiring native execution
166+
query spark_answer_only
167+
SELECT crc32(cast(a as string)) FROM test_crc32
168+
169+
-- tolerance: allows numeric variance for floating-point results
170+
query tolerance=0.0001
171+
SELECT cos(v) FROM test_trig
172+
173+
-- expect_fallback: asserts fallback to Spark occurs
174+
query expect_fallback(unsupported expression)
175+
SELECT unsupported_func(v) FROM test_table
176+
177+
-- expect_error: verifies both engines throw matching exceptions
178+
query expect_error(ARITHMETIC_OVERFLOW)
179+
SELECT 2147483647 + 1
180+
181+
-- ignore: skip queries with known bugs (include GitHub issue link)
182+
query ignore(https://github.com/apache/datafusion-comet/issues/NNNN)
183+
SELECT known_buggy_expr(v) FROM test_table
184+
```
185+
186+
**Running SQL file tests:**
187+
188+
```bash
189+
# All SQL file tests
190+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
191+
192+
# Specific test file (substring match)
193+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Dtest=none
194+
```
195+
196+
**CRITICAL: Verify all test requirements (regardless of framework):**
197+
198+
- [ ] Basic functionality tested (column data, not just literals)
199+
- [ ] Null handling tested (`SELECT expression(NULL)`)
200+
- [ ] Edge cases tested (empty input, overflow, boundary values)
201+
- [ ] Both literal values and column references tested (they use different code paths)
202+
- [ ] For timestamp/datetime expressions, timezone handling is tested (e.g., UTC, non-UTC session timezone, timestamps with and without timezone)
203+
- [ ] One expression per SQL file for easier debugging
204+
- [ ] If using Scala tests instead, literal tests MUST disable constant folding:
205+
```scala
206+
withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
207+
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding") {
208+
checkSparkAnswerAndOperator("SELECT func(literal)")
209+
}
210+
```
211+
212+
### 5. Performance Review (Expression PRs)
213+
214+
**For PRs that add new expressions, performance is not optional.** The whole point of Comet is to be faster than Spark. If a new expression is not faster, it may not be worth adding.
215+
216+
1. **Check that the PR includes microbenchmark results.** The PR description should contain benchmark numbers comparing Comet vs Spark for the new expression. If benchmark results are missing, flag this as a required addition.
217+
218+
2. **Look for a microbenchmark implementation.** Expression benchmarks live in `spark/src/test/scala/org/apache/spark/sql/benchmark/`. Check whether the PR adds a benchmark for the new expression.
219+
220+
3. **Review the benchmark results if provided:**
221+
- Is Comet actually faster than Spark for this expression?
222+
- Are the benchmarks representative? They should test with realistic data sizes, not just trivial inputs.
223+
- Are different data types benchmarked if the expression supports multiple types?
224+
225+
4. **Review the Rust implementation for performance concerns:**
226+
- Unnecessary allocations or copies
227+
- Row-by-row processing where batch/array operations are possible
228+
- Redundant type conversions
229+
- Inefficient string handling (e.g., repeated UTF-8 validation)
230+
- Missing use of Arrow compute kernels where they exist
231+
232+
5. **If benchmark results show Comet is slower than Spark**, flag this clearly. The PR should explain why the regression is acceptable or include a plan to optimize.
233+
234+
### 6. Check CI Test Failures
235+
236+
**Always check the CI status and summarize any test failures in your review.**
237+
238+
```bash
239+
# View CI check status
240+
gh pr checks $ARGUMENTS --repo apache/datafusion-comet
241+
242+
# View failed check details
243+
gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed
244+
```
245+
246+
### 7. Documentation Check
247+
248+
Check whether the PR requires updates to user-facing documentation in `docs/`:
249+
250+
- **Compatibility guide** (`docs/source/user-guide/compatibility.md`): New expressions or operators should be listed. Incompatible behaviors should be documented.
251+
- **Configuration guide** (`docs/source/user-guide/configs.md`): New config options should be documented.
252+
- **Expressions list** (`docs/source/user-guide/expressions.md`): New expressions should be added.
253+
254+
If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed.
255+
256+
### 8. Common Comet Review Issues
257+
258+
1. **Incomplete type support**: Spark expression supports types not handled in PR
259+
2. **Missing edge cases**: Null, overflow, empty string, negative values
260+
3. **Wrong return type**: Return type must match Spark exactly
261+
4. **Tests in wrong framework**: Expression tests should use the SQL file-based framework (`CometSqlFileTestSuite`) rather than adding to Scala test suites like `CometExpressionSuite`. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead.
262+
5. **Stale native code**: PR might need `./mvnw install -pl common -DskipTests`
263+
6. **Missing `getSupportLevel`**: Edge cases should be marked as `Incompatible`
264+
265+
---
266+
267+
## Output Format
268+
269+
Present your review as guidance for the reviewer. Structure your output as:
270+
271+
1. **PR Summary** - Brief description of what the PR does
272+
2. **CI Status** - Summary of CI check results
273+
3. **Findings** - Your analysis organized by area (Spark compatibility, implementation, tests, etc.)
274+
4. **Suggested Review Comments** - Specific comments the reviewer could leave on the PR, with file and line references where applicable
275+
276+
## Review Tone and Style
277+
278+
Write reviews that sound human and conversational. Avoid:
279+
280+
- Robotic or formulaic language
281+
- Em dashes. Use separate sentences instead.
282+
- Semicolons. Use separate sentences instead.
283+
284+
Instead:
285+
286+
- Write in flowing paragraphs using simple grammar
287+
- Keep sentences short and separate rather than joining them with punctuation
288+
- Be kind and constructive, even when raising concerns
289+
- Use backticks around any code references (function names, file paths, class names, types, config keys, etc.)
290+
- **Suggest** adding tests rather than stating tests are missing (e.g., "It might be worth adding a test for X" not "Tests are missing for X")
291+
- **Ask questions** about edge cases rather than asserting they aren't handled (e.g., "Does this handle the case where X is null?" not "This doesn't handle null")
292+
- Frame concerns as questions or suggestions when possible
293+
- Acknowledge what the PR does well before raising concerns
294+
295+
## Do Not Post Comments
296+
297+
**IMPORTANT: Never post comments or reviews on the PR directly.** This skill is for providing guidance to a human reviewer. Present all findings and suggested comments to the user. The user will decide what to post.

.github/actions/java-test/action.yaml

Lines changed: 3 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ runs:
5959
# temporarily work around https://github.com/actions/runner-images/issues/13341
6060
# by disabling caching for macOS
6161
if: ${{ runner.os != 'macOS' }}
62-
uses: actions/cache@v4
62+
uses: actions/cache@v5
6363
with:
6464
path: |
6565
~/.m2/repository
@@ -68,36 +68,6 @@ runs:
6868
restore-keys: |
6969
${{ runner.os }}-java-maven-
7070
71-
- name: Run Maven compile
72-
shell: bash
73-
run: |
74-
./mvnw -B package -DskipTests scalafix:scalafix -Dscalafix.mode=CHECK -Psemanticdb ${{ inputs.maven_opts }}
75-
76-
- name: Setup Node.js
77-
uses: actions/setup-node@v6
78-
with:
79-
node-version: '24'
80-
81-
- name: Install prettier
82-
shell: bash
83-
run: |
84-
npm install -g prettier
85-
86-
- name: Run prettier
87-
shell: bash
88-
run: |
89-
npx prettier "**/*.md" --write
90-
91-
- name: Mark workspace as safe for git
92-
shell: bash
93-
run: |
94-
git config --global --add safe.directory "$GITHUB_WORKSPACE"
95-
96-
- name: Check for any local git changes (such as generated docs)
97-
shell: bash
98-
run: |
99-
./dev/ci/check-working-tree-clean.sh
100-
10171
- name: Run all tests
10272
shell: bash
10373
if: ${{ inputs.suites == '' }}
@@ -106,7 +76,7 @@ runs:
10676
SPARK_LOCAL_HOSTNAME: "localhost"
10777
SPARK_LOCAL_IP: "127.0.0.1"
10878
run: |
109-
MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B -Prelease clean install ${{ inputs.maven_opts }}
79+
MAVEN_OPTS="-Xmx4G -Xms2G -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B -Prelease install ${{ inputs.maven_opts }}
11080
- name: Run specified tests
11181
shell: bash
11282
if: ${{ inputs.suites != '' }}
@@ -117,7 +87,7 @@ runs:
11787
run: |
11888
MAVEN_SUITES="$(echo "${{ inputs.suites }}" | paste -sd, -)"
11989
echo "Running with MAVEN_SUITES=$MAVEN_SUITES"
120-
MAVEN_OPTS="-Xmx4G -Xms2G -DwildcardSuites=$MAVEN_SUITES -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B -Prelease clean install ${{ inputs.maven_opts }}
90+
MAVEN_OPTS="-Xmx4G -Xms2G -DwildcardSuites=$MAVEN_SUITES -XX:+UnlockDiagnosticVMOptions -XX:+ShowMessageBoxOnError -XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=./hs_err_pid%p.log" SPARK_HOME=`pwd` ./mvnw -B -Prelease install ${{ inputs.maven_opts }}
12191
- name: Upload crash logs
12292
if: failure()
12393
uses: actions/upload-artifact@v6

0 commit comments

Comments
 (0)