Skip to content

Commit 8204171

Browse files
Merge remote-tracking branch 'upstream/main' into fix-flaot-round
2 parents f88a04c + e1ff769 commit 8204171

851 files changed

Lines changed: 28086 additions & 56399 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.asf.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ github:
4141
features:
4242
issues: true
4343
discussions: true
44+
projects: true
4445
protected_branches:
4546
main:
4647
required_pull_request_reviews:

.claude/skills/audit-comet-expression/SKILL.md

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ This audit covers:
1313
1. Spark implementation across versions 3.4.3, 3.5.8, and 4.0.1
1414
2. Comet Scala serde implementation
1515
3. Comet Rust / DataFusion implementation
16-
4. Existing test coverage (SQL file tests and Scala tests)
16+
4. Existing test coverage (Comet SQL Tests and Comet Scala Tests)
1717
5. Gap analysis and test recommendations
1818

1919
---
@@ -122,6 +122,16 @@ Read the serde implementation and check:
122122
- Whether `getSupportLevel` is implemented and accurate
123123
- Whether all input types are handled
124124
- Whether any types are explicitly marked `Unsupported`
125+
- Whether `getIncompatibleReasons()` and `getUnsupportedReasons()` are overridden.
126+
`getSupportLevel` controls runtime fallback, but `GenerateDocs` reads these two
127+
methods to build the Compatibility Guide. If `getSupportLevel` returns
128+
`Incompatible(Some(...))` or `Unsupported(Some(...))` but the corresponding
129+
`get*Reasons()` method is not overridden, the reason will not appear in the
130+
published docs. Expect both to return a `Seq[String]` containing the same
131+
reason text used in `getSupportLevel`. Follow the pattern in
132+
`spark/src/main/scala/org/apache/comet/serde/structs.scala::CometStructsToJson`
133+
or `spark/src/main/scala/org/apache/comet/serde/datetime.scala::CometHour`:
134+
extract the reason as a `private val` and reference it from both places.
125135

126136
### Shims
127137

@@ -161,7 +171,7 @@ Read the Rust implementation and check:
161171

162172
## Step 4: Locate Existing Comet Tests
163173

164-
### SQL file tests
174+
### Comet SQL Tests
165175

166176
```bash
167177
# Find SQL test files for this expression
@@ -179,13 +189,13 @@ Read every SQL test file found and list:
179189
- Query modes used (`query`, `spark_answer_only`, `tolerance`, `ignore`, `expect_error`)
180190
- Any ConfigMatrix directives
181191

182-
### Scala tests
192+
### Comet Scala Tests
183193

184194
```bash
185195
grep -r "$ARGUMENTS" spark/src/test/scala/ --include="*.scala" -l
186196
```
187197

188-
Read the relevant Scala test files and list:
198+
Read the relevant Comet Scala Tests and list:
189199

190200
- Input types covered
191201
- Edge cases exercised
@@ -201,7 +211,7 @@ Compare the Spark test coverage (Step 2) against the Comet test coverage (Step 4
201211

202212
For each of the following dimensions, note whether it is covered in Comet tests or missing:
203213

204-
| Dimension | Spark tests it | Comet SQL test | Comet Scala test | Gap? |
214+
| Dimension | Spark tests it | Comet SQL Test | Comet Scala Test | Gap? |
205215
| ------------------------------------------------------------------------------------------------------ | -------------- | -------------- | ---------------- | ---- |
206216
| Column reference argument(s) | | | | |
207217
| Literal argument(s) | | | | |
@@ -227,6 +237,7 @@ Also review the Comet implementation (Step 3) against the Spark behavior (Step 1
227237
- Are there behavioral differences that are NOT marked `Incompatible` but should be?
228238
- Are there behavioral differences between Spark versions that the Comet implementation does not account for (missing shim)?
229239
- Does the Rust implementation match the Spark behavior for all edge cases?
240+
- If `getSupportLevel` returns `Incompatible` or `Unsupported` with a reason, are `getIncompatibleReasons()` / `getUnsupportedReasons()` also overridden so the reason is picked up by `GenerateDocs` and appears in the Compatibility Guide?
230241

231242
---
232243

@@ -256,13 +267,13 @@ After presenting the gap analysis, ask the user:
256267
>
257268
> - [list each missing test case]
258269
>
259-
> I can add them as SQL file tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`
260-
> (or as Scala tests in `CometExpressionSuite` for cases that require programmatic setup).
270+
> I can add them as Comet SQL Tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`
271+
> (or as Comet Scala Tests in `CometExpressionSuite` for cases that require programmatic setup).
261272
262-
If the user says yes, implement the missing tests following the SQL file test format described in
263-
`docs/source/contributor-guide/sql-file-tests.md`. Prefer SQL file tests over Scala tests.
273+
If the user says yes, implement the missing tests following the Comet SQL Tests format described in
274+
`docs/source/contributor-guide/sql-file-tests.md`. Prefer Comet SQL Tests over Comet Scala Tests.
264275

265-
### SQL file test template
276+
### Comet SQL Tests template
266277

267278
```sql
268279
-- Licensed to the Apache Software Foundation (ASF) under one

.claude/skills/bug-triage/SKILL.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
---
2+
name: bug-triage
3+
description: Triage open Comet issues marked `requires-triage` per the project bug triage guide. Applies the recommended priority and area labels, removes `requires-triage`, and files a dated summary issue listing what was done. A human reviews the summary issue and closes it when satisfied.
4+
---
5+
6+
Run a bug triage pass for the `apache/datafusion-comet` repository.
7+
8+
## Overview
9+
10+
This skill triages every open issue carrying the `requires-triage` label. For
11+
each one it:
12+
13+
1. Decides a priority and area labels using the project's triage guide.
14+
2. Applies those labels via `gh`.
15+
3. Removes the `requires-triage` label.
16+
4. Records the decision (with rationale) in a single dated summary issue.
17+
18+
A human reviewer reads the summary issue, sanity-checks the calls, and closes
19+
it when satisfied. Any label correction is done by the reviewer directly on the
20+
affected issue.
21+
22+
The triage criteria come from the project's own guide. Read it before doing any
23+
classification work; do not rely on memory.
24+
25+
## Step 1: Read the Triage Guide
26+
27+
Read the canonical guide in this repository:
28+
29+
```
30+
docs/source/contributor-guide/bug_triage.md
31+
```
32+
33+
Use the priority decision tree, escalation triggers, area labels, and
34+
prioritization principles from that guide. If the guide and this skill ever
35+
disagree, the guide wins. Do not paraphrase the guide; quote the labels and
36+
criteria verbatim when classifying.
37+
38+
## Step 2: Gather Issues That Need Triage
39+
40+
Fetch all open issues labeled `requires-triage`:
41+
42+
```bash
43+
gh issue list \
44+
--repo apache/datafusion-comet \
45+
--label requires-triage \
46+
--state open \
47+
--limit 200 \
48+
--json number,title,author,createdAt,labels,body,url
49+
```
50+
51+
If the list is empty, stop and tell the user there is nothing to triage. Do not
52+
file an empty summary issue and do not modify any labels.
53+
54+
## Step 3: Classify Each Issue
55+
56+
For each issue, review the title and body and determine:
57+
58+
1. **Priority label** (exactly one): apply the decision tree from the guide.
59+
- `priority:critical` for correctness issues (silent wrong results, data
60+
corruption) and security vulnerabilities
61+
- `priority:high` for crashes, panics, segfaults, NPEs on supported paths
62+
- `priority:medium` for functional bugs / performance regressions with
63+
workarounds
64+
- `priority:low` for test-only, CI flakes, tooling, cosmetic
65+
2. **Area labels** (zero or more): from the area table in the guide
66+
(`area:writer`, `area:shuffle`, `area:aggregation`, `area:scan`,
67+
`area:expressions`, `area:ffi`, `area:ci`) plus the pre-existing area
68+
indicators (`native_datafusion`, `native_iceberg_compat`, `spark 4`,
69+
`spark sql tests`).
70+
3. **Escalation note**: if the issue matches an escalation trigger from the
71+
guide (e.g., a `priority:high` crash that may also produce wrong results),
72+
note it in the summary.
73+
74+
## Step 4: Skip Issues You Cannot Confidently Classify
75+
76+
If an issue lacks reproduction steps or is otherwise too ambiguous to classify
77+
with confidence:
78+
79+
- **Do not** apply a priority label.
80+
- **Do not** remove `requires-triage`.
81+
- **Do not** comment on the issue or ask the reporter for more info from this
82+
skill (that is the human reviewer's call).
83+
- Record it in the summary under a "Skipped — needs more info" section so the
84+
reviewer can follow up.
85+
86+
Guessing is worse than skipping.
87+
88+
## Step 5: Apply Labels
89+
90+
For each issue you classified in Step 3, apply the labels and remove
91+
`requires-triage` in a single `gh` call:
92+
93+
```bash
94+
gh issue edit <NUMBER> \
95+
--repo apache/datafusion-comet \
96+
--add-label "priority:high,area:expressions" \
97+
--remove-label "requires-triage"
98+
```
99+
100+
Notes:
101+
102+
- Pass the labels as a single comma-separated string (no spaces around commas).
103+
- Quote labels that contain spaces (e.g., `"spark 4"`).
104+
- Only add labels that already exist in the repo. If a label from the guide is
105+
missing in the repo, skip it for that issue and record a note in the summary
106+
rather than creating new labels.
107+
- Do not comment on the issue.
108+
109+
If `gh issue edit` fails for any issue, leave that issue's `requires-triage`
110+
label intact and record the failure in the summary under a "Failed to label"
111+
section.
112+
113+
## Step 6: File the Summary Issue
114+
115+
Compute today's date in `YYYY-MM-DD` form (use the system date, not memory):
116+
117+
```bash
118+
TRIAGE_DATE=$(date -u +%Y-%m-%d)
119+
```
120+
121+
Title: `Bug triage results: ${TRIAGE_DATE}`
122+
123+
Body: a markdown report with these sections, in this order:
124+
125+
1. **Header**
126+
- Date, total issues processed, and counts per priority
127+
- Link to `docs/source/contributor-guide/bug_triage.md`
128+
- Note that labels have already been applied; the reviewer should spot-check
129+
and close this issue when satisfied
130+
2. **Triaged** — one subsection per priority, ordered highest priority first
131+
(`priority:critical`, then `priority:high`, then `priority:medium`, then
132+
`priority:low`). Omit any subsection whose count is zero. Do **not** use a
133+
markdown table anywhere in this section; use nested bullet lists only.
134+
135+
Within each subsection, one top-level bullet per issue:
136+
137+
```
138+
### priority:critical
139+
140+
- <issue title> ([#1234](https://github.com/apache/datafusion-comet/issues/1234))
141+
- Area labels: `area:expressions`, `area:scan`
142+
- Rationale: one sentence tying the call to the guide
143+
```
144+
145+
The issue number (not the title) is the link target. The title is plain
146+
text. If there are no area labels, write `Area labels: none`.
147+
148+
3. **Escalations to consider** (omit section if empty) — bullet per issue with
149+
the same `<title> ([#N](url))` form, plus a sub-bullet explaining the
150+
trigger from the guide.
151+
4. **Skipped — needs more info** (omit if empty) — bullet per issue with the
152+
same `<title> ([#N](url))` form, plus a sub-bullet explaining what is
153+
missing.
154+
5. **Failed to label** (omit if empty) — bullet per issue with the same
155+
`<title> ([#N](url))` form, plus a sub-bullet quoting the `gh` error.
156+
157+
File the issue with `gh`. Use a temp file for the body to keep quoting sane:
158+
159+
```bash
160+
gh issue create \
161+
--repo apache/datafusion-comet \
162+
--title "Bug triage results: ${TRIAGE_DATE}" \
163+
--body-file /tmp/triage-summary-${TRIAGE_DATE}.md
164+
```
165+
166+
Do not add labels to the summary issue itself.
167+
168+
After creating the issue, print its URL.
169+
170+
## Output to the User
171+
172+
Report back:
173+
174+
1. Number of `requires-triage` issues processed
175+
2. Counts per priority that were applied
176+
3. Number skipped (needs more info) and number failed
177+
4. URL of the new summary issue
178+
179+
Do not paste the full per-issue listing back into the chat; it is in the
180+
summary issue.
181+
182+
## What This Skill Must Not Do
183+
184+
- Do not invent priority or area labels that are not in the guide
185+
- Do not create new labels in the repo
186+
- Do not comment on the triaged issues
187+
- Do not close any triaged issue
188+
- Do not file the summary issue if there were zero `requires-triage` issues
189+
- Do not re-label issues that were skipped or failed (leave `requires-triage`
190+
in place so they show up in the next pass)
191+
- Do not include AI/Claude attribution in the summary issue
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
name: implement-comet-expression
3+
description: Use when implementing a new Spark expression in DataFusion Comet. Walks through cloning latest Spark master to study the canonical implementation, checking the upstream datafusion-spark crate before writing native code, building the Comet serde and Rust wire-up from the contributor guide, then running audit-comet-expression to drive a test-coverage iteration loop.
4+
argument-hint: <expression-name>
5+
---
6+
7+
Implement Comet support for the `$ARGUMENTS` Spark expression.
8+
9+
## Background reading
10+
11+
The contributor guide is the canonical reference. Read these before writing code:
12+
13+
- `docs/source/contributor-guide/adding_a_new_expression.md` covers the Scala serde, protobuf, Rust scalar function flow, support levels, shims, and tests.
14+
- `docs/source/contributor-guide/sql-file-tests.md` describes the Comet SQL Tests format.
15+
- `docs/source/contributor-guide/spark_expressions_support.md` lists the coverage status for every expression.
16+
17+
## Workflow
18+
19+
### 1. Study the Spark master implementation first
20+
21+
Always start from the latest Spark `master`. Shallow clone if not already present:
22+
23+
```bash
24+
if [ ! -d /tmp/spark-master ]; then
25+
git clone --depth 1 https://github.com/apache/spark.git /tmp/spark-master
26+
fi
27+
```
28+
29+
Find the expression class and tests:
30+
31+
```bash
32+
find /tmp/spark-master/sql -name "*.scala" | \
33+
xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null
34+
35+
find /tmp/spark-master/sql -name "*.scala" -path "*/test/*" | \
36+
xargs grep -l "$ARGUMENTS" 2>/dev/null
37+
```
38+
39+
Read the source. Note `inputTypes`, `dataType`, `eval` / `nullSafeEval`, ANSI mode branches, and any `require` guards. These define the contract Comet must match.
40+
41+
### 2. Check for an upstream `datafusion-spark` implementation
42+
43+
Before writing a Comet-specific native function, check whether the expression is already available in the upstream `datafusion-spark` crate. It is a Spark-compatible function library maintained alongside DataFusion, so its semantics are usually a closer match to Spark than a generic `datafusion-functions` built-in.
44+
45+
```bash
46+
grep -rn "fn name\|SparkFunctionName" ~/.cargo/registry/src/*/datafusion-spark-*/src/function/ 2>/dev/null | grep -i "$ARGUMENTS"
47+
```
48+
49+
Functions are organized as `datafusion_spark::function::<category>::<name>::Spark<Name>`. Existing wire-ups can be found in `native/core/src/execution/planner.rs` (e.g. `SparkDateAdd`, `SparkDateSub`, `SparkCollectSet`).
50+
51+
When the upstream implementation matches Spark's semantics, prefer it: register the `ScalarUDF` from `datafusion-spark` rather than re-implementing. This keeps the maintenance burden upstream. If the upstream version is missing, incomplete, or diverges from Spark, fall through to step 3 and write the function locally.
52+
53+
### 3. Implement the initial version
54+
55+
Follow `adding_a_new_expression.md`:
56+
57+
1. Add a `CometExpressionSerde[T]` in the appropriate file under `spark/src/main/scala/org/apache/comet/serde/`.
58+
2. Register it in the matching map in `QueryPlanSerde.scala`.
59+
3. If the function name collides with a DataFusion built-in that has a different signature, use `scalarFunctionExprToProtoWithReturnType` (see "When to set the return type explicitly").
60+
4. For a new scalar function, add a match case in `native/spark-expr/src/comet_scalar_funcs.rs::create_comet_physical_fun`. If step 2 found an upstream implementation, wire that in. Otherwise implement the function under `native/spark-expr/src/`.
61+
5. Add at least one Comet SQL Test at `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql` exercising column references, literals, and `NULL`.
62+
63+
Build and smoke-test:
64+
65+
```bash
66+
make
67+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
68+
```
69+
70+
### 4. Run the audit skill
71+
72+
Once the initial implementation passes its smoke test, run the `audit-comet-expression` skill on `$ARGUMENTS`. It compares the implementation and tests against Spark 3.4.3, 3.5.8, and 4.0.1 and produces a prioritized list of gaps.
73+
74+
### 5. Implement audit-recommended tests and iterate
75+
76+
Add the missing test cases the audit recommends, then re-run the targeted suite:
77+
78+
```bash
79+
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
80+
```
81+
82+
Surface findings to the user and ask whether the coverage is sufficient. Continue iterating (adding tests, fixing bugs, refining `getSupportLevel` / `getIncompatibleReasons` / `getUnsupportedReasons`) until the user confirms they are happy.
83+
84+
### 6. Final checks
85+
86+
Before opening a PR:
87+
88+
```bash
89+
make format
90+
cd native && cargo clippy --all-targets --workspace -- -D warnings
91+
```
92+
93+
### 7. Open the PR
94+
95+
Use the repo's PR template at `.github/pull_request_template.md` and fill in every section: "Which issue does this PR close?", "Rationale for this change", "What changes are included in this PR?", and "How are these changes tested?". Do not add a separate test plan section.
96+
97+
In the "What changes are included in this PR?" section, add a brief note that the `implement-comet-expression` skill was used to scaffold the implementation, so reviewers know which workflow produced the change.

0 commit comments

Comments
 (0)