apache
diff --git a/‎.claude/skills/bug-triage/SKILL.md‎
Lines changed: 191 additions & 0 deletions b/‎.claude/skills/bug-triage/SKILL.md‎
Lines changed: 191 additions & 0 deletions
diff --git a/‎.claude/skills/implement-comet-expression/SKILL.md‎
Lines changed: 97 additions & 0 deletions b/‎.claude/skills/implement-comet-expression/SKILL.md‎
Lines changed: 97 additions & 0 deletions
diff --git a/‎.github/actions/setup-spark-builder/action.yaml‎
Lines changed: 20 additions & 0 deletions b/‎.github/actions/setup-spark-builder/action.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎.github/workflows/iceberg_spark_test.yml‎
Lines changed: 15 additions & 3 deletions b/‎.github/workflows/iceberg_spark_test.yml‎
Lines changed: 15 additions & 3 deletions
diff --git a/‎.github/workflows/pr_build_linux.yml‎
Lines changed: 10 additions & 3 deletions b/‎.github/workflows/pr_build_linux.yml‎
Lines changed: 10 additions & 3 deletions
@@ -0,0 +1,191 @@
+---
+name: bug-triage
+description: Triage open Comet issues marked `requires-triage` per the project bug triage guide. Applies the recommended priority and area labels, removes `requires-triage`, and files a dated summary issue listing what was done. A human reviews the summary issue and closes it when satisfied.
+---
+
+Run a bug triage pass for the `apache/datafusion-comet` repository.
+
+## Overview
+
+This skill triages every open issue carrying the `requires-triage` label. For
+each one it:
+
+1. Decides a priority and area labels using the project's triage guide.
+2. Applies those labels via `gh`.
+3. Removes the `requires-triage` label.
+4. Records the decision (with rationale) in a single dated summary issue.
+
+A human reviewer reads the summary issue, sanity-checks the calls, and closes
+it when satisfied. Any label correction is done by the reviewer directly on the
+affected issue.
+
+The triage criteria come from the project's own guide. Read it before doing any
+classification work; do not rely on memory.
+
+## Step 1: Read the Triage Guide
+
+Read the canonical guide in this repository:
+
+```
+docs/source/contributor-guide/bug_triage.md
+```
+
+Use the priority decision tree, escalation triggers, area labels, and
+prioritization principles from that guide. If the guide and this skill ever
+disagree, the guide wins. Do not paraphrase the guide; quote the labels and
+criteria verbatim when classifying.
+
+## Step 2: Gather Issues That Need Triage
+
+Fetch all open issues labeled `requires-triage`:
+
+```bash
+gh issue list \
+  --repo apache/datafusion-comet \
+  --label requires-triage \
+  --state open \
+  --limit 200 \
+  --json number,title,author,createdAt,labels,body,url
+```
+
+If the list is empty, stop and tell the user there is nothing to triage. Do not
+file an empty summary issue and do not modify any labels.
+
+## Step 3: Classify Each Issue
+
+For each issue, review the title and body and determine:
+
+1. **Priority label** (exactly one): apply the decision tree from the guide.
+   - `priority:critical` for correctness issues (silent wrong results, data
+     corruption) and security vulnerabilities
+   - `priority:high` for crashes, panics, segfaults, NPEs on supported paths
+   - `priority:medium` for functional bugs / performance regressions with
+     workarounds
+   - `priority:low` for test-only, CI flakes, tooling, cosmetic
+2. **Area labels** (zero or more): from the area table in the guide
+   (`area:writer`, `area:shuffle`, `area:aggregation`, `area:scan`,
+   `area:expressions`, `area:ffi`, `area:ci`) plus the pre-existing area
+   indicators (`native_datafusion`, `native_iceberg_compat`, `spark 4`,
+   `spark sql tests`).
+3. **Escalation note**: if the issue matches an escalation trigger from the
+   guide (e.g., a `priority:high` crash that may also produce wrong results),
+   note it in the summary.
+
+## Step 4: Skip Issues You Cannot Confidently Classify
+
+If an issue lacks reproduction steps or is otherwise too ambiguous to classify
+with confidence:
+
+- **Do not** apply a priority label.
+- **Do not** remove `requires-triage`.
+- **Do not** comment on the issue or ask the reporter for more info from this
+  skill (that is the human reviewer's call).
+- Record it in the summary under a "Skipped — needs more info" section so the
+  reviewer can follow up.
+
+Guessing is worse than skipping.
+
+## Step 5: Apply Labels
+
+For each issue you classified in Step 3, apply the labels and remove
+`requires-triage` in a single `gh` call:
+
+```bash
+gh issue edit <NUMBER> \
+  --repo apache/datafusion-comet \
+  --add-label "priority:high,area:expressions" \
+  --remove-label "requires-triage"
+```
+
+Notes:
+
+- Pass the labels as a single comma-separated string (no spaces around commas).
+- Quote labels that contain spaces (e.g., `"spark 4"`).
+- Only add labels that already exist in the repo. If a label from the guide is
+  missing in the repo, skip it for that issue and record a note in the summary
+  rather than creating new labels.
+- Do not comment on the issue.
+
+If `gh issue edit` fails for any issue, leave that issue's `requires-triage`
+label intact and record the failure in the summary under a "Failed to label"
+section.
+
+## Step 6: File the Summary Issue
+
+Compute today's date in `YYYY-MM-DD` form (use the system date, not memory):
+
+```bash
+TRIAGE_DATE=$(date -u +%Y-%m-%d)
+```
+
+Title: `Bug triage results: ${TRIAGE_DATE}`
+
+Body: a markdown report with these sections, in this order:
+
+1. **Header**
+   - Date, total issues processed, and counts per priority
+   - Link to `docs/source/contributor-guide/bug_triage.md`
+   - Note that labels have already been applied; the reviewer should spot-check
+     and close this issue when satisfied
+2. **Triaged** — one subsection per priority, ordered highest priority first
+   (`priority:critical`, then `priority:high`, then `priority:medium`, then
+   `priority:low`). Omit any subsection whose count is zero. Do **not** use a
+   markdown table anywhere in this section; use nested bullet lists only.
+
+   Within each subsection, one top-level bullet per issue:
+
+   ```
+   ### priority:critical
+
+   - <issue title> ([#1234](https://github.com/apache/datafusion-comet/issues/1234))
+     - Area labels: `area:expressions`, `area:scan`
+     - Rationale: one sentence tying the call to the guide
+   ```
+
+   The issue number (not the title) is the link target. The title is plain
+   text. If there are no area labels, write `Area labels: none`.
+
+3. **Escalations to consider** (omit section if empty) — bullet per issue with
+   the same `<title> ([#N](url))` form, plus a sub-bullet explaining the
+   trigger from the guide.
+4. **Skipped — needs more info** (omit if empty) — bullet per issue with the
+   same `<title> ([#N](url))` form, plus a sub-bullet explaining what is
+   missing.
+5. **Failed to label** (omit if empty) — bullet per issue with the same
+   `<title> ([#N](url))` form, plus a sub-bullet quoting the `gh` error.
+
+File the issue with `gh`. Use a temp file for the body to keep quoting sane:
+
+```bash
+gh issue create \
+  --repo apache/datafusion-comet \
+  --title "Bug triage results: ${TRIAGE_DATE}" \
+  --body-file /tmp/triage-summary-${TRIAGE_DATE}.md
+```
+
+Do not add labels to the summary issue itself.
+
+After creating the issue, print its URL.
+
+## Output to the User
+
+Report back:
+
+1. Number of `requires-triage` issues processed
+2. Counts per priority that were applied
+3. Number skipped (needs more info) and number failed
+4. URL of the new summary issue
+
+Do not paste the full per-issue listing back into the chat; it is in the
+summary issue.
+
+## What This Skill Must Not Do
+
+- Do not invent priority or area labels that are not in the guide
+- Do not create new labels in the repo
+- Do not comment on the triaged issues
+- Do not close any triaged issue
+- Do not file the summary issue if there were zero `requires-triage` issues
+- Do not re-label issues that were skipped or failed (leave `requires-triage`
+  in place so they show up in the next pass)
+- Do not include AI/Claude attribution in the summary issue
@@ -0,0 +1,97 @@
+---
+name: implement-comet-expression
+description: Use when implementing a new Spark expression in DataFusion Comet. Walks through cloning latest Spark master to study the canonical implementation, checking the upstream datafusion-spark crate before writing native code, building the Comet serde and Rust wire-up from the contributor guide, then running audit-comet-expression to drive a test-coverage iteration loop.
+argument-hint: <expression-name>
+---
+
+Implement Comet support for the `$ARGUMENTS` Spark expression.
+
+## Background reading
+
+The contributor guide is the canonical reference. Read these before writing code:
+
+- `docs/source/contributor-guide/adding_a_new_expression.md` covers the Scala serde, protobuf, Rust scalar function flow, support levels, shims, and tests.
+- `docs/source/contributor-guide/sql-file-tests.md` describes the Comet SQL Tests format.
+- `docs/source/contributor-guide/spark_expressions_support.md` lists the coverage status for every expression.
+
+## Workflow
+
+### 1. Study the Spark master implementation first
+
+Always start from the latest Spark `master`. Shallow clone if not already present:
+
+```bash
+if [ ! -d /tmp/spark-master ]; then
+  git clone --depth 1 https://github.com/apache/spark.git /tmp/spark-master
+fi
+```
+
+Find the expression class and tests:
+
+```bash
+find /tmp/spark-master/sql -name "*.scala" | \
+  xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null
+
+find /tmp/spark-master/sql -name "*.scala" -path "*/test/*" | \
+  xargs grep -l "$ARGUMENTS" 2>/dev/null
+```
+
+Read the source. Note `inputTypes`, `dataType`, `eval` / `nullSafeEval`, ANSI mode branches, and any `require` guards. These define the contract Comet must match.
+
+### 2. Check for an upstream `datafusion-spark` implementation
+
+Before writing a Comet-specific native function, check whether the expression is already available in the upstream `datafusion-spark` crate. It is a Spark-compatible function library maintained alongside DataFusion, so its semantics are usually a closer match to Spark than a generic `datafusion-functions` built-in.
+
+```bash
+grep -rn "fn name\|SparkFunctionName" ~/.cargo/registry/src/*/datafusion-spark-*/src/function/ 2>/dev/null | grep -i "$ARGUMENTS"
+```
+
+Functions are organized as `datafusion_spark::function::<category>::<name>::Spark<Name>`. Existing wire-ups can be found in `native/core/src/execution/planner.rs` (e.g. `SparkDateAdd`, `SparkDateSub`, `SparkCollectSet`).
+
+When the upstream implementation matches Spark's semantics, prefer it: register the `ScalarUDF` from `datafusion-spark` rather than re-implementing. This keeps the maintenance burden upstream. If the upstream version is missing, incomplete, or diverges from Spark, fall through to step 3 and write the function locally.
+
+### 3. Implement the initial version
+
+Follow `adding_a_new_expression.md`:
+
+1. Add a `CometExpressionSerde[T]` in the appropriate file under `spark/src/main/scala/org/apache/comet/serde/`.
+2. Register it in the matching map in `QueryPlanSerde.scala`.
+3. If the function name collides with a DataFusion built-in that has a different signature, use `scalarFunctionExprToProtoWithReturnType` (see "When to set the return type explicitly").
+4. For a new scalar function, add a match case in `native/spark-expr/src/comet_scalar_funcs.rs::create_comet_physical_fun`. If step 2 found an upstream implementation, wire that in. Otherwise implement the function under `native/spark-expr/src/`.
+5. Add at least one Comet SQL Test at `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql` exercising column references, literals, and `NULL`.
+
+Build and smoke-test:
+
+```bash
+make
+./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
+```
+
+### 4. Run the audit skill
+
+Once the initial implementation passes its smoke test, run the `audit-comet-expression` skill on `$ARGUMENTS`. It compares the implementation and tests against Spark 3.4.3, 3.5.8, and 4.0.1 and produces a prioritized list of gaps.
+
+### 5. Implement audit-recommended tests and iterate
+
+Add the missing test cases the audit recommends, then re-run the targeted suite:
+
+```bash
+./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
+```
+
+Surface findings to the user and ask whether the coverage is sufficient. Continue iterating (adding tests, fixing bugs, refining `getSupportLevel` / `getIncompatibleReasons` / `getUnsupportedReasons`) until the user confirms they are happy.
+
+### 6. Final checks
+
+Before opening a PR:
+
+```bash
+make format
+cd native && cargo clippy --all-targets --workspace -- -D warnings
+```
+
+### 7. Open the PR
+
+Use the repo's PR template at `.github/pull_request_template.md` and fill in every section: "Which issue does this PR close?", "Rationale for this change", "What changes are included in this PR?", and "How are these changes tested?". Do not add a separate test plan section.
+
+In the "What changes are included in this PR?" section, add a brief note that the `implement-comet-expression` skill was used to scaffold the implementation, so reviewers know which workflow produced the change.
@@ -67,3 +67,23 @@ runs:
       run: |
         # Native library should already be in native/target/release/
         ./mvnw install -Prelease -DskipTests -Pspark-${{inputs.spark-short-version}}
+
+    - name: Purge partial Maven cache entries
+      shell: bash
+      run: |
+        # Comet's Maven phase resolves the dependency graph and downloads POMs
+        # for transitive artifacts whose JARs it never actually needs. When sbt
+        # then resolves Spark's deps, Coursier sees the POM in mavenLocal,
+        # declares the artifact "found locally", and fails on the missing JAR
+        # without falling back to Maven Central. Delete those partial entries
+        # so sbt re-fetches the full artifact remotely.
+        for repo in "$HOME/.m2/repository" /root/.m2/repository; do
+          [ -d "$repo" ] || continue
+          find "$repo" -name '*.pom' | while read -r pom; do
+            jar="${pom%.pom}.jar"
+            [ -f "$jar" ] && continue
+            grep -q '<packaging>jar</packaging>\|<packaging>bundle</packaging>' "$pom" 2>/dev/null || continue
+            rm -f "$pom" "${pom}.sha1" "${pom%.pom}.pom.lastUpdated" \
+              "$(dirname "$pom")/_remote.repositories"
+          done
+        done
@@ -120,10 +120,14 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-24.04]
-        java-version: [11, 17]
         iceberg-version: [{short: '1.8', full: '1.8.1'}, {short: '1.9', full: '1.9.1'}, {short: '1.10', full: '1.10.0'}]
         spark-version: [{short: '3.4', full: '3.4.3'}, {short: '3.5', full: '3.5.8'}]
         scala-version: ['2.13']
+        include:
+          - spark-version: {short: '3.4', full: '3.4.3'}
+            java-version: 11
+          - spark-version: {short: '3.5', full: '3.5.8'}
+            java-version: 17
       fail-fast: false
     name: iceberg-spark/${{ matrix.os }}/iceberg-${{ matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ matrix.scala-version }}/java-${{ matrix.java-version }}
     runs-on: ${{ matrix.os }}
@@ -163,10 +167,14 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-24.04]
-        java-version: [11, 17]
         iceberg-version: [{short: '1.8', full: '1.8.1'}, {short: '1.9', full: '1.9.1'}, {short: '1.10', full: '1.10.0'}]
         spark-version: [{short: '3.4', full: '3.4.3'}, {short: '3.5', full: '3.5.8'}]
         scala-version: ['2.13']
+        include:
+          - spark-version: {short: '3.4', full: '3.4.3'}
+            java-version: 11
+          - spark-version: {short: '3.5', full: '3.5.8'}
+            java-version: 17
       fail-fast: false
     name: iceberg-spark-extensions/${{ matrix.os }}/iceberg-${{ matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ matrix.scala-version }}/java-${{ matrix.java-version }}
     runs-on: ${{ matrix.os }}
@@ -206,10 +214,14 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-24.04]
-        java-version: [11, 17]
         iceberg-version: [{short: '1.8', full: '1.8.1'}, {short: '1.9', full: '1.9.1'}, {short: '1.10', full: '1.10.0'}]
         spark-version: [{short: '3.4', full: '3.4.3'}, {short: '3.5', full: '3.5.8'}]
         scala-version: ['2.13']
+        include:
+          - spark-version: {short: '3.4', full: '3.4.3'}
+            java-version: 11
+          - spark-version: {short: '3.5', full: '3.5.8'}
+            java-version: 17
       fail-fast: false
     name: iceberg-spark-runtime/${{ matrix.os }}/iceberg-${{ matrix.iceberg-version.full }}/spark-${{ matrix.spark-version.full }}/scala-${{ matrix.scala-version }}/java-${{ matrix.java-version }}
     runs-on: ${{ matrix.os }}
 
@@ -97,9 +97,10 @@ jobs:
           - name: "Spark 4.0, JDK 21"
             java_version: "21"
             maven_opts: "-Pspark-4.0"
-          # Spark 4.1 is intentionally absent: the lint job invokes -Psemanticdb,
-          # but semanticdb-scalac_2.13.17 is not yet published, so we cannot
-          # currently run scalafix against the spark-4.1 profile.
+          # Spark 4.1 and 4.2 are intentionally absent: the lint job invokes -Psemanticdb,
+          # but semanticdb-scalac for those Scala patch versions (2.13.17 / 2.13.18) is not
+          # yet published, so we cannot currently run scalafix against the spark-4.1 or
+          # spark-4.2 profiles.
       fail-fast: false
     steps:
       - uses: runs-on/action@742bf56072eb4845a0f94b3394673e4903c90ff0  # v2.1.0
@@ -305,6 +306,11 @@ jobs:
             java_version: "17"
             maven_opts: "-Pspark-4.1"
             scan_impl: "auto"
+
+          - name: "Spark 4.2, JDK 17"
+            java_version: "17"
+            maven_opts: "-Pspark-4.2"
+            scan_impl: "auto"
         suite:
           - name: "fuzz"
             value: |
@@ -364,6 +370,7 @@ jobs:
               org.apache.spark.sql.comet.CometTaskMetricsSuite
               org.apache.spark.sql.comet.CometDppFallbackRepro3949Suite
               org.apache.spark.sql.comet.CometShuffleFallbackStickinessSuite
+              org.apache.spark.sql.comet.CometDecimalArithmeticViewSuite
               org.apache.comet.objectstore.NativeConfigSuite
           - name: "expressions"
             value: |