Skip to content

Commit d275007

Browse files
authored
docs: add contributor guide for bringing up a new Spark version (#4133)
1 parent e1ff769 commit d275007

3 files changed

Lines changed: 362 additions & 0 deletions

File tree

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Adding Support for a New Spark Version
21+
22+
This guide describes how to bring up support for a new Apache Spark release in
23+
Comet. Past examples include the work to add Spark 4.0, Spark 4.1, and the
24+
Spark 4.2 preview profile. The goal is a repeatable recipe that keeps each
25+
pull request small, reviewable, and easy to revert if a problem is discovered
26+
later.
27+
28+
## Why Stage the Work
29+
30+
Adding a new Spark version touches the build, the shim layer, CI, and three
31+
different test suites (Comet's own JVM tests, Spark's SQL tests, and the
32+
plan stability golden files). Bundling everything into one pull request
33+
produces a diff that is hard to review and almost impossible to bisect. A
34+
staged approach instead introduces one capability at a time, with CI proving
35+
each stage green before the next one lands.
36+
37+
A typical bring-up uses several focused PRs:
38+
39+
1. **Stage 1**: Maven profile, shims, and a compile-only CI job.
40+
2. **Stage 2**: Enable Comet's own JVM test suite under the new profile.
41+
3. **Stage 3**: Enable Spark's SQL tests under the new profile, skipping
42+
the failing ones with linked issues.
43+
4. **Stage 4**: Add the version to the experimental tier in the user
44+
guide.
45+
5. **Stage 5**: Follow-up PRs that fix one skipped test (or one related
46+
cluster of tests) per PR, removing the skip as part of the fix.
47+
6. **Stage 6**: Eventual promotion PR that moves the version from
48+
experimental to supported in the user guide.
49+
50+
The sections below describe each stage in detail.
51+
52+
## Stage 1: Maven Profile, Shims, and Compile-Only CI
53+
54+
The first PR should produce a configuration where `./mvnw -Pspark-X.Y compile`
55+
succeeds, but no tests are required to pass yet. Keeping this PR
56+
compilation-only avoids mixing build issues with test failures.
57+
58+
### Add the Maven Profile
59+
60+
Add a new `<profile>` block to the top-level `pom.xml`. Copy the most recent
61+
existing profile (for example `spark-4.1`) and update the version
62+
properties:
63+
64+
- `spark.version`: the full upstream version, including any qualifier
65+
(for example `4.2.0-preview4`).
66+
- `spark.version.short`: the major.minor (for example `4.2`).
67+
- `parquet.version`, `slf4j.version`, `scala.version`,
68+
`scala.binary.version`, `java.version`: align with what the new Spark
69+
release actually publishes. Use the exact Scala patch version Spark
70+
publishes, not a looser pin; a mismatch causes `NoSuchMethodError` at
71+
runtime.
72+
- `shims.majorVerSrc` and `shims.minorVerSrc`: the directory names the
73+
build helper plugin will add to the source path. By convention the
74+
major-version directory groups shims that are identical across the family
75+
(for example `spark-4.x`), and the minor-version directory holds
76+
per-release overrides (for example `spark-4.2`).
77+
78+
### Lay Out the Shim Directories
79+
80+
The build helper plugin in `spark/pom.xml` adds `src/main/${shims.majorVerSrc}`
81+
and `src/main/${shims.minorVerSrc}` to the compile source roots. Files in the
82+
minor directory shadow files in the major directory, so the typical pattern
83+
is:
84+
85+
- `spark/src/main/spark-4.x/org/apache/comet/shims/` for shims that work for
86+
the whole 4.x family.
87+
- `spark/src/main/spark-X.Y/org/apache/comet/shims/` for files that have to
88+
diverge for one specific release.
89+
90+
When Spark X.Y is brand new and you do not yet know which shims will need to
91+
diverge, start by setting `shims.majorVerSrc` to an existing major directory
92+
(for example `spark-4.x`) and `shims.minorVerSrc` to a new empty
93+
`spark-X.Y` directory. Compile the project; the compiler will tell you which
94+
shims need a per-version override. Add only those, and leave the rest in the
95+
shared major directory. Commit `13e5f8cf5` ("refactor: consolidate identical
96+
spark-4.0 and spark-4.1 shims into spark-4.x") shows the cleanup that
97+
follows when shims that previously diverged turn out to be identical.
98+
99+
The same layering applies to `spark/src/test/spark-X.Y/` and
100+
`common/src/main/spark-X.Y/`.
101+
102+
### Add a Version-Detection Helper
103+
104+
`CometSparkSessionExtensions.scala` exposes helpers like `isSpark40Plus` and
105+
`isSpark41Plus` that the rest of the codebase uses to gate version-specific
106+
logic and to skip tests. Add the matching helper for the new version
107+
(`isSpark42Plus`, etc.) in this PR so that later stages can use it.
108+
109+
### Add a Compile-Only CI Job
110+
111+
Edit `.github/workflows/pr_build_linux.yml` and `pr_build_macos.yml` to add
112+
the new Spark version to the `build-spark` (or equivalent compile-only) job
113+
matrix. Do not add it to the heavier test matrices yet. A compile-only job
114+
keeps the CI cost of stage 1 small and prevents test failures on the new
115+
version from blocking unrelated PRs.
116+
117+
When CI capacity is constrained (the macOS runners in particular), it is
118+
acceptable to drop an older minor version from the macOS PR matrix while a
119+
preview version is being stabilized. PR #4104 ("ci: reduce macOS PR matrix
120+
to single Spark 4.0 profile") is a precedent for this kind of trim.
121+
122+
### What to Avoid in Stage 1
123+
124+
- Do not enable any test job for the new version yet.
125+
- Do not regenerate golden files yet. Plan stability output is sensitive to
126+
shim correctness, and regenerating before the shims are stable produces
127+
noisy diffs that get overwritten in stage 2.
128+
- Do not modify `dev/generate-versions.py` or other release-doc scripts in
129+
this PR. Those are owned by the release process and have their own update
130+
cadence.
131+
132+
## Stage 2: Enable Comet's JVM Tests
133+
134+
Once the new profile compiles cleanly in CI, the next PR turns on Comet's
135+
own test suites under the new profile.
136+
137+
### Add the New Profile to the Test Matrix
138+
139+
Promote the new Spark version from the compile-only job to the main test
140+
jobs in `.github/workflows/pr_build_linux.yml` (and `pr_build_macos.yml` if
141+
capacity allows). Use `scan_impl: "auto"` so both `native_datafusion` and
142+
`native_iceberg_compat` get exercised, matching how earlier versions are
143+
configured.
144+
145+
### Run the Suite Locally First
146+
147+
Run the JVM test suite locally against the new profile before pushing, since
148+
CI iterations are slow:
149+
150+
```sh
151+
make
152+
./mvnw -Pspark-X.Y test
153+
```
154+
155+
Expect failures. Triage them into three buckets:
156+
157+
1. **Real shim gaps**: a Spark API changed and the shim still calls the old
158+
signature. Fix these in this PR. Per-version overrides go under
159+
`spark/src/main/spark-X.Y/`; if the change applies to the whole family,
160+
put the fix in the shared major-version directory.
161+
2. **Behavioral differences that need a code change**: for example a new
162+
Spark error class, a renamed config, or a new `OneRowRelation` planning
163+
path. Fix the small ones in this PR. Larger ones should be split into
164+
their own PRs and the affected tests skipped.
165+
3. **Things that are clearly broken and need real investigation**: skip with
166+
a linked issue (see the next section) and fix in a follow-up.
167+
168+
### Skip Failing Tests with Linked Issues
169+
170+
For tests that cannot be fixed in this PR, use ScalaTest's `assume()` with a
171+
GitHub issue link as the message:
172+
173+
```scala
174+
assume(!isSpark42Plus, "https://github.com/apache/datafusion-comet/issues/NNNN")
175+
```
176+
177+
Open one issue per test or per cluster of related tests, and reference the
178+
issue from the `assume()` call. The link is what makes the skip
179+
recoverable: a contributor can grep for it later when the underlying problem
180+
is fixed.
181+
182+
Resist the temptation to disable a whole test class. Per-test skips keep the
183+
coverage loss visible and minimize the risk of silently dropping a real
184+
regression.
185+
186+
### Update Fallback Reason Strings If Needed
187+
188+
Some Comet rules (notably in `CometScanRule.scala`) match on Spark error
189+
messages or class names. Spark releases occasionally rename these. Update
190+
matchers to use a common substring that works across all supported versions
191+
rather than branching on `isSparkXYPlus`, so the matcher stays compact.
192+
193+
### What to Avoid in Stage 2
194+
195+
- Do not regenerate plan stability golden files yet. That belongs to
196+
stage 3 once Comet's own suite is green.
197+
- Do not enable Spark's SQL tests yet. They are larger, noisier, and
198+
benefit from landing on a known-good Comet test baseline.
199+
200+
## Stage 3: Enable Spark SQL Tests and Plan Stability
201+
202+
The third PR turns on the externally-driven test suites: Spark's own SQL
203+
tests run through Comet, and the TPC-DS plan stability golden files.
204+
205+
### Plan Stability Golden Files
206+
207+
Plan stability tests live under
208+
`spark/src/test/resources/tpcds-plan-stability/approved-plans-{v1_4,v2_7}-sparkX_Y/`.
209+
The suite (`CometPlanStabilitySuite`) falls back through earlier versions
210+
when no version-specific approved plan exists, so most queries do not need
211+
their own copy. Wire the new version into the fallback chain:
212+
213+
```scala
214+
private val planName = if (isSpark42Plus) {
215+
"approved-plans-v1_4-spark4_2"
216+
} else if (isSpark41Plus) {
217+
"approved-plans-v1_4-spark4_1"
218+
} else if (isSpark40Plus) {
219+
"approved-plans-v1_4-spark4_0"
220+
} else {
221+
...
222+
}
223+
```
224+
225+
Update the version regex in `dev/regenerate-golden-files.sh` to allow the
226+
new version, then regenerate:
227+
228+
```sh
229+
./dev/regenerate-golden-files.sh --spark-version X.Y
230+
```
231+
232+
The script automatically deduplicates: if a regenerated plan matches the
233+
fallback chain it is removed from the version-specific directory. Only the
234+
queries whose plans actually differ on the new version end up under
235+
`approved-plans-*-sparkX_Y/`. Inspect each surviving diff: a small,
236+
explainable difference is fine, but a large or mysterious diff is usually a
237+
sign of a shim bug worth investigating before approving the plan.
238+
239+
### Spark SQL Test Overrides
240+
241+
Spark SQL tests run against patched Spark sources under `dev/diffs/`. Each
242+
supported Spark version has its own diff file. The mechanics of starting
243+
from the closest existing diff, applying it with `--reject`, resolving the
244+
rejects (often with `wiggle`), and regenerating the new diff file are
245+
described in detail in [Spark SQL Tests](spark-sql-tests.md). Follow that
246+
page for the diff workflow itself; the additional points specific to a
247+
new-version bring-up are:
248+
249+
- When Spark introduces new error classes (Spark 4.1 changed
250+
`DIVIDE_BY_ZERO` to `REMAINDER_BY_ZERO` for modulo, for example), prefer
251+
matchers that work across versions, like matching on the substring
252+
`BY_ZERO`, rather than branching by version.
253+
- The same skip-with-linked-issue rule applies as in stage 2: one issue per
254+
test or cluster, and do not disable whole suites.
255+
256+
### CI for the Spark SQL Tests
257+
258+
Spark SQL tests do not run from the main PR build workflows. They have
259+
their own dedicated workflow files:
260+
261+
- `.github/workflows/spark_sql_test.yml`
262+
- `.github/workflows/spark_sql_test_native_iceberg_compat.yml`
263+
264+
Add the new version to the matrix in each of these files (`spark-short`,
265+
`spark-full`, `java`, `scan-impl`). Use the closest existing entry as a
266+
template.
267+
268+
Before merging, run `make format`, run clippy
269+
(`cd native && cargo clippy --all-targets --workspace -- -D warnings`), and
270+
confirm every skip introduced in this PR has a linked GitHub issue.
271+
272+
## Stage 4: Announce the Version in the User Guide
273+
274+
Once stage 3 is merged and CI is green, advertise the version to users.
275+
276+
The single source of truth for which Spark versions Comet works with is the
277+
`### Supported Spark Versions` section in
278+
`docs/source/user-guide/latest/installation.md`. It contains two tables and a
279+
list of per-version jar download links. Update each:
280+
281+
- Add a row to the **experimental** table (the one introduced by the
282+
sentence "Experimental support is provided for the following versions
283+
..."). Include the Java version, Scala version, and the `Yes`/`No`
284+
values for "Comet Tests in CI" and "Spark SQL Tests in CI" that match
285+
what stage 2 and stage 3 actually enabled.
286+
- Add a `(Experimental)` jar download link below the existing entries.
287+
288+
Do not add the new version to the main "Supported Spark Versions" table
289+
yet. That table is reserved for versions that have completed the promotion
290+
criteria described in the next section.
291+
292+
Other user-guide pages (`operators.md`, `datatypes.md`,
293+
`understanding-comet-plans.md`, etc.) generally do not mention specific
294+
Spark versions and do not need editing for a new bring-up. The exception is
295+
text that calls out a specific version's behavior, for example
296+
`understanding-comet-plans.md` mentions `Spark 4.0 and newer`. Search the
297+
user guide for the previous version string when adding a new one and
298+
extend any such phrases that should now apply.
299+
300+
`docs/generate-versions.py` is about Comet release branches, not Spark
301+
versions, and does not need editing.
302+
303+
## Stage 5: Fix the Skipped Tests (Follow-Up PRs)
304+
305+
Each follow-up PR should target one issue (or one cluster of related issues)
306+
opened during stages 2 and 3. The pattern is:
307+
308+
1. Reproduce the failure under `-Pspark-X.Y`.
309+
2. Identify the root cause: shim gap, expression behavior change, planner
310+
change, or genuine Comet bug exposed on the new version.
311+
3. Implement the fix.
312+
4. Remove the corresponding `assume()` skip and rerun the suite.
313+
5. Reference the issue in the PR title or description so it auto-closes on
314+
merge.
315+
316+
Avoid bundling unrelated skip removals into one PR. A targeted PR per
317+
issue keeps the diff small and makes regressions easy to bisect.
318+
319+
## Stage 6: Promote from Experimental to Supported
320+
321+
The user guide currently uses two tiers, "Supported" and "Experimental".
322+
Comet uses "Experimental" to describe its confidence in its own
323+
integration with a Spark version, distinct from Spark's "preview" tag
324+
(which refers to upstream release qualifiers like `4.2.0-preview4`). The
325+
term is already established in `installation.md`, `operators.md`, and
326+
`datasources.md`, so keep using it rather than introducing a new label.
327+
328+
A version starts experimental in stage 4 and is promoted later. Promotion
329+
is its own small PR, gated by these criteria:
330+
331+
- The Spark version is a final upstream release, not a preview, snapshot,
332+
or release candidate.
333+
- Both "Comet Tests in CI" and "Spark SQL Tests in CI" are `Yes` for the
334+
version, and have been `Yes` continuously for at least one Comet release
335+
cycle.
336+
- No `assume(!isSparkXYPlus, ...)` skip remains for a known correctness
337+
issue. Skips for unrelated, infrastructural, or environment-specific
338+
reasons are acceptable; correctness skips are not.
339+
- No open `Critical` or `Blocker`-tagged issue references the version.
340+
341+
When the criteria are met, the promotion PR moves the version's row from
342+
the experimental table into the main "Supported Spark Versions" table and
343+
removes the `(Experimental)` qualifier from the jar download link. No
344+
shim, code, or test changes should be bundled with this promotion. Keeping
345+
it as a doc-only PR makes it easy to revert if a problem shows up after
346+
the promotion.
347+
348+
## Related Documentation
349+
350+
- [Development Guide](development.md): build prerequisites, test commands,
351+
and the canonical Maven invocations.
352+
- [Comet Plugin Overview](plugin_overview.md): how the planner rules and
353+
shims fit together, useful when diagnosing version-specific failures.
354+
- [Spark SQL Tests](spark-sql-tests.md): mechanics of running Spark's own
355+
SQL tests through Comet.
356+
- [Bug Triage](bug_triage.md): conventions for opening and labeling issues
357+
for the skipped tests.

docs/source/contributor-guide/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ ANSI Error Propagation <sql_error_propagation>
4646
Benchmarking Guide <benchmarking>
4747
Adding a New Operator <adding_a_new_operator>
4848
Adding a New Expression <adding_a_new_expression>
49+
Adding a New Spark Version <adding_a_new_spark_version>
4950
Supported Spark Expressions <spark_expressions_support>
5051
Tracing <tracing>
5152
Profiling <profiling>

docs/source/contributor-guide/spark-sql-tests.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,10 @@ Once Comet has support for a new Spark version, we need to create a diff file th
104104
of Apache Spark to enable Comet when running tests. This is a highly manual process and the process can
105105
vary depending on the changes in the new version of Spark, but here is a general guide to the process.
106106

107+
If you are bringing up a brand new Spark major or minor version, this work is one stage of a larger sequence
108+
covered in [Adding Support for a New Spark Version](adding_a_new_spark_version.md). Start there for the
109+
overall plan; the steps below cover only the diff-file mechanics.
110+
107111
We typically start by applying a patch from a previous version of Spark. For example, when enabling the tests
108112
for Spark version 3.5.6 we may start by applying the existing diff for 3.5.5 first.
109113

0 commit comments

Comments
 (0)