|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Adding Support for a New Spark Version |
| 21 | + |
| 22 | +This guide describes how to bring up support for a new Apache Spark release in |
| 23 | +Comet. Past examples include the work to add Spark 4.0, Spark 4.1, and the |
| 24 | +Spark 4.2 preview profile. The goal is a repeatable recipe that keeps each |
| 25 | +pull request small, reviewable, and easy to revert if a problem is discovered |
| 26 | +later. |
| 27 | + |
| 28 | +## Why Stage the Work |
| 29 | + |
| 30 | +Adding a new Spark version touches the build, the shim layer, CI, and three |
| 31 | +different test suites (Comet's own JVM tests, Spark's SQL tests, and the |
| 32 | +plan stability golden files). Bundling everything into one pull request |
| 33 | +produces a diff that is hard to review and almost impossible to bisect. A |
| 34 | +staged approach instead introduces one capability at a time, with CI proving |
| 35 | +each stage green before the next one lands. |
| 36 | + |
| 37 | +A typical bring-up uses several focused PRs: |
| 38 | + |
| 39 | +1. **Stage 1**: Maven profile, shims, and a compile-only CI job. |
| 40 | +2. **Stage 2**: Enable Comet's own JVM test suite under the new profile. |
| 41 | +3. **Stage 3**: Enable Spark's SQL tests under the new profile, skipping |
| 42 | + the failing ones with linked issues. |
| 43 | +4. **Stage 4**: Add the version to the experimental tier in the user |
| 44 | + guide. |
| 45 | +5. **Stage 5**: Follow-up PRs that fix one skipped test (or one related |
| 46 | + cluster of tests) per PR, removing the skip as part of the fix. |
| 47 | +6. **Stage 6**: Eventual promotion PR that moves the version from |
| 48 | + experimental to supported in the user guide. |
| 49 | + |
| 50 | +The sections below describe each stage in detail. |
| 51 | + |
| 52 | +## Stage 1: Maven Profile, Shims, and Compile-Only CI |
| 53 | + |
| 54 | +The first PR should produce a configuration where `./mvnw -Pspark-X.Y compile` |
| 55 | +succeeds, but no tests are required to pass yet. Keeping this PR |
| 56 | +compilation-only avoids mixing build issues with test failures. |
| 57 | + |
| 58 | +### Add the Maven Profile |
| 59 | + |
| 60 | +Add a new `<profile>` block to the top-level `pom.xml`. Copy the most recent |
| 61 | +existing profile (for example `spark-4.1`) and update the version |
| 62 | +properties: |
| 63 | + |
| 64 | +- `spark.version`: the full upstream version, including any qualifier |
| 65 | + (for example `4.2.0-preview4`). |
| 66 | +- `spark.version.short`: the major.minor (for example `4.2`). |
| 67 | +- `parquet.version`, `slf4j.version`, `scala.version`, |
| 68 | + `scala.binary.version`, `java.version`: align with what the new Spark |
| 69 | + release actually publishes. Use the exact Scala patch version Spark |
| 70 | + publishes, not a looser pin; a mismatch causes `NoSuchMethodError` at |
| 71 | + runtime. |
| 72 | +- `shims.majorVerSrc` and `shims.minorVerSrc`: the directory names the |
| 73 | + build helper plugin will add to the source path. By convention the |
| 74 | + major-version directory groups shims that are identical across the family |
| 75 | + (for example `spark-4.x`), and the minor-version directory holds |
| 76 | + per-release overrides (for example `spark-4.2`). |
| 77 | + |
| 78 | +### Lay Out the Shim Directories |
| 79 | + |
| 80 | +The build helper plugin in `spark/pom.xml` adds `src/main/${shims.majorVerSrc}` |
| 81 | +and `src/main/${shims.minorVerSrc}` to the compile source roots. Files in the |
| 82 | +minor directory shadow files in the major directory, so the typical pattern |
| 83 | +is: |
| 84 | + |
| 85 | +- `spark/src/main/spark-4.x/org/apache/comet/shims/` for shims that work for |
| 86 | + the whole 4.x family. |
| 87 | +- `spark/src/main/spark-X.Y/org/apache/comet/shims/` for files that have to |
| 88 | + diverge for one specific release. |
| 89 | + |
| 90 | +When Spark X.Y is brand new and you do not yet know which shims will need to |
| 91 | +diverge, start by setting `shims.majorVerSrc` to an existing major directory |
| 92 | +(for example `spark-4.x`) and `shims.minorVerSrc` to a new empty |
| 93 | +`spark-X.Y` directory. Compile the project; the compiler will tell you which |
| 94 | +shims need a per-version override. Add only those, and leave the rest in the |
| 95 | +shared major directory. Commit `13e5f8cf5` ("refactor: consolidate identical |
| 96 | +spark-4.0 and spark-4.1 shims into spark-4.x") shows the cleanup that |
| 97 | +follows when shims that previously diverged turn out to be identical. |
| 98 | + |
| 99 | +The same layering applies to `spark/src/test/spark-X.Y/` and |
| 100 | +`common/src/main/spark-X.Y/`. |
| 101 | + |
| 102 | +### Add a Version-Detection Helper |
| 103 | + |
| 104 | +`CometSparkSessionExtensions.scala` exposes helpers like `isSpark40Plus` and |
| 105 | +`isSpark41Plus` that the rest of the codebase uses to gate version-specific |
| 106 | +logic and to skip tests. Add the matching helper for the new version |
| 107 | +(`isSpark42Plus`, etc.) in this PR so that later stages can use it. |
| 108 | + |
| 109 | +### Add a Compile-Only CI Job |
| 110 | + |
| 111 | +Edit `.github/workflows/pr_build_linux.yml` and `pr_build_macos.yml` to add |
| 112 | +the new Spark version to the `build-spark` (or equivalent compile-only) job |
| 113 | +matrix. Do not add it to the heavier test matrices yet. A compile-only job |
| 114 | +keeps the CI cost of stage 1 small and prevents test failures on the new |
| 115 | +version from blocking unrelated PRs. |
| 116 | + |
| 117 | +When CI capacity is constrained (the macOS runners in particular), it is |
| 118 | +acceptable to drop an older minor version from the macOS PR matrix while a |
| 119 | +preview version is being stabilized. PR #4104 ("ci: reduce macOS PR matrix |
| 120 | +to single Spark 4.0 profile") is a precedent for this kind of trim. |
| 121 | + |
| 122 | +### What to Avoid in Stage 1 |
| 123 | + |
| 124 | +- Do not enable any test job for the new version yet. |
| 125 | +- Do not regenerate golden files yet. Plan stability output is sensitive to |
| 126 | + shim correctness, and regenerating before the shims are stable produces |
| 127 | + noisy diffs that get overwritten in stage 2. |
| 128 | +- Do not modify `dev/generate-versions.py` or other release-doc scripts in |
| 129 | + this PR. Those are owned by the release process and have their own update |
| 130 | + cadence. |
| 131 | + |
| 132 | +## Stage 2: Enable Comet's JVM Tests |
| 133 | + |
| 134 | +Once the new profile compiles cleanly in CI, the next PR turns on Comet's |
| 135 | +own test suites under the new profile. |
| 136 | + |
| 137 | +### Add the New Profile to the Test Matrix |
| 138 | + |
| 139 | +Promote the new Spark version from the compile-only job to the main test |
| 140 | +jobs in `.github/workflows/pr_build_linux.yml` (and `pr_build_macos.yml` if |
| 141 | +capacity allows). Use `scan_impl: "auto"` so both `native_datafusion` and |
| 142 | +`native_iceberg_compat` get exercised, matching how earlier versions are |
| 143 | +configured. |
| 144 | + |
| 145 | +### Run the Suite Locally First |
| 146 | + |
| 147 | +Run the JVM test suite locally against the new profile before pushing, since |
| 148 | +CI iterations are slow: |
| 149 | + |
| 150 | +```sh |
| 151 | +make |
| 152 | +./mvnw -Pspark-X.Y test |
| 153 | +``` |
| 154 | + |
| 155 | +Expect failures. Triage them into three buckets: |
| 156 | + |
| 157 | +1. **Real shim gaps**: a Spark API changed and the shim still calls the old |
| 158 | + signature. Fix these in this PR. Per-version overrides go under |
| 159 | + `spark/src/main/spark-X.Y/`; if the change applies to the whole family, |
| 160 | + put the fix in the shared major-version directory. |
| 161 | +2. **Behavioral differences that need a code change**: for example a new |
| 162 | + Spark error class, a renamed config, or a new `OneRowRelation` planning |
| 163 | + path. Fix the small ones in this PR. Larger ones should be split into |
| 164 | + their own PRs and the affected tests skipped. |
| 165 | +3. **Things that are clearly broken and need real investigation**: skip with |
| 166 | + a linked issue (see the next section) and fix in a follow-up. |
| 167 | + |
| 168 | +### Skip Failing Tests with Linked Issues |
| 169 | + |
| 170 | +For tests that cannot be fixed in this PR, use ScalaTest's `assume()` with a |
| 171 | +GitHub issue link as the message: |
| 172 | + |
| 173 | +```scala |
| 174 | +assume(!isSpark42Plus, "https://github.com/apache/datafusion-comet/issues/NNNN") |
| 175 | +``` |
| 176 | + |
| 177 | +Open one issue per test or per cluster of related tests, and reference the |
| 178 | +issue from the `assume()` call. The link is what makes the skip |
| 179 | +recoverable: a contributor can grep for it later when the underlying problem |
| 180 | +is fixed. |
| 181 | + |
| 182 | +Resist the temptation to disable a whole test class. Per-test skips keep the |
| 183 | +coverage loss visible and minimize the risk of silently dropping a real |
| 184 | +regression. |
| 185 | + |
| 186 | +### Update Fallback Reason Strings If Needed |
| 187 | + |
| 188 | +Some Comet rules (notably in `CometScanRule.scala`) match on Spark error |
| 189 | +messages or class names. Spark releases occasionally rename these. Update |
| 190 | +matchers to use a common substring that works across all supported versions |
| 191 | +rather than branching on `isSparkXYPlus`, so the matcher stays compact. |
| 192 | + |
| 193 | +### What to Avoid in Stage 2 |
| 194 | + |
| 195 | +- Do not regenerate plan stability golden files yet. That belongs to |
| 196 | + stage 3 once Comet's own suite is green. |
| 197 | +- Do not enable Spark's SQL tests yet. They are larger, noisier, and |
| 198 | + benefit from landing on a known-good Comet test baseline. |
| 199 | + |
| 200 | +## Stage 3: Enable Spark SQL Tests and Plan Stability |
| 201 | + |
| 202 | +The third PR turns on the externally-driven test suites: Spark's own SQL |
| 203 | +tests run through Comet, and the TPC-DS plan stability golden files. |
| 204 | + |
| 205 | +### Plan Stability Golden Files |
| 206 | + |
| 207 | +Plan stability tests live under |
| 208 | +`spark/src/test/resources/tpcds-plan-stability/approved-plans-{v1_4,v2_7}-sparkX_Y/`. |
| 209 | +The suite (`CometPlanStabilitySuite`) falls back through earlier versions |
| 210 | +when no version-specific approved plan exists, so most queries do not need |
| 211 | +their own copy. Wire the new version into the fallback chain: |
| 212 | + |
| 213 | +```scala |
| 214 | +private val planName = if (isSpark42Plus) { |
| 215 | + "approved-plans-v1_4-spark4_2" |
| 216 | +} else if (isSpark41Plus) { |
| 217 | + "approved-plans-v1_4-spark4_1" |
| 218 | +} else if (isSpark40Plus) { |
| 219 | + "approved-plans-v1_4-spark4_0" |
| 220 | +} else { |
| 221 | + ... |
| 222 | +} |
| 223 | +``` |
| 224 | + |
| 225 | +Update the version regex in `dev/regenerate-golden-files.sh` to allow the |
| 226 | +new version, then regenerate: |
| 227 | + |
| 228 | +```sh |
| 229 | +./dev/regenerate-golden-files.sh --spark-version X.Y |
| 230 | +``` |
| 231 | + |
| 232 | +The script automatically deduplicates: if a regenerated plan matches the |
| 233 | +fallback chain it is removed from the version-specific directory. Only the |
| 234 | +queries whose plans actually differ on the new version end up under |
| 235 | +`approved-plans-*-sparkX_Y/`. Inspect each surviving diff: a small, |
| 236 | +explainable difference is fine, but a large or mysterious diff is usually a |
| 237 | +sign of a shim bug worth investigating before approving the plan. |
| 238 | + |
| 239 | +### Spark SQL Test Overrides |
| 240 | + |
| 241 | +Spark SQL tests run against patched Spark sources under `dev/diffs/`. Each |
| 242 | +supported Spark version has its own diff file. The mechanics of starting |
| 243 | +from the closest existing diff, applying it with `--reject`, resolving the |
| 244 | +rejects (often with `wiggle`), and regenerating the new diff file are |
| 245 | +described in detail in [Spark SQL Tests](spark-sql-tests.md). Follow that |
| 246 | +page for the diff workflow itself; the additional points specific to a |
| 247 | +new-version bring-up are: |
| 248 | + |
| 249 | +- When Spark introduces new error classes (Spark 4.1 changed |
| 250 | + `DIVIDE_BY_ZERO` to `REMAINDER_BY_ZERO` for modulo, for example), prefer |
| 251 | + matchers that work across versions, like matching on the substring |
| 252 | + `BY_ZERO`, rather than branching by version. |
| 253 | +- The same skip-with-linked-issue rule applies as in stage 2: one issue per |
| 254 | + test or cluster, and do not disable whole suites. |
| 255 | + |
| 256 | +### CI for the Spark SQL Tests |
| 257 | + |
| 258 | +Spark SQL tests do not run from the main PR build workflows. They have |
| 259 | +their own dedicated workflow files: |
| 260 | + |
| 261 | +- `.github/workflows/spark_sql_test.yml` |
| 262 | +- `.github/workflows/spark_sql_test_native_iceberg_compat.yml` |
| 263 | + |
| 264 | +Add the new version to the matrix in each of these files (`spark-short`, |
| 265 | +`spark-full`, `java`, `scan-impl`). Use the closest existing entry as a |
| 266 | +template. |
| 267 | + |
| 268 | +Before merging, run `make format`, run clippy |
| 269 | +(`cd native && cargo clippy --all-targets --workspace -- -D warnings`), and |
| 270 | +confirm every skip introduced in this PR has a linked GitHub issue. |
| 271 | + |
| 272 | +## Stage 4: Announce the Version in the User Guide |
| 273 | + |
| 274 | +Once stage 3 is merged and CI is green, advertise the version to users. |
| 275 | + |
| 276 | +The single source of truth for which Spark versions Comet works with is the |
| 277 | +`### Supported Spark Versions` section in |
| 278 | +`docs/source/user-guide/latest/installation.md`. It contains two tables and a |
| 279 | +list of per-version jar download links. Update each: |
| 280 | + |
| 281 | +- Add a row to the **experimental** table (the one introduced by the |
| 282 | + sentence "Experimental support is provided for the following versions |
| 283 | + ..."). Include the Java version, Scala version, and the `Yes`/`No` |
| 284 | + values for "Comet Tests in CI" and "Spark SQL Tests in CI" that match |
| 285 | + what stage 2 and stage 3 actually enabled. |
| 286 | +- Add a `(Experimental)` jar download link below the existing entries. |
| 287 | + |
| 288 | +Do not add the new version to the main "Supported Spark Versions" table |
| 289 | +yet. That table is reserved for versions that have completed the promotion |
| 290 | +criteria described in the next section. |
| 291 | + |
| 292 | +Other user-guide pages (`operators.md`, `datatypes.md`, |
| 293 | +`understanding-comet-plans.md`, etc.) generally do not mention specific |
| 294 | +Spark versions and do not need editing for a new bring-up. The exception is |
| 295 | +text that calls out a specific version's behavior, for example |
| 296 | +`understanding-comet-plans.md` mentions `Spark 4.0 and newer`. Search the |
| 297 | +user guide for the previous version string when adding a new one and |
| 298 | +extend any such phrases that should now apply. |
| 299 | + |
| 300 | +`docs/generate-versions.py` is about Comet release branches, not Spark |
| 301 | +versions, and does not need editing. |
| 302 | + |
| 303 | +## Stage 5: Fix the Skipped Tests (Follow-Up PRs) |
| 304 | + |
| 305 | +Each follow-up PR should target one issue (or one cluster of related issues) |
| 306 | +opened during stages 2 and 3. The pattern is: |
| 307 | + |
| 308 | +1. Reproduce the failure under `-Pspark-X.Y`. |
| 309 | +2. Identify the root cause: shim gap, expression behavior change, planner |
| 310 | + change, or genuine Comet bug exposed on the new version. |
| 311 | +3. Implement the fix. |
| 312 | +4. Remove the corresponding `assume()` skip and rerun the suite. |
| 313 | +5. Reference the issue in the PR title or description so it auto-closes on |
| 314 | + merge. |
| 315 | + |
| 316 | +Avoid bundling unrelated skip removals into one PR. A targeted PR per |
| 317 | +issue keeps the diff small and makes regressions easy to bisect. |
| 318 | + |
| 319 | +## Stage 6: Promote from Experimental to Supported |
| 320 | + |
| 321 | +The user guide currently uses two tiers, "Supported" and "Experimental". |
| 322 | +Comet uses "Experimental" to describe its confidence in its own |
| 323 | +integration with a Spark version, distinct from Spark's "preview" tag |
| 324 | +(which refers to upstream release qualifiers like `4.2.0-preview4`). The |
| 325 | +term is already established in `installation.md`, `operators.md`, and |
| 326 | +`datasources.md`, so keep using it rather than introducing a new label. |
| 327 | + |
| 328 | +A version starts experimental in stage 4 and is promoted later. Promotion |
| 329 | +is its own small PR, gated by these criteria: |
| 330 | + |
| 331 | +- The Spark version is a final upstream release, not a preview, snapshot, |
| 332 | + or release candidate. |
| 333 | +- Both "Comet Tests in CI" and "Spark SQL Tests in CI" are `Yes` for the |
| 334 | + version, and have been `Yes` continuously for at least one Comet release |
| 335 | + cycle. |
| 336 | +- No `assume(!isSparkXYPlus, ...)` skip remains for a known correctness |
| 337 | + issue. Skips for unrelated, infrastructural, or environment-specific |
| 338 | + reasons are acceptable; correctness skips are not. |
| 339 | +- No open `Critical` or `Blocker`-tagged issue references the version. |
| 340 | + |
| 341 | +When the criteria are met, the promotion PR moves the version's row from |
| 342 | +the experimental table into the main "Supported Spark Versions" table and |
| 343 | +removes the `(Experimental)` qualifier from the jar download link. No |
| 344 | +shim, code, or test changes should be bundled with this promotion. Keeping |
| 345 | +it as a doc-only PR makes it easy to revert if a problem shows up after |
| 346 | +the promotion. |
| 347 | + |
| 348 | +## Related Documentation |
| 349 | + |
| 350 | +- [Development Guide](development.md): build prerequisites, test commands, |
| 351 | + and the canonical Maven invocations. |
| 352 | +- [Comet Plugin Overview](plugin_overview.md): how the planner rules and |
| 353 | + shims fit together, useful when diagnosing version-specific failures. |
| 354 | +- [Spark SQL Tests](spark-sql-tests.md): mechanics of running Spark's own |
| 355 | + SQL tests through Comet. |
| 356 | +- [Bug Triage](bug_triage.md): conventions for opening and labeling issues |
| 357 | + for the skipped tests. |
0 commit comments