Skip to content

Commit 4820e97

Browse files
imariosclaudepomadchin
authored
Add Apache Spark 4.0 support (#787) (#941)
* Add Apache Spark 4.0 support (#787) Adds an in-tree `frameless-*-spark40` module set targeting Spark 4.0.2, cross-built for Scala 2.13 only (Spark 4 dropped 2.12) and requiring JDK 17. No external shim dependency: version-divergent Catalyst access is isolated behind FramelessInternals in a `src/main/spark-4` source overlay, mirroring the existing spark-3 / spark-3.4+ pattern. Key adaptations for Spark 4: - Column no longer wraps a Catalyst Expression; bridge through classic.ExpressionUtils.column and an eager ColumnNodeToExpressionConverter (the lazy ColumnNodeExpression is Unevaluable and hides children, which broke self-join disambiguation and codegen). - Dataset/SparkSession split into abstract API + classic impl; internal helpers downcast to classic for logicalPlan/sessionState/sqlContext. - ExpressionEncoder now takes a leading AgnosticEncoder (SPARK-49025); supply a metadata-only JavaBeanEncoder stand-in carrying the right ClassTag. - AnalysisException is errorClass-based; MapGroups gets a spark-4 variant. - joinCross re-encodes its result via TypedExpressionEncoder, consistent with the other joins. - Hide the new catalyst expressions.With from TypedColumn's wildcard import. Test harness: disable ANSI mode (Spark 4 default) so the property generators keep their wrap-around/null semantics, and strip field metadata in SchemaTests. All changes are no-ops on Spark 3.x. CI: add a JDK 17 leg and pin root-spark40 to Scala 2.13 / JDK 17. dataset-spark40 passes 414/414 tests; verified end-to-end on a 2-worker standalone Spark 4.0.2 cluster (groupBy/agg, self-join, joinWith, executor closures) to confirm cross-node serialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Apply scalafmt formatting Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Make docs/mdoc build on JDK 17 (site CI job) Adding a JDK 17 CI leg for Spark 4 made sbt-typelevel run the Generate Site job on JDK 17 (it picks the last configured Java). mdoc executes Spark code, which needs the module --add-opens flags on JDK 17. Fork the docs run, pass the flags through (extracted into sparkJava17Options, shared with the test config), and pin the forked run's working directory to the repo root so docs keep finding their relative data files (docs/iris.data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add MiMa filters for FramelessInternals compat-seam changes The Spark 4 port reworked FramelessInternals (internal version-compat plumbing, not intended public API): `column` is now the Expression->Column bridge and `mkDataset` derives the session from the source Dataset instead of taking a SQLContext. Both are binary-incompatible signature changes flagged by MiMa against the 3.x baselines (0.14.0/0.14.1), so exclude them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix Scala 2.12 scaladoc: use backticks instead of [[]] links Scala 2.12's scaladoc fails (fatally) on [[Expression]] / [[Column]] / [[ExpressionEncoder]] doc links in FramelessInternals because those Spark types aren't resolvable in the doc scope. Use backticks (code spans) instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add value-level self-join regression test The existing self-join tests only compare row counts. This collects and verifies the decoded (T, U) tuples through the colLeft/colRight disambiguation path - a regression guard for the Spark 4 ColumnNode rework, which broke that path (only count-level coverage would have missed a subtly wrong decode). Passes unchanged on Spark 3.x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Keep imports closer to source in TypedExpressionEncoder Revert the opinionated merge of the standalone `import ...Encoder` into a braced group; add FramelessInternals as a separate plain import instead. scalafmt does not merge imports, so this stays linter-clean while staying closer to the original source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Keep Spark 3.5 the default version, drop Spark 3.3 artifacts --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Grigory Pomadchin <grigory.pomadchin@disneystreaming.com>
1 parent 502bb54 commit 4820e97

19 files changed

Lines changed: 3834 additions & 2283 deletions

File tree

.github/workflows/ci.yml

Lines changed: 56 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,19 @@ jobs:
3131
matrix:
3232
os: [ubuntu-22.04]
3333
scala: [2.13, 2.12]
34-
java: [temurin@8]
35-
project: [root-spark33, root-spark34, root-spark35]
34+
java: [temurin@8, temurin@17]
35+
project: [root-spark34, root-spark35, root-spark40]
3636
exclude:
37-
- scala: 2.13
38-
project: root-spark33
3937
- scala: 2.13
4038
project: root-spark34
39+
- scala: 2.12
40+
project: root-spark40
41+
- java: temurin@17
42+
project: root-spark34
43+
- java: temurin@17
44+
project: root-spark35
45+
- java: temurin@8
46+
project: root-spark40
4147
runs-on: ${{ matrix.os }}
4248
timeout-minutes: 60
4349
steps:
@@ -62,6 +68,19 @@ jobs:
6268
if: matrix.java == 'temurin@8' && steps.setup-java-temurin-8.outputs.cache-hit == 'false'
6369
run: sbt +update
6470

71+
- name: Setup Java (temurin@17)
72+
id: setup-java-temurin-17
73+
if: matrix.java == 'temurin@17'
74+
uses: actions/setup-java@v5
75+
with:
76+
distribution: temurin
77+
java-version: 17
78+
cache: sbt
79+
80+
- name: sbt update
81+
if: matrix.java == 'temurin@17' && steps.setup-java-temurin-17.outputs.cache-hit == 'false'
82+
run: sbt +update
83+
6584
- name: Check that workflows are up to date
6685
run: sbt githubWorkflowCheck
6786

@@ -115,6 +134,19 @@ jobs:
115134
if: matrix.java == 'temurin@8' && steps.setup-java-temurin-8.outputs.cache-hit == 'false'
116135
run: sbt +update
117136

137+
- name: Setup Java (temurin@17)
138+
id: setup-java-temurin-17
139+
if: matrix.java == 'temurin@17'
140+
uses: actions/setup-java@v5
141+
with:
142+
distribution: temurin
143+
java-version: 17
144+
cache: sbt
145+
146+
- name: sbt update
147+
if: matrix.java == 'temurin@17' && steps.setup-java-temurin-17.outputs.cache-hit == 'false'
148+
run: sbt +update
149+
118150
- name: Import signing key
119151
if: env.PGP_SECRET != '' && env.PGP_PASSPHRASE == ''
120152
env:
@@ -169,18 +201,31 @@ jobs:
169201
if: matrix.java == 'temurin@8' && steps.setup-java-temurin-8.outputs.cache-hit == 'false'
170202
run: sbt +update
171203

204+
- name: Setup Java (temurin@17)
205+
id: setup-java-temurin-17
206+
if: matrix.java == 'temurin@17'
207+
uses: actions/setup-java@v5
208+
with:
209+
distribution: temurin
210+
java-version: 17
211+
cache: sbt
212+
213+
- name: sbt update
214+
if: matrix.java == 'temurin@17' && steps.setup-java-temurin-17.outputs.cache-hit == 'false'
215+
run: sbt +update
216+
172217
- name: Submit Dependencies
173218
uses: scalacenter/sbt-dependency-submission@v2
174219
with:
175-
modules-ignore: root-spark33_2.13 root-spark33_2.12 docs_2.13 docs_2.12 root-spark34_2.13 root-spark34_2.12 root-spark35_2.13 root-spark35_2.12
220+
modules-ignore: docs_2.13 docs_2.12 root-spark34_2.13 root-spark34_2.12 root-spark35_2.13 root-spark35_2.12 root-spark40_2.13
176221
configs-ignore: test scala-tool scala-doc-tool test-internal
177222

178223
site:
179224
name: Generate Site
180225
strategy:
181226
matrix:
182227
os: [ubuntu-22.04]
183-
java: [temurin@11]
228+
java: [temurin@17]
184229
runs-on: ${{ matrix.os }}
185230
steps:
186231
- name: Checkout current branch (full)
@@ -204,17 +249,17 @@ jobs:
204249
if: matrix.java == 'temurin@8' && steps.setup-java-temurin-8.outputs.cache-hit == 'false'
205250
run: sbt +update
206251

207-
- name: Setup Java (temurin@11)
208-
id: setup-java-temurin-11
209-
if: matrix.java == 'temurin@11'
252+
- name: Setup Java (temurin@17)
253+
id: setup-java-temurin-17
254+
if: matrix.java == 'temurin@17'
210255
uses: actions/setup-java@v5
211256
with:
212257
distribution: temurin
213-
java-version: 11
258+
java-version: 17
214259
cache: sbt
215260

216261
- name: sbt update
217-
if: matrix.java == 'temurin@11' && steps.setup-java-temurin-11.outputs.cache-hit == 'false'
262+
if: matrix.java == 'temurin@17' && steps.setup-java-temurin-17.outputs.cache-hit == 'false'
218263
run: sbt +update
219264

220265
- name: Generate site

README.md

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -25,37 +25,40 @@ associated channels (e.g. GitHub, Discord) to be a safe and friendly environment
2525
The compatible versions of [Spark](http://spark.apache.org/) and
2626
[cats](https://github.com/typelevel/cats) are as follows:
2727

28-
| Frameless | Spark | Cats | Cats-Effect | Scala |
29-
|-----------|-----------------------------|----------|-------------|-------------|
30-
| 0.16.0 | 3.5.0 / 3.4.0 / 3.3.0 | 2.x | 3.x | 2.12 / 2.13 |
31-
| 0.15.0 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
32-
| 0.14.1 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
33-
| 0.14.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
34-
| 0.13.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
35-
| 0.12.0 | 3.2.1 / 3.1.3 / 3.0.3 | 2.x | 3.x | 2.12 / 2.13 |
36-
| 0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
37-
| 0.11.0* | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
38-
| 0.10.1 | 3.1.0 | 2.x | 2.x | 2.12 |
39-
| 0.9.0 | 3.0.0 | 1.x | 1.x | 2.12 |
40-
| 0.8.0 | 2.4.0 | 1.x | 1.x | 2.11 / 2.12 |
41-
| 0.7.0 | 2.3.1 | 1.x | 1.x | 2.11 |
42-
| 0.6.1 | 2.3.0 | 1.x | 0.8 | 2.11 |
43-
| 0.5.2 | 2.2.1 | 1.x | 0.8 | 2.11 |
44-
| 0.4.1 | 2.2.0 | 1.x | 0.8 | 2.11 |
45-
| 0.4.0 | 2.2.0 | 1.0.0-IF | 0.4 | 2.11 |
28+
| Frameless | Spark | Cats | Cats-Effect | Scala |
29+
|-----------|------------------------|----------|-------------|-------------|
30+
| 0.17.0 | 4.0.2† / 3.5.8 / 3.4.4 | 2.x | 3.x | 2.12 / 2.13 |
31+
| 0.16.0 | 3.5.0 / 3.4.0 / 3.3.0 | 2.x | 3.x | 2.12 / 2.13 |
32+
| 0.15.0 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
33+
| 0.14.1 | 3.4.0 / 3.3.0 / 3.2.2 | 2.x | 3.x | 2.12 / 2.13 |
34+
| 0.14.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
35+
| 0.13.0 | 3.3.0 / 3.2.2 / 3.1.3 | 2.x | 3.x | 2.12 / 2.13 |
36+
| 0.12.0 | 3.2.1 / 3.1.3 / 3.0.3 | 2.x | 3.x | 2.12 / 2.13 |
37+
| 0.11.1 | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
38+
| 0.11.0* | 3.2.0 / 3.1.2 / 3.0.1 | 2.x | 2.x | 2.12 / 2.13 |
39+
| 0.10.1 | 3.1.0 | 2.x | 2.x | 2.12 |
40+
| 0.9.0 | 3.0.0 | 1.x | 1.x | 2.12 |
41+
| 0.8.0 | 2.4.0 | 1.x | 1.x | 2.11 / 2.12 |
42+
| 0.7.0 | 2.3.1 | 1.x | 1.x | 2.11 |
43+
| 0.6.1 | 2.3.0 | 1.x | 0.8 | 2.11 |
44+
| 0.5.2 | 2.2.1 | 1.x | 0.8 | 2.11 |
45+
| 0.4.1 | 2.2.0 | 1.x | 0.8 | 2.11 |
46+
| 0.4.0 | 2.2.0 | 1.0.0-IF | 0.4 | 2.11 |
4647

4748
_\* 0.11.0 has broken Spark 3.1.2 and 3.0.1 artifacts published._
4849

50+
_† The Spark 4.0.x artifacts (`-spark40`) are published for **Scala 2.13 only** and require **JDK 17+**, since Spark 4 dropped Scala 2.12 and JDK 8/11. The default (unsuffixed) artifacts still target Spark 3.5._
51+
4952
Starting 0.11 we introduced Spark cross published artifacts:
5053

5154
* By default, frameless artifacts depend on the most recent Spark version
5255
* Suffix `-spark{major}{minor}` is added to artifacts that are released for the previous Spark version(s)
5356

5457
Artifact names examples:
5558

56-
* `frameless-dataset` (the latest Spark dependency)
57-
* `frameless-dataset-spark33` (Spark 3.3.x dependency)
58-
* `frameless-dataset-spark32` (Spark 3.2.x dependency)
59+
* `frameless-dataset` (the default Spark 3.5.x dependency)
60+
* `frameless-dataset-spark40` (Spark 4.0.x dependency; Scala 2.13 + JDK 17 only)
61+
* `frameless-dataset-spark34` (Spark 3.4.x dependency)
5962

6063
Versions 0.5.x and 0.6.x have identical features. The first is compatible with Spark 2.2.1 and the second with 2.3.0.
6164

0 commit comments

Comments
 (0)