Skip to content

Commit 21ce7b6

Browse files
committed
feat(ci): add scripts to run Spark SQL test suite locally for Spark 4.1
Add bash scripts under dev/ci/spark-sql-tests/ that reproduce the spark_sql_test.yml GitHub Actions workflow on a developer machine for Apache Spark 4.1. They run Spark's own SQL test suites with Comet enabled, which is useful for debugging a Spark SQL test failure locally instead of waiting on CI. - config.sh: shared configuration and the seven CI module-shard definitions, copied from spark_sql_test.yml - setup-spark.sh: maintains a persistent apache/spark checkout and applies dev/diffs/4.1.1.diff, preserving build artifacts across runs - run.sh: builds Comet, runs the selected module shard(s), and prints a PASS/FAIL summary - README.md: usage, prerequisites, and environment variables Only Spark 4.1 is supported for now. [skip ci]
1 parent c7cee9b commit 21ce7b6

5 files changed

Lines changed: 429 additions & 0 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,4 @@ output
2727
docs/comet-*/
2828
docs/build/
2929
docs/temp/
30+
dev/ci/spark-sql-tests/logs/

dev/ci/spark-sql-tests/README.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Local Spark SQL Tests
21+
22+
These scripts reproduce the `spark_sql_test.yml` GitHub Actions workflow on a
23+
developer machine for **Apache Spark 4.1**. They run Spark's own SQL test
24+
suites with Comet enabled, which is useful for debugging a Spark SQL test
25+
failure locally instead of waiting on CI.
26+
27+
## Prerequisites
28+
29+
- JDK 17 with `JAVA_HOME` set. Spark 4.1 also runs on newer JDKs, but CI uses 17.
30+
- A Rust toolchain, plus `protobuf-compiler` and `clang`, for the Comet native build.
31+
- Git, and enough disk space for an `apache/spark` checkout and its build output.
32+
33+
## Usage
34+
35+
Run from anywhere inside the repository:
36+
37+
```sh
38+
dev/ci/spark-sql-tests/run.sh [module]
39+
```
40+
41+
`module` is one of the seven CI shards, or `all` (the default):
42+
43+
| Module | Spark suites |
44+
|--------------|--------------|
45+
| `catalyst` | `catalyst/test` |
46+
| `sql_core-1` | `sql` suites excluding `ExtendedSQLTest` / `SlowSQLTest` |
47+
| `sql_core-2` | `sql` `ExtendedSQLTest` suites |
48+
| `sql_core-3` | `sql` `SlowSQLTest` suites |
49+
| `sql_hive-1` | `hive` suites excluding `ExtendedHiveTest` / `SlowHiveTest` |
50+
| `sql_hive-2` | `hive` `ExtendedHiveTest` suites |
51+
| `sql_hive-3` | `hive` `SlowHiveTest` suites |
52+
53+
Examples:
54+
55+
```sh
56+
# Run a single shard
57+
dev/ci/spark-sql-tests/run.sh sql_core-1
58+
59+
# Run all seven shards sequentially
60+
dev/ci/spark-sql-tests/run.sh
61+
62+
# Re-run a shard without rebuilding Comet or re-applying the Spark diff
63+
SKIP_BUILD=1 SKIP_SPARK_SETUP=1 dev/ci/spark-sql-tests/run.sh sql_core-1
64+
```
65+
66+
The first run clones `apache/spark` and builds both Comet and Spark, which
67+
takes a while. A full `all` run takes several hours, the same as CI. Per-module
68+
output is written to `dev/ci/spark-sql-tests/logs/<module>.log`, and a
69+
PASS/FAIL summary is printed at the end.
70+
71+
## Environment variables
72+
73+
| Variable | Default | Effect |
74+
|--------------------|-------------------------------------------|--------|
75+
| `SKIP_BUILD` | unset | `1` skips the Comet build and reuses existing artifacts. |
76+
| `SKIP_SPARK_SETUP` | unset | `1` skips the Spark clone/reset/diff step. |
77+
| `COMET_SPARK_DIR` | `~/.cache/datafusion-comet/apache-spark` | Persistent Spark checkout location. |
78+
| `SPARK_REF` | `v4.1.1` | Git ref checked out for the Spark sources. |
79+
| `SBT_MEM` | `4096` | sbt heap size in MB. |
80+
| `LC_ALL` | `C.UTF-8` | Locale for the sbt run. Use `en_US.UTF-8` on macOS if `C.UTF-8` is unavailable. |
81+
82+
## How it works
83+
84+
1. `run.sh` builds Comet with `PROFILES=-Pspark-4.1 make release` (unless
85+
`SKIP_BUILD=1`), then purges partial Maven cache entries so sbt's resolver
86+
does not choke on POM-only artifacts.
87+
2. `setup-spark.sh` maintains a persistent `apache/spark` checkout: it clones
88+
the `v4.1.1` tag on first use, and on every run resets it to a clean state
89+
and applies `dev/diffs/4.1.1.diff`. Spark's compiled `target/` artifacts are
90+
preserved across runs so rebuilds are incremental.
91+
3. `run.sh` runs the selected module shard(s) with `build/sbt`, using the same
92+
environment and arguments as the `spark_sql_test.yml` workflow.
93+
94+
Only Spark 4.1 is supported for now. The CI workflow's optional Comet
95+
fallback-reason log collection (`workflow_dispatch`) is not reproduced.

dev/ci/spark-sql-tests/config.sh

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
#!/bin/bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
21+
# Shared configuration for the local Spark SQL test scripts. This file is
22+
# sourced by setup-spark.sh and run.sh; it is not meant to be run directly.
23+
#
24+
# The variables below are consumed by the sourcing scripts, so shellcheck
25+
# cannot see their use when checking this file in isolation.
26+
# shellcheck disable=SC2034
27+
28+
# --- Spark version under test ----------------------------------------------
29+
SPARK_VERSION="4.1.1"
30+
SPARK_SHORT="4.1"
31+
32+
# Git ref checked out for the Spark sources. Defaults to the released tag.
33+
SPARK_REF="${SPARK_REF:-v${SPARK_VERSION}}"
34+
35+
# JDK major version the CI workflow uses for this Spark version.
36+
REQUIRED_JDK="17"
37+
38+
# --- Paths -----------------------------------------------------------------
39+
# Persistent apache/spark checkout. Reused across runs to avoid re-cloning.
40+
COMET_SPARK_DIR="${COMET_SPARK_DIR:-$HOME/.cache/datafusion-comet/apache-spark}"
41+
42+
# Directory containing these scripts, and the Comet repository root.
43+
COMET_SQL_TEST_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
44+
COMET_REPO_ROOT="$(git -C "$COMET_SQL_TEST_DIR" rev-parse --show-toplevel)"
45+
46+
# --- sbt / locale ----------------------------------------------------------
47+
# sbt heap size in MB. Higher than CI's 3072 since local machines are not
48+
# constrained to 7 GB GitHub runners.
49+
SBT_MEM="${SBT_MEM:-4096}"
50+
51+
# Locale for the sbt run. CI uses C.UTF-8; macOS users may need en_US.UTF-8.
52+
export LC_ALL="${LC_ALL:-C.UTF-8}"
53+
54+
# --- Module shards ---------------------------------------------------------
55+
# The seven module shards, copied verbatim from
56+
# .github/workflows/spark_sql_test.yml. Order matches the CI matrix.
57+
SPARK_SQL_MODULES=(
58+
catalyst
59+
sql_core-1
60+
sql_core-2
61+
sql_core-3
62+
sql_hive-1
63+
sql_hive-2
64+
sql_hive-3
65+
)
66+
67+
# module_sbt_args <module>
68+
# Echoes the single build/sbt argument for the given module shard.
69+
# Returns non-zero for an unknown module.
70+
module_sbt_args() {
71+
case "$1" in
72+
catalyst)
73+
echo 'catalyst/test' ;;
74+
sql_core-1)
75+
echo 'sql/testOnly * -- -l org.apache.spark.tags.ExtendedSQLTest -l org.apache.spark.tags.SlowSQLTest' ;;
76+
sql_core-2)
77+
echo 'sql/testOnly * -- -n org.apache.spark.tags.ExtendedSQLTest' ;;
78+
sql_core-3)
79+
echo 'sql/testOnly * -- -n org.apache.spark.tags.SlowSQLTest' ;;
80+
sql_hive-1)
81+
echo 'hive/testOnly * -- -l org.apache.spark.tags.ExtendedHiveTest -l org.apache.spark.tags.SlowHiveTest' ;;
82+
sql_hive-2)
83+
echo 'hive/testOnly * -- -n org.apache.spark.tags.ExtendedHiveTest' ;;
84+
sql_hive-3)
85+
echo 'hive/testOnly * -- -n org.apache.spark.tags.SlowHiveTest' ;;
86+
*)
87+
return 1 ;;
88+
esac
89+
}

dev/ci/spark-sql-tests/run.sh

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
#!/bin/bash
2+
#
3+
# Licensed to the Apache Software Foundation (ASF) under one
4+
# or more contributor license agreements. See the NOTICE file
5+
# distributed with this work for additional information
6+
# regarding copyright ownership. The ASF licenses this file
7+
# to you under the Apache License, Version 2.0 (the
8+
# "License"); you may not use this file except in compliance
9+
# with the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing,
14+
# software distributed under the License is distributed on an
15+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16+
# KIND, either express or implied. See the License for the
17+
# specific language governing permissions and limitations
18+
# under the License.
19+
#
20+
21+
# Runs Apache Spark's SQL test suites locally with Comet enabled, reproducing
22+
# the spark_sql_test.yml GitHub Actions workflow for Spark 4.1.
23+
#
24+
# -e is intentionally not set: when running all module shards, one failing
25+
# shard must not stop the rest. Build and setup failures are checked
26+
# explicitly below.
27+
28+
set -uo pipefail
29+
30+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
31+
# shellcheck source=config.sh
32+
source "$SCRIPT_DIR/config.sh"
33+
34+
usage() {
35+
cat <<EOF
36+
Usage: $(basename "$0") [module]
37+
38+
Run Apache Spark SQL test suites locally with Comet enabled (Spark $SPARK_VERSION).
39+
40+
Arguments:
41+
module One of: ${SPARK_SQL_MODULES[*]}
42+
or 'all' to run every shard sequentially (default).
43+
44+
Environment variables:
45+
SKIP_BUILD=1 Skip the Comet build; reuse existing artifacts.
46+
SKIP_SPARK_SETUP=1 Skip the Spark clone/reset/diff step.
47+
COMET_SPARK_DIR Spark checkout path (default: \$HOME/.cache/datafusion-comet/apache-spark).
48+
SPARK_REF Git ref for the Spark sources (default: v$SPARK_VERSION).
49+
SBT_MEM sbt heap size in MB (default: 4096).
50+
LC_ALL Locale for the sbt run (default: C.UTF-8; use en_US.UTF-8 on macOS).
51+
EOF
52+
}
53+
54+
module="${1:-all}"
55+
case "$module" in
56+
-h|--help) usage; exit 0 ;;
57+
esac
58+
59+
# Resolve the list of modules to run.
60+
modules_to_run=()
61+
if [ "$module" = "all" ]; then
62+
modules_to_run=("${SPARK_SQL_MODULES[@]}")
63+
elif module_sbt_args "$module" >/dev/null 2>&1; then
64+
modules_to_run=("$module")
65+
else
66+
echo "ERROR: unknown module '$module'" >&2
67+
echo >&2
68+
usage >&2
69+
exit 1
70+
fi
71+
72+
# --- JDK version check (warning only) --------------------------------------
73+
jdk_version="$(java -version 2>&1 | head -n1 | sed -E 's/.*version "([0-9]+).*/\1/')"
74+
if [ "$jdk_version" != "$REQUIRED_JDK" ]; then
75+
echo "WARNING: active JDK reports major version '$jdk_version'; Spark $SPARK_VERSION CI uses JDK $REQUIRED_JDK." >&2
76+
echo " Set JAVA_HOME to a JDK $REQUIRED_JDK install to match CI exactly." >&2
77+
fi
78+
79+
# --- Build Comet -----------------------------------------------------------
80+
if [ "${SKIP_BUILD:-}" = "1" ]; then
81+
echo "SKIP_BUILD=1: skipping Comet build."
82+
else
83+
echo "Building Comet (PROFILES=-Pspark-$SPARK_SHORT make release) ..."
84+
if ! ( cd "$COMET_REPO_ROOT" && PROFILES="-Pspark-$SPARK_SHORT" make release ); then
85+
echo "ERROR: Comet build failed." >&2
86+
exit 1
87+
fi
88+
fi
89+
90+
# --- Purge partial Maven cache entries -------------------------------------
91+
# Mirrors .github/actions/setup-spark-builder/action.yaml. Comet's Maven phase
92+
# downloads POMs for transitive artifacts whose JARs it never needs. sbt's
93+
# Coursier resolver then treats the POM-only entry as "found locally" and
94+
# fails on the missing JAR instead of fetching it remotely. Delete those
95+
# partial entries so sbt re-fetches the full artifact.
96+
maven_repo="$HOME/.m2/repository"
97+
if [ -d "$maven_repo" ]; then
98+
echo "Purging partial Maven cache entries ..."
99+
find "$maven_repo" -name '*.pom' | while read -r pom; do
100+
jar="${pom%.pom}.jar"
101+
[ -f "$jar" ] && continue
102+
grep -q '<packaging>jar</packaging>\|<packaging>bundle</packaging>' "$pom" 2>/dev/null || continue
103+
rm -f "$pom" "${pom}.sha1" "${pom%.pom}.pom.lastUpdated" \
104+
"$(dirname "$pom")/_remote.repositories"
105+
done
106+
fi
107+
108+
# --- Set up the Spark checkout ---------------------------------------------
109+
if [ "${SKIP_SPARK_SETUP:-}" = "1" ]; then
110+
echo "SKIP_SPARK_SETUP=1: using the existing Spark checkout as-is."
111+
if [ ! -d "$COMET_SPARK_DIR/.git" ]; then
112+
echo "ERROR: SKIP_SPARK_SETUP=1 but no Spark checkout at $COMET_SPARK_DIR" >&2
113+
exit 1
114+
fi
115+
else
116+
if ! "$SCRIPT_DIR/setup-spark.sh"; then
117+
echo "ERROR: Spark setup failed." >&2
118+
exit 1
119+
fi
120+
fi
121+
122+
# --- Run the selected module shards ----------------------------------------
123+
log_dir="$SCRIPT_DIR/logs"
124+
mkdir -p "$log_dir"
125+
126+
results=()
127+
overall_status=0
128+
129+
for m in "${modules_to_run[@]}"; do
130+
sbt_args="$(module_sbt_args "$m")"
131+
log_file="$log_dir/${m}.log"
132+
echo
133+
echo "=================================================================="
134+
echo "Module: $m"
135+
echo "sbt args: $sbt_args"
136+
echo "Log file: $log_file"
137+
echo "=================================================================="
138+
139+
# Stale Parquet cache workaround (mirrors spark_sql_test.yml).
140+
rm -rf "$maven_repo/org/apache/parquet"
141+
142+
(
143+
cd "$COMET_SPARK_DIR" || exit 1
144+
NOLINT_ON_COMPILE=true \
145+
ENABLE_COMET=true \
146+
ENABLE_COMET_ONHEAP=true \
147+
ENABLE_COMET_LOG_FALLBACK_REASONS=false \
148+
SERIAL_SBT_TESTS=1 \
149+
build/sbt -Dsbt.log.noformat=true -mem "$SBT_MEM" \
150+
'set Global / concurrentRestrictions := Seq(Tags.limit(Tags.ForkedTestGroup, 1))' \
151+
"$sbt_args"
152+
) 2>&1 | tee "$log_file"
153+
status="${PIPESTATUS[0]}"
154+
155+
if [ "$status" -eq 0 ]; then
156+
results+=("PASS $m")
157+
else
158+
results+=("FAIL $m (sbt exit $status)")
159+
overall_status=1
160+
fi
161+
done
162+
163+
# --- Summary ---------------------------------------------------------------
164+
echo
165+
echo "=================================================================="
166+
echo "Spark SQL test summary (Spark $SPARK_VERSION)"
167+
echo "=================================================================="
168+
for line in "${results[@]}"; do
169+
echo " $line"
170+
done
171+
echo "Logs written to: $log_dir"
172+
exit "$overall_status"

0 commit comments

Comments
 (0)