Skip to content

chore: add optional CI flow for parquet writes#4696

Open
comphead wants to merge 2 commits into
apache:mainfrom
comphead:writer_tests
Open

chore: add optional CI flow for parquet writes#4696
comphead wants to merge 2 commits into
apache:mainfrom
comphead:writer_tests

Conversation

@comphead

@comphead comphead commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #3209 .

Rationale for this change

New .github/workflows/spark_sql_writer_tests.yml, workflow_dispatch only. Runs a focused set of upstream Parquet writer suites with ENABLE_COMET_WRITE=true (spark.comet.parquet.write.enabled).

What changes are included in this PR?

How are these changes tested?

Comment on lines +58 to +154
name: spark-sql-writer/spark-${{ inputs.spark-version }}
runs-on: ubuntu-24.04
container:
image: amd64/rust
steps:
- uses: actions/checkout@v7

- name: Resolve Spark full version and JDK
id: resolve
shell: bash
run: |
# Map each supported Spark minor version to its full version + JDK.
# Mirrors ci.yml's per-version reusable invocations (default-on PR
# versions only; 3.4 and 4.0 are label-gated and not offered here).
case "${{ inputs.spark-version }}" in
3.5) spark_full=3.5.8; java=17 ;;
4.1) spark_full=4.1.2; java=17 ;;
*) echo "Unsupported spark-version: ${{ inputs.spark-version }}" >&2; exit 1 ;;
esac
echo "spark-full=$spark_full" >> "$GITHUB_OUTPUT"
echo "java=$java" >> "$GITHUB_OUTPUT"

- name: Setup Rust & Java toolchain
uses: ./.github/actions/setup-builder
with:
rust-version: ${{ env.RUST_VERSION }}
jdk-version: ${{ steps.resolve.outputs.java }}

- name: Restore Cargo cache
uses: actions/cache/restore@v5
with:
path: |
~/.cargo/registry
~/.cargo/git
native/target
key: ${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-${{ hashFiles('native/**/*.rs') }}
restore-keys: |
${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-

- name: Build native library (CI profile)
run: |
cd native
cargo build --profile ci
env:
RUSTFLAGS: "-Ctarget-cpu=x86-64-v3 -Clink-arg=-fuse-ld=bfd"

- name: Stage native library at release path
run: |
# setup-spark-builder's `mvnw install -DskipTests` (skip-native-build
# path) bundles native/target/release/libcomet.so into the Comet JAR.
# We built with --profile ci to avoid LTO, so the file lives at
# native/target/ci/. Copy it to where the Maven build expects it.
mkdir -p native/target/release
cp native/target/ci/libcomet.so native/target/release/libcomet.so

- name: Setup Spark
uses: ./.github/actions/setup-spark-builder
with:
spark-version: ${{ steps.resolve.outputs.spark-full }}
spark-short-version: ${{ inputs.spark-version }}
skip-native-build: true

- name: Run Parquet writer tests
run: |
cd apache-spark
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
# SERIAL_SBT_TESTS gates SparkParallelTestGrouping in
# project/SparkBuild.scala. We always set it to reduce peak memory
# on standard 7 GB runners (3.5 and 4.1 are unaffected by the
# 4.0+JDK 21 file-stream-leak case the reusable workflow handles).
export SERIAL_SBT_TESTS=1
# Same forked-test-JVM caps as sql_core-* in spark_sql_test_reusable.yml.
export HEAP_SIZE=3g
export METASPACE_SIZE=1g
NOLINT_ON_COMPILE=true ENABLE_COMET=true ENABLE_COMET_ONHEAP=true ENABLE_COMET_WRITE=true \
build/sbt -Dsbt.log.noformat=true -mem $SBT_MEM \
'set Global / concurrentRestrictions := Seq(Tags.limit(Tags.ForkedTestGroup, 1))' \
"sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite org.apache.spark.sql.execution.datasources.parquet.ParquetCommitterSuite org.apache.spark.sql.execution.datasources.parquet.ParquetEncodingSuite org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2QuerySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV1PartitionDiscoverySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2PartitionDiscoverySuite org.apache.spark.sql.execution.datasources.parquet.ParquetFieldIdIOSuite org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaSuite org.apache.spark.sql.execution.datasources.FileFormatWriterSuite org.apache.spark.sql.sources.PartitionedWriteSuite"
env:
LC_ALL: "C.UTF-8"
# Cap SBT orchestrator heap so the freed RAM goes to the forked test
# JVM and OS/container overhead, fixing cgroup-OOM SIGKILLs under
# 7 GB runners.
SBT_MEM: "1024"
# G1GC + tuning for the SBT orchestrator JVM. -Xss4m replaces the
# launcher's -Xss64m default (no compile here, deep recursion not
# needed). UseStringDeduplication and MaxMetaspaceSize cap real and
# ceiling footprint. ExitOnOutOfMemoryError fails fast.
SBT_OPTS: >-
-Xss4m
-XX:+UseG1GC
-XX:+UseStringDeduplication
-XX:MaxMetaspaceSize=384m
-XX:G1HeapRegionSize=2m
-XX:InitiatingHeapOccupancyPercent=35
-XX:+ParallelRefProcEnabled
-XX:+ExitOnOutOfMemoryError
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Run Spark tests with Comet native writer

2 participants