chore: add optional CI flow for parquet writes by comphead · Pull Request #4696 · apache/datafusion-comet

comphead · 2026-06-20T04:40:37Z

Which issue does this PR close?

Closes #3209 .

Rationale for this change

New .github/workflows/spark_sql_writer_tests.yml, workflow_dispatch only. Runs a focused set of upstream Parquet writer suites with ENABLE_COMET_WRITE=true (spark.comet.parquet.write.enabled).

What changes are included in this PR?

How are these changes tested?

+    name: spark-sql-writer/spark-${{ inputs.spark-version }}
+    runs-on: ubuntu-24.04
+    container:
+      image: amd64/rust
+    steps:
+      - uses: actions/checkout@v7
+
+      - name: Resolve Spark full version and JDK
+        id: resolve
+        shell: bash
+        run: |
+          # Map each supported Spark minor version to its full version + JDK.
+          # Mirrors ci.yml's per-version reusable invocations (default-on PR
+          # versions only; 3.4 and 4.0 are label-gated and not offered here).
+          case "${{ inputs.spark-version }}" in
+            3.5) spark_full=3.5.8; java=17 ;;
+            4.1) spark_full=4.1.2; java=17 ;;
+            *) echo "Unsupported spark-version: ${{ inputs.spark-version }}" >&2; exit 1 ;;
+          esac
+          echo "spark-full=$spark_full" >> "$GITHUB_OUTPUT"
+          echo "java=$java" >> "$GITHUB_OUTPUT"
+
+      - name: Setup Rust & Java toolchain
+        uses: ./.github/actions/setup-builder
+        with:
+          rust-version: ${{ env.RUST_VERSION }}
+          jdk-version: ${{ steps.resolve.outputs.java }}
+
+      - name: Restore Cargo cache
+        uses: actions/cache/restore@v5
+        with:
+          path: |
+            ~/.cargo/registry
+            ~/.cargo/git
+            native/target
+          key: ${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-${{ hashFiles('native/**/*.rs') }}
+          restore-keys: |
+            ${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-
+
+      - name: Build native library (CI profile)
+        run: |
+          cd native
+          cargo build --profile ci
+        env:
+          RUSTFLAGS: "-Ctarget-cpu=x86-64-v3 -Clink-arg=-fuse-ld=bfd"
+
+      - name: Stage native library at release path
+        run: |
+          # setup-spark-builder's `mvnw install -DskipTests` (skip-native-build
+          # path) bundles native/target/release/libcomet.so into the Comet JAR.
+          # We built with --profile ci to avoid LTO, so the file lives at
+          # native/target/ci/. Copy it to where the Maven build expects it.
+          mkdir -p native/target/release
+          cp native/target/ci/libcomet.so native/target/release/libcomet.so
+
+      - name: Setup Spark
+        uses: ./.github/actions/setup-spark-builder
+        with:
+          spark-version: ${{ steps.resolve.outputs.spark-full }}
+          spark-short-version: ${{ inputs.spark-version }}
+          skip-native-build: true
+
+      - name: Run Parquet writer tests
+        run: |
+          cd apache-spark
+          rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
+          # SERIAL_SBT_TESTS gates SparkParallelTestGrouping in
+          # project/SparkBuild.scala. We always set it to reduce peak memory
+          # on standard 7 GB runners (3.5 and 4.1 are unaffected by the
+          # 4.0+JDK 21 file-stream-leak case the reusable workflow handles).
+          export SERIAL_SBT_TESTS=1
+          # Same forked-test-JVM caps as sql_core-* in spark_sql_test_reusable.yml.
+          export HEAP_SIZE=3g
+          export METASPACE_SIZE=1g
+          NOLINT_ON_COMPILE=true ENABLE_COMET=true ENABLE_COMET_ONHEAP=true ENABLE_COMET_WRITE=true \
+            build/sbt -Dsbt.log.noformat=true -mem $SBT_MEM \
+              'set Global / concurrentRestrictions := Seq(Tags.limit(Tags.ForkedTestGroup, 1))' \
+              "sql/testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite org.apache.spark.sql.execution.datasources.parquet.ParquetCommitterSuite org.apache.spark.sql.execution.datasources.parquet.ParquetEncodingSuite org.apache.spark.sql.execution.datasources.parquet.ParquetCompressionCodecPrecedenceSuite org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV1Suite org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormatV2Suite org.apache.spark.sql.execution.datasources.parquet.ParquetV1QuerySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2QuerySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV1PartitionDiscoverySuite org.apache.spark.sql.execution.datasources.parquet.ParquetV2PartitionDiscoverySuite org.apache.spark.sql.execution.datasources.parquet.ParquetFieldIdIOSuite org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaSuite org.apache.spark.sql.execution.datasources.FileFormatWriterSuite org.apache.spark.sql.sources.PartitionedWriteSuite"
+        env:
+          LC_ALL: "C.UTF-8"
+          # Cap SBT orchestrator heap so the freed RAM goes to the forked test
+          # JVM and OS/container overhead, fixing cgroup-OOM SIGKILLs under
+          # 7 GB runners.
+          SBT_MEM: "1024"
+          # G1GC + tuning for the SBT orchestrator JVM. -Xss4m replaces the
+          # launcher's -Xss64m default (no compile here, deep recursion not
+          # needed). UseStringDeduplication and MaxMetaspaceSize cap real and
+          # ceiling footprint. ExitOnOutOfMemoryError fails fast.
+          SBT_OPTS: >-
+            -Xss4m
+            -XX:+UseG1GC
+            -XX:+UseStringDeduplication
+            -XX:MaxMetaspaceSize=384m
+            -XX:G1HeapRegionSize=2m
+            -XX:InitiatingHeapOccupancyPercent=35
+            -XX:+ParallelRefProcEnabled
+            -XX:+ExitOnOutOfMemoryError


comphead added 2 commits June 19, 2026 21:38

chore: add optional CI flow for parquet writes

b6cc53a

chore: add optional CI flow for parquet writes

6d60e30

github-advanced-security AI found potential problems Jun 20, 2026

View reviewed changes

comphead requested review from andygrove and coderfender June 20, 2026 04:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add optional CI flow for parquet writes#4696

chore: add optional CI flow for parquet writes#4696
comphead wants to merge 2 commits into
apache:mainfrom
comphead:writer_tests

comphead commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

comphead commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

comphead commented Jun 20, 2026 •

edited

Loading