Skip to content

fix(gso): stop 9-judge eval SIGSEGV from concurrent Spark Connect#227

Open
hiydavid wants to merge 1 commit into
mainfrom
fix/gso-syntax-validity-spark-concurrency-segfault
Open

fix(gso): stop 9-judge eval SIGSEGV from concurrent Spark Connect#227
hiydavid wants to merge 1 commit into
mainfrom
fix/gso-syntax-validity-spark-concurrency-segfault

Conversation

@hiydavid

@hiydavid hiydavid commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Problem

GSO's baseline_eval task crashed the Python kernel with a fatal SIGSEGV
(exit code 139, "The Python kernel is unresponsive") during Step 2b: Run
9-Judge Evaluation
. It was reproducible across serverless and dedicated
compute, and reducing target_benchmark_count did not help.

Root cause

The 9-judge evaluation runs scorers across an 8-worker thread pool
(scorer_workers=8). Of the 9 judges, only syntax_validity touches
Spark
— it called spark.sql("USE CATALOG …") / spark.sql("EXPLAIN …")
directly inside its scoring closure. Under the thread pool, up to 8 threads
drove the shared Spark Connect session concurrently, mutating session
state and issuing EXPLAINs over the same gRPC channel. Spark Connect's client
and the underlying gRPC / pyarrow C extensions are not safe under that
concurrency, producing a native segfault.

Evidence:

  • Preflight passes (UC metadata + data profiling use Spark single-threaded);
    only Step 2b (8-way concurrent Spark) crashes.
  • Reproducible on any compute — a code-level concurrency bug, not infra.
  • Reducing benchmark count doesn't help — each row still fans out 8 scorers.
  • The faulthandler dump shows the main thread merely parked in the watchdog
    join; the watchdog did not time out, so the crash came from inside the eval.

Fix

  1. Route the scorer's EXPLAIN through the thread-safe SQL Warehouse Statement
    Execution API
    (_execute_sql_via_warehouse) when a warehouse_id is
    available — each call is an independent HTTP request with no shared session.
    This mirrors the existing benchmark precheck, and surfaces planning errors
    from the returned plan column (the warehouse EXPLAIN returns a plan rather
    than throwing).
  2. Add a process-wide spark_serialized() lock (common/spark_concurrency.py)
    as defense-in-depth for the no-warehouse fallback, so the shared Spark Connect
    session is never driven by two threads at once.

EXPLAIN's full catalog-aware validation (unresolved columns/tables/functions,
not just syntax) is preserved — the warehouse path runs the same EXPLAIN.

Changes

  • common/spark_concurrency.py (new) — module-level lock + spark_serialized().
  • scorers/syntax_validity.py — new _explain_sql() helper (warehouse-first,
    serialized-Spark fallback); factory gains w + warehouse_id.
  • scorers/__init__.pymake_all_scorers(...) gains warehouse_id, threaded
    into the syntax scorer.
  • harness.py — all 3 make_all_scorers call sites pass
    warehouse_id=resolve_warehouse_id("").

Verification

No local dev server — verify by deploying (./scripts/deploy.sh --update) and
running an Auto-Optimize pass against a test Genie Space:

  1. baseline_evalStep 2b completes without the exit-139 SIGSEGV.
  2. syntax_validity still emits yes/no verdicts — valid SQL → yes, SQL with a
    bad column/function → no with the right failure type (proves the
    warehouse-routed EXPLAIN does real catalog-aware validation).
  3. Job log shows no concurrent Spark Connect EXPLAINs from the scorer pool.

The syntax_validity scorer issued spark.sql("EXPLAIN ...") directly on the
shared Spark Connect session. MLflow runs scorers across an 8-worker thread
pool, so up to 8 threads drove that session concurrently — mutating session
state (USE CATALOG/SCHEMA) and issuing EXPLAINs over the same gRPC channel.
Spark Connect's client and the underlying gRPC/pyarrow C extensions aren't
thread-safe under that load, crashing the kernel with a native SIGSEGV
(exit 139) during Step 2b of baseline_eval. Preflight passed because it uses
Spark single-threaded; only the concurrent eval crashed, on any compute.

Route the scorer's EXPLAIN through the thread-safe SQL Warehouse Statement
Execution API (_execute_sql_via_warehouse) when a warehouse is available —
each call is an independent HTTP request with no shared session, matching the
benchmark precheck. Add a process-wide spark_serialized() lock for the
no-warehouse fallback so the shared session is never driven by two threads at
once. EXPLAIN's full catalog-aware validation is preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant