fix(gso): stop 9-judge eval SIGSEGV from concurrent Spark Connect by hiydavid · Pull Request #227 · databricks-solutions/databricks-genie-workbench

hiydavid · 2026-06-09T03:09:50Z

Problem

GSO's baseline_eval task crashed the Python kernel with a fatal SIGSEGV
(exit code 139, "The Python kernel is unresponsive") during Step 2b: Run
9-Judge Evaluation. It was reproducible across serverless and dedicated
compute, and reducing target_benchmark_count did not help.

Root cause

The 9-judge evaluation runs scorers across an 8-worker thread pool
(scorer_workers=8). Of the 9 judges, only syntax_validity touches
Spark — it called spark.sql("USE CATALOG …") / spark.sql("EXPLAIN …")
directly inside its scoring closure. Under the thread pool, up to 8 threads
drove the shared Spark Connect session concurrently, mutating session
state and issuing EXPLAINs over the same gRPC channel. Spark Connect's client
and the underlying gRPC / pyarrow C extensions are not safe under that
concurrency, producing a native segfault.

Evidence:

Preflight passes (UC metadata + data profiling use Spark single-threaded);
only Step 2b (8-way concurrent Spark) crashes.
Reproducible on any compute — a code-level concurrency bug, not infra.
Reducing benchmark count doesn't help — each row still fans out 8 scorers.
The faulthandler dump shows the main thread merely parked in the watchdog
join; the watchdog did not time out, so the crash came from inside the eval.

Fix

Route the scorer's EXPLAIN through the thread-safe SQL Warehouse Statement
Execution API (_execute_sql_via_warehouse) when a warehouse_id is
available — each call is an independent HTTP request with no shared session.
This mirrors the existing benchmark precheck, and surfaces planning errors
from the returned plan column (the warehouse EXPLAIN returns a plan rather
than throwing).
Add a process-wide spark_serialized() lock (common/spark_concurrency.py)
as defense-in-depth for the no-warehouse fallback, so the shared Spark Connect
session is never driven by two threads at once.

EXPLAIN's full catalog-aware validation (unresolved columns/tables/functions,
not just syntax) is preserved — the warehouse path runs the same EXPLAIN.

Changes

common/spark_concurrency.py (new) — module-level lock + spark_serialized().
scorers/syntax_validity.py — new _explain_sql() helper (warehouse-first,
serialized-Spark fallback); factory gains w + warehouse_id.
scorers/__init__.py — make_all_scorers(...) gains warehouse_id, threaded
into the syntax scorer.
harness.py — all 3 make_all_scorers call sites pass
warehouse_id=resolve_warehouse_id("").

Verification

No local dev server — verify by deploying (./scripts/deploy.sh --update) and
running an Auto-Optimize pass against a test Genie Space:

baseline_eval → Step 2b completes without the exit-139 SIGSEGV.
syntax_validity still emits yes/no verdicts — valid SQL → yes, SQL with a
bad column/function → no with the right failure type (proves the
warehouse-routed EXPLAIN does real catalog-aware validation).
Job log shows no concurrent Spark Connect EXPLAINs from the scorer pool.

The syntax_validity scorer issued spark.sql("EXPLAIN ...") directly on the shared Spark Connect session. MLflow runs scorers across an 8-worker thread pool, so up to 8 threads drove that session concurrently — mutating session state (USE CATALOG/SCHEMA) and issuing EXPLAINs over the same gRPC channel. Spark Connect's client and the underlying gRPC/pyarrow C extensions aren't thread-safe under that load, crashing the kernel with a native SIGSEGV (exit 139) during Step 2b of baseline_eval. Preflight passed because it uses Spark single-threaded; only the concurrent eval crashed, on any compute. Route the scorer's EXPLAIN through the thread-safe SQL Warehouse Statement Execution API (_execute_sql_via_warehouse) when a warehouse is available — each call is an independent HTTP request with no shared session, matching the benchmark precheck. Add a process-wide spark_serialized() lock for the no-warehouse fallback so the shared session is never driven by two threads at once. EXPLAIN's full catalog-aware validation is preserved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gso): stop 9-judge eval SIGSEGV from concurrent Spark Connect#227

fix(gso): stop 9-judge eval SIGSEGV from concurrent Spark Connect#227
hiydavid wants to merge 1 commit into
mainfrom
fix/gso-syntax-validity-spark-concurrency-segfault

hiydavid commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hiydavid commented Jun 9, 2026

Problem

Root cause

Fix

Changes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant