Skip to content

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525

Open
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/scheme-gate-object-store
Open

fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525
schenksj wants to merge 2 commits into
apache:mainfrom
schenksj:fix/scheme-gate-object-store

Conversation

@schenksj
Copy link
Copy Markdown

Which issue does this PR close?

Closes #4520.

Rationale for this change

Comet's native readers go through object_store, which only understands a fixed set of URL schemes. When a scan's path uses a custom Hadoop FileSystem scheme (e.g. registered via spark.hadoop.fs.<scheme>.impl), the native reader fails at execution with Generic URL error: Unable to recognise URL "..." — there is no graceful recovery once native execution has started. This was surfaced by Delta tables opened with custom filesystem options (DeltaTable.forPath(spark, path, fsOptions)), where Delta reads its internal _delta_log/*.checkpoint.parquet via ordinary V1 parquet scans that Comet then claimed and crashed on, but it reproduces for any V1 parquet scan on such a scheme.

What changes are included in this PR?

CometScanRule declines a V1 native scan when its root-path scheme isn't natively readable, so Spark's Hadoop-FS-aware reader handles it. Rather than hardcode the object_store-supported scheme set in the planner (a mirror that drifts), the answer comes from the native layer itself: a new NativeBase.isObjectStoreSchemeSupported JNI method backed by object_store's own ObjectStoreScheme::parse — the same path prepare_object_store_with_configs dispatches through. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; and if native can't be consulted the scheme is assumed supported rather than over-restricting.

How are these changes tested?

CometScanSchemeFallbackSuite registers FakeHDFSFileSystem for a fake:// scheme (not routed through libhdfs) and applies CometScanRule to the scan's physical plan. It asserts the scan falls back to Spark (no CometScanExec). The test fails without the gate (Comet claims the fake:// scan) and passes with it. The libhdfs-scheme regression guard (ParquetReadFromFakeHadoopFsSuite) continues to engage Comet for configured libhdfs schemes.

schenksj and others added 2 commits May 29, 2026 20:44
…chemes

Comet's native readers go through object_store, which only understands a fixed set
of URL schemes. A custom Hadoop FileSystem (e.g. registered via
spark.hadoop.fs.<scheme>.impl) crashes the native reader at execution with
"Generic URL error: Unable to recognise URL", with no graceful recovery. Decline
such scans at planning time so Spark's Hadoop-FS-aware reader handles them.

Whether object_store recognizes a scheme is answered by the native layer itself
(NativeBase.isObjectStoreSchemeSupported, backed by object_store's
ObjectStoreScheme::parse -- the same path prepare_object_store_with_configs uses)
rather than a hardcoded list, so the planner can't drift from object_store's actual
support. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is
unioned in on the JVM side; results are cached per scheme; if native can't be
consulted the scheme is assumed supported rather than over-restricting.

Adds CometScanSchemeFallbackSuite, which asserts a `fake://` scan falls back to
Spark; it fails without the gate (Comet claims the scan) and passes with it.

Closes apache#4520

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CometScanRule: decline native V1 scans on object_store-unsupported filesystem schemes (fall back to Spark)

1 participant