fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525
Open
schenksj wants to merge 2 commits into
Open
fix: decline native V1 scans on object_store-unsupported filesystem schemes#4525schenksj wants to merge 2 commits into
schenksj wants to merge 2 commits into
Conversation
…chemes Comet's native readers go through object_store, which only understands a fixed set of URL schemes. A custom Hadoop FileSystem (e.g. registered via spark.hadoop.fs.<scheme>.impl) crashes the native reader at execution with "Generic URL error: Unable to recognise URL", with no graceful recovery. Decline such scans at planning time so Spark's Hadoop-FS-aware reader handles them. Whether object_store recognizes a scheme is answered by the native layer itself (NativeBase.isObjectStoreSchemeSupported, backed by object_store's ObjectStoreScheme::parse -- the same path prepare_object_store_with_configs uses) rather than a hardcoded list, so the planner can't drift from object_store's actual support. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; if native can't be consulted the scheme is assumed supported rather than over-restricting. Adds CometScanSchemeFallbackSuite, which asserts a `fake://` scan falls back to Spark; it fails without the gate (Comet claims the scan) and passes with it. Closes apache#4520 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4520.
Rationale for this change
Comet's native readers go through
object_store, which only understands a fixed set of URL schemes. When a scan's path uses a custom HadoopFileSystemscheme (e.g. registered viaspark.hadoop.fs.<scheme>.impl), the native reader fails at execution withGeneric URL error: Unable to recognise URL "..."— there is no graceful recovery once native execution has started. This was surfaced by Delta tables opened with custom filesystem options (DeltaTable.forPath(spark, path, fsOptions)), where Delta reads its internal_delta_log/*.checkpoint.parquetvia ordinary V1 parquet scans that Comet then claimed and crashed on, but it reproduces for any V1 parquet scan on such a scheme.What changes are included in this PR?
CometScanRuledeclines a V1 native scan when its root-path scheme isn't natively readable, so Spark's Hadoop-FS-aware reader handles it. Rather than hardcode the object_store-supported scheme set in the planner (a mirror that drifts), the answer comes from the native layer itself: a newNativeBase.isObjectStoreSchemeSupportedJNI method backed byobject_store's ownObjectStoreScheme::parse— the same pathprepare_object_store_with_configsdispatches through. The user's libhdfs scheme config (spark.hadoop.fs.comet.libhdfs.schemes) is unioned in on the JVM side; results are cached per scheme; and if native can't be consulted the scheme is assumed supported rather than over-restricting.How are these changes tested?
CometScanSchemeFallbackSuiteregistersFakeHDFSFileSystemfor afake://scheme (not routed through libhdfs) and appliesCometScanRuleto the scan's physical plan. It asserts the scan falls back to Spark (noCometScanExec). The test fails without the gate (Comet claims thefake://scan) and passes with it. The libhdfs-scheme regression guard (ParquetReadFromFakeHadoopFsSuite) continues to engage Comet for configured libhdfs schemes.