Skip to content

[SPARK-54807][SQL] Allow qualified names for built-in and session functions#53570

Closed
srielau wants to merge 126 commits into
apache:masterfrom
srielau:search-path
Closed

[SPARK-54807][SQL] Allow qualified names for built-in and session functions#53570
srielau wants to merge 126 commits into
apache:masterfrom
srielau:search-path

Conversation

@srielau

@srielau srielau commented Dec 22, 2025

Copy link
Copy Markdown
Contributor

design_sketch.md

What changes were proposed in this pull request?

  • Allow reference of built in functions with qualifiers builtin or system.builtin and temporary functions as session or system.session. Functions registered as extensionz can be qualified with system.extension or extension.

  • Cleaned up APIs to resolve functions to prep for configurable path

  • Register builtin extension, and session functions with qualified names, so they can co-exist

  • Fix a bug that allowed session functions with the same name to co-exist as table and scalar functions
    design_sketch.md

Why are the changes needed?

This portion of work allows users to excplicitly pick a builtin or temporary function, the same way they would pick a persisted function by fully qualifying it.
This increases security.
WIth this we now have a fixed order:
extension -> builtin -> session -> current schema for function resolution.

In follow on work we plan to allow the priority of function resolution to be configurable, for example to push temporary functions after built-ins or even after persisted functions.
Ultimately we aim for proper SQL Standard PATH support where a user can add "libraries" of functions to the path.

Does this PR introduce any user-facing change?

You can now reference builtin functions such as concat with builtin.concat or system.builtin.concat. Teh same for temporary functions which can be qualified as session, or system.session.

How was this patch tested?

A new suite: functionQualificationSuite.scala has been added

Was this patch authored or co-authored using generative AI tooling?

Yes: Claude Sonnet

@github-actions github-actions Bot added the SQL label Dec 22, 2025
- Persistent functions now cached with unqualified keys for compatibility
- Temporary functions use composite keys (session.funcName)
- Both can coexist in the same registry
- Views correctly exclude temp functions from resolution

Issue: View test still failing - needs investigation into function builder resolution
- Persistent functions now stored with qualified keys (catalog.db.func)
- Prevents conflicts when multiple databases have same function name
- Temporary functions still use composite keys (session.func)

Known issues:
- View function resolution test still failing
- Possible function listing regressions to investigate
Added extensive debug logging to understand why views capture
wrong function class. Ready for detailed tracing.
**THE BUG:**
In resolveBuiltinOrTempFunctionInternal, the 'isBuiltin' parameter was
incorrectly checking if the temp/builtin identifier existed in the
session registry, instead of checking the static FunctionRegistry.builtin.

This caused lookupTempFuncWithViewContext to treat temp functions as
builtins, bypassing view context checks and allowing temp functions
created AFTER a view to incorrectly shadow the persistent function
that the view should use.

**THE FIX:**
Changed the isBuiltin check to use FunctionRegistry.builtin.functionExists
and TableFunctionRegistry.builtin.functionExists directly, matching
master's behavior.

**TEST RESULTS:**
✅ All 62 tests pass (PersistedViewTestSuite + FunctionQualificationSuite)
✅ SPARK-33692 view test now passes
✅ View correctly uses MyDoubleAvg and ignores temp MyDoubleSum
Removed leftover test scripts that were causing compilation errors:
- test_simple_function.scala
- test_view_function.scala

All code now compiles cleanly.
Analysis covers:
- Complete API surface (read/write operations)
- Current architecture and memory usage
- Three proposed optimization approaches
- Detailed feasibility assessment

KEY DISCOVERY: Internal functions already use separate static registry!
- FunctionRegistry.internal contains ~20 ML/Pandas/Connect functions
- Resolved directly, bypassing SessionCatalog
- Proves composite registry pattern works in production
- Validates proposed optimization approach

Memory savings potential: 98% reduction for high-session deployments
Implementation effort: 2-3 days coding + testing
Risk: Low (pattern already proven with internal functions)
Comprehensive comparison covering:
- Registry architecture (cloned vs static)
- User-facing vs implementation details
- Resolution paths and shadowing behavior
- Examples and use cases
- Historical context (Spark 4 separation)

KEY FINDINGS:
- Builtin: ~500 user-facing SQL functions, cloned per session
- Internal: ~20 implementation functions for Connect/ML/Pandas, single global registry
- Internal functions already use separate static registry pattern
- Proves composite registry approach is production-ready

This validates our proposed optimization approach for builtins.
Updated test to expect INVALID_TEMP_OBJ_QUALIFIER (AnalysisException)
instead of INVALID_SQL_SYNTAX.CREATE_TEMP_FUNC_WITH_DATABASE (ParseException)
for invalid temporary function qualifications.

This aligns with the user's request to treat invalid temp function
qualifications as semantic errors (42602 SQLSTATE) rather than
syntax errors.

Test cases updated:
- CREATE TEMPORARY FUNCTION a.b() - now expects INVALID_TEMP_OBJ_QUALIFIER
- CREATE TEMPORARY FUNCTION a.b.c() - now expects INVALID_TEMP_OBJ_QUALIFIER

All tests pass.
These files were working notes created during development and should
not be committed to the repository:
- BUILTIN_VS_INTERNAL_FUNCTIONS.md
- CURSOR_IMPLEMENTATION_SUMMARY.md
- CURSOR_TEST_RESULTS.md
- FUNCTION_QUALIFICATION_ANALYSIS.md
- FUNCTION_QUALIFICATION_COMPLETE.md
- FUNCTION_QUALIFICATION_SUMMARY.md
- FUNCTION_REGISTRY_API_ANALYSIS.md
- IMPLEMENTATION_COMPLETE.md
- TABLE_FUNCTION_REGISTRY_ANALYSIS.md
- UNIFIED_FUNCTION_NAMESPACE.md

Only production code and tests should be in the repository.
These were working test files created during development:
- test_both.sql
- test_namespace.sql
- test_range.sql

They should not be committed to the repository.
…apsulation

Refactoring #1: Extract Scalar/Table Function Duplication
- Added handleViewContext() helper to centralize view resolution logic
- Added lookupFunctionWithShadowing() generic helper for both scalar and table functions
- Eliminated ~50 lines of duplicated shadowing and view context logic
- Simplified lookupBuiltinOrTempFunction() and lookupBuiltinOrTempTableFunction()

Refactoring #2: Unified Qualification Checker
- Added isQualifiedWithNamespace() helper to check namespace qualifications
- Refactored maybeBuiltinFunctionName() and maybeTempFunctionName() to use common helper
- Eliminated ~15 lines of duplication
- Prepares for future PATH implementation

Documentation Improvements:
- Added comprehensive scaladoc to TEMP_FUNCTION_DB explaining composite key pattern
- Added scaladoc to tempFunctionIdentifier() and isTempFunctionIdentifier()
- Added detailed comments to new helper methods

Benefits:
- Single source of truth for shadowing logic
- Easier to maintain and test
- Better encapsulation of view resolution context
- More consistent code structure

All tests pass:
- FunctionQualificationSuite: 24/24 tests passed
- SessionCatalogSuite: 100/100 tests passed
Naming Improvements:
- Renamed 'useTempIdentifier' → 'lookupAsTemporary' for clarity
- More semantic boolean parameter names

Code Quality:
- Made handleViewContext() functional (removed imperative 'return')
- Removed unused 'tempIdentifier' and 'builtinIdentifier' variables
- Extracted resolveFunctionWithFallback() to eliminate duplication
- Used Option.filter for more idiomatic Scala

Benefits:
- More functional programming style
- Clearer intent with better parameter names
- No unused variables cluttering the code
- Extracted common pattern reduces duplication by ~30 lines
- More readable and maintainable

All 24 tests in FunctionQualificationSuite pass.
Removed sql-function-qualifiers.sql and its golden files.
All test coverage is comprehensively provided by FunctionQualificationSuite.scala.

Rationale:
- Eliminates duplication between SQL and Scala tests
- Scala tests provide better error validation with checkError()
- Scala tests are easier to maintain and debug
- Scala tests have better test isolation
- No loss of coverage: Scala suite has 24 tests covering all scenarios

FunctionQualificationSuite.scala provides complete coverage:
- 8 tests: Reference qualification (SELECT statements)
- 9 tests: DDL (CREATE/DROP TEMPORARY FUNCTION)
- 4 tests: Type mismatch errors
- 3 tests: Integration scenarios

All 24 tests pass.
@cloud-fan

cloud-fan commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

I don't think we need EXTENSION_NAMESPACE. The purpose of SparkSessionExtensions is to allow power users to overwrite built-in functions — they register functions with a FunctionIdentifier they fully control, and those functions simply replace the builtin at that key. There's no need for a separate namespace to distinguish them. Furthermore, we should not let end users know the concept of extension functions. The power users may want to do it silently.

A few concrete issues with the current Extension kind:

  1. extensionFunctionIdentifier(name) produces FunctionIdentifier(format(name)) — identical to builtins. Extensions are stored at the exact same key as builtins, so the Extension kind has no storage-level distinction.

  2. SparkSessionExtensions.registerFunctions allows users to provide arbitrary FunctionIdentifier with database/catalog. The PR silently strips qualification for unqualified extensions and keeps it as-is for qualified ones, but qualified extensions can never be reached via extension.func or system.extension.func since those only look up unqualified keys. The abstraction is incomplete.

  3. Every code path treats Extension and Builtin identicallyresolveScalarFunction, resolveTableFunction, lookupFunctionType, ResolveCatalogs, etc. all collapse them together. This adds branches and complexity with no behavioral difference.

Suggestion: remove Extension from SessionFunctionKind, remove EXTENSION_NAMESPACE from CatalogManager, and let extension functions continue to work as they do on master — they overwrite builtins at injection time, no special namespace needed.


This comment was generated with GitHub MCP.

@cloud-fan

Copy link
Copy Markdown
Contributor

This review was generated by Cursor with Claude Opus 4.6.

Additional implementation review comments beyond what's already been raised:

1. lookupFunctionInfo — table functions with kind == Temp skip view context filtering (bug)

For kind == Temp, the scalar branch applies view context filtering via lookupTempFuncWithViewContext, but the table function branch does a plain registry.functionExists / registry.lookupFunction with no view filtering:

case SessionCatalog.Temp =>
  synchronized {
    if (tableFunction) {
      // No view context filtering!
      if (registry.functionExists(identifier)) registry.lookupFunction(identifier) else None
    } else {
      lookupTempFuncWithViewContext(
        name, _ => false, _ => registry.lookupFunction(identifier))
    }
  }

This means a temp table function could be visible inside a view even when it shouldn't be. Both branches should apply handleViewContext.

2. registerUserDefinedFunctionisTableFunction flag is redundant

The method already receives registry: FunctionRegistryBase[T] which is either functionRegistry or tableFunctionRegistry. The "other" registry can be derived:

val otherRegistry = if (registry eq functionRegistry) tableFunctionRegistry else functionRegistry

No need for the extra isTableFunction parameter.

3. resolveQualifiedTableFunctioncase _: Exception => // ignore is dangerous

In the inner try-catch that checks whether a scalar function exists (to throw NOT_A_TABLE_FUNCTION), all exceptions are silently swallowed:

} catch {
  case _: Exception => // ignore
}

This could mask real errors (permission failures, catalog connectivity, etc.). It should catch only the specific exceptions that indicate "function not found" — NoSuchFunctionException, NoSuchNamespaceException, CatalogNotFoundException, and the FORBIDDEN_OPERATION AnalysisException (matching the outer catch).

4. FunctionType sealed trait is over-engineered for its single use

FunctionType (with 5 case objects) is only used by lookupFunctionType, which is only called from LookupFunctions in Analyzer.scala. The Builtin/Temporary distinction is never actually used — both hit the throw SparkException.internalError(...) branch. Consider simplifying or inlining the logic.

5. resolveBuiltinOrTempTableFunctionRespectingPathOrder — appears to be dead code

This public method returns Option[Either[LogicalPlan, Unit]] but doesn't appear to be called anywhere in the diff. If it is needed, Right(()) to signal "scalar found but no table function" is not self-documenting and should use a clearer type. If it's not called, it should be removed.

6. createOrReplaceTempFunction — the source == "internal" check is dead code

The PR adds a source == "internal" branch to avoid qualifying internal functions with SESSION_NAMESPACE. But FunctionRegistry.internal is a completely separate SimpleFunctionRegistry instance — internal functions are registered via registerInternalExpression directly into that registry, never through createOrReplaceTempFunction. No caller ever passes "internal" as the source. The conditional should be removed and the method should always qualify with SESSION_NAMESPACE.

7. Duplicated scaladoc on lookupFunctionWithShadowing

There are two consecutive scaladoc blocks before lookupFunctionWithShadowing — the first is an orphaned leftover that should be removed.

8. Indentation-only changes pollute the diff

Several places in SessionCatalog.scala have whitespace-only changes that shift indentation without any logical change (e.g. fetchCatalogFunction, the persistent branch of registerUserDefinedFunction, registerFunction call sites). These should be reverted to keep git blame clean.


This comment was generated with GitHub MCP.

@gengliangwang

Copy link
Copy Markdown
Member

I will merge this one once the CI passed. Thanks for the works @srielau

@MaxGekk

MaxGekk commented Jun 29, 2026

Copy link
Copy Markdown
Member

Hi @srielau @cloud-fan @gengliangwang @vladimirg-db, a heads-up on a performance regression introduced by this change (SPARK-54807), tracked as SPARK-57758.

Before this PR, an unqualified built-in function (count, sum, coalesce, ...) resolved with a single in-memory registry lookup. Now FunctionResolution.resolveFunction/resolveTableFunction builds an ordered candidate search path for every UnresolvedFunction: it reads the AnalysisContext/CatalogManager, reads spark.sql.functionResolution.sessionOrder, allocates the search-path and candidate Seqs, and iterates candidates doing a name-kind parse plus a registry lookup. None of it is memoized, so it runs for every function node, on every analysis pass.

The impact is amplified under Spark Connect, which re-analyzes the whole (growing) plan on every AnalyzePlan call, so the per-function overhead is paid repeatedly and analysis time regresses multi-fold. Execution is unaffected.

I opened #56869 to fix it: memoize the resolution path per analysis pass and add a built-in-only fast path for single-part names (gated so it only short-circuits when system.builtin precedes system.session, preserving the configurable order). Resolution semantics are unchanged.

Would appreciate your review. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants