Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ license: |
- Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`.
- Since Spark 4.2, support for Derby JDBC datasource is deprecated.
- Since Spark 4.2, a new default method `mergeWith` has been added to the `CustomTaskMetric` interface. The default implementation sums the two metric values, which is correct for count-type metrics. Data source connector implementations that report non-additive metrics (e.g., maximum, average, compression ratio, or gauge values) must override `mergeWith` to provide correct merge semantics.
- Since Spark 4.2, the virtual `system` catalog hosts the new `system.builtin` and `system.session` namespaces. `system.builtin` exposes built-in functions and functions injected through `SparkSessionExtensions`; `system.session` exposes temporary views, temporary functions, and session variables created in the current session. As a result, 2-part references like `builtin.func()` and `session.func()` now follow a mini-path that tries the system namespace first and the current catalog second, so a persistent schema named `builtin` or `session` is no longer reached by `builtin.func()` / `session.func()` when the system namespace contains an object of the same name. To restore the previous behavior (current catalog first), set `spark.sql.legacy.persistentCatalogFirst` to `true`. Persistent schemas with these names are still allowed but should be reached with an explicit catalog prefix (for example, `spark_catalog.session.x`). See [Reserved system names](sql-ref-identifier.html#reserved-system-names).
- Since Spark 4.2, `CREATE TEMPORARY VIEW`, `CREATE TEMPORARY FUNCTION`, and the corresponding `DROP` statements accept the `session` and `system.session` qualifiers on the object name (in addition to the previously supported unqualified form); for example, `CREATE TEMPORARY VIEW system.session.v AS ...` and `DROP TEMPORARY FUNCTION session.f` are now valid. Any other qualifier on a temporary object is rejected with `INVALID_TEMP_OBJ_QUALIFIER`.
- Since Spark 4.2, the SQL standard `PATH` feature is available: the `SET PATH` statement, the `current_path()` function, path-based resolution of unqualified routines, tables, views, and session variables, and the configurations `spark.sql.path.enabled` (default `false`) and `spark.sql.defaultPath`. The feature is opt-in; when `spark.sql.path.enabled` is `false`, unqualified resolution falls back to a fixed default path and `SET PATH` is rejected with `UNSUPPORTED_FEATURE.SET_PATH_WHEN_DISABLED`. See [SET PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html) and [Name Resolution](sql-ref-name-resolution.html).

## Upgrading from Spark SQL 4.0 to 4.1

Expand Down
85 changes: 85 additions & 0 deletions docs/sql-ref-function-current-path.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
layout: global
title: current_path function
displayTitle: current_path function
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---

Returns the effective SQL Path for the current session as a comma-separated string of
qualified namespace names. See [`SET PATH`](sql-ref-syntax-aux-conf-mgmt-set-path.html) for a
description of what the path is, how to enable it, and how to change it, and
[Name Resolution](sql-ref-name-resolution.html) for how the path drives unqualified name
resolution.

### Syntax

```sql
current_path()
```

### Arguments

This function takes no arguments. The parentheses may be omitted.

### Returns

A non-nullable `STRING`. Each path entry is written as a dotted name with backticks added only
where required by Spark's identifier rules. Entries are separated by a single comma.

When the path contains the virtual `CURRENT_SCHEMA` marker, the marker is materialized as the
catalog-qualified current schema (`current_catalog.current_schema`) each time
`current_path()` is evaluated, so subsequent `USE SCHEMA` statements are reflected without
re-issuing `SET PATH`.

### Examples

```sql
> SELECT current_path();
system.builtin,system.session,spark_catalog.default

-- ANSI no-parens form returns the same value.
> SELECT CURRENT_PATH;
system.builtin,system.session,spark_catalog.default

-- The output reflects the latest SET PATH.
> SET PATH = spark_catalog.default, system.builtin;
> SELECT current_path();
spark_catalog.default,system.builtin

-- CURRENT_SCHEMA on the path is re-evaluated on every call.
> SET PATH = CURRENT_SCHEMA, system.builtin;
> USE spark_catalog.finance;
> SELECT current_path();
spark_catalog.finance,system.builtin
> USE spark_catalog.default;
> SELECT current_path();
spark_catalog.default,system.builtin

-- Inside a persistent view or SQL function body, current_path() returns the invoker's path,
-- not the frozen path captured at creation time.
> SET PATH = spark_catalog.default, system.builtin;
> CREATE VIEW v_path AS SELECT current_path() AS p;
> SET PATH = spark_catalog.other, system.builtin;
> SELECT * FROM v_path;
spark_catalog.other,system.builtin
```

### Related Statements

* [Name Resolution](sql-ref-name-resolution.html)
* [SET PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html)
* [Built-in Functions](sql-ref-functions-builtin.html)
4 changes: 4 additions & 0 deletions docs/sql-ref-functions-builtin.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ license: |
limitations under the License.
---

All built-in functions live in the virtual schema `system.builtin`. They can always be referenced
unambiguously by their fully qualified name (for example `system.builtin.abs`), regardless of any
user-defined function that may share the same name.

### Aggregate Functions
{% include_api_gen generated-agg-funcs-table.html %}
#### Examples
Expand Down
24 changes: 24 additions & 0 deletions docs/sql-ref-identifier.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,30 @@ An identifier is a string used to identify a database object such as a table, vi

Any character from the character set. Use <code>`</code> to escape special characters (e.g., <code>`</code>).

### Reserved system names

`system`, `session`, and `builtin` have special meaning and should not be used as user-defined
catalog or schema names.

| Name | Position | Notes |
| :--- | :------- | :---- |
| `system` | catalog | Virtual catalog hosting `system.builtin` and `system.session`. Spark does not load `system` through the v2 catalog API; setting `spark.sql.catalog.system = ...` is unsupported and produces undefined results. The current catalog cannot be `system`. |
| `builtin` | schema | A persistent schema named `builtin` is allowed but discouraged because it collides with `system.builtin`. |
| `session` | schema | A persistent schema named `session` is allowed but discouraged because it collides with `system.session`. |

An unqualified 2-part reference like `builtin.x` or `session.x` walks a small **mini-path** to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An unqualified 2-part reference" is in tension with the taxonomy this PR establishes in sql-ref-name-resolution.md:273, which heads exactly this case as ### Partially qualified (2 parts) — schema.object. describe-function.md:47 (also new in this PR) just says "2-part names". A 2-part reference like builtin.x is partially qualified — it carries one level of qualifier (the schema), so calling it "unqualified" reads as self-contradictory.

(Late catch — this wording was already in the prior review's snapshot and I should have flagged it then. Apologies for the second pass.)

Suggested change
An unqualified 2-part reference like `builtin.x` or `session.x` walks a small **mini-path** to
A partially qualified 2-part reference like `builtin.x` or `session.x` walks a small **mini-path** to

choose the implicit catalog: by default it resolves to `system.builtin.x` / `system.session.x`
if such an object exists, and otherwise falls back to the same name in the current catalog. So
an object in a persistent `builtin` or `session` schema is shadowed only when an object of the
same name exists in the corresponding system namespace. The shadowed object stays reachable via its fully qualified 3-part name (for example
`spark_catalog.session.x`). Set `spark.sql.legacy.persistentCatalogFirst` to `true` to reverse
the preference: the current catalog is tried first and the system namespace becomes the fallback.

The `system.builtin` and `system.session` namespaces are described in
[SET PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html). Temporary objects in `system.session` are
documented under [CREATE VIEW](sql-ref-syntax-ddl-create-view.html) and
[CREATE FUNCTION (SQL)](sql-ref-syntax-ddl-create-sql-function.html).

### Examples

```sql
Expand Down
164 changes: 110 additions & 54 deletions docs/sql-ref-name-resolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ license: |
limitations under the License.
---

Name resolution is the process by which [identifiers](sql-ref-identifier.html) are resolved to specific column-, field-, parameter-, or table-references.
Name resolution is the process by which [identifiers](sql-ref-identifier.html) are resolved to specific column-, field-, parameter-, table-, function-, or variable-references.

## Column, field, parameter, and variable resolution

Expand Down Expand Up @@ -50,7 +50,7 @@ In detail, resolution of identifiers to a specific reference follows these rules

1. **Parameterless function reference**

If the identifier is unqualified and matches `current_user`, `current_date`, or `current_timestamp`: Resolve it as one of these functions.
If the identifier is unqualified and matches `current_user`, `current_date`, `current_time`, `current_timestamp`, or `current_path`: Resolve it as one of these functions.

1. **Column DEFAULT specification**

Expand Down Expand Up @@ -137,7 +137,10 @@ In detail, resolution of identifiers to a specific reference follows these rules

1. **Session Variables**

1. Match the identifier to a variable name. If the identifier is qualified, the qualifier must be `session` or `system.session`.
1. Match the identifier to a session variable name.
If the identifier is qualified, the qualifier must be `session` or `system.session`.
If the identifier is unqualified, `system.session` must be present on the
[SQL Path](sql-ref-syntax-aux-conf-mgmt-set-path.html) (the default path includes it).
1. If the identifier is qualified, match to a field or map key of a variable following rule 1.c

### Limitations
Expand Down Expand Up @@ -256,37 +259,54 @@ This restriction also applies to parameter references in SQL functions.
frm.a lat.b func.c
```

## Table and view resolution

An identifier in table-reference can be any one of the following:
## Object name resolution

- Persistent table or view
- Common table expression (CTE)
- [Temporary view](sql-ref-syntax-ddl-create-view.html)
Tables, views, and functions follow the same resolution rule. It depends on how many parts the
identifier has.

Resolution of an identifier depends on whether it is qualified:
### Fully qualified (3 parts) &mdash; `catalog.schema.object`

- **Qualified**
The reference is unique and is looked up in `catalog.schema`. `system.builtin.object` identifies
a built-in function; `system.session.object` identifies a temporary view, function, or session
variable.

If the identifier is fully qualified with three parts: `catalog.schema.relation`, it is unique.
### Partially qualified (2 parts) &mdash; `schema.object`

If the identifier consists of two parts: `schema.relation`, it is further qualified with the result of `SELECT current_catalog()` to make it unique.
The identifier is qualified with `current_catalog` &mdash; producing
`current_catalog.schema.object` &mdash; unless the leading part is `session` (or `builtin`, for
functions). In that case Spark uses the
[mini-path](sql-ref-identifier.html#reserved-system-names) to choose the implicit catalog,
Comment thread
srielau marked this conversation as resolved.
returning the first match:

- **Unqualified**
| `spark.sql.legacy.persistentCatalogFirst` | Mini-path tried in order |
| :-------------------------------------- | :----------------------- |
| `false` (default) | the system namespace (`system.session.x` / `system.builtin.x`), then the current catalog's `session.x` / `builtin.x` |
| `true` (legacy) | the current catalog's `session.x` / `builtin.x`, then the system namespace (`system.session.x` / `system.builtin.x`) |

1. **Common table expression**
### Unqualified (1 part) &mdash; `object`

If the reference is within the scope of a `WITH` clause, match the identifier to a CTE starting with the immediately containing `WITH` clause and moving outwards from there.
In queries and DML, Spark walks the [SQL Path](sql-ref-syntax-aux-conf-mgmt-set-path.html) and
returns the first match. In DDL, the identifier is qualified with `current_catalog.current_schema`.

1. **Temporary view**
> Note: persistent views and SQL UDFs capture the SQL Path at `CREATE` time. When the view or
> function is invoked, its body resolves names &mdash; tables, views, and functions &mdash;
> against that frozen path, not the invoker's current path. `current_schema()` and
> `current_path()` inside the body still return the invoker's context. See
> [SET PATH](sql-ref-syntax-aux-conf-mgmt-set-path.html).

Match the identifier to any temporary view defined within the current session.
## Table and view resolution

1. **Persisted table**
A table reference can be a persistent table or view, a temporary view, or a common table
expression (CTE).

Fully qualify the identifier by pre-pending the result of `SELECT current_catalog()` and `SELECT current_schema()` and look it up as a persistent relation.
Resolution follows [Object name resolution](#object-name-resolution), with one addition for
unqualified references: when the reference is inside a `WITH` clause, Spark first matches the
identifier against CTEs from the innermost `WITH` outward. If no CTE matches, Spark walks the
SQL Path.

If the relation cannot be resolved to any table, view, or CTE, Databricks raises a TABLE_OR_VIEW_NOT_FOUND error.
If the relation cannot be resolved, Spark raises `TABLE_OR_VIEW_NOT_FOUND`. The error includes
the effective search path, for example
`searchPath = [system.builtin, system.session, spark_catalog.default]`.

### Examples

Expand Down Expand Up @@ -317,7 +337,13 @@ If the relation cannot be resolved to any table, view, or CTE, Databricks raises
> SELECT c1 FROM rel;
2

-- Temporary views cannot be qualified, so qualifiecation resolved to the table:
-- A temporary view can be qualified with `session` or `system.session`:
> SELECT c1 FROM session.rel;
2
> SELECT c1 FROM system.session.rel;
2

-- Other 2-part qualifications resolve to the persisted table:
> SELECT c1 FROM default.rel;
1

Expand All @@ -343,45 +369,34 @@ If the relation cannot be resolved to any table, view, or CTE, Databricks raises
SELECT 1),
cte;
[TABLE_OR_VIEW_NOT_FOUND] The table or view `cte` cannot be found.
```

## Function resolution

A function reference is recognized by the mandatory trailing set of parentheses.

It can resolve to:

- A builtin function provided by Spark,
- A temporary user defined function scoped to the current session, or
- A persistent user defined function.

Resolution of a function name depends on whether it is qualified:
-- PATH drives unqualified relation lookup order
> CREATE SCHEMA db_a;
> CREATE SCHEMA db_b;
> CREATE TABLE db_a.t USING parquet AS SELECT 1 AS v;
> CREATE TABLE db_b.t USING parquet AS SELECT 2 AS v;

- **Qualified**

If the name is fully qualified with three parts: `catalog.schema.function`, it is unique.

If the name consists of two parts: `schema.function`, it is further qualified with the result of `SELECT current_catalog()` to make it unique.

The function is then looked up in the catalog.

- **Unqualified**

For unqualified function names Spark follows a fixed order of precedence (`PATH`):

1. **Builtin function**

If a function by this name exists among the set of built-in functions, that function is chosen.
> SET PATH = spark_catalog.db_a, spark_catalog.db_b, system.builtin;
> SELECT v FROM t;
1

1. **Temporary function**
> SET PATH = spark_catalog.db_b, spark_catalog.db_a, system.builtin;
> SELECT v FROM t;
2

If a function by this name exists among the set of temporary functions, that function is chosen.
-- Three-part `system.session.x` references the temporary scope only:
> SELECT * FROM system.session.no_such_view;
[TABLE_OR_VIEW_NOT_FOUND] ... `system`.`session`.`no_such_view` ...
```

1. **Persisted function**
## Function resolution

Fully qualify the function name by pre-pending the result of `SELECT current_catalog()` and `SELECT current_schema()` and look it up as a persistent function.
A function reference is recognized by the trailing parentheses, and follows
[Object name resolution](#object-name-resolution).

If the function cannot be resolved Spark raises an `UNRESOLVED_ROUTINE` error.
If the function cannot be resolved, Spark raises `UNRESOLVED_ROUTINE`. The error includes the
effective search path, for example
`searchPath = [system.builtin, system.session, spark_catalog.default]`.

### Examples

Expand Down Expand Up @@ -420,4 +435,45 @@ If the function cannot be resolved Spark raises an `UNRESOLVED_ROUTINE` error.
-- To resolve the persistent function it now needs qualification
> SELECT spark_catalog.default.func(4, 3);
6

-- A built-in can always be reached by qualification, even when shadowed.
-- Put system.session ahead of system.builtin so a matching temp `abs` shadows the built-in.
> SET PATH = system.session, system.builtin, spark_catalog.default;
> CREATE TEMPORARY FUNCTION abs(x INT) RETURNS INT RETURN x + 100;

-- Unqualified abs(-5) resolves to the temp (-5 + 100 = 95).
> SELECT abs(-5);
95

-- system.builtin.abs and builtin.abs reach the built-in around the shadow.
> SELECT system.builtin.abs(-5);
5
> SELECT builtin.abs(-5);
5

-- session.abs reaches the temp explicitly.
> SELECT session.abs(-5);
95

> DROP TEMPORARY FUNCTION abs;
> SET PATH = DEFAULT_PATH;

-- PATH controls unqualified routine lookup order
> CREATE SCHEMA path_a;
> CREATE SCHEMA path_b;
> CREATE FUNCTION path_a.pick() RETURNS INT RETURN 10;
> CREATE FUNCTION path_b.pick() RETURNS INT RETURN 20;

> SET PATH = spark_catalog.path_a, spark_catalog.path_b, system.builtin;
> SELECT pick();
10

> SET PATH = spark_catalog.path_b, spark_catalog.path_a, system.builtin;
> SELECT pick();
20

-- Unresolved routine lists the effective search path
> SET PATH = spark_catalog.default, system.builtin;
> SELECT does_not_exist();
[UNRESOLVED_ROUTINE] ... searchPath: [`spark_catalog`.`default`, `system`.`builtin`] ...
```
Loading