Skip to content

Commit 194ef24

Browse files
committed
[Enhancement](udf) support deterministic property for udf
1 parent 96f1410 commit 194ef24

6 files changed

Lines changed: 384 additions & 8 deletions

File tree

docs/query-data/udf/python-user-defined-function.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ PROPERTIES (
4040
"type" = "PYTHON_UDF",
4141
"symbol" = "entry_function_name",
4242
"runtime_version" = "python_version",
43+
"deterministic" = "true|false",
4344
"always_nullable" = "true|false"
4445
)
4546
AS $$
@@ -58,7 +59,8 @@ RETURNS INT
5859
PROPERTIES (
5960
"type" = "PYTHON_UDF",
6061
"symbol" = "evaluate",
61-
"runtime_version" = "3.10.12"
62+
"runtime_version" = "3.10.12",
63+
"deterministic" = "true"
6264
)
6365
AS $$
6466
def evaluate(a, b):
@@ -77,7 +79,8 @@ RETURNS STRING
7779
PROPERTIES (
7880
"type" = "PYTHON_UDF",
7981
"symbol" = "evaluate",
80-
"runtime_version" = "3.10.12"
82+
"runtime_version" = "3.10.12",
83+
"deterministic" = "true"
8184
)
8285
AS $$
8386
def evaluate(s1, s2):
@@ -362,6 +365,7 @@ DROP FUNCTION IF EXISTS py_is_prime(INT);
362365
| `symbol` | Yes | - | Python function entry name.<br>• **Inline Mode**: Write function name directly, such as `"evaluate"`<br>• **Module Mode**: Format is `[package_name.]module_name.func_name`, see module mode description |
363366
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
364367
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"`, requires complete version number |
368+
| `deterministic` | No | `false` | Whether the Python UDF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
365369
| `always_nullable` | No | `true` | Whether to always return nullable results |
366370

367371
#### Runtime Version Description
@@ -979,6 +983,7 @@ PROPERTIES (
979983
"type" = "PYTHON_UDF",
980984
"symbol" = "ClassName",
981985
"runtime_version" = "python_version",
986+
"deterministic" = "true|false",
982987
"always_nullable" = "true|false"
983988
)
984989
AS $$
@@ -1409,6 +1414,7 @@ DROP FUNCTION IF EXISTS py_variance(DOUBLE);
14091414
| `symbol` | Yes | - | Python class name.<br>• **Inline Mode**: Write class name directly, such as `"SumUDAF"`<br>• **Module Mode**: Format is `[package_name.]module_name.ClassName` |
14101415
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
14111416
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"` |
1417+
| `deterministic` | No | `false` | Whether the Python UDAF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
14121418
| `always_nullable` | No | `true` | Whether to always return nullable results |
14131419

14141420
#### runtime_version Description
@@ -1907,6 +1913,7 @@ PROPERTIES (
19071913
"type" = "PYTHON_UDF",
19081914
"symbol" = "function_name",
19091915
"runtime_version" = "python_version",
1916+
"deterministic" = "true|false",
19101917
"always_nullable" = "true|false"
19111918
)
19121919
AS $$
@@ -2405,6 +2412,7 @@ CREATE TABLES FUNCTION py_split(STRING, STRING) ...;
24052412
| `symbol` | Yes | - | Python function name.<br>• **Inline Mode**: Write function name directly, such as `"split_string_udtf"`<br>• **Module Mode**: Format is `[package_name.]module_name.function_name` |
24062413
| `file` | No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
24072414
| `runtime_version` | Yes | - | Python runtime version, such as `"3.10.12"` |
2415+
| `deterministic` | No | `false` | Whether the Python UDTF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
24082416
| `always_nullable` | No | `true` | Whether to always return nullable results |
24092417

24102418
#### runtime_version Description

docs/sql-manual/sql-statements/function/CREATE-FUNCTION.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ CREATE [ GLOBAL ]
7575
> - `symbol`: Indicates the class name containing the UDF class. This parameter is mandatory.
7676
> - `type`: Indicates the UDF call type. The default is Native. Use JAVA_UDF when using a Java UDF.
7777
> - `always_nullable`: Indicates whether the UDF result may contain NULL values. This is an optional parameter with a default value of true.
78+
> - `deterministic`: Indicates whether a Java UDF or Python UDF is deterministic. This is an optional parameter with a default value of false. Set it to true only when identical inputs always produce identical outputs, and the implementation does not depend on current time, random numbers, or external mutable state Correct marking allows the optimizer to handle query rewrites more safely; incorrect marking may lead to wrong query results.
7879
7980
## Access Control Requirements
8081

@@ -135,4 +136,91 @@ To execute this command, the user must have `ADMIN_PRIV` privileges.
135136

136137
```sql
137138
CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
138-
```
139+
```
140+
141+
6. Create a non-deterministic Python UDF. Functions such as `uuid.uuid4()` that depend on randomness should keep the default `deterministic = false` and must not be incorrectly marked as `true`.
142+
143+
```sql
144+
CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
145+
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
146+
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);
147+
148+
DROP FUNCTION IF EXISTS py_uuid_token(INT);
149+
CREATE FUNCTION py_uuid_token(INT)
150+
RETURNS STRING
151+
PROPERTIES (
152+
"type" = "PYTHON_UDF",
153+
"symbol" = "py_uuid_token_impl",
154+
"always_nullable" = "false",
155+
"runtime_version" = "3.12.11"
156+
)
157+
AS $$
158+
import uuid
159+
def py_uuid_token_impl(x):
160+
return f"{x}-{uuid.uuid4()}"
161+
$$;
162+
163+
SET enable_cte_materialize = true;
164+
SET inline_cte_referenced_threshold = 10;
165+
166+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
167+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
168+
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
169+
GROUP BY id ORDER BY id;
170+
```
171+
172+
Correct result:
173+
174+
```text
175+
+------+-----------------+
176+
| id | distinct_tokens |
177+
+------+-----------------+
178+
| 1 | 1 |
179+
| 2 | 1 |
180+
| 3 | 1 |
181+
+------+-----------------+
182+
```
183+
184+
For this function, the following definition is incorrect:
185+
186+
```sql
187+
DROP FUNCTION IF EXISTS py_uuid_token(INT);
188+
CREATE FUNCTION py_uuid_token(INT)
189+
RETURNS STRING
190+
PROPERTIES (
191+
"type" = "PYTHON_UDF",
192+
"symbol" = "py_uuid_token_impl",
193+
"always_nullable" = "false",
194+
"runtime_version" = "3.12.11",
195+
"deterministic" = "true"
196+
)
197+
AS $$
198+
import uuid
199+
def py_uuid_token_impl(x):
200+
return f"{x}-{uuid.uuid4()}"
201+
$$;
202+
```
203+
204+
Run the same query again:
205+
206+
```sql
207+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
208+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
209+
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
210+
GROUP BY id ORDER BY id;
211+
```
212+
213+
Incorrect result:
214+
215+
```text
216+
+------+-----------------+
217+
| id | distinct_tokens |
218+
+------+-----------------+
219+
| 1 | 2 |
220+
| 2 | 2 |
221+
| 3 | 2 |
222+
+------+-----------------+
223+
```
224+
225+
Why this is wrong:
226+
Because `py_uuid_token` is non-deterministic, each call to `uuid.uuid4()` generates a new value. If the function is incorrectly marked as `deterministic = true`, the optimizer may treat repeated references as safe to rewrite and may choose a plan that evaluates the UDF separately on both sides of `UNION ALL`. As a result, the same `id` can produce two different `token` values, and `COUNT(DISTINCT token)` changes from `1` to `2`.

i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/udf/python-user-defined-function.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ PROPERTIES (
4040
"type" = "PYTHON_UDF",
4141
"symbol" = "entry_function_name",
4242
"runtime_version" = "python_version",
43+
"deterministic" = "true|false",
4344
"always_nullable" = "true|false"
4445
)
4546
AS $$
@@ -58,7 +59,8 @@ RETURNS INT
5859
PROPERTIES (
5960
"type" = "PYTHON_UDF",
6061
"symbol" = "evaluate",
61-
"runtime_version" = "3.10.12"
62+
"runtime_version" = "3.10.12",
63+
"deterministic" = "true"
6264
)
6365
AS $$
6466
def evaluate(a, b):
@@ -77,7 +79,8 @@ RETURNS STRING
7779
PROPERTIES (
7880
"type" = "PYTHON_UDF",
7981
"symbol" = "evaluate",
80-
"runtime_version" = "3.10.12"
82+
"runtime_version" = "3.10.12",
83+
"deterministic" = "true"
8184
)
8285
AS $$
8386
def evaluate(s1, s2):
@@ -362,6 +365,7 @@ DROP FUNCTION IF EXISTS py_is_prime(INT);
362365
| `symbol` || - | Python 函数入口名称。<br>• **内联模式**: 直接写函数名,如 `"evaluate"`<br>• **模块模式**: 格式为 `[package_name.]module_name.func_name`,详见模块模式说明 |
363366
| `file` || - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
364367
| `runtime_version` || - | Python 运行时版本,如 `"3.10.12"`,需填写完整的版本号 |
368+
| `deterministic` || `false` | Python UDF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
365369
| `always_nullable` || `true` | 是否总是返回可空结果 |
366370

367371
#### 运行时版本说明
@@ -979,6 +983,7 @@ PROPERTIES (
979983
"type" = "PYTHON_UDF",
980984
"symbol" = "ClassName",
981985
"runtime_version" = "python_version",
986+
"deterministic" = "true|false",
982987
"always_nullable" = "true|false"
983988
)
984989
AS $$
@@ -1409,6 +1414,7 @@ DROP FUNCTION IF EXISTS py_variance(DOUBLE);
14091414
| `symbol` || - | Python 类名。<br>• **内联模式**: 直接写类名,如 `"SumUDAF"`<br>• **模块模式**: 格式为 `[package_name.]module_name.ClassName` |
14101415
| `file` || - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
14111416
| `runtime_version` || - | Python 运行时版本,如 `"3.10.12"` |
1417+
| `deterministic` || `false` | Python UDAF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
14121418
| `always_nullable` || `true` | 是否总是返回可空结果 |
14131419

14141420
#### runtime_version 说明
@@ -1907,6 +1913,7 @@ PROPERTIES (
19071913
"type" = "PYTHON_UDF",
19081914
"symbol" = "function_name",
19091915
"runtime_version" = "python_version",
1916+
"deterministic" = "true|false",
19101917
"always_nullable" = "true|false"
19111918
)
19121919
AS $$
@@ -2405,6 +2412,7 @@ CREATE TABLES FUNCTION py_split(STRING, STRING) ...;
24052412
| `symbol` || - | Python 函数名。<br>• **内联模式**: 直接写函数名,如 `"split_string_udtf"`<br>• **模块模式**: 格式为 `[package_name.]module_name.function_name` |
24062413
| `file` || - | Python `.zip` 包路径,仅模块模式需要。支持三种协议:<br>• `file://` - 本地文件系统路径<br>• `http://` - HTTP 远程下载<br>• `https://` - HTTPS 远程下载 |
24072414
| `runtime_version` || - | Python 运行时版本,如 `"3.10.12"` |
2415+
| `deterministic` || `false` | Python UDTF 是否为确定性函数。<br>只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 `true`。<br>正确标记后,优化器在查询改写和其他优化场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询改写或下推行为。 |
24082416
| `always_nullable` || `true` | 是否总是返回可空结果 |
24092417

24102418
#### runtime_version 说明

i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/function/CREATE-FUNCTION.md

Lines changed: 89 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ CREATE [ GLOBAL ]
7474
> - `symbol`: 表示的是包含 UDF 类的类名。这个参数是必须设定的
7575
> - `type`: 表示的 UDF 调用类型,默认为 Native,使用 Java UDF 时传 JAVA_UDF。
7676
> - `always_nullable`:表示的 UDF 返回结果中是否有可能出现 NULL 值,是可选参数,默认值为 true。
77+
> - `deterministic`:表示 Java UDF 或 Python UDF 是否为确定性函数,可选参数,默认值为 false。只有在相同输入始终产生相同输出,且实现不依赖当前时间、随机数、外部可变状态时,才应设置为 true。正确标记后,优化器在查询改写等场景中可以基于稳定语义做更安全的处理;错误标记可能导致错误的查询结果。
7778
7879
## 权限控制
7980

@@ -124,4 +125,91 @@ CREATE [ GLOBAL ]
124125

125126
```sql
126127
CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
127-
```
128+
```
129+
130+
6. 创建一个非确定性的 Python UDF。像 `uuid.uuid4()` 这类依赖随机数的函数,应保持 `deterministic` 的默认值 `false`,不要错误标记为 `true`
131+
132+
```sql
133+
CREATE TABLE cte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
134+
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num" = "1");
135+
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);
136+
137+
DROP FUNCTION IF EXISTS py_uuid_token(INT);
138+
CREATE FUNCTION py_uuid_token(INT)
139+
RETURNS STRING
140+
PROPERTIES (
141+
"type" = "PYTHON_UDF",
142+
"symbol" = "py_uuid_token_impl",
143+
"always_nullable" = "false",
144+
"runtime_version" = "3.12.11"
145+
)
146+
AS $$
147+
import uuid
148+
def py_uuid_token_impl(x):
149+
return f"{x}-{uuid.uuid4()}"
150+
$$;
151+
152+
SET enable_cte_materialize = true;
153+
SET inline_cte_referenced_threshold = 10;
154+
155+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
156+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
157+
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
158+
GROUP BY id ORDER BY id;
159+
```
160+
161+
正确结果:
162+
163+
```text
164+
+------+-----------------+
165+
| id | distinct_tokens |
166+
+------+-----------------+
167+
| 1 | 1 |
168+
| 2 | 1 |
169+
| 3 | 1 |
170+
+------+-----------------+
171+
```
172+
173+
对于上述函数,不应写成下面这样:
174+
175+
```sql
176+
DROP FUNCTION IF EXISTS py_uuid_token(INT);
177+
CREATE FUNCTION py_uuid_token(INT)
178+
RETURNS STRING
179+
PROPERTIES (
180+
"type" = "PYTHON_UDF",
181+
"symbol" = "py_uuid_token_impl",
182+
"always_nullable" = "false",
183+
"runtime_version" = "3.12.11",
184+
"deterministic" = "true"
185+
)
186+
AS $$
187+
import uuid
188+
def py_uuid_token_impl(x):
189+
return f"{x}-{uuid.uuid4()}"
190+
$$;
191+
```
192+
193+
重新执行同一条查询:
194+
195+
```sql
196+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
197+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
198+
FROM (SELECT id, token FROM cte UNION ALL SELECT id, token FROM cte) u
199+
GROUP BY id ORDER BY id;
200+
```
201+
202+
错误结果:
203+
204+
```text
205+
+------+-----------------+
206+
| id | distinct_tokens |
207+
+------+-----------------+
208+
| 1 | 2 |
209+
| 2 | 2 |
210+
| 3 | 2 |
211+
+------+-----------------+
212+
```
213+
214+
错误原因:
215+
`py_uuid_token` 是非确定性函数,`uuid.uuid4()` 每次调用都会生成新值。如果错误地将它标记为 `deterministic = true`,优化器可能会把重复引用视为可安全改写,并选择在 `UNION ALL` 两侧分别执行 UDF 的计划。这样同一个 `id` 会生成两个不同的 `token``COUNT(DISTINCT token)` 就会从 `1` 变成 `2`

0 commit comments

Comments
 (0)