You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/query-data/udf/python-user-defined-function.md
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,7 @@ PROPERTIES (
40
40
"type"="PYTHON_UDF",
41
41
"symbol"="entry_function_name",
42
42
"runtime_version"="python_version",
43
+
"deterministic"="true|false",
43
44
"always_nullable"="true|false"
44
45
)
45
46
AS $$
@@ -58,7 +59,8 @@ RETURNS INT
58
59
PROPERTIES (
59
60
"type"="PYTHON_UDF",
60
61
"symbol"="evaluate",
61
-
"runtime_version"="3.10.12"
62
+
"runtime_version"="3.10.12",
63
+
"deterministic"="true"
62
64
)
63
65
AS $$
64
66
def evaluate(a, b):
@@ -77,7 +79,8 @@ RETURNS STRING
77
79
PROPERTIES (
78
80
"type"="PYTHON_UDF",
79
81
"symbol"="evaluate",
80
-
"runtime_version"="3.10.12"
82
+
"runtime_version"="3.10.12",
83
+
"deterministic"="true"
81
84
)
82
85
AS $$
83
86
def evaluate(s1, s2):
@@ -362,6 +365,7 @@ DROP FUNCTION IF EXISTS py_is_prime(INT);
362
365
|`symbol`| Yes | - | Python function entry name.<br>• **Inline Mode**: Write function name directly, such as `"evaluate"`<br>• **Module Mode**: Format is `[package_name.]module_name.func_name`, see module mode description |
363
366
|`file`| No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
364
367
|`runtime_version`| Yes | - | Python runtime version, such as `"3.10.12"`, requires complete version number |
368
+
|`deterministic`| No |`false`| Whether the Python UDF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
365
369
|`always_nullable`| No |`true`| Whether to always return nullable results |
366
370
367
371
#### Runtime Version Description
@@ -979,6 +983,7 @@ PROPERTIES (
979
983
"type"="PYTHON_UDF",
980
984
"symbol"="ClassName",
981
985
"runtime_version"="python_version",
986
+
"deterministic"="true|false",
982
987
"always_nullable"="true|false"
983
988
)
984
989
AS $$
@@ -1409,6 +1414,7 @@ DROP FUNCTION IF EXISTS py_variance(DOUBLE);
1409
1414
|`symbol`| Yes | - | Python class name.<br>• **Inline Mode**: Write class name directly, such as `"SumUDAF"`<br>• **Module Mode**: Format is `[package_name.]module_name.ClassName`|
1410
1415
|`file`| No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
1411
1416
|`runtime_version`| Yes | - | Python runtime version, such as `"3.10.12"`|
1417
+
|`deterministic`| No |`false`| Whether the Python UDAF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
1412
1418
|`always_nullable`| No |`true`| Whether to always return nullable results |
1413
1419
1414
1420
#### runtime_version Description
@@ -1907,6 +1913,7 @@ PROPERTIES (
1907
1913
"type"="PYTHON_UDF",
1908
1914
"symbol"="function_name",
1909
1915
"runtime_version"="python_version",
1916
+
"deterministic"="true|false",
1910
1917
"always_nullable"="true|false"
1911
1918
)
1912
1919
AS $$
@@ -2405,6 +2412,7 @@ CREATE TABLES FUNCTION py_split(STRING, STRING) ...;
2405
2412
|`symbol`| Yes | - | Python function name.<br>• **Inline Mode**: Write function name directly, such as `"split_string_udtf"`<br>• **Module Mode**: Format is `[package_name.]module_name.function_name`|
2406
2413
|`file`| No | - | Python `.zip` package path, only required for module mode. Supports three protocols:<br>• `file://` - Local filesystem path<br>• `http://` - HTTP remote download<br>• `https://` - HTTPS remote download |
2407
2414
|`runtime_version`| Yes | - | Python runtime version, such as `"3.10.12"`|
2415
+
|`deterministic`| No |`false`| Whether the Python UDTF is deterministic.<br>Set it to `true` only when the same inputs always produce the same outputs, and the implementation does not depend on current time, random numbers, or external mutable state.<br>Correctly marking this property allows the optimizer to handle rewrite and other optimization scenarios more safely; incorrect marking may cause wrong query rewrite or pushdown behavior. |
2408
2416
|`always_nullable`| No |`true`| Whether to always return nullable results |
Copy file name to clipboardExpand all lines: docs/sql-manual/sql-statements/function/CREATE-FUNCTION.md
+89-1Lines changed: 89 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,6 +75,7 @@ CREATE [ GLOBAL ]
75
75
> -`symbol`: Indicates the class name containing the UDF class. This parameter is mandatory.
76
76
> -`type`: Indicates the UDF call type. The default is Native. Use JAVA_UDF when using a Java UDF.
77
77
> -`always_nullable`: Indicates whether the UDF result may contain NULL values. This is an optional parameter with a default value of true.
78
+
> -`deterministic`: Indicates whether a Java UDF or Python UDF is deterministic. This is an optional parameter with a default value of false. Set it to true only when identical inputs always produce identical outputs, and the implementation does not depend on current time, random numbers, or external mutable state Correct marking allows the optimizer to handle query rewrites more safely; incorrect marking may lead to wrong query results.
78
79
79
80
## Access Control Requirements
80
81
@@ -135,4 +136,91 @@ To execute this command, the user must have `ADMIN_PRIV` privileges.
135
136
136
137
```sql
137
138
CREATE GLOBAL ALIAS FUNCTION id_masking(INT) WITH PARAMETER(id) AS CONCAT(LEFT(id, 3), '****', RIGHT(id, 4));
138
-
```
139
+
```
140
+
141
+
6. Create a non-deterministic Python UDF. Functions such as `uuid.uuid4()` that depend on randomness should keep the default `deterministic = false` and must not be incorrectly marked as `true`.
142
+
143
+
```sql
144
+
CREATETABLEcte_uuid_seed (id INT) ENGINE=OLAP DUPLICATE KEY(id)
145
+
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ("replication_num"="1");
146
+
INSERT INTO cte_uuid_seed VALUES (1),(2),(3);
147
+
148
+
DROPFUNCTION IF EXISTS py_uuid_token(INT);
149
+
CREATEFUNCTIONpy_uuid_token(INT)
150
+
RETURNS STRING
151
+
PROPERTIES (
152
+
"type"="PYTHON_UDF",
153
+
"symbol"="py_uuid_token_impl",
154
+
"always_nullable"="false",
155
+
"runtime_version"="3.12.11"
156
+
)
157
+
AS $$
158
+
import uuid
159
+
def py_uuid_token_impl(x):
160
+
return f"{x}-{uuid.uuid4()}"
161
+
$$;
162
+
163
+
SET enable_cte_materialize = true;
164
+
SET inline_cte_referenced_threshold =10;
165
+
166
+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
167
+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
168
+
FROM (SELECT id, token FROM cte UNION ALLSELECT id, token FROM cte) u
169
+
GROUP BY id ORDER BY id;
170
+
```
171
+
172
+
Correct result:
173
+
174
+
```text
175
+
+------+-----------------+
176
+
| id | distinct_tokens |
177
+
+------+-----------------+
178
+
| 1 | 1 |
179
+
| 2 | 1 |
180
+
| 3 | 1 |
181
+
+------+-----------------+
182
+
```
183
+
184
+
For this function, the following definition is incorrect:
185
+
186
+
```sql
187
+
DROPFUNCTION IF EXISTS py_uuid_token(INT);
188
+
CREATEFUNCTIONpy_uuid_token(INT)
189
+
RETURNS STRING
190
+
PROPERTIES (
191
+
"type"="PYTHON_UDF",
192
+
"symbol"="py_uuid_token_impl",
193
+
"always_nullable"="false",
194
+
"runtime_version"="3.12.11",
195
+
"deterministic"="true"
196
+
)
197
+
AS $$
198
+
import uuid
199
+
def py_uuid_token_impl(x):
200
+
return f"{x}-{uuid.uuid4()}"
201
+
$$;
202
+
```
203
+
204
+
Run the same query again:
205
+
206
+
```sql
207
+
WITH cte AS (SELECT id, py_uuid_token(id) AS token FROM cte_uuid_seed)
208
+
SELECT id, COUNT(DISTINCT token) AS distinct_tokens
209
+
FROM (SELECT id, token FROM cte UNION ALLSELECT id, token FROM cte) u
210
+
GROUP BY id ORDER BY id;
211
+
```
212
+
213
+
Incorrect result:
214
+
215
+
```text
216
+
+------+-----------------+
217
+
| id | distinct_tokens |
218
+
+------+-----------------+
219
+
| 1 | 2 |
220
+
| 2 | 2 |
221
+
| 3 | 2 |
222
+
+------+-----------------+
223
+
```
224
+
225
+
Why this is wrong:
226
+
Because `py_uuid_token` is non-deterministic, each call to `uuid.uuid4()` generates a new value. If the function is incorrectly marked as `deterministic = true`, the optimizer may treat repeated references as safe to rewrite and may choose a plan that evaluates the UDF separately on both sides of `UNION ALL`. As a result, the same `id` can produce two different `token` values, and `COUNT(DISTINCT token)` changes from `1` to `2`.
0 commit comments