Python Table UDFs by paultiq · Pull Request #99 · duckdb/duckdb-python

paultiq · 2025-10-03T14:24:21Z

Per discussion #84, this PR implements Python Table Valued Functions. (aka: User Defined Table Functions)*.

Table Valued Functions* allow Python callables to be registered as DuckDB Table Functions, returning either (Iterator[Sequence]) or an Arrow table.

This is implemented primarily in python_tvf.cpp. Tuple TVFs scan the py::iter from the Callable. Arrow TVFs delegate to ArrowTableFunction.

* While the main code is ready for review, I'll need a pointer on how to properly add and regenerate the bindings. I added them manually in pyconnection.cpp. As I know some work was being done in the stubs and didn't want to conflict.

** Getting the GIL and reference counting part right took a bit of work, especially around destruction. I found PythonObjectContainer late (after trying other approaches), so please let me know if this is the right approach.

Edits

Edit 1: Changed schemas to dictionary, rather than a List[Tuple]

Changes in this PR

New functions: create_table_function and unregister_table_function
New enum: PythonTVFType
Add: case_insensitive_map_t<unique_ptr<ExternalDependency>> registered_table_functions; to PyConnection
Change: PyConnection now releases the GIL when materializing.. this was needed to prevent deadlock where the TVF needs the GIL to scan.
Tests: Added tests for TUPLES, ARROW_TABLE, along with registration & schema mismatches.

Registration

This implementation adds two new functions:

create_table_function
unregister_table_function

import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

conn = duckdb.connect()

conn.create_table_function(
    name="gen_function",
    callable=lambda count: [(f"name_{i}", i) for i in range(count)] ,
    parameters=["count"],
    schema={"name": VARCHAR, "value": BIGINT},
    type="tuples", # or "arrow_table" or PythonTVFType.TUPLES or PythonTVFType.ARROW_TABLE
)

table = conn.execute("select * from gen_function(count:=10)").fetch_arrow_table()

Parameters

Parameters are declared as a list of parameter names, such as parameters = ["col1", "col2"].

Positional parameters do not need to be declared: parameters=None still allows positional parameters to be called.
Parameters called by name need to be declared: myfunction(count:=10)

Schema

The schema of a Table Function must be declared at bind time. This is done by capturing the return type at registration time. *

Schemas are defined as List[Tuple[str,str]], where each tuple is a pair of Column Name and Data Type.

* It's possible, but not implemented, to infer the schema for Arrow Tables rather than requiring it. Or, perhaps, to be more lenient.

Test Failures:

There's two test failures:

tests/fast/api/test_duckdb_connection.py::TestDuckDBConnection::test_wrap_coverage : If/when the main code is reviewed, just need help figuring out how to properly do this. I didn't want to interfere with the stubs work being done.
tests/fast/arrow/test_polars.py::TestPolars::test_polars_lazy_pushdown_numeric: See Polars Lazy Filter not working with latest Polars (1.34.0) #98

Tuples Example with positional args

import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

def simple_generator(count: int = 10):
    for i in range(count):
        yield (f"name_{i}", i)

with duckdb.connect() as conn:
        conn.create_table_function(
            name="gen_function",
            callable=simple_generator,
            parameters=["count"],
            schema={"name": VARCHAR, "value": BIGINT},
            type="tuples",
        )
	
        result = conn.sql(
            "SELECT * FROM gen_function(?)",
            params=(5,),
        ).fetchall()

Example with Parameters=None

import duckdb
from duckdb.sqltypes import VARCHAR, BIGINT

conn = duckdb.connect()

conn.create_table_function(
    name="gen_function",
    callable=lambda x,count=10: [(f"name_{x}", i) for i in range(count)] ,
    parameters=None,
    schema={"name": VARCHAR, "value": BIGINT},
    type="tuples",
)

table = conn.execute("select * from gen_function('foo')").fetch_arrow_table()

Arrow Example

import pyarrow as pa
import duckdb
from duckdb.sqltypes import BIGINT, VARCHAR

conn = duckdb.connect()

def my_function(count=10):
    return pa.table({
        "id": list(range(count)),
        "value": [i * 2 for i in range(count)],
        "name": [f"row_{i}" for i in range(count)],
    })

conn.create_table_function(
    name="somefunction",
    callable=my_function,
    parameters=["count"],
    schema={"id": BIGINT, "value": BIGINT, "name": VARCHAR},
    type="arrow_table",
)

table = conn.execute("select * from somefunction(10)").fetch_arrow_table()

Discussion

Feature Name

What to call this feature?

Some databases refer to the function as Table Valued Functions, and others as User Defined Table Function. Either name works for me... I started with TVF but I think I'm not leaning towards UDTFs.

Deleting / Unregistering

There doesn't appear to be a way to truly "delete" or "unregister" table functions from a connection... so does it make sense to even have an unregister? I chose to require an explicit unregister prior to registering a different function with same name, but this is somewhat arbitrary.

Materializing vs Streaming the Iterator

I didn't notice any significant performance difference between streaming the iterator vs fully materializing it. The benefits outweighed the added complexity.

There is perhaps some room for checking the callable result to see if it has a Length to set the Cardinality initially which may help the optimizer.

Other Callable Types

This PR supports TUPLES and ARROW_TABLES. The TUPLE implementation supports streaming, but the ARROW_TABLES are fully materialized.

ARROW_BATCHED_READER would make sense as a next type. I am assuming this is not required in an initial implementation.

(future) Passing a cursor

For "table in" situations, I'd think we could add a "pass_connection" option to create_table_function. If enabled, a cursor to the current connection would be passed to the TVF as a kwarg.

* I left it out of this initial PR to keep it simple.

Something like:

def myfunction(conn, values=10):
        return conn.execute("select * from range(10)").fetchall()

    conn.create_table_function(
        name="gen_function",
        callable=myfunction,
        parameters=["values"],
        pass_connection=True,  # when true, a cursor is passed to the 'conn' kwarg. 
        schema=[["name", "int"]],
        type=PythonTVFType.TUPLES,
    )

TODO

Add support for duckdb.typing types in the schema
Consider implementing parameter inference: look at the function signature to populate parameters.

Tishj · 2025-10-05T20:40:58Z

        python_import_cache.cpp
        python_replacement_scan.cpp
        python_udf.cpp
+        python_tvf.cpp


python_table_udf.cpp has my preference

For types like PythonTVFType, would you prefer PythonTUDF, or PythonUDTF, or PythonTableUDF?

For what it's worth, SnowFlake and DataBrix have gone with UDTF: https://docs.snowflake.com/en/developer-guide/udf/sql/udf-sql-tabular-functions and https://docs.databricks.com/aws/en/udf/python-udtf

I don't like TUDF as an abbreviation, but UDTF or TableUDF both sound good to me.

My preference is table udf
So a search for "udf" finds both versions

Done: renamed files to "table_udf" and in code to TableUDF

Tishj · 2025-10-09T16:54:19Z

+			throw InvalidInputException("Invalid schema format: expected [name, type] pairs, got string '%s'",
+			                            py::str(item).cast<std::string>());
+		}
+		if (!py::hasattr(item, "__getitem__") || py::len(item) < 2) {


This ignores cases where >2 items are given

But I don't get why we are taking schemas as [[name, type], [name, type]] instead of {name: type, name: type} ?

At initial design, I wasn't sure if I'd need any other attributes other than Name and Type, so left it as a List of Tuples (or List of Lists).

But, looking back, a mapping makes more sense.

Will do.

Done:

Modified to schema={"x": sqltypes.BIGINT, "y": sqltypes.BIGINT, "name": sqltypes.VARCHAR}.

Updated PR examples and test cases to match.

Tishj · 2025-10-09T16:57:55Z

+	switch (type) {
+	case PythonTVFType::TUPLES:
+		tf =
+		    duckdb::TableFunction(name, {}, +PyTVFTuplesScanFunction, +PyTVFTuplesBindFunction, +PyTVFTuplesInitGlobal);


What is this syntax? +PyTVFTuplesScanFunction ?
I've seen &PyTVFTuplesScanFunction, which makes sense because you're taking the address of the function, but even that is redundant, you can use PyTVFTuplesScanFunction directly afaik

This was a holdover from a fight with the linter, it's unnecessary / will remove.

Done, removed.

https://github.com/paultiq/duckdb-pythonf/blob/e2f465e13bb29a707123f776ffbafe790648fd78/src/duckdb_py/python_table_udf.cpp#L344-L350

Tishj · 2025-10-09T17:03:33Z


+	connection_module.def("create_table_function", &DuckDBPyConnection::RegisterTableFunction,
+	                      "Register a table valued function via Callable", py::arg("name"), py::arg("callable"),
+	                      py::arg("parameters") = py::none(), py::arg("schema") = py::none(),


parameters should be a keyword-only argument
making the python signature equivalent to: create_table_function(name, callable, schema, type, *, parameters)
We can even infer the parameters of the function, we have similar logic for scalar udfs

I also feel like type can be a keyword-only argument, defaulting to TUPLES if omitted

Done:

kwargs for type & parameters

default for type is TUPLES

parameters is optional

I did not (yet?) do the parameter inference yet. I want to think about that a little bit. Added to TODO in PR Comment

connection_module.def("create_table_function", &DuckDBPyConnection::RegisterTableFunction, "Register a table user defined function via Callable", py::arg("name"), py::arg("callable"), py::arg("schema"), py::kw_only(), py::arg("type") = PythonTableUDFType::TUPLES, py::arg("parameters") = py::none());

Tishj · 2025-10-09T17:08:50Z

+			throw InvalidInputException("Invalid schema format: each schema item must be a [name, type] pair");
+		}
+		names.emplace_back(py::str(item[py::int_(0)]));
+		types.emplace_back(TransformStringToLogicalType(py::str(item[py::int_(1)])));


This can accept a DuckDBPyType instead, we can extract the logical type from that
see this for example:

case PythonObjectType::Value: { // Extract the internal object and the type from the Value instance auto object = ele.attr("object"); auto type = ele.attr("type"); shared_ptr<DuckDBPyType> internal_type; if (!py::try_cast<shared_ptr<DuckDBPyType>>(type, internal_type)) { string actual_type = py::str(type.get_type()); throw InvalidInputException("The 'type' of a Value should be of type DuckDBPyType, not '%s'", actual_type); } return TransformPythonValue(object, internal_type->Type()); }

Done:

schema is a mapping of str=>DuckDBPyType

Tishj · 2025-10-09T17:14:51Z

+	}
+};
+
+struct PyTVFTuplesGlobalState : public GlobalTableFunctionState {


Keep in mind that this only supports single threaded execution, because the virtual function MaxThreads returns 1 by default.
You are using the global state directly so I think you're aware of this, but just double checking.
I think it's correct though, because enabling multi-threaded execution for a Python table UDF sounds like it wouldn't help much. As all time is spent in Python, so the GIL would make it essentially single threaded anyways.

because the virtual function MaxThreads returns 1 by default.

Do you think we should make this explicit in table_udf, or is it fine to rely on the default?

enabling multi-threaded execution for a Python table UDF sounds like it wouldn't help much

Agree. Although, with free-threading, the GIL constraint goes away... but I think such complex cases (that need multi-threaded consumption of a Python callable) are best handled in Python-land.

Tishj · 2025-10-09T17:16:45Z

+	named_parameter_map_t kwargs;
+	vector<LogicalType> return_types;
+	vector<string> return_names;
+	PythonObjectContainer python_objects; // Holds the callable


I would like a test where we create a table function, create a view that uses the table function, then unregister the table function, and make sure to exit the scope where the python table function was created.
Then execute the view.

Just to make sure that this is enough to keep the python callable alive

(This message is misplaced, it's related to the callable of the PyTVFInfo)

Done:

tests/fast/table_udf/test_tuples.py::test_callable_lifetime_in_view

Tishj · 2025-10-09T17:25:21Z

+static unique_ptr<PyTVFBindData> PyTVFBindInternal(ClientContext &context, TableFunctionBindInput &in,
+                                                   vector<LogicalType> &return_types, vector<string> &return_names) {
+	// Disable progress bar to prevent GIL deadlock with Jupyter
+	// TODO: Decide if this is still needed - was a problem when fully materializing, but switched to streaming


unresolved TODO

Done:

disabling progress bar was not needed.

This was a holdover from my first implementation, which materialized everything at init time and never released the GIL.

…use a dict[str,duckdb.sqltype] for schema, kwargs for create_table_function, clean up tests

paultiq · 2025-10-12T16:52:42Z

Note: There's an unrelated issue with some CTE tests - the tests are also failing on main.

paultiq · 2025-10-28T14:02:27Z

I'm closing so I can take a step back and revisit this and re-evaluate the performance characteristics of this approach.

While this approached worked, there's a few opportunities I found to improve this. In particular, using arrow_c_stream instead of requiring Arrow... which would unlock arrow/polars/pandas inputs... redoing how to handle schema, and rethinking the single-threaded implication.

paultiq added 2 commits October 3, 2025 13:35

feat: implement table valued functions / user defined table functions

a9bb62d

tests: add TVF test cases

013081f

Tishj reviewed Oct 5, 2025

View reviewed changes

Tishj reviewed Oct 9, 2025

View reviewed changes

paultiq changed the title ~~Python Table Valued Functions / User Defined Table Functions~~ Python Table UDFs Oct 10, 2025

Merge branch 'main' into tvf_impl

262ccfd

paultiq marked this pull request as draft October 10, 2025 12:12

paultiq and others added 3 commits October 11, 2025 13:45

Cleanup from first round of PR review: Rename from TVF to Table UDF, …

8384dc3

…use a dict[str,duckdb.sqltype] for schema, kwargs for create_table_function, clean up tests

chore: linting and formatting (triggered by new pre-commits).

e2f465e

Merge branch 'main' into tvf_impl

88198b7

paultiq marked this pull request as ready for review October 12, 2025 16:52

paultiq closed this Oct 28, 2025

Conversation

paultiq commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edits

Changes in this PR

Registration

Parameters

Schema

Test Failures:

Tuples Example with positional args

Example with Parameters=None

Arrow Example

Discussion

Feature Name

Deleting / Unregistering

Materializing vs Streaming the Iterator

Other Callable Types

(future) Passing a cursor

TODO

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paultiq Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paultiq commented Oct 12, 2025

Uh oh!

paultiq commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paultiq commented Oct 3, 2025 •

edited

Loading

Tishj Oct 9, 2025 •

edited

Loading

Tishj Oct 9, 2025 •

edited

Loading

paultiq Oct 11, 2025 •

edited

Loading

Tishj Oct 9, 2025 •

edited

Loading