Description
Background
Users of datafusion-python can create table providers from at least two different pathways:
- A table provider created via PyCapsule (i.e. custom providers implemented in Rust or elsewhere, exposed to Python via
__datafusion_table_provider__).
- A view-based provider created via
into_view() on an existing DataFrame, then registered with a SessionContext.
These two types serve similar roles (they supply data / logical plans to DataFusion), but currently they behave differently, which can lead to confusion.
Problem / Confusion
- A user might reasonably expect that a table provider object created with
into_view() could be registered with a session context in the same way as a PyCapsule‐exposed provider, but that may not always work (or may not be documented).
- There is risk of mismatch in how the internals treat providers from the two sources (views vs PyCapsules).
- Without a unified type or interface, it’s unclear whether certain operations should/can be supported for both.
- The divergence might cause unexpected errors or surprising behavior for the user, especially around registration, reuse, or compatibility of providers.
Desired Behavior / Suggestion
-
Define a single common PyTableProvider (or similarly named abstraction) that works identically whether created via into_view() or via a PyCapsule / external source.
-
Ensure that the SessionContext.register_table_provider(...) accepts this common type regardless of source.
-
Document clearly:
- what kinds of table providers are accepted (views, PyCapsules, external)
- how to obtain them from each path
- equivalence or limitations (if any)
-
Possibly enhance the implementation so that a view-based provider can be converted (or wrapped) into the same internal abstraction that a PyCapsule provider uses.
Benefits
- Reduced confusion for users.
- More consistency in the API.
- Easier to reason about table providers across different parts of a codebase.
- Potential fewer bugs when mixing providers from different sources.
Context
This issue is motivated by this comment
#1016 (comment)
Description
Background
Users of
datafusion-pythoncan create table providers from at least two different pathways:__datafusion_table_provider__).into_view()on an existingDataFrame, then registered with aSessionContext.These two types serve similar roles (they supply data / logical plans to DataFusion), but currently they behave differently, which can lead to confusion.
Problem / Confusion
into_view()could be registered with a session context in the same way as a PyCapsule‐exposed provider, but that may not always work (or may not be documented).Desired Behavior / Suggestion
Define a single common
PyTableProvider(or similarly named abstraction) that works identically whether created viainto_view()or via a PyCapsule / external source.Ensure that the
SessionContext.register_table_provider(...)accepts this common type regardless of source.Document clearly:
Possibly enhance the implementation so that a view-based provider can be converted (or wrapped) into the same internal abstraction that a PyCapsule provider uses.
Benefits
Context
This issue is motivated by this comment
#1016 (comment)