Skip to content

Commit 9227cf7

Browse files
authored
[python] Introduce DataFusion SQL to PyPaimon (#7599)
PR has introduced PyPaimon with SQL query capabilities based on PyPaimon-rust + DataFusion.
1 parent 051dc75 commit 9227cf7

File tree

10 files changed

+864
-1
lines changed

10 files changed

+864
-1
lines changed

.github/workflows/paimon-python-checks.yml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,8 @@ jobs:
7171
build-essential \
7272
git \
7373
curl \
74+
pkg-config \
75+
libssl-dev \
7476
&& apt-get clean \
7577
&& rm -rf /var/lib/apt/lists/*
7678
@@ -139,12 +141,22 @@ jobs:
139141
if: matrix.python-version != '3.6.15'
140142
shell: bash
141143
run: |
142-
pip install maturin
144+
pip install maturin[patchelf]
143145
git clone -b support_directory https://github.com/JingsongLi/tantivy-py.git /tmp/tantivy-py
144146
cd /tmp/tantivy-py
145147
maturin build --release
146148
pip install target/wheels/tantivy-*.whl
147149
150+
- name: Build and install pypaimon-rust from source
151+
if: matrix.python-version != '3.6.15'
152+
shell: bash
153+
run: |
154+
git clone https://github.com/apache/paimon-rust.git /tmp/paimon-rust
155+
cd /tmp/paimon-rust/bindings/python
156+
maturin build --release -o dist
157+
pip install dist/pypaimon_rust-*.whl
158+
pip install 'datafusion>=52'
159+
148160
- name: Run lint-python.sh
149161
shell: bash
150162
run: |

docs/content/pypaimon/cli.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -621,3 +621,105 @@ default
621621
mydb
622622
analytics
623623
```
624+
625+
## SQL Command
626+
627+
Execute SQL queries on Paimon tables directly from the command line. This feature is powered by pypaimon-rust and DataFusion.
628+
629+
**Prerequisites:**
630+
631+
```shell
632+
pip install pypaimon[sql]
633+
```
634+
635+
### One-Shot Query
636+
637+
Execute a single SQL query and display the result:
638+
639+
```shell
640+
paimon sql "SELECT * FROM users LIMIT 10"
641+
```
642+
643+
Output:
644+
```
645+
id name age city
646+
1 Alice 25 Beijing
647+
2 Bob 30 Shanghai
648+
3 Charlie 35 Guangzhou
649+
```
650+
651+
**Options:**
652+
653+
- `--format, -f`: Output format: `table` (default) or `json`
654+
655+
**Examples:**
656+
657+
```shell
658+
# Direct table name (uses default catalog and database)
659+
paimon sql "SELECT * FROM users"
660+
661+
# Two-part: database.table
662+
paimon sql "SELECT * FROM mydb.users"
663+
664+
# Query with filter and aggregation
665+
paimon sql "SELECT city, COUNT(*) AS cnt FROM users GROUP BY city ORDER BY cnt DESC"
666+
667+
# Output as JSON
668+
paimon sql "SELECT * FROM users LIMIT 5" --format json
669+
```
670+
671+
### Interactive REPL
672+
673+
Start an interactive SQL session by running `paimon sql` without a query argument. The REPL supports arrow keys for line editing, and command history is persisted across sessions in `~/.paimon_history`.
674+
675+
```shell
676+
paimon sql
677+
```
678+
679+
Output:
680+
```
681+
____ _
682+
/ __ \____ _(_)___ ___ ____ ____
683+
/ /_/ / __ `/ / __ `__ \/ __ \/ __ \
684+
/ ____/ /_/ / / / / / / / /_/ / / / /
685+
/_/ \__,_/_/_/ /_/ /_/\____/_/ /_/
686+
687+
Powered by pypaimon-rust + DataFusion
688+
Type 'help' for usage, 'exit' to quit.
689+
690+
paimon> SHOW DATABASES;
691+
default
692+
mydb
693+
694+
paimon> USE mydb;
695+
Using database 'mydb'.
696+
697+
paimon> SHOW TABLES;
698+
orders
699+
users
700+
701+
paimon> SELECT count(*) AS cnt
702+
> FROM users
703+
> WHERE age > 18;
704+
cnt
705+
42
706+
(1 row in 0.05s)
707+
708+
paimon> exit
709+
Bye!
710+
```
711+
712+
SQL statements end with `;` and can span multiple lines. The continuation prompt ` >` indicates that more input is expected.
713+
714+
**REPL Commands:**
715+
716+
| Command | Description |
717+
|---|---|
718+
| `USE <database>;` | Switch the default database |
719+
| `SHOW DATABASES;` | List all databases |
720+
| `SHOW TABLES;` | List tables in the current database |
721+
| `SELECT ...;` | Execute a SQL query |
722+
| `help` | Show usage information |
723+
| `exit` / `quit` | Exit the REPL |
724+
725+
For more details on SQL syntax and the Python API, see [SQL Query]({{< ref "pypaimon/sql" >}}).

docs/content/pypaimon/sql.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
title: "SQL Query"
3+
weight: 8
4+
type: docs
5+
aliases:
6+
- /pypaimon/sql.html
7+
---
8+
9+
<!--
10+
Licensed to the Apache Software Foundation (ASF) under one
11+
or more contributor license agreements. See the NOTICE file
12+
distributed with this work for additional information
13+
regarding copyright ownership. The ASF licenses this file
14+
to you under the Apache License, Version 2.0 (the
15+
"License"); you may not use this file except in compliance
16+
with the License. You may obtain a copy of the License at
17+
18+
http://www.apache.org/licenses/LICENSE-2.0
19+
20+
Unless required by applicable law or agreed to in writing,
21+
software distributed under the License is distributed on an
22+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
23+
KIND, either express or implied. See the License for the
24+
specific language governing permissions and limitations
25+
under the License.
26+
-->
27+
28+
# SQL Query
29+
30+
PyPaimon supports executing SQL queries on Paimon tables, powered by [pypaimon-rust](https://github.com/apache/paimon-rust/tree/main/bindings/python) and [DataFusion](https://datafusion.apache.org/python/).
31+
32+
## Installation
33+
34+
SQL query support requires additional dependencies. Install them with:
35+
36+
```shell
37+
pip install pypaimon[sql]
38+
```
39+
40+
This will install `pypaimon-rust` and `datafusion`.
41+
42+
## Usage
43+
44+
Create a `SQLContext`, register one or more catalogs with their options, and run SQL queries.
45+
46+
### Basic Query
47+
48+
```python
49+
from pypaimon.sql import SQLContext
50+
51+
ctx = SQLContext()
52+
ctx.register_catalog("paimon", {"warehouse": "/path/to/warehouse"})
53+
ctx.set_current_catalog("paimon")
54+
ctx.set_current_database("default")
55+
56+
# Execute SQL and get PyArrow Table
57+
table = ctx.sql("SELECT * FROM my_table")
58+
print(table)
59+
60+
# Convert to Pandas DataFrame
61+
df = table.to_pandas()
62+
print(df)
63+
```
64+
65+
### Table Reference Format
66+
67+
The default catalog and default database can be configured via `set_current_catalog()` and `set_current_database()`, so you can reference tables in two ways:
68+
69+
```python
70+
# Direct table name (uses default database)
71+
ctx.sql("SELECT * FROM my_table")
72+
73+
# Two-part: database.table
74+
ctx.sql("SELECT * FROM mydb.my_table")
75+
```
76+
77+
### Filtering
78+
79+
```python
80+
table = ctx.sql("""
81+
SELECT id, name, age
82+
FROM users
83+
WHERE age > 18 AND city = 'Beijing'
84+
""")
85+
```
86+
87+
### Aggregation
88+
89+
```python
90+
table = ctx.sql("""
91+
SELECT city, COUNT(*) AS cnt, AVG(age) AS avg_age
92+
FROM users
93+
GROUP BY city
94+
ORDER BY cnt DESC
95+
""")
96+
```
97+
98+
### Join
99+
100+
```python
101+
table = ctx.sql("""
102+
SELECT u.name, o.order_id, o.amount
103+
FROM users u
104+
JOIN orders o ON u.id = o.user_id
105+
WHERE o.amount > 100
106+
""")
107+
```
108+
109+
### Subquery
110+
111+
```python
112+
table = ctx.sql("""
113+
SELECT * FROM users
114+
WHERE id IN (
115+
SELECT user_id FROM orders
116+
WHERE amount > 1000
117+
)
118+
""")
119+
```
120+
121+
### Cross-Database Query
122+
123+
```python
124+
# Query a table in another database using two-part syntax
125+
table = ctx.sql("""
126+
SELECT u.name, o.amount
127+
FROM default.users u
128+
JOIN analytics.orders o ON u.id = o.user_id
129+
""")
130+
```
131+
132+
### Multi-Catalog Query
133+
134+
`SQLContext` supports registering multiple catalogs for cross-catalog queries:
135+
136+
```python
137+
from pypaimon.sql import SQLContext
138+
139+
ctx = SQLContext()
140+
ctx.register_catalog("a", {"warehouse": "/path/to/warehouse_a"})
141+
ctx.register_catalog("b", {
142+
"metastore": "rest",
143+
"uri": "http://localhost:8080",
144+
"warehouse": "warehouse_b",
145+
})
146+
ctx.set_current_catalog("a")
147+
ctx.set_current_database("default")
148+
149+
# Cross-catalog join
150+
table = ctx.sql("""
151+
SELECT a_users.name, b_orders.amount
152+
FROM a.default.users AS a_users
153+
JOIN b.default.orders AS b_orders ON a_users.id = b_orders.user_id
154+
""")
155+
```
156+
157+
## Supported SQL Syntax
158+
159+
The SQL engine is powered by Apache DataFusion, which supports a rich set of SQL syntax including:
160+
161+
- `SELECT`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, `LIMIT`
162+
- `JOIN` (INNER, LEFT, RIGHT, FULL, CROSS)
163+
- Subqueries and CTEs (`WITH`)
164+
- Aggregate functions (`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, etc.)
165+
- Window functions (`ROW_NUMBER`, `RANK`, `LAG`, `LEAD`, etc.)
166+
- `UNION`, `INTERSECT`, `EXCEPT`
167+
168+
For the full SQL reference, see the [DataFusion SQL documentation](https://datafusion.apache.org/user-guide/sql/index.html).

paimon-python/pypaimon/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,13 @@
2828
from pypaimon.schema.schema import Schema
2929
from pypaimon.tag.tag import Tag
3030
from pypaimon.tag.tag_manager import TagManager
31+
from pypaimon.sql.sql_context import SQLContext
3132

3233
__all__ = [
3334
"PaimonVirtualFileSystem",
3435
"CatalogFactory",
3536
"Schema",
3637
"Tag",
3738
"TagManager",
39+
"SQLContext",
3840
]

paimon-python/pypaimon/cli/cli.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,10 @@ def main():
121121
from pypaimon.cli.cli_catalog import add_catalog_subcommands
122122
add_catalog_subcommands(catalog_parser)
123123

124+
# SQL command
125+
from pypaimon.cli.cli_sql import add_sql_subcommand
126+
add_sql_subcommand(subparsers)
127+
124128
args = parser.parse_args()
125129

126130
if args.command is None:

0 commit comments

Comments
 (0)