Skip to content

Commit 59102c5

Browse files
authored
[doc] add SeaweedFS integration doc (#3607)
## Versions - [x] dev - [x] 4.x - [x] 3.x - [ ] 2.1 ## Languages - [x] Chinese - [x] English ## Summary Adds an Iceberg lakehouse integration page for [SeaweedFS](https://github.com/seaweedfs/seaweedfs), which exposes both an S3 object endpoint and an Apache Iceberg REST Catalog from the same `weed` process. The doc walks through: 1. Starting `weed mini` with a single IAM config and a pre-created S3 Tables bucket. 2. Registering the catalog in Doris with `iceberg.catalog.type = "rest"`, where the OAuth2 client credentials and the S3 keys are the same access pair. 3. Reading and writing an Iceberg table. The same end-to-end path is exercised in CI by the `TestDorisIcebergCatalog` integration test in the SeaweedFS repo (`test/s3tables/catalog_doris/`), which boots SeaweedFS, registers a Doris Iceberg catalog against it, writes rows via PyIceberg, and reads them back from `apache/doris:doris-all-in-one-2.1.0`. The Doris catalog properties in the doc are the ones the test uses verbatim. ## Files - 4 English pages: `docs/`, `docs-next/`, `versioned_docs/version-3.x`, `versioned_docs/version-4.x` - 4 Chinese pages: corresponding `i18n/zh-CN/...` paths - 4 sidebars: `sidebars.ts`, `sidebars-next.ts`, `versioned_sidebars/version-{3,4}.x-sidebars.json` — `doris-seaweedfs` slotted under the Iceberg Catalog category, after `doris-lakekeeper`. I'm the maintainer of SeaweedFS, so I can keep this page in sync with future changes upstream.
1 parent dde83e9 commit 59102c5

9 files changed

Lines changed: 1026 additions & 0 deletions

File tree

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
{
3+
"title": "Integration with SeaweedFS",
4+
"language": "en"
5+
}
6+
---
7+
8+
[SeaweedFS](https://seaweedfs.com/) is a distributed storage system that exposes both an S3-compatible object API and an Apache Iceberg REST Catalog from the same `weed` process. Parquet data and Iceberg metadata are served by one executable, authenticated by one S3 credential pair.
9+
10+
This page shows the minimal configuration that turns SeaweedFS into a Doris-backed Iceberg lakehouse. The same end-to-end path is exercised by the [`TestDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris) integration test in the SeaweedFS repository, which boots a SeaweedFS mini cluster, registers a Doris Iceberg catalog against it, writes rows with PyIceberg, and reads them back from `apache/doris:doris-all-in-one-2.1.0`.
11+
12+
## Why SeaweedFS for an Iceberg lakehouse
13+
14+
A typical lakehouse stack today stitches together three layers:
15+
16+
* Object storage (S3 or compatible)
17+
* A standalone Iceberg catalog (Hive Metastore, Glue, Polaris, Lakekeeper, Nessie, ...)
18+
* A query engine (Doris, Spark, Trino, ...)
19+
20+
SeaweedFS collapses the first two into one process. The same `weed` executable is both:
21+
22+
* the S3-compatible object store that holds the parquet files, and
23+
* the Iceberg REST Catalog that holds the table metadata.
24+
25+
So Doris talks to one system instead of two. The practical implications:
26+
27+
* **Fewer moving parts.** No Hive Metastore, no Glue, no Postgres backing a separate catalog, no STS role to provision.
28+
* **Simpler deployment.** One executable, one IAM config, one S3 credential pair shared by Doris's Iceberg REST client and its S3 reader.
29+
* **Local or on-prem friendly.** Nothing in the path requires a cloud-native service. The same setup runs on a laptop, a single VM, or a Kubernetes cluster.
30+
* **Lower latency on the metadata path.** Catalog state lives in the same SeaweedFS filer that serves the data, so namespace and table lookups don't cross a separate service boundary.
31+
* **S3-native on disk.** Tables are stored as standard Iceberg directories in S3 buckets. Any S3 client (rclone, `aws s3`, Spark, Trino, Dremio, RisingWave) can read or replicate them alongside Doris.
32+
33+
Architecturally:
34+
35+
```text
36+
Doris
37+
|
38+
v
39+
Iceberg tables
40+
|
41+
v
42+
SeaweedFS (S3 storage + REST catalog)
43+
```
44+
45+
For smaller teams or internal platforms, this is a clean way to build a lakehouse without depending on a separate metastore service.
46+
47+
## 1. Start SeaweedFS
48+
49+
Build or install `weed` from [github.com/seaweedfs/seaweedfs](https://github.com/seaweedfs/seaweedfs).
50+
51+
Create an IAM config that grants an access key full S3 access. The same key is also used as the OAuth2 client for the Iceberg REST endpoint:
52+
53+
```json
54+
{
55+
"identities": [
56+
{
57+
"name": "doris",
58+
"credentials": [
59+
{
60+
"accessKey": "AKIAIOSFODNN7EXAMPLE",
61+
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
62+
}
63+
],
64+
"actions": ["Admin"]
65+
}
66+
]
67+
}
68+
```
69+
70+
Start a single-process cluster with the Iceberg REST endpoint and a pre-created table bucket:
71+
72+
```bash
73+
weed mini \
74+
-ip $(hostname -I | awk '{print $1}') \
75+
-dir /var/lib/seaweedfs \
76+
-s3.config /etc/seaweedfs/iam_config.json \
77+
-tableBucket iceberg-tables
78+
```
79+
80+
`weed mini` runs master, volume, filer, S3, and the Iceberg REST catalog in one process. Default ports:
81+
82+
| Component | Port | Override flag |
83+
| --------- | ---- | ------------- |
84+
| Master HTTP | 9333 | `-master.port` |
85+
| Filer HTTP | 8888 | `-filer.port` |
86+
| S3 | 8333 | `-s3.port` |
87+
| Iceberg REST | 8181 | `-s3.port.iceberg` |
88+
89+
`-tableBucket iceberg-tables` creates the S3 Tables bucket on startup, which is the Iceberg-aware bucket type Doris will write into.
90+
91+
To verify the catalog is reachable:
92+
93+
```bash
94+
curl -s http://SEAWEED_HOST:8181/v1/config | jq .
95+
```
96+
97+
## 2. Register the Iceberg catalog in Doris
98+
99+
```sql
100+
CREATE CATALOG seaweedfs PROPERTIES (
101+
"type" = "iceberg",
102+
"iceberg.catalog.type" = "rest",
103+
"uri" = "http://SEAWEED_HOST:8181",
104+
"warehouse" = "s3://iceberg-tables",
105+
"credential" = "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
106+
"s3.endpoint" = "http://SEAWEED_HOST:8333",
107+
"s3.access_key" = "AKIAIOSFODNN7EXAMPLE",
108+
"s3.secret_key" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
109+
"s3.region" = "us-west-2",
110+
"use_path_style" = "true"
111+
);
112+
```
113+
114+
Notes:
115+
116+
* `credential = "<access_key>:<secret_key>"` is forwarded by Doris's Iceberg REST client as OAuth2 client credentials. SeaweedFS validates them against the same IAM config that secures the S3 endpoint.
117+
* The `s3.*` properties are used by Doris's own parquet reader and writer. They point at the same `weed` process — same host, same key pair.
118+
* `use_path_style = "true"` is required because SeaweedFS serves S3 in path-style by default.
119+
* The integration test uses these exact properties; see [`createDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/blob/master/test/s3tables/catalog_doris/doris_catalog_test.go) for the canonical form.
120+
121+
If you create namespaces or tables outside Doris (for example with PyIceberg) before the catalog is registered, refresh the metadata cache:
122+
123+
```sql
124+
REFRESH CATALOG seaweedfs;
125+
```
126+
127+
## 3. Use the catalog
128+
129+
```sql
130+
USE seaweedfs;
131+
132+
CREATE DATABASE IF NOT EXISTS demo;
133+
134+
USE seaweedfs.demo;
135+
136+
CREATE TABLE iceberg_smoke (
137+
id BIGINT,
138+
label STRING
139+
);
140+
141+
INSERT INTO iceberg_smoke VALUES (1, 'one'), (2, 'two'), (3, 'three');
142+
143+
SELECT id, label FROM iceberg_smoke ORDER BY id;
144+
```
145+
146+
Expected output:
147+
148+
```text
149+
+----+-------+
150+
| id | label |
151+
+----+-------+
152+
| 1 | one |
153+
| 2 | two |
154+
| 3 | three |
155+
+----+-------+
156+
```
157+
158+
This is the same path the SeaweedFS integration test exercises: namespace and table created through the Iceberg REST catalog, rows appended via PyIceberg, and reads served by Doris through the standard S3 plus Iceberg metadata flow.
159+
160+
## Production notes
161+
162+
* For a production cluster, replace `weed mini` with `weed master`, `weed volume`, `weed filer`, and `weed s3 -iceberg.port=8181` (or use the SeaweedFS Helm chart). The Doris-side configuration is identical — only the host and ports change.
163+
* The OAuth2 credential is the S3 access key. To rotate Doris's catalog access, rotate the IAM identity that holds it, the same way you rotate any S3 user.
164+
* Iceberg table maintenance (compaction, snapshot expiration, orphan removal, manifest rewriting) is built into SeaweedFS and runs against the same bucket. See the [SeaweedFS Iceberg Catalog wiki](https://github.com/seaweedfs/seaweedfs/wiki/SeaweedFS-Iceberg-Catalog) for details.
165+
166+
## References
167+
168+
* [SeaweedFS](https://github.com/seaweedfs/seaweedfs)
169+
* [Doris Iceberg integration test in SeaweedFS](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris)
170+
* [Doris Iceberg Catalog reference](https://doris.apache.org/docs/lakehouse/catalogs/iceberg-catalog)
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
---
2+
{
3+
"title": "集成 SeaweedFS",
4+
"language": "zh-CN",
5+
"description": "使用 SeaweedFS 同时承载 Iceberg 表的对象存储和 REST Catalog,凭证、部署、运维三位一体。"
6+
}
7+
---
8+
9+
[SeaweedFS](https://seaweedfs.com/) 是一个分布式存储系统,单个 `weed` 进程即可同时提供 S3 兼容的对象存储接口和 Apache Iceberg REST Catalog。Parquet 数据和 Iceberg 元数据由同一个执行文件对外服务,并使用同一对 S3 凭证完成鉴权。
10+
11+
本文介绍将 SeaweedFS 作为 Doris 的 Iceberg Lakehouse 后端的最小配置。完整的端到端路径已经在 SeaweedFS 仓库的 [`TestDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris) 集成测试中验证:测试会启动 SeaweedFS mini 集群,在 Doris 中注册 Iceberg Catalog,通过 PyIceberg 写入数据,再由 `apache/doris:doris-all-in-one-2.1.0` 容器读回。
12+
13+
## 为什么用 SeaweedFS 搭 Iceberg Lakehouse
14+
15+
当下的 Lakehouse 架构通常需要把三层系统拼起来:
16+
17+
* 对象存储(S3 或兼容实现)
18+
* 独立的 Iceberg Catalog(Hive Metastore、Glue、Polaris、Lakekeeper、Nessie 等)
19+
* 查询引擎(Doris、Spark、Trino 等)
20+
21+
SeaweedFS 把前两层合并到了同一个进程里。同一个 `weed` 执行文件既是:
22+
23+
* 存放 parquet 文件的 S3 兼容对象存储,
24+
* 也是存放表元数据的 Iceberg REST Catalog。
25+
26+
也就是说,Doris 只需要对接一个系统,而不是两个。具体好处:
27+
28+
* **更少的组件。** 不再需要 Hive Metastore、Glue,不需要为 Catalog 单独部署 Postgres,也不需要单独维护 STS 角色。
29+
* **更简单的部署。** 一个执行文件、一份 IAM 配置;Doris 的 Iceberg REST 客户端和 S3 读写器共用同一对 S3 凭证。
30+
* **适合本地与私有化场景。** 整个链路不依赖任何云服务,从笔记本、单台 VM 到 Kubernetes 集群,部署方式一致。
31+
* **元数据路径更低延时。** Catalog 状态保存在同一个 SeaweedFS filer 中,与数据为邻;命名空间和表元数据查询不再跨独立服务。
32+
* **磁盘上是标准 S3。** 表以标准 Iceberg 目录结构存放在 S3 桶中,任何 S3 客户端(rclone、`aws s3`、Spark、Trino、Dremio、RisingWave)都可以与 Doris 一同读取或复制。
33+
34+
架构上:
35+
36+
```text
37+
Doris
38+
|
39+
v
40+
Iceberg 表
41+
|
42+
v
43+
SeaweedFS (S3 存储 + REST Catalog)
44+
```
45+
46+
对于小团队和内部数据平台来说,这是一种不依赖独立 Catalog 服务、就能搭起 Lakehouse 的干净方式。
47+
48+
## 1. 启动 SeaweedFS
49+
50+
[github.com/seaweedfs/seaweedfs](https://github.com/seaweedfs/seaweedfs) 编译或安装 `weed`
51+
52+
准备一份 IAM 配置,给一个访问密钥授予 S3 权限。同一个密钥也作为 Iceberg REST 端点的 OAuth2 客户端:
53+
54+
```json
55+
{
56+
"identities": [
57+
{
58+
"name": "doris",
59+
"credentials": [
60+
{
61+
"accessKey": "AKIAIOSFODNN7EXAMPLE",
62+
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
63+
}
64+
],
65+
"actions": ["Admin"]
66+
}
67+
]
68+
}
69+
```
70+
71+
启动单进程集群,并在启动时创建用于 Iceberg 的 Table Bucket:
72+
73+
```bash
74+
weed mini \
75+
-ip $(hostname -I | awk '{print $1}') \
76+
-dir /var/lib/seaweedfs \
77+
-s3.config /etc/seaweedfs/iam_config.json \
78+
-tableBucket iceberg-tables
79+
```
80+
81+
`weed mini` 会在一个进程内同时启动 master、volume、filer、S3 和 Iceberg REST Catalog。默认端口:
82+
83+
| 组件 | 端口 | 修改参数 |
84+
| ---- | ---- | -------- |
85+
| Master HTTP | 9333 | `-master.port` |
86+
| Filer HTTP | 8888 | `-filer.port` |
87+
| S3 | 8333 | `-s3.port` |
88+
| Iceberg REST | 8181 | `-s3.port.iceberg` |
89+
90+
`-tableBucket iceberg-tables` 会在启动时创建一个 S3 Tables 类型的 Bucket,也就是 Doris 后续写入 Iceberg 表所用的 Bucket。
91+
92+
验证 Catalog 端点可用:
93+
94+
```bash
95+
curl -s http://SEAWEED_HOST:8181/v1/config | jq .
96+
```
97+
98+
## 2. 在 Doris 中注册 Iceberg Catalog
99+
100+
```sql
101+
CREATE CATALOG seaweedfs PROPERTIES (
102+
"type" = "iceberg",
103+
"iceberg.catalog.type" = "rest",
104+
"uri" = "http://SEAWEED_HOST:8181",
105+
"warehouse" = "s3://iceberg-tables",
106+
"credential" = "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
107+
"s3.endpoint" = "http://SEAWEED_HOST:8333",
108+
"s3.access_key" = "AKIAIOSFODNN7EXAMPLE",
109+
"s3.secret_key" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
110+
"s3.region" = "us-west-2",
111+
"use_path_style" = "true"
112+
);
113+
```
114+
115+
说明:
116+
117+
* `credential = "<access_key>:<secret_key>"` 会被 Doris 的 Iceberg REST 客户端作为 OAuth2 client credentials 发起鉴权。SeaweedFS 用同一份 IAM 配置校验。
118+
* `s3.*` 系列属性给 Doris 本地的 parquet 读写器使用,指向同一个 `weed` 进程,主机和密钥都和上面一致。
119+
* `use_path_style = "true"` 是必需的,SeaweedFS 默认采用 path-style 的 S3 协议。
120+
* 集成测试使用的就是上述属性,可参考 [`createDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/blob/master/test/s3tables/catalog_doris/doris_catalog_test.go)
121+
122+
如果在注册 Catalog 前已经通过其他客户端(例如 PyIceberg)创建了 Namespace 或表,需要刷新元数据缓存:
123+
124+
```sql
125+
REFRESH CATALOG seaweedfs;
126+
```
127+
128+
## 3. 使用 Catalog
129+
130+
```sql
131+
USE seaweedfs;
132+
133+
CREATE DATABASE IF NOT EXISTS demo;
134+
135+
USE seaweedfs.demo;
136+
137+
CREATE TABLE iceberg_smoke (
138+
id BIGINT,
139+
label STRING
140+
);
141+
142+
INSERT INTO iceberg_smoke VALUES (1, 'one'), (2, 'two'), (3, 'three');
143+
144+
SELECT id, label FROM iceberg_smoke ORDER BY id;
145+
```
146+
147+
预期结果:
148+
149+
```text
150+
+----+-------+
151+
| id | label |
152+
+----+-------+
153+
| 1 | one |
154+
| 2 | two |
155+
| 3 | three |
156+
+----+-------+
157+
```
158+
159+
这正是 SeaweedFS 集成测试覆盖的路径:通过 Iceberg REST Catalog 创建 Namespace 和表,由 PyIceberg 追加数据,再由 Doris 通过 S3 加 Iceberg 元数据走标准链路读回。
160+
161+
## 生产部署建议
162+
163+
* 生产环境可以把 `weed mini` 拆成 `weed master``weed volume``weed filer`,再加 `weed s3 -iceberg.port=8181`,也可以使用 SeaweedFS Helm Chart。Doris 这边的配置完全不用改,只需替换主机和端口。
164+
* OAuth2 credential 就是 S3 访问密钥,需要轮换 Doris 的 Catalog 凭证时,按普通 S3 用户的方式轮换 IAM 身份即可。
165+
* Iceberg 表的运维任务(Compaction、Snapshot Expiration、Orphan Removal、Manifest Rewriting)由 SeaweedFS 内置实现,针对同一个 Bucket 运行,详见 [SeaweedFS Iceberg Catalog Wiki](https://github.com/seaweedfs/seaweedfs/wiki/SeaweedFS-Iceberg-Catalog)
166+
167+
## 相关链接
168+
169+
* [SeaweedFS](https://github.com/seaweedfs/seaweedfs)
170+
* [SeaweedFS 中的 Doris Iceberg 集成测试](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris)
171+
* [Doris Iceberg Catalog 文档](https://doris.apache.org/zh-CN/docs/lakehouse/catalogs/iceberg-catalog)

0 commit comments

Comments
 (0)