Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions docs-next/lakehouse/best-practices/doris-seaweedfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
{
"title": "Integration with SeaweedFS",
"language": "en"
}
---

[SeaweedFS](https://seaweedfs.com/) is a distributed storage system that exposes both an S3-compatible object API and an Apache Iceberg REST Catalog from the same `weed` process. Parquet data and Iceberg metadata are served by one executable, authenticated by one S3 credential pair.

This page shows the minimal configuration that turns SeaweedFS into a Doris-backed Iceberg lakehouse. The same end-to-end path is exercised by the [`TestDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris) integration test in the SeaweedFS repository, which boots a SeaweedFS mini cluster, registers a Doris Iceberg catalog against it, writes rows with PyIceberg, and reads them back from `apache/doris:doris-all-in-one-2.1.0`.

## Why SeaweedFS for an Iceberg lakehouse

A typical lakehouse stack today stitches together three layers:

* Object storage (S3 or compatible)
* A standalone Iceberg catalog (Hive Metastore, Glue, Polaris, Lakekeeper, Nessie, ...)
* A query engine (Doris, Spark, Trino, ...)

SeaweedFS collapses the first two into one process. The same `weed` executable is both:

* the S3-compatible object store that holds the parquet files, and
* the Iceberg REST Catalog that holds the table metadata.

So Doris talks to one system instead of two. The practical implications:

* **Fewer moving parts.** No Hive Metastore, no Glue, no Postgres backing a separate catalog, no STS role to provision.
* **Simpler deployment.** One executable, one IAM config, one S3 credential pair shared by Doris's Iceberg REST client and its S3 reader.
* **Local or on-prem friendly.** Nothing in the path requires a cloud-native service. The same setup runs on a laptop, a single VM, or a Kubernetes cluster.
* **Lower latency on the metadata path.** Catalog state lives in the same SeaweedFS filer that serves the data, so namespace and table lookups don't cross a separate service boundary.
* **S3-native on disk.** Tables are stored as standard Iceberg directories in S3 buckets. Any S3 client (rclone, `aws s3`, Spark, Trino, Dremio, RisingWave) can read or replicate them alongside Doris.

Architecturally:

```text
Doris
|
v
Iceberg tables
|
v
SeaweedFS (S3 storage + REST catalog)
```

For smaller teams or internal platforms, this is a clean way to build a lakehouse without depending on a separate metastore service.

## 1. Start SeaweedFS

Build or install `weed` from [github.com/seaweedfs/seaweedfs](https://github.com/seaweedfs/seaweedfs).

Create an IAM config that grants an access key full S3 access. The same key is also used as the OAuth2 client for the Iceberg REST endpoint:

```json
{
"identities": [
{
"name": "doris",
"credentials": [
{
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}
],
"actions": ["Admin"]
}
]
}
```

Start a single-process cluster with the Iceberg REST endpoint and a pre-created table bucket:

```bash
weed mini \
-ip $(hostname -I | awk '{print $1}') \
-dir /var/lib/seaweedfs \
-s3.config /etc/seaweedfs/iam_config.json \
-tableBucket iceberg-tables
```

`weed mini` runs master, volume, filer, S3, and the Iceberg REST catalog in one process. Default ports:

| Component | Port | Override flag |
| --------- | ---- | ------------- |
| Master HTTP | 9333 | `-master.port` |
| Filer HTTP | 8888 | `-filer.port` |
| S3 | 8333 | `-s3.port` |
| Iceberg REST | 8181 | `-s3.port.iceberg` |

`-tableBucket iceberg-tables` creates the S3 Tables bucket on startup, which is the Iceberg-aware bucket type Doris will write into.

To verify the catalog is reachable:

```bash
curl -s http://SEAWEED_HOST:8181/v1/config | jq .
```

## 2. Register the Iceberg catalog in Doris

```sql
CREATE CATALOG seaweedfs PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"uri" = "http://SEAWEED_HOST:8181",
"warehouse" = "s3://iceberg-tables",
"credential" = "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"s3.endpoint" = "http://SEAWEED_HOST:8333",
"s3.access_key" = "AKIAIOSFODNN7EXAMPLE",
"s3.secret_key" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"s3.region" = "us-west-2",
"use_path_style" = "true"
);
```

Notes:

* `credential = "<access_key>:<secret_key>"` is forwarded by Doris's Iceberg REST client as OAuth2 client credentials. SeaweedFS validates them against the same IAM config that secures the S3 endpoint.
* The `s3.*` properties are used by Doris's own parquet reader and writer. They point at the same `weed` process — same host, same key pair.
* `use_path_style = "true"` is required because SeaweedFS serves S3 in path-style by default.
* The integration test uses these exact properties; see [`createDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/blob/master/test/s3tables/catalog_doris/doris_catalog_test.go) for the canonical form.

If you create namespaces or tables outside Doris (for example with PyIceberg) before the catalog is registered, refresh the metadata cache:

```sql
REFRESH CATALOG seaweedfs;
```

## 3. Use the catalog

```sql
USE seaweedfs;

CREATE DATABASE IF NOT EXISTS demo;

USE seaweedfs.demo;

CREATE TABLE iceberg_smoke (
id BIGINT,
label STRING
);

INSERT INTO iceberg_smoke VALUES (1, 'one'), (2, 'two'), (3, 'three');

SELECT id, label FROM iceberg_smoke ORDER BY id;
```

Expected output:

```text
+----+-------+
| id | label |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
+----+-------+
```

This is the same path the SeaweedFS integration test exercises: namespace and table created through the Iceberg REST catalog, rows appended via PyIceberg, and reads served by Doris through the standard S3 plus Iceberg metadata flow.

## Production notes

* For a production cluster, replace `weed mini` with `weed master`, `weed volume`, `weed filer`, and `weed s3 -iceberg.port=8181` (or use the SeaweedFS Helm chart). The Doris-side configuration is identical — only the host and ports change.
* The OAuth2 credential is the S3 access key. To rotate Doris's catalog access, rotate the IAM identity that holds it, the same way you rotate any S3 user.
* Iceberg table maintenance (compaction, snapshot expiration, orphan removal, manifest rewriting) is built into SeaweedFS and runs against the same bucket. See the [SeaweedFS Iceberg Catalog wiki](https://github.com/seaweedfs/seaweedfs/wiki/SeaweedFS-Iceberg-Catalog) for details.

## References

* [SeaweedFS](https://github.com/seaweedfs/seaweedfs)
* [Doris Iceberg integration test in SeaweedFS](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris)
* [Doris Iceberg Catalog reference](https://doris.apache.org/docs/lakehouse/catalogs/iceberg-catalog)
170 changes: 170 additions & 0 deletions docs/lakehouse/best-practices/doris-seaweedfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
{
"title": "Integration with SeaweedFS",
"language": "en"
}
---

[SeaweedFS](https://seaweedfs.com/) is a distributed storage system that exposes both an S3-compatible object API and an Apache Iceberg REST Catalog from the same `weed` process. Parquet data and Iceberg metadata are served by one executable, authenticated by one S3 credential pair.

This page shows the minimal configuration that turns SeaweedFS into a Doris-backed Iceberg lakehouse. The same end-to-end path is exercised by the [`TestDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris) integration test in the SeaweedFS repository, which boots a SeaweedFS mini cluster, registers a Doris Iceberg catalog against it, writes rows with PyIceberg, and reads them back from `apache/doris:doris-all-in-one-2.1.0`.

## Why SeaweedFS for an Iceberg lakehouse

A typical lakehouse stack today stitches together three layers:

* Object storage (S3 or compatible)
* A standalone Iceberg catalog (Hive Metastore, Glue, Polaris, Lakekeeper, Nessie, ...)
* A query engine (Doris, Spark, Trino, ...)

SeaweedFS collapses the first two into one process. The same `weed` executable is both:

* the S3-compatible object store that holds the parquet files, and
* the Iceberg REST Catalog that holds the table metadata.

So Doris talks to one system instead of two. The practical implications:

* **Fewer moving parts.** No Hive Metastore, no Glue, no Postgres backing a separate catalog, no STS role to provision.
* **Simpler deployment.** One executable, one IAM config, one S3 credential pair shared by Doris's Iceberg REST client and its S3 reader.
* **Local or on-prem friendly.** Nothing in the path requires a cloud-native service. The same setup runs on a laptop, a single VM, or a Kubernetes cluster.
* **Lower latency on the metadata path.** Catalog state lives in the same SeaweedFS filer that serves the data, so namespace and table lookups don't cross a separate service boundary.
* **S3-native on disk.** Tables are stored as standard Iceberg directories in S3 buckets. Any S3 client (rclone, `aws s3`, Spark, Trino, Dremio, RisingWave) can read or replicate them alongside Doris.

Architecturally:

```text
Doris
|
v
Iceberg tables
|
v
SeaweedFS (S3 storage + REST catalog)
```

For smaller teams or internal platforms, this is a clean way to build a lakehouse without depending on a separate metastore service.

## 1. Start SeaweedFS

Build or install `weed` from [github.com/seaweedfs/seaweedfs](https://github.com/seaweedfs/seaweedfs).

Create an IAM config that grants an access key full S3 access. The same key is also used as the OAuth2 client for the Iceberg REST endpoint:

```json
{
"identities": [
{
"name": "doris",
"credentials": [
{
"accessKey": "AKIAIOSFODNN7EXAMPLE",
"secretKey": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}
],
"actions": ["Admin"]
}
]
}
```

Start a single-process cluster with the Iceberg REST endpoint and a pre-created table bucket:

```bash
weed mini \
-ip $(hostname -I | awk '{print $1}') \
-dir /var/lib/seaweedfs \
-s3.config /etc/seaweedfs/iam_config.json \
-tableBucket iceberg-tables
```

`weed mini` runs master, volume, filer, S3, and the Iceberg REST catalog in one process. Default ports:

| Component | Port | Override flag |
| --------- | ---- | ------------- |
| Master HTTP | 9333 | `-master.port` |
| Filer HTTP | 8888 | `-filer.port` |
| S3 | 8333 | `-s3.port` |
| Iceberg REST | 8181 | `-s3.port.iceberg` |

`-tableBucket iceberg-tables` creates the S3 Tables bucket on startup, which is the Iceberg-aware bucket type Doris will write into.

To verify the catalog is reachable:

```bash
curl -s http://SEAWEED_HOST:8181/v1/config | jq .
```

## 2. Register the Iceberg catalog in Doris

```sql
CREATE CATALOG seaweedfs PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"uri" = "http://SEAWEED_HOST:8181",
"warehouse" = "s3://iceberg-tables",
"credential" = "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"s3.endpoint" = "http://SEAWEED_HOST:8333",
"s3.access_key" = "AKIAIOSFODNN7EXAMPLE",
"s3.secret_key" = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"s3.region" = "us-west-2",
"use_path_style" = "true"
);
```

Notes:

* `credential = "<access_key>:<secret_key>"` is forwarded by Doris's Iceberg REST client as OAuth2 client credentials. SeaweedFS validates them against the same IAM config that secures the S3 endpoint.
* The `s3.*` properties are used by Doris's own parquet reader and writer. They point at the same `weed` process — same host, same key pair.
* `use_path_style = "true"` is required because SeaweedFS serves S3 in path-style by default.
* The integration test uses these exact properties; see [`createDorisIcebergCatalog`](https://github.com/seaweedfs/seaweedfs/blob/master/test/s3tables/catalog_doris/doris_catalog_test.go) for the canonical form.

If you create namespaces or tables outside Doris (for example with PyIceberg) before the catalog is registered, refresh the metadata cache:

```sql
REFRESH CATALOG seaweedfs;
```

## 3. Use the catalog

```sql
USE seaweedfs;

CREATE DATABASE IF NOT EXISTS demo;

USE seaweedfs.demo;

CREATE TABLE iceberg_smoke (
id BIGINT,
label STRING
);

INSERT INTO iceberg_smoke VALUES (1, 'one'), (2, 'two'), (3, 'three');

SELECT id, label FROM iceberg_smoke ORDER BY id;
```

Expected output:

```text
+----+-------+
| id | label |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
+----+-------+
```

This is the same path the SeaweedFS integration test exercises: namespace and table created through the Iceberg REST catalog, rows appended via PyIceberg, and reads served by Doris through the standard S3 plus Iceberg metadata flow.

## Production notes

* For a production cluster, replace `weed mini` with `weed master`, `weed volume`, `weed filer`, and `weed s3 -iceberg.port=8181` (or use the SeaweedFS Helm chart). The Doris-side configuration is identical — only the host and ports change.
* The OAuth2 credential is the S3 access key. To rotate Doris's catalog access, rotate the IAM identity that holds it, the same way you rotate any S3 user.
* Iceberg table maintenance (compaction, snapshot expiration, orphan removal, manifest rewriting) is built into SeaweedFS and runs against the same bucket. See the [SeaweedFS Iceberg Catalog wiki](https://github.com/seaweedfs/seaweedfs/wiki/SeaweedFS-Iceberg-Catalog) for details.

## References

* [SeaweedFS](https://github.com/seaweedfs/seaweedfs)
* [Doris Iceberg integration test in SeaweedFS](https://github.com/seaweedfs/seaweedfs/tree/master/test/s3tables/catalog_doris)
* [Doris Iceberg Catalog reference](https://doris.apache.org/docs/lakehouse/catalogs/iceberg-catalog)
Loading