Skip to content

Commit c8e618b

Browse files
committed
feat: improve semantic type docs
1 parent 3e981dc commit c8e618b

1 file changed

Lines changed: 134 additions & 22 deletions

File tree

docs/SEMANTIC_TYPES.md

Lines changed: 134 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,137 @@
1-
# Semantic Type Annotations for SQL Schema Generation
1+
# Protobuf Annotations Used by from-proto
22

33
## Overview
44

5-
The semantic type annotation system allows you to specify high-level semantic meanings for protobuf fields that get automatically mapped to optimal SQL types for each database dialect. This enables support for specialized types like RisingWave's `rw_int256` while maintaining compatibility across PostgreSQL, RisingWave, and ClickHouse.
5+
The from-proto path derives your SQL schema directly from protobuf descriptors. It recognizes a set of annotations that control table/column naming, relationships, constraints, dialect-specific table options, and semantic typing of fields.
66

7-
## Quick Start
7+
When the package includes `sf/substreams/sink/sql/schema/v1/schema.proto`, from-proto honors these annotations (useProtoOption enabled). Without it, from-proto falls back to best‑effort inference (table = message name; simple fields become columns; no explicit constraints unless added by the dialect for system integrity).
8+
9+
This document explains all supported annotations, how each dialect uses them, and why they matter.
10+
11+
---
12+
13+
## Message Options: table
14+
15+
Import the annotations definition:
816

9-
1. Import the schema annotations in your protobuf:
1017
```protobuf
1118
import "sf/substreams/sink/sql/schema/v1/schema.proto";
1219
```
1320

14-
2. Add semantic type annotations to your fields:
21+
Annotate any message that should materialize as a table:
22+
1523
```protobuf
16-
message EthereumTransaction {
17-
option (sf.substreams.sink.sql.schema.v1.table) = { name: "eth_transactions" };
18-
19-
string hash = 1 [(sf.substreams.sink.sql.schema.v1.field) = {
24+
message Orders {
25+
option (sf.substreams.sink.sql.schema.v1.table) = {
26+
name: "orders"
27+
child_of: "accounts on id" // optional parent relation
28+
clickhouse_table_options: { // optional, ClickHouse only
29+
order_by_fields: [{ name: "order_id" }]
30+
partition_fields: [{ name: "_block_timestamp_", function: toYYYYMM }]
31+
replacing_fields: [{ name: "order_id" }]
32+
index_fields: [{ name: "idx_product", field_name: "product", type: set, granularity: 4 }]
33+
}
34+
};
35+
...
36+
}
37+
```
38+
39+
Fields:
40+
- name: Required. The SQL table name.
41+
- child_of: Optional. Defines a parent/child relation: `"<parent_table> on <parent_pk_field>"`.
42+
- Postgres: Adds a NOT NULL parent reference column to the child table plus a FK to the parent’s PK. Also every table gets a FK to `_blocks_` on `_block_number_` (ON DELETE CASCADE).
43+
- RisingWave: Adds the parent reference column (no FK constraints; autocommit system).
44+
- ClickHouse: Adds the parent reference column (no FK constraints; column used for modeling joins).
45+
- clickhouse_table_options: Optional, ClickHouse‑only. See “ClickHouse Table Options” below.
46+
47+
Defaults when no table option is present:
48+
- With proto options (schema.proto present): messages without `(table)` are ignored (no table).
49+
- Without proto options: every message becomes a table named after the message.
50+
51+
System columns added to every table:
52+
- `_block_number_` (all dialects) — tracks the originating block.
53+
- `_block_timestamp_` (all dialects).
54+
- `_version_`, `_deleted_` (ClickHouse only) — used by ReplacingMergeTree and retraction modeling.
55+
56+
Primary keys when not specified explicitly:
57+
- Postgres: PK only if specified via column annotation; else no table PK (but FK to `_blocks_`).
58+
- RisingWave: If no explicit PK is set, a composite PK is created: `(_block_number_, <parent keys...>)` to preserve uniqueness in streaming mode.
59+
- ClickHouse: PRIMARY KEY/ORDER BY derived from ClickHouse options or defaults (see below).
60+
61+
---
62+
63+
## Field Options: column
64+
65+
Annotate fields that should map to specific columns/constraints:
66+
67+
```protobuf
68+
message Orders {
69+
option (sf.substreams.sink.sql.schema.v1.table) = { name: "orders" };
70+
71+
string order_id = 1 [(sf.substreams.sink.sql.schema.v1.field) = {
72+
name: "order_id",
2073
primary_key: true,
21-
semantic_type: "hash" // Optimized hash storage
74+
unique: true, // adds uniqueness constraint (dialect-specific)
75+
semantic_type: "hash", // see Semantic Types below
76+
format_hint: "hex" // optional value format hint
2277
}];
23-
24-
string value = 2 [(sf.substreams.sink.sql.schema.v1.field) = {
25-
semantic_type: "uint256", // Uses RisingWave's rw_int256
26-
format_hint: "decimal"
27-
}];
28-
29-
string from_address = 3 [(sf.substreams.sink.sql.schema.v1.field) = {
30-
semantic_type: "address" // Blockchain address format
78+
79+
string account_id = 2 [(sf.substreams.sink.sql.schema.v1.field) = {
80+
foreign_key: "accounts on id" // FK to accounts(id) (Postgres only enforces)
3181
}];
3282
}
3383
```
3484

85+
Fields:
86+
- name: Optional. Overrides the SQL column name (default is the proto field name).
87+
- primary_key: Optional. Marks this column as the table’s primary key.
88+
- Postgres: Adds a PK constraint.
89+
- RisingWave: Declares an inline PK.
90+
- ClickHouse: Used for defaults in ORDER BY/PRIMARY KEY where applicable.
91+
- unique: Optional. Enforces uniqueness.
92+
- Postgres: Adds a unique constraint.
93+
- RisingWave: Emits `UNIQUE` in column definition.
94+
- ClickHouse: No native unique constraint — ignored (consider indexes).
95+
- foreign_key: Optional. Declares a FK to another table: `"<table> on <field>"`.
96+
- Postgres: Adds a FK constraint to the referenced table/field.
97+
- RisingWave/ClickHouse: Presence is validated for existence, but no constraint is created.
98+
- semantic_type, format_hint: Optional. See “Semantic Types & Format Hints”. Affects column type selection in each dialect. Value conversion helpers exist but are not applied automatically by from‑proto inserts (see “Runtime Conversion” below).
99+
100+
---
101+
102+
## ClickHouse Table Options
103+
104+
ClickHouse engines need explicit ORDER/PARTITION configuration for good performance and correctness. The `clickhouse_table_options` block lets you control this per table.
105+
106+
Fields (repeated lists):
107+
- order_by_fields: Required for CH. Defines the ORDER BY tuple. Each item supports:
108+
- name: Column to order by (e.g., `_block_number_`, your PK, other fields).
109+
- descending: Optional.
110+
- function: Optional function wrapper (e.g., `toYYYYMM` for dates).
111+
- partition_fields: Optional additional partition keys. If none is provided, the dialect adds a default month partition on `_block_timestamp_`.
112+
- replacing_fields: Optional extra fields in `ReplacingMergeTree(version, <replacing_fields...>)` for conflict resolution.
113+
- index_fields: Optional skip indexes to accelerate predicates:
114+
- name: Index name.
115+
- field_name: Column to index.
116+
- type: One of `minmax`, `set`, `ngrambf_v1`, `tokenbf_v1`, `bloom_filter`.
117+
- granularity: Index granularity.
118+
- function: Optional function wrapper.
119+
120+
Defaults when options are omitted:
121+
- Engine: `ReplacingMergeTree(_version_)`.
122+
- PARTITION BY: `toYYYYMM(_block_timestamp_)`.
123+
- ORDER BY: if not provided, dialect defaults to PK or `_block_number_`.
124+
125+
---
126+
127+
## Semantic Types & Format Hints
128+
129+
Semantic types give the dialect a clue to select the best storage type for a field (e.g., 256‑bit integers, addresses, hashes). Format hints help interpret the incoming literal representation when conversion is needed (hex vs decimal, etc.).
130+
131+
Supported semantic types and their column type mappings:
132+
133+
Note: If a semantic type is not supported by a dialect, the dialect falls back to its default mapping for the underlying protobuf type.
134+
35135
## Supported Semantic Types
36136

37137
### Blockchain/Crypto Types
@@ -63,9 +163,9 @@ message EthereumTransaction {
63163
| `unix_timestamp_ms` | Unix timestamp (milliseconds) | `TIMESTAMP WITH TIME ZONE` | `TIMESTAMP WITH TIME ZONE` | `DateTime64(3)` |
64164
| `block_timestamp` | Blockchain timestamp | `TIMESTAMP WITH TIME ZONE` | `TIMESTAMP WITH TIME ZONE` | `DateTime` |
65165

66-
## Format Hints
166+
### Format Hints
67167

68-
Format hints provide additional guidance for value conversion:
168+
Format hints provide additional guidance for value conversion (when conversions are used):
69169

70170
| Format Hint | Description | Usage |
71171
|-------------|-------------|-------|
@@ -74,7 +174,19 @@ Format hints provide additional guidance for value conversion:
74174
| `base64` | Base64 format | For binary data encoded as base64 |
75175
| `string` | String format | Default string handling |
76176

77-
## Complete Example
177+
---
178+
179+
## Runtime Conversion (advanced)
180+
181+
The codebase contains per‑dialect helpers to convert annotated values at insert time (e.g., converting `uint256` hex to a decimal literal for PostgreSQL, or casting to `rw_uint256` in RisingWave). Today, from‑proto uses prepared statements and passes values as they appear in your message — it does not automatically apply semantic conversions. The annotations primarily affect column type selection.
182+
183+
Practical guidance:
184+
- Emit values in the “natural” format for your chosen dialect when possible (e.g., strings for `rw_uint256` or `NUMERIC`).
185+
- If you require strict conversions, adapt your Substreams output to provide appropriately typed/encoded values. The conversion helpers in `db_proto/sql/*/types.go` show how to transform values if you build a custom inserter.
186+
187+
---
188+
189+
## End‑to‑End Example (with annotations)
78190

79191
```protobuf
80192
syntax = "proto3";
@@ -391,4 +503,4 @@ substreams-sink-sql from-proto "clickhouse://..." manifest.yaml
391503
- Proper type selection improves index performance
392504
- Reduced type conversion overhead in queries
393505

394-
This semantic type system provides a powerful way to leverage database-specific features like RisingWave's `rw_int256` while maintaining broad compatibility across different SQL databases.
506+
This semantic type system provides a powerful way to leverage database-specific features like RisingWave's `rw_int256` while maintaining broad compatibility across different SQL databases.

0 commit comments

Comments
 (0)