diff --git a/docs/admin-manual/cluster-management/tso.md b/docs/admin-manual/cluster-management/tso.md new file mode 100644 index 0000000000000..2859b85872d7e --- /dev/null +++ b/docs/admin-manual/cluster-management/tso.md @@ -0,0 +1,159 @@ +--- +{ + "title": "Timestamp Oracle (TSO)", + "language": "en", + "description": "Timestamp Oracle (TSO) provides globally monotonic timestamps for Doris." +} +--- + +## Overview + +Timestamp Oracle (TSO) is a service running on the **Master FE** that generates **globally monotonic** 64-bit timestamps. Doris uses TSO as a unified version reference in distributed scenarios, avoiding the correctness risks caused by physical clock skew across nodes. + +Typical use cases include: + +- A unified “transaction version” across multiple tables and nodes. +- Incremental processing / version-based reads using a single global ordering. +- Better observability: a timestamp is easier to interpret than an internal version counter. + +## Timestamp Format + +TSO is a 64-bit integer: + +- High bits: **physical time (milliseconds)** since Unix epoch +- Low bits: **logical counter** for issuing multiple unique timestamps within the same millisecond + +The core guarantee of TSO is **monotonicity**, not being an exact wall clock. + +## Architecture and Lifecycle + +- **Master FE** hosts the `TSOService` daemon. +- FE components (for example, transaction publish and metadata repair flows) obtain timestamps from `Env.getCurrentEnv().getTSOService().getTSO()`. +- The service uses a **time window lease** (window end physical time) to reduce persistence overhead while ensuring monotonicity across master failover. + +### Monotonicity Guarantee + +TSO monotonicity is guaranteed by combining three layers: + +- **Within the same millisecond**: Doris keeps the physical time unchanged and increases the logical counter, so a later TSO in the same millisecond is always larger. +- **Across milliseconds**: once physical time moves forward, the logical counter is reset, so the next TSO still remains greater than previous ones. +- **Across restart or master switch**: Doris replays the persisted TSO window end and calibrates the new starting physical time to be greater than the previously persisted upper bound. + +This is why Doris treats TSO as a **monotonic version generator**, not as a direct wall-clock mirror. + +### Monotonicity Across Master Failover + +On master switch, the new Master FE replays the persisted window end and calibrates the initial physical time to ensure the first TSO it issues is strictly greater than any TSO issued by the previous master. + +### Why Only Master FE Issues TSO + +Only the Master FE is allowed to issue TSO values and expose `/api/tso`. + +- This avoids multiple FE nodes issuing timestamps independently. +- The active master owns both timestamp generation and persistence of the leased window end. +- After role change, the old master is not supposed to continue serving as a TSO allocator. + +Without this master-only rule, Doris could not safely guarantee a single global TSO order. + +### Persistence and Recovery + +The key persisted state is the **window end physical time** (`windowEndTSO`), not every individual issued TSO. + +- Doris leases a future time window and persists the **right boundary** of that window to EditLog. +- Persisting the window boundary is much cheaper than writing every issued timestamp while still providing a safe upper bound for recovery. +- If enabled, the checkpoint image can also store the TSO module so that recovery can restore the same boundary faster. +- During recovery, the new master replays the persisted boundary and chooses a new physical time that is greater than the historical upper bound before issuing new TSO values. + +This design is what lets Doris preserve monotonicity across restart and master switch without turning every TSO allocation into a persistence operation. + +### End-to-End Flow + +- Master FE runs `TSOService` and allocates TSO values. +- The daemon periodically renews the time window and writes the new window end to EditLog. +- Checkpoint image can optionally persist the TSO module for faster recovery. +- After restart or master switch, Doris replays the window end and calibrates a safe new starting point. +- Transactions on tables with `enable_tso = true` record commit TSO into rowset metadata. +- `/api/tso` shows current service state, while `information_schema.rowsets.COMMIT_TSO` shows committed results written into rowsets. + +## Configuration + +TSO is controlled by FE configuration items (see [FE Configuration](../config/fe-config.md) for how to set and persist configs): + +- `enable_tso_feature` +- `tso_service_update_interval_ms` +- `tso_max_update_retry_count` +- `tso_max_get_retry_count` +- `tso_service_window_duration_ms` +- `tso_clock_backward_startup_threshold_ms` +- `tso_time_offset_debug_mode` (test only) +- `enable_tso_persist_journal` (may affect rollback compatibility) +- `enable_tso_checkpoint_module` (may affect older versions reading newer images) +- `enable_tso_forward_when_counter_full` + +## Clock Backward Behavior + +TSO handles clock backward differently during startup calibration and normal runtime: + +- During startup calibration, the new Master FE compares the persisted TSO window end with the current system time. +- If the backward gap exceeds `tso_clock_backward_startup_threshold_ms`, TSO initialization fails fast and the Master FE cannot safely issue new TSOs. +- During normal runtime, detecting clock backward only triggers warning logs and metrics. The service does not immediately stop. + +This means a clock rollback does not always fail transactions immediately. The actual risk depends on whether physical time can move forward again before the logical counter is exhausted. + +Runtime rollback detection is intentionally softer than startup calibration. During runtime, Doris prefers to keep the master available and relies on the existing monotonicity guards, logical counter, and persisted window boundary. The hard failure happens at startup calibration because that is the point where Doris must prove the next TSO can still be greater than historical values. + +## Logical Counter Exhaustion + +TSO uses a logical counter to generate multiple timestamps within the same millisecond. If physical time cannot advance for a while, the service keeps consuming the logical counter under the same physical millisecond. + +- When the logical counter reaches its limit, `getTSO()` retries according to `tso_max_get_retry_count`. +- If retries are exhausted before a new physical millisecond becomes available, TSO allocation fails. +- Transactions that need a commit TSO may then fail because FE cannot obtain a valid TSO. + +This is the main reason clock rollback can eventually surface as transaction errors even though runtime rollback detection itself is not a hard-stop mechanism. + +## Configuration Impact + +- `tso_clock_backward_startup_threshold_ms`: only affects startup calibration. It defines how much backward clock drift is tolerated before TSO initialization fails. +- `enable_tso_forward_when_counter_full`: when enabled, the TSO service proactively advances physical time by 1ms once the logical counter becomes high, which helps reduce the chance of hitting the logical counter limit. +- `enable_tso_forward_when_counter_full = false`: the service depends more strictly on real wall-clock progress and does not proactively advance physical time. Under clock stall or rollback, logical-counter exhaustion is more likely. +- `tso_max_get_retry_count`: controls how many retries FE performs before returning a TSO allocation failure. +- `tso_service_update_interval_ms`: affects how often the daemon checks clock conditions and refreshes the TSO window. +- `enable_tso_persist_journal`: is the persistence foundation that allows restart or master switch to resume from a safe upper bound instead of risking rollback. +- `enable_tso_checkpoint_module`: affects whether checkpoint image also carries the TSO boundary for faster recovery; it does not change the runtime allocation algorithm. + +## Observability and Debugging + +### FE HTTP API + +You can fetch the current TSO without consuming the logical counter via FE HTTP API: + +- `GET /api/tso` + +The response is a read-only snapshot of the current TSO state, including the current logical counter and the current window end. It is useful for observation, but it does not guarantee that future transactions will always be able to obtain a new TSO. + +`window_end_physical_time` is the leased upper bound of the current TSO window, while `current_tso` represents the current allocation cursor. It is normal for the window end to be ahead of the current TSO physical time. + +See [TSO Action](../open-api/fe-http/tso-action.md) for authentication, response fields, examples, and caveats. + +### System Table: `information_schema.rowsets` + +When enabled, Doris records the commit TSO into rowset metadata and exposes it via: + +- `information_schema.rowsets.COMMIT_TSO` + +This requires both FE-level `enable_tso_feature = true` and table-level `enable_tso = true`. + +Table-level `enable_tso` only controls whether commit TSO is recorded for that table. It does not change how `TSOService` allocates timestamps or how monotonicity is protected. + +See [rowsets](../system-tables/information_schema/rowsets.md). + +## FAQ + +### Can I treat TSO as a wall clock? + +No. Although the physical part is in milliseconds, the physical time may be advanced proactively (for example, to handle high logical counter usage), so TSO should be used as a **monotonic version** rather than a precise wall clock. + +### Why can transactions fail during clock rollback? + +Clock rollback during runtime only raises warnings and metrics, but it can keep TSO in the same physical millisecond for longer than expected. If the logical counter is consumed faster than physical time recovers, FE may fail to obtain a new TSO after `tso_max_get_retry_count` retries, and transactions that require commit TSO may fail. diff --git a/docs/admin-manual/config/fe-config.md b/docs/admin-manual/config/fe-config.md index b9bd59857c349..92cbc4c5d9cd0 100644 --- a/docs/admin-manual/config/fe-config.md +++ b/docs/admin-manual/config/fe-config.md @@ -360,6 +360,108 @@ Is it possible to dynamically configure: true Is it a configuration item unique to the Master FE node: false +### TSO (Timestamp Oracle) + +#### `enable_tso_feature` + +Default:false + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Whether to enable TSO (Timestamp Oracle) related features on FE. This is the global switch for TSO service availability and for table-level `enable_tso` usage, including recording rowset commit TSO and exposing it via system tables. + +#### `tso_service_update_interval_ms` + +Default:50(ms) + +IsMutable:false + +Is it a configuration item unique to the Master FE node: true + +The update interval of the TSO service in milliseconds. The daemon periodically checks clock drift/backward and renews the time window. + +#### `tso_max_update_retry_count` + +Default:3 + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Maximum retry count when the TSO service updates the global timestamp (for example, when persisting a new window end). + +#### `tso_max_get_retry_count` + +Default:10 + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Maximum retry count when generating a new TSO. If FE still cannot obtain a valid TSO after these retries, requests such as transaction commit may fail. + +#### `tso_service_window_duration_ms` + +Default:5000(ms) + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +The duration of a leased TSO time window in milliseconds. The Master FE persists the leased future window end, not every issued TSO, so a larger window reduces persistence frequency while still preserving a safe upper bound for restart or master failover recovery. + +#### `tso_clock_backward_startup_threshold_ms` + +Default:1800000(ms) + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Maximum tolerated clock-backward threshold during TSO startup calibration. If the persisted TSO window end is ahead of current system time by more than this threshold, TSO initialization fails. This threshold only affects startup calibration and is not a runtime circuit breaker. + +#### `tso_time_offset_debug_mode` + +Default:0(ms) + +IsMutable:true + +Is it a configuration item unique to the Master FE node: false + +Time offset for the TSO service in milliseconds. For test/debug only. + +#### `enable_tso_persist_journal` + +Default:false + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Whether to persist the TSO window end into edit log. This is the persistence foundation for restarting or switching master without TSO rollback, because startup calibration must recover a historical upper bound before issuing new timestamps. Enabling this may emit new operation codes and may break rollback compatibility with older versions. + +#### `enable_tso_checkpoint_module` + +Default:false + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Whether to include TSO information as a checkpoint image module for faster recovery. This mainly affects checkpoint/image recovery speed and completeness; it does not change the runtime TSO allocation algorithm itself. Older versions may need to ignore unknown modules when reading newer images. + +#### `enable_tso_forward_when_counter_full` + +Default:true + +IsMutable:true + +Is it a configuration item unique to the Master FE node: true + +Whether to proactively advance TSO physical time by 1ms when the logical counter becomes high. Enabling this reduces the chance of logical-counter exhaustion when the wall clock does not move forward fast enough. This forward step is part of monotonicity protection and does not mean TSO is intended to be an exact wall clock. If disabled, TSO depends more strictly on actual clock progress, so clock stall or rollback is more likely to surface as TSO allocation failure and transaction errors. + ### Service #### `query_port` diff --git a/docs/admin-manual/open-api/fe-http/tso-action.md b/docs/admin-manual/open-api/fe-http/tso-action.md new file mode 100644 index 0000000000000..ad46d3830b61f --- /dev/null +++ b/docs/admin-manual/open-api/fe-http/tso-action.md @@ -0,0 +1,79 @@ +--- +{ + "title": "TSO Action", + "language": "en", + "description": "Get current TSO (Timestamp Oracle) information from the Master FE." +} +--- + +## Request + +`GET /api/tso` + +## Description + +Returns the current TSO (Timestamp Oracle) information from the **Master FE**. + +- This endpoint is **read-only**: it returns the current TSO value **without increasing** it. +- Authentication is required. Use an account with **administrator privileges**. +- This endpoint is useful for observing the current TSO window end, physical time part, and logical counter part. +- This endpoint is only a snapshot of current state. It does not guarantee that a later transaction can always obtain a new TSO. + +## Path parameters + +None. + +## Query parameters + +None. + +## Request body + +None. + +## Response + +On success, the response body has `code = 0` and the `data` field contains: + +| Field | Type | Description | +| --- | --- | --- | +| window_end_physical_time | long | The end physical time (ms) of the current TSO window on the Master FE. | +| current_tso | long | The current composed 64-bit TSO value. | +| current_tso_physical_time | long | The extracted physical time part (ms) from `current_tso`. | +| current_tso_logical_counter | long | The extracted logical counter part from `current_tso`. | + +Interpretation: + +- `window_end_physical_time` is the upper bound of the currently leased TSO window, not the time of the latest issued TSO. +- `current_tso_physical_time` and `current_tso_logical_counter` together describe the current global allocation cursor. +- It is normal for `window_end_physical_time` to be greater than `current_tso_physical_time`, because the window end is a pre-leased future upper bound. + +Example: + +```json +{ + "code": 0, + "msg": "success", + "data": { + "window_end_physical_time": 1625097600000, + "current_tso": 123456789012345678, + "current_tso_physical_time": 1625097600000, + "current_tso_logical_counter": 123 + } +} +``` + +## Errors + +Common error cases include: + +- FE is not ready +- Current FE is not master +- Authentication failure + +## Notes + +- Calling this API does not consume the logical counter. +- If the system is experiencing clock rollback or clock stall, the returned TSO may still look normal at the instant of observation, while later transaction commits can fail because FE cannot obtain a new TSO after retries. +- A single normal response only proves the current snapshot looks healthy; it is not a guarantee that later allocations will succeed. +- See [TSO](../../cluster-management/tso.md) for clock-backward behavior and [FE Configuration](../../config/fe-config.md) for related settings such as `tso_clock_backward_startup_threshold_ms` and `enable_tso_forward_when_counter_full`. diff --git a/docs/admin-manual/system-tables/information_schema/rowsets.md b/docs/admin-manual/system-tables/information_schema/rowsets.md index c295b76668813..e6ad888495d3e 100644 --- a/docs/admin-manual/system-tables/information_schema/rowsets.md +++ b/docs/admin-manual/system-tables/information_schema/rowsets.md @@ -32,4 +32,21 @@ Returns basic information about the Rowset. | DATA_DISK_SIZE | bigint | The storage space for data within the Rowset. | | CREATION_TIME | datetime | The creation time of the Rowset. | | NEWEST_WRITE_TIMESTAMP | datetime | The most recent write time of the Rowset. | -| SCHEMA_VERSION | int | The Schema version number of the table corresponding to the Rowset data. | \ No newline at end of file +| SCHEMA_VERSION | int | The Schema version number of the table corresponding to the Rowset data. | +| COMMIT_TSO | bigint | The commit TSO recorded in the Rowset metadata (64-bit). This is typically available only when FE-level `enable_tso_feature = true`, table-level `enable_tso = true`, and the transaction successfully obtained a valid TSO. If commit TSO is not recorded, the value is typically `-1`. | + +## Usage Notes + +- `COMMIT_TSO` is useful for tracing the global commit order of rowsets created by TSO-enabled tables. +- `COMMIT_TSO` being `-1` usually means TSO recording was not enabled for that table or the transaction did not persist a commit TSO. +- `COMMIT_TSO` reflects committed rowset metadata only. It does not expose the current internal state of `TSOService`, and table-level TSO settings do not change how timestamps are allocated by the service. + +Example: + +```sql +SELECT BACKEND_ID, TXN_ID, TABLET_ID, ROWSET_ID, COMMIT_TSO +FROM information_schema.rowsets +WHERE COMMIT_TSO != -1 +ORDER BY COMMIT_TSO DESC +LIMIT 20; +``` diff --git a/docs/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md b/docs/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md index 767b1e034a0ab..c473b8c3f24d5 100644 --- a/docs/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md +++ b/docs/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md @@ -370,6 +370,7 @@ The functionality of creating synchronized materialized views with rollup is lim | enable_mow_light_delete | Whether to enable writing Delete predicate with Delete statements on Unique tables with Mow. If enabled, it will improve the performance of Delete statements, but partial column updates after Delete may result in some data errors. If disabled, it will reduce the performance of Delete statements to ensure correctness. The default value of this property is `false`. This property can only be enabled on Unique Merge-on-Write tables. | | Dynamic Partitioning Related Properties | For dynamic partitioning, refer to [Data Partitioning - Dynamic Partitioning](../../../../table-design/data-partitioning/dynamic-partitioning) | | enable_unique_key_skip_bitmap_column | Whether to enable the [Flexible Column Update feature](../../../../data-operate/update/update-of-unique-model.md#flexible-partial-column-updates) on Unique Merge-on-Write tables. This property can only be enabled on Unique Merge-on-Write tables. | +| enable_tso | Whether to enable TSO-related features for this table. This property requires FE-level `enable_tso_feature = true`. When enabled, successful commits on this table will try to record Rowset commit TSO and expose it through `information_schema.rowsets.COMMIT_TSO`. This table property only controls commit TSO recording for the table; it does not change the master-only allocation model, monotonicity rules, or persistence/recovery behavior of `TSOService`. This property does not bypass TSO service limitations such as clock rollback or logical-counter exhaustion. See [TSO](../../../../admin-manual/cluster-management/tso.md), [FE Configuration](../../../../admin-manual/config/fe-config.md), and [rowsets](../../../../admin-manual/system-tables/information_schema/rowsets.md). | ## Access Control Requirements @@ -735,4 +736,4 @@ AS SELECT * FROM t1; ```sql CREATE TABLE t11 LIKE t10; -``` \ No newline at end of file +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/cluster-management/tso.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/cluster-management/tso.md new file mode 100644 index 0000000000000..96eb6a52e75f5 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/cluster-management/tso.md @@ -0,0 +1,156 @@ +--- +{ + "title": "全局时间戳服务(TSO)", + "language": "zh-CN", + "description": "TSO(Timestamp Oracle)为 Doris 提供全局单调递增的时间戳。" +} +--- + +## 概述 + +TSO(Timestamp Oracle)是运行在 **Master FE** 上的服务,用于生成 **全局单调递增** 的 64 位时间戳。Doris 在分布式场景中将 TSO 作为统一的版本基准,从而规避多节点物理时钟偏移带来的正确性风险。 + +典型使用场景包括: + +- 跨表、跨节点的统一“事务版本号”。 +- 基于全局顺序的增量计算 / 分版本读取。 +- 更易观测:时间戳相比内部版本号更具可读性。 + +## 时间戳结构 + +TSO 是一个 64 位整数: + +- 高位:自 Unix 纪元以来的**物理时间(毫秒)** +- 低位:用于同一毫秒内发号的**逻辑计数器** + +TSO 的核心保证是**单调递增**,而不是精确反映物理时钟(wall clock)。 + +## 架构与生命周期 + +- **Master FE** 上运行 `TSOService` 守护线程。 +- FE 内部组件(例如事务发布与元数据修复流程)通过 `Env.getCurrentEnv().getTSOService().getTSO()` 获取时间戳。 +- 服务采用“**时间窗口租约**”(窗口右界物理时间)来降低持久化开销,同时保证切主后的单调性。 + +### 单调性保证原理 + +TSO 的单调性由三层机制共同保证: + +- **同一物理毫秒内**:Doris 保持物理时间不变,仅递增逻辑计数器,因此同一毫秒内后发出的 TSO 一定更大。 +- **跨物理毫秒**:当物理时间向前推进后,逻辑计数器会重置,因此新的 TSO 仍然会大于之前发出的值。 +- **跨重启或切主**:Doris 会回放持久化的 TSO 窗口右界,并把新的起始物理时间校准到历史上界之后,再继续发号。 + +这也是为什么 Doris 应将 TSO 视为**单调递增的版本生成器**,而不是直接映射物理时钟的时间值。 + +### Master 切换时的单调性保证 + +当发生切主时,新 Master FE 会回放持久化的窗口右界并执行时间校准,确保新主发出的第一个 TSO 严格大于旧主已经发出的所有 TSO。 + +### 只有 Master FE 才能发号 + +只有 Master FE 被允许生成 TSO 和暴露 `/api/tso`。 + +- 这样可以避免多个 FE 节点各自独立发号。 +- 当前 master 同时负责时间戳分配和窗口右界持久化。 +- 角色切换后,旧 master 不应继续作为 TSO 分配器对外服务。 + + +### 持久化与恢复 + +TSO 重点持久化的是**时间窗口右界**(`windowEndTSO`),而不是每一个已经发出的 TSO。 + +- Doris 会预先租约一个未来时间窗口,并把该窗口的**右界**写入 EditLog。 +- 持久化窗口右界比“每发一个 TSO 就持久化一次”开销更低,但仍然能为恢复提供一个安全的历史上界。 +- 如果开启相关能力,checkpoint image 也可以保存 TSO 模块,用于更快恢复这条边界。 +- 在恢复过程中,新 master 会先回放这个历史边界,再选择一个比它更大的物理时间作为新的起点,然后继续发号。 + +正是这个设计,使 Doris 可以在不把每次 TSO 申请都变成持久化操作的前提下,仍然在重启和切主后保持单调性。 + +### 端到端链路 + +- Master FE 上的 `TSOService` 负责发放 TSO。 +- 守护线程会周期性续租时间窗口,并把新的窗口右界写入 EditLog。 +- checkpoint image 可以按需保存 TSO 模块,加快恢复速度。 +- 重启或切主后,Doris 会回放窗口右界并校准新的安全起点。 +- 对开启 `enable_tso = true` 的表,事务提交时会把 commit TSO 写入 rowset 元数据。 +- `/api/tso` 观测的是当前服务状态,`information_schema.rowsets.COMMIT_TSO` 观测的是已经提交落盘的结果。 + +## 配置项 + +TSO 由 FE 配置项控制(如何配置与持久化请参见 [FE 配置项](../config/fe-config.md)): + +- `enable_tso_feature` +- `tso_service_update_interval_ms` +- `tso_max_update_retry_count` +- `tso_max_get_retry_count` +- `tso_service_window_duration_ms` +- `tso_clock_backward_startup_threshold_ms` +- `tso_time_offset_debug_mode`(仅测试/调试) +- `enable_tso_persist_journal`(可能影响回滚兼容性) +- `enable_tso_checkpoint_module`(旧版本读取新镜像可能需忽略未知模块) +- `enable_tso_forward_when_counter_full` + +## 时钟回拨行为 + +TSO 在“启动校准”和“正常运行”两个阶段,对时钟回拨的处理方式不同: + +- 启动校准阶段,新 Master FE 会比较“持久化的 TSO 窗口右界”与当前系统时间。 +- 如果回拨幅度超过 `tso_clock_backward_startup_threshold_ms`,TSO 初始化会直接失败,Master FE 不能安全地继续发放新的 TSO。 +- 正常运行阶段,检测到时钟回拨只会记录告警日志和指标,不会立即停止服务。 + +因此,时钟回拨并不一定会立刻让事务失败;真正的风险在于物理时间能否及时重新向前推进,以及逻辑计数器是否会先被耗尽。 + +运行阶段对回拨采用较软的处理策略,是因为 Doris 更倾向于先保持 master 可用,再依靠已有的单调性保护、逻辑计数器和已持久化窗口右界继续运行。真正的硬失败发生在启动校准阶段,因为那时 Doris 必须先证明“下一次发号仍然会严格大于历史值”。 + +## 逻辑计数器耗尽 + +TSO 使用逻辑计数器在同一毫秒内发放多个唯一时间戳。如果物理时间在一段时间内无法前进,服务就会持续消耗同一个物理毫秒下的逻辑计数器。 + +- 当逻辑计数器达到上限后,`getTSO()` 会按照 `tso_max_get_retry_count` 进行重试。 +- 如果在重试耗尽前仍然等不到新的物理毫秒,TSO 申请会失败。 +- 需要 commit TSO 的事务随后可能因为 FE 无法获取有效 TSO 而提交失败。 + +## 配置影响 + +- `tso_clock_backward_startup_threshold_ms`:只影响启动校准阶段,用于定义在初始化失败前可容忍的最大时钟回拨量。 +- `enable_tso_forward_when_counter_full`:开启后,当逻辑计数器占用较高时,TSO 服务会主动把物理时间前推 1ms,以降低命中逻辑计数器上限的概率。 +- `enable_tso_forward_when_counter_full = false`:服务会更依赖真实物理时钟前进;在时钟停滞或回拨场景下,更容易出现逻辑计数器耗尽,但是不会更新物理始终。 +- `tso_max_get_retry_count`:控制 FE 在返回 TSO 申请失败前最多重试多少次。 +- `tso_service_update_interval_ms`:影响守护线程检查时钟状态与刷新 TSO 时间窗口的频率。 +- `enable_tso_persist_journal`:是重启或切主后继续从安全历史上界恢复的基础;没有它就无法可靠避免恢复后的回退风险。 +- `enable_tso_checkpoint_module`:影响 checkpoint image 是否也携带 TSO 边界,用于加快恢复;它不会改变运行期发号算法本身。 + +## 可观测与调试 + +### FE HTTP 接口 + +可以通过 FE HTTP 接口在不消耗逻辑计数器的情况下读取当前 TSO 信息: + +- `GET /api/tso` + +返回结果是当前 TSO 状态的只读快照,包括当前逻辑计数器与当前时间窗口右界。它适合用于观测,但不能保证后续事务一定还能成功获取新的 TSO。 + +其中,`window_end_physical_time` 表示当前租约窗口的上界,`current_tso` 表示当前的发号游标。窗口右界领先于当前 TSO 的物理时间是正常现象,因为窗口右界本来就是预先租约的未来上界。 + +参见 [TSO Action](../open-api/fe-http/tso-action.md) 获取鉴权方式、返回字段、示例与注意事项。 + +### 系统表:`information_schema.rowsets` + +在相关能力开启后,Doris 会将提交时的 commit tso 写入 Rowset 元数据,并通过系统表暴露: + +- `information_schema.rowsets.COMMIT_TSO` + +这依赖 FE 级别 `enable_tso_feature = true` 以及表级 `enable_tso = true` 同时开启。 + +表级 `enable_tso` 只决定该表是否记录 commit TSO,不会改变 `TSOService` 的发号方式,也不会放宽单调性保护约束。 + +参见 [rowsets](../system-tables/information_schema/rowsets.md)。 + +## FAQ + +### TSO 能否当作精确的物理时钟(wall clock)使用? + +不能。虽然高位包含毫秒级物理时间,但在某些情况下(例如逻辑计数器使用量较高)物理部分可能会被主动推进。因此,应将 TSO 视为**单调递增的版本**,而不是精确的物理时钟。 + +### 为什么时钟回拨时事务可能报错? + +运行期时钟回拨本身只会触发告警和指标,但它可能让 TSO 在同一个物理毫秒内停留更久。如果逻辑计数器的消耗速度快于物理时间恢复速度,FE 在 `tso_max_get_retry_count` 次重试后仍可能拿不到新的 TSO,从而导致需要 commit TSO 的事务提交失败。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/config/fe-config.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/config/fe-config.md index 813131a9e91b5..77e538a2d81f5 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/config/fe-config.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/config/fe-config.md @@ -361,6 +361,108 @@ heartbeat_mgr 中处理心跳事件的线程数。 是否为 Master FE 节点独有的配置项:false +### TSO(Timestamp Oracle) + +#### `enable_tso_feature` + +默认值:false + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +是否启用 FE 侧 TSO(全局时间戳)相关能力。这是 TSO 服务可用性以及表级 `enable_tso` 能力的全局开关,包括记录 Rowset 的提交 TSO 并在系统表中暴露相关字段。 + +#### `tso_service_update_interval_ms` + +默认值:50(ms) + +是否可以动态配置:false + +是否为 Master FE 节点独有配置项:true + +TSO 服务的更新间隔(毫秒)。守护线程会周期性检查时钟漂移/回拨,并在需要时续租时间窗口。 + +#### `tso_max_update_retry_count` + +默认值:3 + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +TSO 服务更新全局时间戳(例如持久化新的时间窗口右界)失败时的最大重试次数。 + +#### `tso_max_get_retry_count` + +默认值:10 + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +获取/生成 TSO 失败时的最大重试次数。如果在这些重试之后 FE 仍无法拿到有效 TSO,请求(例如事务提交)可能失败。 + +#### `tso_service_window_duration_ms` + +默认值:5000(ms) + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +TSO 时间窗口时长(毫秒)。Master FE 持久化的是“已租约的未来窗口右界”,而不是每一次已经发出的 TSO,因此窗口越大,持久化频率越低,同时仍能在重启或切主恢复时提供一个安全的历史上界。 + +#### `tso_clock_backward_startup_threshold_ms` + +默认值:1800000(ms) + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +TSO 启动校准阶段允许的最大时钟回拨阈值。如果持久化的 TSO 窗口右界领先当前系统时间超过该阈值,TSO 初始化会失败。该阈值只影响启动校准阶段,不是运行期熔断阈值。 + +#### `tso_time_offset_debug_mode` + +默认值:0(ms) + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:false + +TSO 服务时间偏移(毫秒),仅用于测试/调试。 + +#### `enable_tso_persist_journal` + +默认值:false + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +是否启用将 TSO 时间窗口右界写入 EditLog。这是重启或切主后避免 TSO 回退的持久化基础,因为启动校准必须先恢复出一个历史上界,再继续发放新的时间戳。开启后可能会产生新的操作码,回滚到旧版本可能不兼容。 + +#### `enable_tso_checkpoint_module` + +默认值:false + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +是否启用将 TSO 信息作为 checkpoint 镜像模块参与持久化。它主要影响 checkpoint/image 恢复路径的速度与完整性,不改变运行期发号算法本身。开启后镜像中包含新模块,旧版本读取新镜像可能需要忽略未知模块。 + +#### `enable_tso_forward_when_counter_full` + +默认值:true + +是否可以动态配置:true + +是否为 Master FE 节点独有配置项:true + +当逻辑计数器占用较高时,是否主动将 TSO 的物理时间前推 1ms。开启后可以降低在物理时钟未及时前进时触发逻辑计数器耗尽的概率。这个前推动作属于单调性保护的一部分,并不意味着 TSO 旨在精确反映物理时钟。关闭后,TSO 会更依赖真实时钟推进,因此在时钟停滞或回拨场景下更容易表现为 TSO 分配失败和事务报错。 + ### 服务 #### `query_port` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/open-api/fe-http/tso-action.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/open-api/fe-http/tso-action.md new file mode 100644 index 0000000000000..cceaa53ca5fec --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/open-api/fe-http/tso-action.md @@ -0,0 +1,79 @@ +--- +{ + "title": "TSO Action", + "language": "zh-CN", + "description": "从 Master FE 获取当前 TSO(Timestamp Oracle)信息。" +} +--- + +## Request + +`GET /api/tso` + +## Description + +从 **Master FE** 获取当前 TSO(Timestamp Oracle)信息。 + +- 该接口为**只读**:返回当前 TSO,但**不会递增** TSO 值。 +- 需要鉴权,请使用具有**管理员权限**的账号访问。 +- 该接口适合用于观测当前 TSO 时间窗口右界、物理时间部分和逻辑计数器部分。 +- 该接口只反映当前时刻的状态快照,不能保证后续事务一定还能成功获取新的 TSO。 + +## Path parameters + +无 + +## Query parameters + +无 + +## Request body + +无 + +## Response + +成功时,返回体 `code = 0`,并在 `data` 中包含: + +| 字段 | 类型 | 含义 | +| --- | --- | --- | +| window_end_physical_time | long | Master FE 当前 TSO 时间窗口的右界物理时间(毫秒)。 | +| current_tso | long | 当前完整的 64 位 TSO 值。 | +| current_tso_physical_time | long | 从 `current_tso` 解析出的物理时间部分(毫秒)。 | +| current_tso_logical_counter | long | 从 `current_tso` 解析出的逻辑计数器部分。 | + +字段解读: + +- `window_end_physical_time` 表示当前已租约 TSO 时间窗口的上界,而不是“最近一次已经发出的 TSO 时间”。 +- `current_tso_physical_time` 与 `current_tso_logical_counter` 一起表示当前全局发号游标。 +- `window_end_physical_time` 大于 `current_tso_physical_time` 是正常现象,因为窗口右界本来就是预先租约的未来上界。 + +示例: + +```json +{ + "code": 0, + "msg": "success", + "data": { + "window_end_physical_time": 1625097600000, + "current_tso": 123456789012345678, + "current_tso_physical_time": 1625097600000, + "current_tso_logical_counter": 123 + } +} +``` + +## 错误 + +常见错误包括: + +- FE 尚未就绪 +- 当前 FE 不是 Master +- 鉴权失败 + +## 注意事项 + +- 调用该接口不会消耗逻辑计数器。 +- 如果系统正处于时钟回拨或时钟停滞场景,当前返回的 TSO 在观测时刻仍可能看起来正常,但后续事务提交仍可能因为 FE 在重试后拿不到新的 TSO 而失败。 +- 单次返回正常只说明当前快照看起来健康,并不保证后续分配一定成功。 +- 关于时钟回拨行为,请参见 [全局时间戳服务(TSO)](../../cluster-management/tso.md);关于相关配置,请参见 [FE 配置项](../../config/fe-config.md) 中的 `tso_clock_backward_startup_threshold_ms` 和 `enable_tso_forward_when_counter_full`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/system-tables/information_schema/rowsets.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/system-tables/information_schema/rowsets.md index 821adff23711a..a63771fe3782b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/system-tables/information_schema/rowsets.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/system-tables/information_schema/rowsets.md @@ -32,4 +32,21 @@ | DATA_DISK_SIZE | bigint | Rowset 内数据的存储空间。 | | CREATION_TIME | datetime | Rowset 的创建时间。 | | NEWEST_WRITE_TIMESTAMP | datetime | Rowset 的最近写入时间。 | -| SCHEMA_VERSION | int | Rowset 数据对应的表 Schema 版本号。 | \ No newline at end of file +| SCHEMA_VERSION | int | Rowset 数据对应的表 Schema 版本号。 | +| COMMIT_TSO | bigint | Rowset 元数据中记录的提交 TSO(64 位)。通常只有在 FE 级别 `enable_tso_feature = true`、表级 `enable_tso = true`,且事务成功拿到有效 TSO 时才会有值;未记录时通常为 `-1`。 | + +## 使用说明 + +- `COMMIT_TSO` 适合用于追踪开启 TSO 的表所生成 Rowset 的全局提交顺序。 +- 如果 `COMMIT_TSO = -1`,通常表示该表未开启 TSO 记录能力,或者该事务没有持久化 commit TSO。 +- `COMMIT_TSO` 反映的是已经提交到 Rowset 元数据中的结果,而不是 `TSOService` 当前的内部状态;表级 TSO 开关也不会改变服务如何发号。 + +示例: + +```sql +SELECT BACKEND_ID, TXN_ID, TABLET_ID, ROWSET_ID, COMMIT_TSO +FROM information_schema.rowsets +WHERE COMMIT_TSO != -1 +ORDER BY COMMIT_TSO DESC +LIMIT 20; +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md index a5a8c5de85fb7..bf33639dbdee5 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-TABLE.md @@ -371,6 +371,7 @@ rollup 可以创建的同步物化视图功能有限。已不再推荐使用。 | enable_mow_light_delete | 是否在 Unique 表 Mow 上开启 Delete 语句写 Delete predicate。若开启,会提升 Delete 语句的性能,但 Delete 后进行部分列更新可能会出现部分数据错误的情况。若关闭,会降低 Delete 语句的性能来保证正确性。此属性的默认值为 `false`。此属性只能在 Unique Merge-on-Write 表上开启。 | | 动态分区相关属性 | 动态分区相关参考[数据划分 - 动态分区](../../../../table-design/data-partitioning/dynamic-partitioning) | | enable_unique_key_skip_bitmap_column | 是否在 Unique Merge-on-Write 表上开启[灵活列更新功能](../../../../data-operate/update/update-of-unique-model.md#灵活部分列更新)。此属性只能在 Unique Merge-on-Write 表上开启。 | +| enable_tso | 是否对该表开启 TSO 相关能力。该属性依赖 FE 级别 `enable_tso_feature = true`。开启后,该表成功提交的事务会尝试记录 Rowset commit TSO,并通过 `information_schema.rowsets.COMMIT_TSO` 暴露。这个表属性只控制该表是否记录 commit TSO,不会改变 `TSOService` 的 master-only 发号模型、单调性规则或持久化恢复行为。该属性不会绕过 TSO 服务本身的限制,例如时钟回拨或逻辑计数器耗尽。参见 [全局时间戳服务(TSO)](../../../../admin-manual/cluster-management/tso.md)、[FE 配置项](../../../../admin-manual/config/fe-config.md) 和 [rowsets](../../../../admin-manual/system-tables/information_schema/rowsets.md)。 | ## 权限控制 执行此 SQL 命令的[用户](../../../../admin-manual/auth/authentication-and-authorization.md)必须至少具有以下[权限](../../../../admin-manual/auth/authentication-and-authorization.md): @@ -734,4 +735,4 @@ AS SELECT * FROM t1 ```sql CREATE TABLE t11 LIKE t10 -``` \ No newline at end of file +``` diff --git a/sidebars.ts b/sidebars.ts index 5f05731a615d4..9b36c7de0c9c9 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -755,6 +755,7 @@ const sidebars: SidebarsConfig = { 'admin-manual/cluster-management/load-balancing', 'admin-manual/cluster-management/time-zone', 'admin-manual/cluster-management/fqdn', + 'admin-manual/cluster-management/tso', ], }, { @@ -1014,6 +1015,7 @@ const sidebars: SidebarsConfig = { 'admin-manual/open-api/fe-http/meta-info-action-V2', 'admin-manual/open-api/fe-http/debug-point-action', 'admin-manual/open-api/fe-http/statistic-action', + 'admin-manual/open-api/fe-http/tso-action', ], }, {