Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions docs/admin-manual/cluster-management/tso.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
{
"title": "Timestamp Oracle (TSO)",
"language": "en",
"description": "Timestamp Oracle (TSO) provides globally monotonic timestamps for Doris."
}
---

## Overview

Timestamp Oracle (TSO) is a service running on the **Master FE** that generates **globally monotonic** 64-bit timestamps. Doris uses TSO as a unified version reference in distributed scenarios, avoiding the correctness risks caused by physical clock skew across nodes.

Typical use cases include:

- A unified “transaction version” across multiple tables and nodes.
- Incremental processing / version-based reads using a single global ordering.
- Better observability: a timestamp is easier to interpret than an internal version counter.

## Timestamp Format

TSO is a 64-bit integer:

- High bits: **physical time (milliseconds)** since Unix epoch
- Low bits: **logical counter** for issuing multiple unique timestamps within the same millisecond

The core guarantee of TSO is **monotonicity**, not being an exact wall clock.

## Architecture and Lifecycle

- **Master FE** hosts the `TSOService` daemon.
- FE components (for example, transaction publish and metadata repair flows) obtain timestamps from `Env.getCurrentEnv().getTSOService().getTSO()`.
- The service uses a **time window lease** (window end physical time) to reduce persistence overhead while ensuring monotonicity across master failover.

### Monotonicity Guarantee

TSO monotonicity is guaranteed by combining three layers:

- **Within the same millisecond**: Doris keeps the physical time unchanged and increases the logical counter, so a later TSO in the same millisecond is always larger.
- **Across milliseconds**: once physical time moves forward, the logical counter is reset, so the next TSO still remains greater than previous ones.
- **Across restart or master switch**: Doris replays the persisted TSO window end and calibrates the new starting physical time to be greater than the previously persisted upper bound.

This is why Doris treats TSO as a **monotonic version generator**, not as a direct wall-clock mirror.

### Monotonicity Across Master Failover

On master switch, the new Master FE replays the persisted window end and calibrates the initial physical time to ensure the first TSO it issues is strictly greater than any TSO issued by the previous master.

### Why Only Master FE Issues TSO

Only the Master FE is allowed to issue TSO values and expose `/api/tso`.

- This avoids multiple FE nodes issuing timestamps independently.
- The active master owns both timestamp generation and persistence of the leased window end.
- After role change, the old master is not supposed to continue serving as a TSO allocator.

Without this master-only rule, Doris could not safely guarantee a single global TSO order.

### Persistence and Recovery

The key persisted state is the **window end physical time** (`windowEndTSO`), not every individual issued TSO.

- Doris leases a future time window and persists the **right boundary** of that window to EditLog.
- Persisting the window boundary is much cheaper than writing every issued timestamp while still providing a safe upper bound for recovery.
- If enabled, the checkpoint image can also store the TSO module so that recovery can restore the same boundary faster.
- During recovery, the new master replays the persisted boundary and chooses a new physical time that is greater than the historical upper bound before issuing new TSO values.

This design is what lets Doris preserve monotonicity across restart and master switch without turning every TSO allocation into a persistence operation.

### End-to-End Flow

- Master FE runs `TSOService` and allocates TSO values.
- The daemon periodically renews the time window and writes the new window end to EditLog.
- Checkpoint image can optionally persist the TSO module for faster recovery.
- After restart or master switch, Doris replays the window end and calibrates a safe new starting point.
- Transactions on tables with `enable_tso = true` record commit TSO into rowset metadata.
- `/api/tso` shows current service state, while `information_schema.rowsets.COMMIT_TSO` shows committed results written into rowsets.

## Configuration

TSO is controlled by FE configuration items (see [FE Configuration](../config/fe-config.md) for how to set and persist configs):

- `enable_tso_feature`
- `tso_service_update_interval_ms`
- `tso_max_update_retry_count`
- `tso_max_get_retry_count`
- `tso_service_window_duration_ms`
- `tso_clock_backward_startup_threshold_ms`
- `tso_time_offset_debug_mode` (test only)
- `enable_tso_persist_journal` (may affect rollback compatibility)
- `enable_tso_checkpoint_module` (may affect older versions reading newer images)
- `enable_tso_forward_when_counter_full`

## Clock Backward Behavior

TSO handles clock backward differently during startup calibration and normal runtime:

- During startup calibration, the new Master FE compares the persisted TSO window end with the current system time.
- If the backward gap exceeds `tso_clock_backward_startup_threshold_ms`, TSO initialization fails fast and the Master FE cannot safely issue new TSOs.
- During normal runtime, detecting clock backward only triggers warning logs and metrics. The service does not immediately stop.

This means a clock rollback does not always fail transactions immediately. The actual risk depends on whether physical time can move forward again before the logical counter is exhausted.

Runtime rollback detection is intentionally softer than startup calibration. During runtime, Doris prefers to keep the master available and relies on the existing monotonicity guards, logical counter, and persisted window boundary. The hard failure happens at startup calibration because that is the point where Doris must prove the next TSO can still be greater than historical values.

## Logical Counter Exhaustion

TSO uses a logical counter to generate multiple timestamps within the same millisecond. If physical time cannot advance for a while, the service keeps consuming the logical counter under the same physical millisecond.

- When the logical counter reaches its limit, `getTSO()` retries according to `tso_max_get_retry_count`.
- If retries are exhausted before a new physical millisecond becomes available, TSO allocation fails.
- Transactions that need a commit TSO may then fail because FE cannot obtain a valid TSO.

This is the main reason clock rollback can eventually surface as transaction errors even though runtime rollback detection itself is not a hard-stop mechanism.

## Configuration Impact

- `tso_clock_backward_startup_threshold_ms`: only affects startup calibration. It defines how much backward clock drift is tolerated before TSO initialization fails.
- `enable_tso_forward_when_counter_full`: when enabled, the TSO service proactively advances physical time by 1ms once the logical counter becomes high, which helps reduce the chance of hitting the logical counter limit.
- `enable_tso_forward_when_counter_full = false`: the service depends more strictly on real wall-clock progress and does not proactively advance physical time. Under clock stall or rollback, logical-counter exhaustion is more likely.
- `tso_max_get_retry_count`: controls how many retries FE performs before returning a TSO allocation failure.
- `tso_service_update_interval_ms`: affects how often the daemon checks clock conditions and refreshes the TSO window.
- `enable_tso_persist_journal`: is the persistence foundation that allows restart or master switch to resume from a safe upper bound instead of risking rollback.
- `enable_tso_checkpoint_module`: affects whether checkpoint image also carries the TSO boundary for faster recovery; it does not change the runtime allocation algorithm.

## Observability and Debugging

### FE HTTP API

You can fetch the current TSO without consuming the logical counter via FE HTTP API:

- `GET /api/tso`

The response is a read-only snapshot of the current TSO state, including the current logical counter and the current window end. It is useful for observation, but it does not guarantee that future transactions will always be able to obtain a new TSO.

`window_end_physical_time` is the leased upper bound of the current TSO window, while `current_tso` represents the current allocation cursor. It is normal for the window end to be ahead of the current TSO physical time.

See [TSO Action](../open-api/fe-http/tso-action.md) for authentication, response fields, examples, and caveats.

### System Table: `information_schema.rowsets`

When enabled, Doris records the commit TSO into rowset metadata and exposes it via:

- `information_schema.rowsets.COMMIT_TSO`

This requires both FE-level `enable_tso_feature = true` and table-level `enable_tso = true`.

Table-level `enable_tso` only controls whether commit TSO is recorded for that table. It does not change how `TSOService` allocates timestamps or how monotonicity is protected.

See [rowsets](../system-tables/information_schema/rowsets.md).

## FAQ

### Can I treat TSO as a wall clock?

No. Although the physical part is in milliseconds, the physical time may be advanced proactively (for example, to handle high logical counter usage), so TSO should be used as a **monotonic version** rather than a precise wall clock.

### Why can transactions fail during clock rollback?

Clock rollback during runtime only raises warnings and metrics, but it can keep TSO in the same physical millisecond for longer than expected. If the logical counter is consumed faster than physical time recovers, FE may fail to obtain a new TSO after `tso_max_get_retry_count` retries, and transactions that require commit TSO may fail.
102 changes: 102 additions & 0 deletions docs/admin-manual/config/fe-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,108 @@ Is it possible to dynamically configure: true

Is it a configuration item unique to the Master FE node: false

### TSO (Timestamp Oracle)

#### `enable_tso_feature`

Default:false

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Whether to enable TSO (Timestamp Oracle) related features on FE. This is the global switch for TSO service availability and for table-level `enable_tso` usage, including recording rowset commit TSO and exposing it via system tables.

#### `tso_service_update_interval_ms`

Default:50(ms)

IsMutable:false

Is it a configuration item unique to the Master FE node: true

The update interval of the TSO service in milliseconds. The daemon periodically checks clock drift/backward and renews the time window.

#### `tso_max_update_retry_count`

Default:3

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Maximum retry count when the TSO service updates the global timestamp (for example, when persisting a new window end).

#### `tso_max_get_retry_count`

Default:10

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Maximum retry count when generating a new TSO. If FE still cannot obtain a valid TSO after these retries, requests such as transaction commit may fail.

#### `tso_service_window_duration_ms`

Default:5000(ms)

IsMutable:true

Is it a configuration item unique to the Master FE node: true

The duration of a leased TSO time window in milliseconds. The Master FE persists the leased future window end, not every issued TSO, so a larger window reduces persistence frequency while still preserving a safe upper bound for restart or master failover recovery.

#### `tso_clock_backward_startup_threshold_ms`

Default:1800000(ms)

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Maximum tolerated clock-backward threshold during TSO startup calibration. If the persisted TSO window end is ahead of current system time by more than this threshold, TSO initialization fails. This threshold only affects startup calibration and is not a runtime circuit breaker.

#### `tso_time_offset_debug_mode`

Default:0(ms)

IsMutable:true

Is it a configuration item unique to the Master FE node: false

Time offset for the TSO service in milliseconds. For test/debug only.

#### `enable_tso_persist_journal`

Default:false

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Whether to persist the TSO window end into edit log. This is the persistence foundation for restarting or switching master without TSO rollback, because startup calibration must recover a historical upper bound before issuing new timestamps. Enabling this may emit new operation codes and may break rollback compatibility with older versions.

#### `enable_tso_checkpoint_module`

Default:false

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Whether to include TSO information as a checkpoint image module for faster recovery. This mainly affects checkpoint/image recovery speed and completeness; it does not change the runtime TSO allocation algorithm itself. Older versions may need to ignore unknown modules when reading newer images.

#### `enable_tso_forward_when_counter_full`

Default:true

IsMutable:true

Is it a configuration item unique to the Master FE node: true

Whether to proactively advance TSO physical time by 1ms when the logical counter becomes high. Enabling this reduces the chance of logical-counter exhaustion when the wall clock does not move forward fast enough. This forward step is part of monotonicity protection and does not mean TSO is intended to be an exact wall clock. If disabled, TSO depends more strictly on actual clock progress, so clock stall or rollback is more likely to surface as TSO allocation failure and transaction errors.

### Service

#### `query_port`
Expand Down
79 changes: 79 additions & 0 deletions docs/admin-manual/open-api/fe-http/tso-action.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
{
"title": "TSO Action",
"language": "en",
"description": "Get current TSO (Timestamp Oracle) information from the Master FE."
}
---

## Request

`GET /api/tso`

## Description

Returns the current TSO (Timestamp Oracle) information from the **Master FE**.

- This endpoint is **read-only**: it returns the current TSO value **without increasing** it.
- Authentication is required. Use an account with **administrator privileges**.
- This endpoint is useful for observing the current TSO window end, physical time part, and logical counter part.
- This endpoint is only a snapshot of current state. It does not guarantee that a later transaction can always obtain a new TSO.

## Path parameters

None.

## Query parameters

None.

## Request body

None.

## Response

On success, the response body has `code = 0` and the `data` field contains:

| Field | Type | Description |
| --- | --- | --- |
| window_end_physical_time | long | The end physical time (ms) of the current TSO window on the Master FE. |
| current_tso | long | The current composed 64-bit TSO value. |
| current_tso_physical_time | long | The extracted physical time part (ms) from `current_tso`. |
| current_tso_logical_counter | long | The extracted logical counter part from `current_tso`. |

Interpretation:

- `window_end_physical_time` is the upper bound of the currently leased TSO window, not the time of the latest issued TSO.
- `current_tso_physical_time` and `current_tso_logical_counter` together describe the current global allocation cursor.
- It is normal for `window_end_physical_time` to be greater than `current_tso_physical_time`, because the window end is a pre-leased future upper bound.

Example:

```json
{
"code": 0,
"msg": "success",
"data": {
"window_end_physical_time": 1625097600000,
"current_tso": 123456789012345678,
"current_tso_physical_time": 1625097600000,
"current_tso_logical_counter": 123
}
}
```

## Errors

Common error cases include:

- FE is not ready
- Current FE is not master
- Authentication failure

## Notes

- Calling this API does not consume the logical counter.
- If the system is experiencing clock rollback or clock stall, the returned TSO may still look normal at the instant of observation, while later transaction commits can fail because FE cannot obtain a new TSO after retries.
- A single normal response only proves the current snapshot looks healthy; it is not a guarantee that later allocations will succeed.
- See [TSO](../../cluster-management/tso.md) for clock-backward behavior and [FE Configuration](../../config/fe-config.md) for related settings such as `tso_clock_backward_startup_threshold_ms` and `enable_tso_forward_when_counter_full`.
19 changes: 18 additions & 1 deletion docs/admin-manual/system-tables/information_schema/rowsets.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,21 @@ Returns basic information about the Rowset.
| DATA_DISK_SIZE | bigint | The storage space for data within the Rowset. |
| CREATION_TIME | datetime | The creation time of the Rowset. |
| NEWEST_WRITE_TIMESTAMP | datetime | The most recent write time of the Rowset. |
| SCHEMA_VERSION | int | The Schema version number of the table corresponding to the Rowset data. |
| SCHEMA_VERSION | int | The Schema version number of the table corresponding to the Rowset data. |
| COMMIT_TSO | bigint | The commit TSO recorded in the Rowset metadata (64-bit). This is typically available only when FE-level `enable_tso_feature = true`, table-level `enable_tso = true`, and the transaction successfully obtained a valid TSO. If commit TSO is not recorded, the value is typically `-1`. |

## Usage Notes

- `COMMIT_TSO` is useful for tracing the global commit order of rowsets created by TSO-enabled tables.
- `COMMIT_TSO` being `-1` usually means TSO recording was not enabled for that table or the transaction did not persist a commit TSO.
- `COMMIT_TSO` reflects committed rowset metadata only. It does not expose the current internal state of `TSOService`, and table-level TSO settings do not change how timestamps are allocated by the service.

Example:

```sql
SELECT BACKEND_ID, TXN_ID, TABLET_ID, ROWSET_ID, COMMIT_TSO
FROM information_schema.rowsets
WHERE COMMIT_TSO != -1
ORDER BY COMMIT_TSO DESC
LIMIT 20;
```
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,7 @@ The functionality of creating synchronized materialized views with rollup is lim
| enable_mow_light_delete | Whether to enable writing Delete predicate with Delete statements on Unique tables with Mow. If enabled, it will improve the performance of Delete statements, but partial column updates after Delete may result in some data errors. If disabled, it will reduce the performance of Delete statements to ensure correctness. The default value of this property is `false`. This property can only be enabled on Unique Merge-on-Write tables. |
| Dynamic Partitioning Related Properties | For dynamic partitioning, refer to [Data Partitioning - Dynamic Partitioning](../../../../table-design/data-partitioning/dynamic-partitioning) |
| enable_unique_key_skip_bitmap_column | Whether to enable the [Flexible Column Update feature](../../../../data-operate/update/update-of-unique-model.md#flexible-partial-column-updates) on Unique Merge-on-Write tables. This property can only be enabled on Unique Merge-on-Write tables. |
| enable_tso | Whether to enable TSO-related features for this table. This property requires FE-level `enable_tso_feature = true`. When enabled, successful commits on this table will try to record Rowset commit TSO and expose it through `information_schema.rowsets.COMMIT_TSO`. This table property only controls commit TSO recording for the table; it does not change the master-only allocation model, monotonicity rules, or persistence/recovery behavior of `TSOService`. This property does not bypass TSO service limitations such as clock rollback or logical-counter exhaustion. See [TSO](../../../../admin-manual/cluster-management/tso.md), [FE Configuration](../../../../admin-manual/config/fe-config.md), and [rowsets](../../../../admin-manual/system-tables/information_schema/rowsets.md). |

## Access Control Requirements

Expand Down Expand Up @@ -735,4 +736,4 @@ AS SELECT * FROM t1;

```sql
CREATE TABLE t11 LIKE t10;
```
```
Loading