Skip to content

Commit b02fed5

Browse files
authored
RATIS-2388 (Further) Enhancing content for concept in ratis-docs (#1338)
1 parent 66401f2 commit b02fed5

5 files changed

Lines changed: 621 additions & 74 deletions

File tree

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one or more
3+
contributor license agreements. See the NOTICE file distributed with
4+
this work for additional information regarding copyright ownership.
5+
The ASF licenses this file to You under the Apache License, Version 2.0
6+
(the "License"); you may not use this file except in compliance with
7+
the License. You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
# Introduction to Apache Ratis
18+
19+
Previous: [Operations and Management](operations.md) | Top:[Overview of Raft and Ratis](index.md)
20+
21+
## Section 5: Advanced Topics
22+
23+
* [Scaling with Multi-Raft Groups](#scaling-with-multi-raft-groups)
24+
25+
### Scaling with Multi-Raft Groups
26+
27+
As your application grows, you may find that a single Raft group becomes a bottleneck. This is
28+
where Ratis's multi-group capability becomes valuable.
29+
30+
#### Understanding Multi-Raft
31+
32+
Multi-Raft is an implementation pattern that Ratis supports for scaling beyond the limits of a
33+
single Raft group. In a multi-Raft setup, you run multiple independent Raft groups, each
34+
handling a subset of your application's operations. Each group operates independently with its
35+
own leader election, consensus, log, and state machine.
36+
37+
#### What is a Raft Group in Ratis?
38+
39+
In Ratis terminology, a "Raft Group" is a collection of servers that participate in a single
40+
Raft cluster. Each group has a unique RaftGroupId (a UUID) that distinguishes it from other groups.
41+
Each group consists of a set of RaftPeer objects representing the servers that participate in that
42+
group's consensus.
43+
44+
#### When to Use Multiple Groups
45+
46+
Consider using multiple Raft groups when a single group cannot handle the required throughput,
47+
when you can logically partition your data or operations (such as having one group per geographic
48+
region, per customer tenant, or per data type), when you need better fault isolation (if one
49+
group fails, other groups can continue operating), or when you need different operational
50+
characteristics for different parts of your system.
51+
52+
#### Implementation Considerations
53+
54+
A single RaftServer instance can participate in multiple groups simultaneously. Each group gets
55+
its own "Division" within the server, with its own state machine and storage. Since groups don't
56+
coordinate at the Raft level, your application must handle any cross-group consistency
57+
requirements through distributed transactions, saga patterns, or eventual consistency approaches.
58+
59+
To use multi-Raft effectively, you need to partition your application state. Horizontal
60+
partitioning involves partitioning data across groups based on some key (e.g., user ID hash,
61+
geographic region). Functional partitioning assigns different groups to handle different types
62+
of operations or services. Hierarchical partitioning uses a tree-like structure where
63+
higher-level groups coordinate lower-level groups.
64+
65+
Clients need to know which group to send requests to through client-side routing logic, a proxy
66+
layer that routes requests, or consistent hashing schemes.
67+
68+
#### Trade-offs and Limitations
69+
70+
Multi-group setups are significantly more complex than single-group setups. Maintaining
71+
consistency across groups requires application-level coordination, which can be complex and
72+
error-prone. More groups mean more leaders to monitor, more logs to manage, and more complex
73+
failure scenarios. Each group consumes resources, so there's a practical limit to the number of
74+
groups per server.
75+
76+
#### Best Practices
77+
78+
Begin with a single group and only move to multiple groups when you have a clear scalability
79+
need. Design your data model and operations to be partition-friendly from the start if you
80+
anticipate needing multiple groups. Implement comprehensive monitoring for all groups, including
81+
leader stability, replication lag, and resource usage.
82+
83+
Multi-Raft groups are a powerful scaling tool, but they should be used judiciously. The added
84+
complexity is only worthwhile when you have clear scalability requirements that cannot be met
85+
with a single Raft cluster.
Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,200 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one or more
3+
contributor license agreements. See the NOTICE file distributed with
4+
this work for additional information regarding copyright ownership.
5+
The ASF licenses this file to You under the Apache License, Version 2.0
6+
(the "License"); you may not use this file except in compliance with
7+
the License. You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
# Introduction to Apache Ratis
18+
19+
Previous: [Overview of Raft and Ratis](index.md) | Top:[Overview of Raft and Ratis](index.md)
20+
21+
## Section 2: Core Concepts
22+
23+
* [The Raft Log](#the-raft-log---foundation-of-consensus)
24+
* [The State Machine](#the-state-machine---your-applications-heart)
25+
* [Consistency Models and Read Patterns](#consistency-models-and-read-patterns)
26+
27+
### The Raft Log - Foundation of Consensus
28+
29+
The Raft log is the central data structure that makes distributed consensus possible. Each server
30+
in a Raft group maintains its own copy of this append-only ledger that records every operation
31+
in the exact order they should be applied to the state machine.
32+
33+
Each entry in the log contains three key pieces of information: the operation itself (what should
34+
be done), a log index (a sequential number indicating the entry's position), and a term number
35+
(the period during which a leader created this entry). Terms represent periods of leadership and
36+
increase each time a new leader is elected, preventing old leaders from overwriting newer entries.
37+
The combination of the term and log index is referred to as a term-index (`TermIndex`) and
38+
establishes the ordering of entries in the log.
39+
40+
The log serves as both the mechanism for replication (leaders send log entries to followers) and
41+
the source of truth for recovery (servers can rebuild their state by replaying the log). When we
42+
talk about "committing" an operation, we mean that a majority of servers have acknowledged
43+
storing that log entry, making it safe to apply to the state machine.
44+
45+
### The State Machine - Your Application's Heart
46+
47+
In Ratis, the state machine is your application's primary integration point. Your business logic
48+
or data storage operations are implemented by the state machine.
49+
50+
The state machine is a deterministic computation engine that processes a sequence of operations
51+
and maintains some internal state. The state machine must be deterministic: given the same
52+
sequence of operations, it must always produce the same results and end up in the same final state.
53+
Operations are processed sequentially, one at a time, in the order they appear in the Raft log.
54+
55+
#### State Machine Responsibilities
56+
57+
Your state machine has three primary responsibilities. First, it processes Raft transactions by
58+
validating incoming requests before they're replicated and applying committed operations to your
59+
application state. Second, it maintains your application's actual data, which might be an
60+
in-memory data structure, a local database, files on disk, or any combination of these. Third,
61+
it creates point-in-time representations of its state (snapshots) and can restore its state from
62+
snapshots during recovery.
63+
64+
#### The State Machine Lifecycle
65+
66+
The state machine operates at two different lifecycle levels: an overall peer lifecycle and a
67+
per-transaction processing lifecycle.
68+
69+
##### Peer Lifecycle
70+
71+
During initialization, when a peer starts up, the state machine loads any existing snapshots and
72+
prepares its internal data structures. The Raft layer then replays any log entries that occurred
73+
after the snapshot, bringing the peer up to the current state of the group.
74+
75+
During normal operation, the state machine continuously processes transactions as they're
76+
committed by the Raft group, handles read-only queries, and may respond to changes in the node's
77+
status as a leader or follower. For read-only operations, the state machine can answer queries
78+
directly without going through the Raft log, providing better performance for reads but with
79+
[consistency trade-offs](#consistency-models-and-read-patterns).
80+
81+
Periodically, the state machine creates snapshots of its current state. This happens either
82+
automatically based on configuration (like log size thresholds) or manually through
83+
administrative commands.
84+
85+
##### Transaction Processing Lifecycle
86+
87+
For each individual transaction, the state machine follows a multistep processing sequence. In
88+
the validation phase, the leader's state machine examines incoming requests through the
89+
`startTransaction` method. This is where you validate that the operation is properly structured
90+
and valid in the current context.
91+
92+
In the pre-append phase, just before the operation is written to the log, the state machine can
93+
perform any final preparations through the `preAppendTransaction` method. After the operation is
94+
committed by the Raft group, the state machine is notified via `applyTransactionSerial` and can
95+
handle any order-sensitive logic that must happen before the main application logic is invoked.
96+
97+
Finally, in the application phase, the operation is applied to the actual application state
98+
through the `applyTransaction` method. This is where your business logic executes and where the
99+
operation's effects become visible to future queries.
100+
101+
#### Designing Your State Machine
102+
103+
When designing your state machine, ensure your operations are deterministic and can be efficiently
104+
serialized for replication. Operations are not required to be idempotent because the Raft protocol
105+
ensures that each operation is applied exactly once on each peer, however idempotent operations may
106+
make it easier to reason about your application.
107+
108+
Plan how you'll represent your application's state for both runtime efficiency and snapshot
109+
serialization. If your state machine maintains state in external systems (databases, files),
110+
ensure your snapshot process captures this external state consistently.
111+
112+
Robust error handling is crucial. Server-side errors require distinguishing between recoverable
113+
errors (like validation failures) and fatal errors (like storage failures). Errors in
114+
`startTransaction` prevent operations from being committed and replicated. Errors in
115+
`applyTransaction` are considered fatal since they indicate the state machine cannot process
116+
already-committed operations.
117+
118+
### Consistency Models and Read Patterns
119+
120+
In a distributed system, consistency refers to the guarantees you have about seeing the effects
121+
of write operations when you read data. For write operations, Raft and Ratis provide strong
122+
consistency: once a write operation is acknowledged as committed, all subsequent reads will see
123+
the effects of that write. Read operations are more complex because Ratis offers several
124+
different approaches with different consistency and performance characteristics.
125+
126+
#### Write Consistency
127+
128+
Write operations in Ratis follow a straightforward path that provides strong consistency. Clients
129+
send write requests to the leader, which validates the operation through the state machine's
130+
`startTransaction` method, then replicates it to a majority of followers. Once a majority
131+
acknowledges, the operation is committed. The leader applies the operation to its state machine
132+
and returns the result to the client, while followers eventually apply the same operation in the
133+
same order.
134+
135+
#### Read Consistency Options
136+
137+
Ratis provides several read patterns with different consistency and performance characteristics.
138+
139+
Read requests query the state machine of a server directly without going through the Raft consensus
140+
protocol. The `sendReadOnly()` API sends a read request to the leader. If a non-leader server
141+
receives such request, it throws a `NotLeaderException` and then the client will retry other
142+
servers. In contrast, the `sendReadOnly(message, serverId)` API sends the request to a particular
143+
server, which may be a leader or a follower.
144+
145+
The server's `raft.server.read.option` configuration affects read consistency behavior:
146+
147+
* **DEFAULT (default setting)**: `sendReadOnly()` performs leader reads for efficiency. It provides
148+
strong consistency under normal conditions. However, In case that an old leader has been
149+
partitioned from the majority and a new leader has been elected, reading from the old leader can
150+
return stale data since the old leader does not have the new transactions committed by the new
151+
leader (referred to as the "split-brain problem").
152+
* **LINEARIZABLE**: both `sendReadOnly()` and `sendReadOnly(message, serverId)` use the ReadIndex
153+
protocol to provide linearizable consistency, ensuring you always read the most up-to-date committed
154+
data and won't read stale data as described in the "Split-brain Problem" above.
155+
* Non-linearizable API: Clients may use `sendReadOnlyNonLinearizable()` to read from leader's
156+
state machine directly without a linearizable guarantee.
157+
158+
Server-side configuration allows operators to choose between performance (leader reads) and strong
159+
consistency guarantees (linearizable reads) for their entire cluster.
160+
161+
Stale reads with minimum index let you specify a minimum log index that the peer must have
162+
applied before serving the read. Call `sendStaleRead()`: if the peer hasn't caught up to your
163+
minimum index, it will throw a `StaleReadException`.
164+
165+
In summary:
166+
* **Leader reads** query the current leader's state machine directly without going through the Raft
167+
consensus protocol. Call `sendReadOnly()` for the strongest consistency supported by the server.
168+
* Use`sendReadOnlyNonLinearizable()` for leader reads without a linearizable guarantee.
169+
* Use `sendReadOnly(message, serverId)` with a specific follower's server ID for **follower reads**,
170+
which offer better performance but may return stale data.
171+
* Use `sendStaleRead()` to specify the minimum log index that the server must have applied.
172+
* Use `sendReadAfterWrite()` to ensure the read reflects the latest successful write by the
173+
same client, for **read-after-write consistency**.
174+
175+
Note that all of these operations may be performing as blocking or async operations. See
176+
[Client API Patterns](integration.md#client-api-patterns) for more information.
177+
178+
#### The Query Method and Read-Only Operations
179+
180+
The state machine's `query` method enables you to handle read-only operations without going
181+
through the Raft protocol. This provides significant performance benefits but requires careful
182+
consideration of consistency requirements. Your state machine's `query` method will be called
183+
for explicit read-only requests from clients, queries that need to read state without modifying
184+
it, and health checks or monitoring queries.
185+
186+
#### Choosing the Right Read Pattern
187+
188+
Use **linearizable reads** when correctness is more important than performance, you need to read
189+
your own writes immediately, or the application cannot tolerate any stale data. Use **leader
190+
reads** when you need strong consistency but can tolerate very brief staleness during network
191+
partitions, or when building interactive applications where users expect to see their recent
192+
changes.
193+
194+
Use **follower reads** when you can tolerate stale data in exchange for better performance and
195+
availability, you're implementing read replicas for scaling read-heavy workloads, or the data
196+
being read doesn't change frequently. Use **stale reads** when you need fine-grained control
197+
over the consistency/performance trade-off.
198+
199+
---
200+
Next: [Integration](integration.md)

0 commit comments

Comments
 (0)