|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one or more |
| 3 | + contributor license agreements. See the NOTICE file distributed with |
| 4 | + this work for additional information regarding copyright ownership. |
| 5 | + The ASF licenses this file to You under the Apache License, Version 2.0 |
| 6 | + (the "License"); you may not use this file except in compliance with |
| 7 | + the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | + Unless required by applicable law or agreed to in writing, software |
| 12 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | + See the License for the specific language governing permissions and |
| 15 | + limitations under the License. |
| 16 | +--> |
| 17 | +# Introduction to Apache Ratis |
| 18 | + |
| 19 | +Previous: [Overview of Raft and Ratis](index.md) | Top:[Overview of Raft and Ratis](index.md) |
| 20 | + |
| 21 | +## Section 2: Core Concepts |
| 22 | + |
| 23 | +* [The Raft Log](#the-raft-log---foundation-of-consensus) |
| 24 | +* [The State Machine](#the-state-machine---your-applications-heart) |
| 25 | +* [Consistency Models and Read Patterns](#consistency-models-and-read-patterns) |
| 26 | + |
| 27 | +### The Raft Log - Foundation of Consensus |
| 28 | + |
| 29 | +The Raft log is the central data structure that makes distributed consensus possible. Each server |
| 30 | +in a Raft group maintains its own copy of this append-only ledger that records every operation |
| 31 | +in the exact order they should be applied to the state machine. |
| 32 | + |
| 33 | +Each entry in the log contains three key pieces of information: the operation itself (what should |
| 34 | +be done), a log index (a sequential number indicating the entry's position), and a term number |
| 35 | +(the period during which a leader created this entry). Terms represent periods of leadership and |
| 36 | +increase each time a new leader is elected, preventing old leaders from overwriting newer entries. |
| 37 | +The combination of the term and log index is referred to as a term-index (`TermIndex`) and |
| 38 | +establishes the ordering of entries in the log. |
| 39 | + |
| 40 | +The log serves as both the mechanism for replication (leaders send log entries to followers) and |
| 41 | +the source of truth for recovery (servers can rebuild their state by replaying the log). When we |
| 42 | +talk about "committing" an operation, we mean that a majority of servers have acknowledged |
| 43 | +storing that log entry, making it safe to apply to the state machine. |
| 44 | + |
| 45 | +### The State Machine - Your Application's Heart |
| 46 | + |
| 47 | +In Ratis, the state machine is your application's primary integration point. Your business logic |
| 48 | +or data storage operations are implemented by the state machine. |
| 49 | + |
| 50 | +The state machine is a deterministic computation engine that processes a sequence of operations |
| 51 | +and maintains some internal state. The state machine must be deterministic: given the same |
| 52 | +sequence of operations, it must always produce the same results and end up in the same final state. |
| 53 | +Operations are processed sequentially, one at a time, in the order they appear in the Raft log. |
| 54 | + |
| 55 | +#### State Machine Responsibilities |
| 56 | + |
| 57 | +Your state machine has three primary responsibilities. First, it processes Raft transactions by |
| 58 | +validating incoming requests before they're replicated and applying committed operations to your |
| 59 | +application state. Second, it maintains your application's actual data, which might be an |
| 60 | +in-memory data structure, a local database, files on disk, or any combination of these. Third, |
| 61 | +it creates point-in-time representations of its state (snapshots) and can restore its state from |
| 62 | +snapshots during recovery. |
| 63 | + |
| 64 | +#### The State Machine Lifecycle |
| 65 | + |
| 66 | +The state machine operates at two different lifecycle levels: an overall peer lifecycle and a |
| 67 | +per-transaction processing lifecycle. |
| 68 | + |
| 69 | +##### Peer Lifecycle |
| 70 | + |
| 71 | +During initialization, when a peer starts up, the state machine loads any existing snapshots and |
| 72 | +prepares its internal data structures. The Raft layer then replays any log entries that occurred |
| 73 | +after the snapshot, bringing the peer up to the current state of the group. |
| 74 | + |
| 75 | +During normal operation, the state machine continuously processes transactions as they're |
| 76 | +committed by the Raft group, handles read-only queries, and may respond to changes in the node's |
| 77 | +status as a leader or follower. For read-only operations, the state machine can answer queries |
| 78 | +directly without going through the Raft log, providing better performance for reads but with |
| 79 | +[consistency trade-offs](#consistency-models-and-read-patterns). |
| 80 | + |
| 81 | +Periodically, the state machine creates snapshots of its current state. This happens either |
| 82 | +automatically based on configuration (like log size thresholds) or manually through |
| 83 | +administrative commands. |
| 84 | + |
| 85 | +##### Transaction Processing Lifecycle |
| 86 | + |
| 87 | +For each individual transaction, the state machine follows a multistep processing sequence. In |
| 88 | +the validation phase, the leader's state machine examines incoming requests through the |
| 89 | +`startTransaction` method. This is where you validate that the operation is properly structured |
| 90 | +and valid in the current context. |
| 91 | + |
| 92 | +In the pre-append phase, just before the operation is written to the log, the state machine can |
| 93 | +perform any final preparations through the `preAppendTransaction` method. After the operation is |
| 94 | +committed by the Raft group, the state machine is notified via `applyTransactionSerial` and can |
| 95 | +handle any order-sensitive logic that must happen before the main application logic is invoked. |
| 96 | + |
| 97 | +Finally, in the application phase, the operation is applied to the actual application state |
| 98 | +through the `applyTransaction` method. This is where your business logic executes and where the |
| 99 | +operation's effects become visible to future queries. |
| 100 | + |
| 101 | +#### Designing Your State Machine |
| 102 | + |
| 103 | +When designing your state machine, ensure your operations are deterministic and can be efficiently |
| 104 | +serialized for replication. Operations are not required to be idempotent because the Raft protocol |
| 105 | +ensures that each operation is applied exactly once on each peer, however idempotent operations may |
| 106 | +make it easier to reason about your application. |
| 107 | + |
| 108 | +Plan how you'll represent your application's state for both runtime efficiency and snapshot |
| 109 | +serialization. If your state machine maintains state in external systems (databases, files), |
| 110 | +ensure your snapshot process captures this external state consistently. |
| 111 | + |
| 112 | +Robust error handling is crucial. Server-side errors require distinguishing between recoverable |
| 113 | +errors (like validation failures) and fatal errors (like storage failures). Errors in |
| 114 | +`startTransaction` prevent operations from being committed and replicated. Errors in |
| 115 | +`applyTransaction` are considered fatal since they indicate the state machine cannot process |
| 116 | +already-committed operations. |
| 117 | + |
| 118 | +### Consistency Models and Read Patterns |
| 119 | + |
| 120 | +In a distributed system, consistency refers to the guarantees you have about seeing the effects |
| 121 | +of write operations when you read data. For write operations, Raft and Ratis provide strong |
| 122 | +consistency: once a write operation is acknowledged as committed, all subsequent reads will see |
| 123 | +the effects of that write. Read operations are more complex because Ratis offers several |
| 124 | +different approaches with different consistency and performance characteristics. |
| 125 | + |
| 126 | +#### Write Consistency |
| 127 | + |
| 128 | +Write operations in Ratis follow a straightforward path that provides strong consistency. Clients |
| 129 | +send write requests to the leader, which validates the operation through the state machine's |
| 130 | +`startTransaction` method, then replicates it to a majority of followers. Once a majority |
| 131 | +acknowledges, the operation is committed. The leader applies the operation to its state machine |
| 132 | +and returns the result to the client, while followers eventually apply the same operation in the |
| 133 | +same order. |
| 134 | + |
| 135 | +#### Read Consistency Options |
| 136 | + |
| 137 | +Ratis provides several read patterns with different consistency and performance characteristics. |
| 138 | + |
| 139 | +Read requests query the state machine of a server directly without going through the Raft consensus |
| 140 | +protocol. The `sendReadOnly()` API sends a read request to the leader. If a non-leader server |
| 141 | +receives such request, it throws a `NotLeaderException` and then the client will retry other |
| 142 | +servers. In contrast, the `sendReadOnly(message, serverId)` API sends the request to a particular |
| 143 | +server, which may be a leader or a follower. |
| 144 | + |
| 145 | +The server's `raft.server.read.option` configuration affects read consistency behavior: |
| 146 | + |
| 147 | +* **DEFAULT (default setting)**: `sendReadOnly()` performs leader reads for efficiency. It provides |
| 148 | +strong consistency under normal conditions. However, In case that an old leader has been |
| 149 | +partitioned from the majority and a new leader has been elected, reading from the old leader can |
| 150 | +return stale data since the old leader does not have the new transactions committed by the new |
| 151 | +leader (referred to as the "split-brain problem"). |
| 152 | +* **LINEARIZABLE**: both `sendReadOnly()` and `sendReadOnly(message, serverId)` use the ReadIndex |
| 153 | +protocol to provide linearizable consistency, ensuring you always read the most up-to-date committed |
| 154 | +data and won't read stale data as described in the "Split-brain Problem" above. |
| 155 | + * Non-linearizable API: Clients may use `sendReadOnlyNonLinearizable()` to read from leader's |
| 156 | + state machine directly without a linearizable guarantee. |
| 157 | + |
| 158 | +Server-side configuration allows operators to choose between performance (leader reads) and strong |
| 159 | +consistency guarantees (linearizable reads) for their entire cluster. |
| 160 | + |
| 161 | +Stale reads with minimum index let you specify a minimum log index that the peer must have |
| 162 | +applied before serving the read. Call `sendStaleRead()`: if the peer hasn't caught up to your |
| 163 | +minimum index, it will throw a `StaleReadException`. |
| 164 | + |
| 165 | +In summary: |
| 166 | +* **Leader reads** query the current leader's state machine directly without going through the Raft |
| 167 | +consensus protocol. Call `sendReadOnly()` for the strongest consistency supported by the server. |
| 168 | +* Use`sendReadOnlyNonLinearizable()` for leader reads without a linearizable guarantee. |
| 169 | +* Use `sendReadOnly(message, serverId)` with a specific follower's server ID for **follower reads**, |
| 170 | +which offer better performance but may return stale data. |
| 171 | +* Use `sendStaleRead()` to specify the minimum log index that the server must have applied. |
| 172 | +* Use `sendReadAfterWrite()` to ensure the read reflects the latest successful write by the |
| 173 | +same client, for **read-after-write consistency**. |
| 174 | + |
| 175 | +Note that all of these operations may be performing as blocking or async operations. See |
| 176 | +[Client API Patterns](integration.md#client-api-patterns) for more information. |
| 177 | + |
| 178 | +#### The Query Method and Read-Only Operations |
| 179 | + |
| 180 | +The state machine's `query` method enables you to handle read-only operations without going |
| 181 | +through the Raft protocol. This provides significant performance benefits but requires careful |
| 182 | +consideration of consistency requirements. Your state machine's `query` method will be called |
| 183 | +for explicit read-only requests from clients, queries that need to read state without modifying |
| 184 | +it, and health checks or monitoring queries. |
| 185 | + |
| 186 | +#### Choosing the Right Read Pattern |
| 187 | + |
| 188 | +Use **linearizable reads** when correctness is more important than performance, you need to read |
| 189 | +your own writes immediately, or the application cannot tolerate any stale data. Use **leader |
| 190 | +reads** when you need strong consistency but can tolerate very brief staleness during network |
| 191 | +partitions, or when building interactive applications where users expect to see their recent |
| 192 | +changes. |
| 193 | + |
| 194 | +Use **follower reads** when you can tolerate stale data in exchange for better performance and |
| 195 | +availability, you're implementing read replicas for scaling read-heavy workloads, or the data |
| 196 | +being read doesn't change frequently. Use **stale reads** when you need fine-grained control |
| 197 | +over the consistency/performance trade-off. |
| 198 | + |
| 199 | +--- |
| 200 | +Next: [Integration](integration.md) |
0 commit comments