Skip to content

Commit c5bcdff

Browse files
authored
[docs] Reorganize Doris Operator Kubernetes documentation (#3948)
## Summary This PR reorganizes and expands the Doris Operator documentation under the Kubernetes deployment section. Changes include: - Adds dedicated Doris Operator concept pages for: - architecture - resource model - lifecycle management - status and troubleshooting - Updates the Doris Operator overview page to merge the previous introductory context with the newer conceptual structure. - Adds clickable landing pages for: - Doris Operator Concepts and Capabilities - Preparation - Moves cloud-specific preparation guidance into a separate Preparation category with Alibaba Cloud and AWS EKS pages. - Updates current and 4.x docs in both English and Chinese so the active documentation sets remain aligned. - Updates sidebar navigation and Chinese sidebar translations to reduce repeated Doris Operator prefixes in child items. - Updates docs governance sync detection so current English and Chinese docs are included in the strong synchronization checks. ## Validation - `yarn docs:links:changed` - `yarn docs-governance:test` - Started the local Chinese dev server and verified the new Doris Operator landing pages are reachable. ## Notes The page URLs for the main Doris Operator content are preserved. This PR mainly changes navigation grouping, landing pages, and documentation coverage.
1 parent 4c91427 commit c5bcdff

48 files changed

Lines changed: 2913 additions & 375 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
{
3+
"title": "Doris Operator Architecture",
4+
"sidebar_label": "Architecture",
5+
"language": "en",
6+
"description": "Describes the Doris Operator control plane, data plane, Reconcile model, and control paths for compute-storage integrated and compute-storage decoupled clusters.",
7+
"keywords": ["Doris Operator architecture", "Kubernetes Operator", "Reconcile", "DorisCluster", "DorisDisaggregatedCluster"]
8+
}
9+
---
10+
11+
Doris Operator follows the standard Kubernetes Operator pattern. Users declare the desired state through CRDs, and the Operator continuously observes actual state and reconciles the difference by creating, updating, or deleting Kubernetes resources.
12+
13+
This document focuses on the overall architecture. For field definitions and deployment procedures, see the corresponding installation and configuration documents.
14+
15+
## Architecture layers
16+
17+
Doris Operator can be viewed as two layers: the control plane and the data plane.
18+
19+
![Doris Operator architecture layers](/images/doris-operator/mermaid/02-architecture-layers.jpg)
20+
21+
The control plane reads Doris custom resources and executes the Reconcile flow. The data plane is made of native Kubernetes resources that actually run Doris components.
22+
23+
## Reconcile model
24+
25+
The Reconcile loop is the core of Doris Operator. Whenever a Doris custom resource is created or updated, or when related resources change status, the Operator recalculates the difference between desired state and actual state, then performs the necessary operations.
26+
27+
A typical flow is:
28+
29+
1. Read the Doris custom resource.
30+
2. Check whether the resource is being deleted.
31+
3. Parse component, storage, authentication, and access configuration.
32+
4. Create or update the corresponding `StatefulSet`, `Service`, PVC, and related resources.
33+
5. Aggregate component state and write it into CR `status`.
34+
6. If the cluster is not ready, wait for the next Reconcile loop.
35+
36+
![Doris Operator reconcile model](/images/doris-operator/mermaid/03-architecture-reconcile-model.jpg)
37+
38+
## Control path for compute-storage integrated clusters
39+
40+
`DorisCluster` is used for compute-storage integrated deployments. A single resource can include FE, BE, CN, and Broker configuration.
41+
42+
![Control path for compute-storage integrated clusters](/images/doris-operator/mermaid/04-architecture-integrated-control-path.jpg)
43+
44+
In this path, the main Reconciler is responsible for reading the CR, ordering component actions, cleaning up invalid resources, and aggregating status. Component controllers create and inspect resources for their own components.
45+
46+
## Control path for compute-storage decoupled clusters
47+
48+
`DorisDisaggregatedCluster` is used for compute-storage decoupled deployments. A single resource contains MetaService, FE, and one or more ComputeGroups.
49+
50+
![Control path for compute-storage decoupled clusters](/images/doris-operator/mermaid/05-architecture-decoupled-control-path.jpg)
51+
52+
Compared with compute-storage integrated mode, this path involves more Doris metadata-level actions. For example, when scaling down a ComputeGroup, the Operator may need to perform decommission or drop actions first, and then update Kubernetes resources.
53+
54+
## Webhook and validation
55+
56+
Doris Operator can optionally enable a Webhook to apply defaults and reject obviously invalid configurations before they enter the Reconcile flow.
57+
58+
Common checks include:
59+
60+
- Whether FE replica and election-related settings are compatible.
61+
- Whether resource field formats satisfy CRD constraints.
62+
- Whether component configuration matches models supported by the Operator.
63+
64+
Whether Webhook is enabled depends on the deployment method and environment. Even without Webhook, CRD schema validation and Reconcile status feedback still provide baseline constraints.
65+
66+
## Status aggregation
67+
68+
The Operator does not only create resources. It also writes component state back into the Doris custom resource. You can use `kubectl get` or `kubectl describe` to see whether components are available, scaling, waiting for scheduling, or failing.
69+
70+
For compute-storage integrated clusters, status is aggregated by FE, BE, CN, and Broker. For compute-storage decoupled clusters, status is aggregated by MetaService, FE, and ComputeGroups, with an additional overall health summary.

docs/install/deploy-on-kubernetes/doris-operator/doris-operator-overview.md

Lines changed: 68 additions & 82 deletions
Large diffs are not rendered by default.
Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,43 @@
11
---
22
{
3-
"title": "Pre-deployment Preparation",
3+
"title": "Doris Operator Concepts and Capabilities",
44
"language": "en",
5-
"description": "Manage Apache Doris clusters on Kubernetes with Doris Operator."
5+
"description": "Concepts, architecture, resource model, lifecycle, and troubleshooting guidance for Doris Operator."
66
}
77
---
88

99
import GettingStartedCard from '@site/src/components/getting-started-card/getting-started-card';
1010

11-
Doris Operator is a tool for natively deploying and managing Apache Doris clusters on Kubernetes. Read the overview first, then choose the installation guide that matches your cloud provider.
11+
Doris Operator is the control plane for deploying and managing Apache Doris on Kubernetes. Read these documents first to understand what the Operator manages, how it works, and how to locate problems through resource status.
1212

1313
<div className="cards-grid">
1414
<GettingStartedCard
1515
title="Doris Operator Overview"
16-
description="Learn what Doris Operator can do and how it manages Doris on Kubernetes"
16+
description="Learn what Doris Operator manages, when to use it, and its capability boundaries"
1717
link="doris-operator-overview"
1818
/>
1919

2020
<GettingStartedCard
21-
title="Deploy on Alibaba Cloud"
22-
description="Install Doris Operator on Alibaba Cloud Container Service for Kubernetes (ACK)"
23-
link="on-alibaba"
21+
title="Architecture"
22+
description="Understand the control plane, data plane, and Reconcile model"
23+
link="architecture"
2424
/>
2525

2626
<GettingStartedCard
27-
title="Deploy on AWS"
28-
description="Install Doris Operator on Amazon Elastic Kubernetes Service (EKS)"
29-
link="on-aws"
27+
title="Resource Model"
28+
description="See how Doris custom resources map to StatefulSet, Service, PVC, and other resources"
29+
link="resource-model"
30+
/>
31+
32+
<GettingStartedCard
33+
title="Lifecycle Management"
34+
description="Understand creation, scaling, configuration changes, and deletion behavior"
35+
link="lifecycle"
36+
/>
37+
38+
<GettingStartedCard
39+
title="Status and Troubleshooting"
40+
description="Check status fields and follow the recommended troubleshooting path"
41+
link="status-and-troubleshooting"
3042
/>
3143
</div>
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
{
3+
"title": "Doris Operator Lifecycle Management",
4+
"sidebar_label": "Lifecycle Management",
5+
"language": "en",
6+
"description": "Explains the main behavior of Doris Operator during cluster creation, scaling, configuration changes, rolling updates, and deletion.",
7+
"keywords": ["Doris Operator lifecycle", "scale out", "scale in", "rolling update", "configuration change", "DorisCluster", "DorisDisaggregatedCluster"]
8+
}
9+
---
10+
11+
Doris Operator manages Doris clusters through the Reconcile loop. After a Doris custom resource is changed, the Operator updates underlying Kubernetes resources according to the new desired state and, when needed, performs Doris metadata-level actions.
12+
13+
This document explains the operational semantics behind those changes.
14+
15+
## Cluster creation
16+
17+
When you create a `DorisCluster` or `DorisDisaggregatedCluster`, the Operator creates the underlying Kubernetes resources according to the component configuration in the resource.
18+
19+
For compute-storage integrated clusters, FE is usually created first:
20+
21+
![Cluster creation flow for DorisCluster](/images/doris-operator/mermaid/11-lifecycle-doriscluster-creation.jpg)
22+
23+
For compute-storage decoupled clusters, MetaService is created first:
24+
25+
![Cluster creation flow for DorisDisaggregatedCluster](/images/doris-operator/mermaid/12-lifecycle-decoupled-cluster-creation.jpg)
26+
27+
If dependencies are not ready, the Operator keeps waiting and retries in later Reconcile loops.
28+
29+
## Scale out
30+
31+
Scaling out is usually done by increasing a component's `replicas` field. For example:
32+
33+
```yaml
34+
spec:
35+
beSpec:
36+
replicas: 5
37+
```
38+
39+
After the Operator detects the change, it updates the corresponding `StatefulSet`. Kubernetes creates the new Pods, and the Operator continues checking readiness and updating CR `status`.
40+
41+
For compute-storage decoupled clusters, scaling out a ComputeGroup is done in the same way:
42+
43+
```yaml
44+
spec:
45+
computeGroups:
46+
- uniqueId: adhoc-query
47+
replicas: 5
48+
```
49+
50+
Pay attention to:
51+
52+
- Whether the cluster has enough CPU, memory, and storage.
53+
- Whether new Pods can be scheduled.
54+
- Whether PVCs can be bound.
55+
- Whether new nodes can register with the Doris cluster.
56+
57+
## Scale in
58+
59+
Scale-in is riskier than scale-out because it may affect data replicas, metadata quorum, node roles, or service capacity.
60+
61+
:::caution Caution
62+
Before scaling in a production cluster, confirm Doris replica status, business traffic, and rollback options.
63+
:::
64+
65+
### Compute-storage integrated clusters
66+
67+
Scale-in risks differ by component:
68+
69+
| Component | Main concern |
70+
| --- | --- |
71+
| FE | Follower and Observer roles can affect metadata quorum |
72+
| BE | Replica migration and data availability must be considered |
73+
| CN | No data replicas, but scale-in affects compute capacity and cache |
74+
| Broker | Check whether external access tasks still depend on it |
75+
76+
For FE, the replica count cannot be lower than the number of election nodes. For scale-in in this mode, evaluate cluster topology and Doris-level risk separately.
77+
78+
### Compute-storage decoupled clusters
79+
80+
When scaling in a ComputeGroup, Doris metadata-level actions may be required. The behavior depends on `enableDecommission`:
81+
82+
| Configuration | Behavior |
83+
| --- | --- |
84+
| `enableDecommission: true` | Run decommission before scale-in and wait for safe removal |
85+
| `enableDecommission: false` | Directly drop the corresponding node |
86+
87+
![ComputeGroup scale-in flow](/images/doris-operator/mermaid/13-lifecycle-computegroup-scale-in.jpg)
88+
89+
Before scaling in, confirm current data distribution and business traffic.
90+
91+
## Configuration changes
92+
93+
Doris startup configuration is usually mounted through `ConfigMap`. Whether a change requires a restart depends on the configuration type and Operator settings.
94+
95+
For compute-storage integrated clusters:
96+
97+
```yaml
98+
spec:
99+
enableRestartWhenConfigChange: true
100+
```
101+
102+
When this is enabled, core ConfigMap changes can trigger a rolling restart.
103+
104+
Check the following when changing configuration:
105+
106+
- Whether ConfigMap keys match component requirements, such as `fe.conf`.
107+
- Whether configured directories match PVC mount paths.
108+
- Whether ports, FQDN, and authentication settings match the Kubernetes network model.
109+
- Whether the change takes effect only after component restart.
110+
111+
## Rolling updates
112+
113+
Changing component images, some Pod template fields, or configuration hashes can trigger a StatefulSet rolling update.
114+
115+
Recommended practice:
116+
117+
- Perform the update during off-peak hours.
118+
- Make sure the cluster has no unresolved failures first.
119+
- Confirm client retry behavior.
120+
- Follow Doris version upgrade documentation for upgrade order.
121+
122+
## Cluster deletion
123+
124+
After the Doris custom resource is deleted, the Operator enters cleanup flow and removes Kubernetes resources that it manages.
125+
126+
Before deletion, confirm:
127+
128+
- Whether PVCs should be retained.
129+
- Whether object storage, FoundationDB, or other external dependencies are shared.
130+
- Whether data and metadata backups are needed.
131+
- Whether clients are still connected.
132+
133+
:::caution Caution
134+
Deleting the CR can remove the corresponding Kubernetes resources. Confirm cleanup scope and backup strategy before proceeding.
135+
:::
136+
137+
## Observing lifecycle operations
138+
139+
Lifecycle completion should be judged using CR `status`, Kubernetes resource state, and Doris component state together.
140+
141+
```shell
142+
kubectl get dcr -n ${namespace}
143+
kubectl describe dcr ${cluster_name} -n ${namespace}
144+
kubectl get ddc -n ${namespace}
145+
kubectl describe ddc ${cluster_name} -n ${namespace}
146+
kubectl get pod,sts,svc,pvc -n ${namespace}
147+
```
148+
149+
If status does not converge for a long time, continue with Operator logs and Kubernetes Events to determine whether the problem is in scheduling, storage binding, configuration mounting, node registration, or Doris metadata operations.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
{
3+
"title": "Preparation",
4+
"language": "en",
5+
"description": "Environment preparation and deployment recommendations before deploying Doris with Doris Operator on cloud Kubernetes services."
6+
}
7+
---
8+
9+
import GettingStartedCard from '@site/src/components/getting-started-card/getting-started-card';
10+
11+
Before deploying Apache Doris with Doris Operator, check the Kubernetes service, node system parameters, image registry, storage, and privilege requirements for your cloud environment.
12+
13+
<div className="cards-grid">
14+
<GettingStartedCard
15+
title="Alibaba Cloud Container Service Deployment Recommendations"
16+
description="Prepare ACK or ACS before deploying Doris with Doris Operator"
17+
link="on-alibaba"
18+
/>
19+
20+
<GettingStartedCard
21+
title="AWS EKS Deployment Recommendations"
22+
description="Prepare Amazon EKS before deploying Doris with Doris Operator"
23+
link="on-aws"
24+
/>
25+
</div>

0 commit comments

Comments
 (0)