add dataload docs for cache runtime (#5807)

xliuqq · web-flow · commit b2263fb67308 · 2026-04-28T15:00:30.000+08:00
* add dataload docs for cache runtime

Signed-off-by: xliuqq &lt;xlzq1992@gmail.com&gt;

* fix typo

Signed-off-by: xliuqq &lt;xlzq1992@gmail.com&gt;

* add trailing newline

Signed-off-by: xliuqq &lt;xlzq1992@gmail.com&gt;

---------

Signed-off-by: xliuqq &lt;xlzq1992@gmail.com&gt;
diff --git a/docs/en/TOC.md b/docs/en/TOC.md
@@ -20,6 +20,7 @@
     - [Share data across namespace (Sidecar mode)](samples/dataset_across_namespace_with_sidecar.md)
   + Operation
     - [Data Preloading](samples/data_warmup.md)
+    - [CacheRuntime Data Operations](samples/cacheruntime_data_operations.md)
     - [Cache Runtime Manually Scaling](samples/dataset_scaling.md)
     - [Automatic Cleanup Data Operation](samples/automatic_clean_up_data_operation.md)
   + Security
diff --git a/docs/en/samples/cacheruntime_data_operations.md b/docs/en/samples/cacheruntime_data_operations.md
@@ -0,0 +1,98 @@
+# Example - CacheRuntime Data Operations
+
+## Prerequisites
+
+This document is an extension of the [CacheRuntime Integration Guide](../dev/generic_cache_runtime_integration.md) and **assumes you have already completed the basic CacheRuntime integration** (including defining topology, configuring components, etc.).
+
+This document only explains how to **add data operation support** to an existing `CacheRuntimeClass`. The core change involves just one field: `dataOperationSpecs`.
+
+## Background
+
+Fluid's CacheRuntime provides a generic cache runtime abstraction that allows users to define implementation details for different caching systems through `CacheRuntimeClass`. Starting from the latest version of Fluid, CacheRuntime natively supports data operations, including DataLoad (data preloading) and DataProcess (data processing).
+
+This document demonstrates how to configure and use the DataLoad feature for CacheRuntime.
+
+> Note: The DataProcess Spec defines Pod information and mounts the Dataset as a PVC, so no modifications to the caching system are required to use the DataProcess feature.
+
+## Environment Verification
+
+Before running this example, please refer to the [Installation Guide](../userguide/install.md) to complete the Fluid installation and verify that all Fluid components are running properly:
+
+```shell
+$ kubectl get pod -n fluid-system
+cacheruntime-controller-xxxxx              1/1     Running   0          8h
+csi-nodeplugin-fluid-fwgjh                  2/2     Running   0          8h
+dataset-controller-5b7848dbbb-n44dj         1/1     Running   0          8h
+```
+
+Ensure that your cluster has installed a CacheRuntime controller that supports data operations.
+
+## Core Concepts
+
+### The Only Change Compared to Basic Integration
+
+Compared to the basic CacheRuntime integration, supporting data operations **only requires adding one top-level field** to the CacheRuntimeClass:
+
+- **Only add** the `dataOperationSpecs` field
+- **No need to modify** any existing fields (topology, fileSystemType, extraResources, etc.)
+- **Backward compatible**: CacheRuntimeClass without this field configured can still use basic caching functionality normally
+
+```yaml
+apiVersion: data.fluid.io/v1alpha1
+kind: CacheRuntimeClass
+metadata:
+  name: curvine-demo
+fileSystemType: curvinefs
+
+# [NEW] Only this field is added; other configurations (topology, etc.) remain unchanged
+dataOperationSpecs:
+  - name: DataLoad
+    command: ["/bin/bash", "-c"]
+    args: ["..."]
+
+# [ORIGINAL CONFIGURATION] Fields like topology and extraResources require no modifications
+topology:
+  master:
+    # ... Exactly the same as basic integration
+  worker:
+    # ... Exactly the same as basic integration
+  client:
+    # ... Exactly the same as basic integration
+```
+
+### Detailed Explanation of dataOperationSpecs Field
+
+`dataOperationSpecs` is an array where each element defines the execution specification for a type of data operation.
+
+#### Field Structure
+
+```yaml
+dataOperationSpecs:
+  - name: <operation type>
+    command: [<command>, <parameters>]
+    args: [<script or parameters>]
+    image: <optional: dedicated image>
+```
+
+#### Field Description
+
+| Field Name | Type | Required | Description                                                                                                                        |
+|--------|------|----|---------------------------------------------------------------------------------------------------------------------------|
+| `name` | string | Yes | Operation type identifier. Currently supported values:<br>• `DataLoad`: Data preloading operation<br>• `DataMigrate`: Data migration operation (not yet supported)<br>• `DataBackup`: Data backup operation (not yet supported) |
+| `command` | []string | Yes | Command to execute in the container (entrypoint), typically set to `["/bin/bash", "-c"]` to support script execution                                                                  |
+| `args` | []string | Yes | Arguments for the command, usually containing the complete execution script. The script can use environment variables injected by Fluid (see below)                                                                               |
+| `image` | string | No | Container image used for the operation.<br>• **If not specified**: Defaults to using the `worker` component image from `CacheRuntimeClass`<br>• **If specified**: Uses a custom dedicated image (suitable for scenarios requiring special tools)                |
+
+### Available Environment Variables
+
+During data operation execution, Fluid automatically injects the following environment variables into the container:
+
+#### DataLoad-Specific Environment Variables
+
+| Environment Variable Name | Description                             | Example Value |
+|-----------|--------------------------------|--------|
+| `FLUID_DATALOAD_METADATA` | Whether to load metadata                        | `"true"` or `"false"` |
+| `FLUID_DATALOAD_DATA_PATH` | Data paths to be loaded (multiple paths separated by colons)           | `/spark/spark-3.0.1:/spark/spark-2.4.7` |
+| `FLUID_DATALOAD_PATH_REPLICAS` | Number of replicas for each path (separated by colons, corresponding one-to-one with DATA_PATH) | `1:2` |
+
+The underlying caching system writes data preloading scripts based on the above environment variables and packages them into the image. When users define DataLoad operations, they can specify the script through the `command` and `args` fields.
diff --git a/docs/zh/TOC.md b/docs/zh/TOC.md
@@ -25,6 +25,7 @@
     - [跨namespace共享数据(sidecar模式)](samples/dataset_across_namespace_with_sidecar.md)
   + 操作
     - [数据预加载](samples/data_warmup.md)
+    - [CacheRuntime 数据操作](samples/cacheruntime_data_operations.md)
     - [Cache Runtime手动扩缩容](samples/dataset_scaling.md)
     - [数据操作自动清理](samples/automatic_clean_up_data_operation.md)
   + 安全
diff --git a/docs/zh/samples/cacheruntime_data_operations.md b/docs/zh/samples/cacheruntime_data_operations.md
@@ -0,0 +1,98 @@
+# 示例 - CacheRuntime 数据操作
+
+## 前提说明
+
+本文档是 [CacheRuntime 对接社区文档](../dev/generic_cache_runtime_integration.md) 的扩展，**假设您已经完成了基础的 CacheRuntime 集成**（包括定义 topology、配置组件等）。
+
+本文档仅说明如何在已有的 `CacheRuntimeClass` 基础上**新增数据操作支持**，核心改动只有一个字段：`dataOperationSpecs`。
+
+## 背景介绍
+
+Fluid 的 CacheRuntime 提供了一种通用的缓存运行时抽象，允许用户通过 `CacheRuntimeClass` 定义不同缓存系统的实现细节。从 Fluid 最新版本开始，CacheRuntime 原生支持数据操作（Data Operations），包括数据预热（DataLoad）和数据处理（DataProcess）。
+
+本文档将阐述如何为 CacheRuntime 配置和使用 DataLoad 功能。
+
+> 注意： DataProcess Spec 中定义了 Pod 信息，是把 DataSet 当作 PVC 进行挂载使用，因此不需要缓存系统做改动即可使用 DataProcess 功能。
+
+## 前提条件
+
+在运行该示例之前，请参考[安装文档](../userguide/install.md)完成 Fluid 安装，并检查 Fluid 各组件正常运行：
+
+```shell
+$ kubectl get pod -n fluid-system
+cacheruntime-controller-xxxxx              1/1     Running   0          8h
+csi-nodeplugin-fluid-fwgjh                  2/2     Running   0          8h
+dataset-controller-5b7848dbbb-n44dj         1/1     Running   0          8h
+```
+
+确保你的集群中已安装了支持数据操作的 CacheRuntime 控制器。
+
+## 核心概念
+
+### 相对于基础集成的唯一改动
+
+与基础的 CacheRuntime 集成相比，支持数据操作**仅需在 CacheRuntimeClass 中添加一个顶层字段**：
+
+- **只需添加** `dataOperationSpecs` 字段
+- **无需修改** 任何现有字段（topology、fileSystemType、extraResources 等）
+- **向后兼容**：未配置此字段的 CacheRuntimeClass 仍可正常使用基础缓存功能
+
+```yaml
+apiVersion: data.fluid.io/v1alpha1
+kind: CacheRuntimeClass
+metadata:
+  name: curvine-demo
+fileSystemType: curvinefs
+
+# 【新增】仅此一个字段，其他配置（topology等）完全保持不变
+dataOperationSpecs:
+  - name: DataLoad
+    command: ["/bin/bash", "-c"]
+    args: ["..."]
+
+# 【原有配置】topology、extraResources 等字段无需任何修改
+topology:
+  master:
+    # ... 与基础集成完全一致
+  worker:
+    # ... 与基础集成完全一致
+  client:
+    # ... 与基础集成完全一致
+```
+
+### dataOperationSpecs 字段详解
+
+`dataOperationSpecs` 是一个数组，每个元素定义一种数据操作的执行规范。
+
+#### 字段结构
+
+```yaml
+dataOperationSpecs:
+  - name: <操作类型>
+    command: [<命令>, <参数>]
+    args: [<脚本或参数>]
+    image: <可选：专用镜像>
+```
+
+#### 字段说明
+
+| 字段名 | 类型 | 必填 | 说明                                                                                                                        |
+|--------|------|----|---------------------------------------------------------------------------------------------------------------------------|
+| `name` | string |  是 | 操作类型标识符，当前支持的值：<br>• `DataLoad`：数据预热操作<br>•  `DataMigrate`：数据迁移操作（暂未支持)<br>• `DataBackup`：数据备份操作（暂未支持) |
+| `command` | []string |  是 | 容器中执行的命令（entrypoint），通常设置为 `["/bin/bash", "-c"]` 以支持脚本执行                                                                  |
+| `args` | []string |  是 | 命令的参数，通常包含完整的执行脚本。脚本中可使用 Fluid 注入的环境变量（见下文）                                                                               |
+| `image` | string |  否 | 操作使用的容器镜像。<br>• **如果不指定**：默认使用 `CacheRuntimeClass` 中 `worker` 组件的镜像<br>• **如果指定**：使用自定义的专用镜像（适用于需要特殊工具的场景）                |
+
+### 可用环境变量
+
+在数据操作执行过程中，Fluid 会自动向容器中注入以下环境变量：
+
+#### DataLoad 专属环境变量
+
+| 环境变量名 | 说明                             | 示例值 |
+|-----------|--------------------------------|--------|
+| `FLUID_DATALOAD_METADATA` | 是否加载元数据                        | `"true"` 或 `"false"` |
+| `FLUID_DATALOAD_DATA_PATH` | 需要加载的数据路径（多个路径用冒号分隔）           | `/spark/spark-3.0.1:/spark/spark-2.4.7` |
+| `FLUID_DATALOAD_PATH_REPLICAS` | 每个路径的副本数（用冒号分隔，与 DATA_PATH 一一对应） | `1:2` |
+
+底层的缓存系统根据上面的环境变量，编写数据预热的脚本并打包到镜像中，用户在定义 DataLoad 操作时，即可通过`command` 和 `args` 字段指定脚本。