|
| 1 | +# Example - CacheRuntime Data Operations |
| 2 | + |
| 3 | +## Prerequisites |
| 4 | + |
| 5 | +This document is an extension of the [CacheRuntime Integration Guide](../dev/generic_cache_runtime_integration.md) and **assumes you have already completed the basic CacheRuntime integration** (including defining topology, configuring components, etc.). |
| 6 | + |
| 7 | +This document only explains how to **add data operation support** to an existing `CacheRuntimeClass`. The core change involves just one field: `dataOperationSpecs`. |
| 8 | + |
| 9 | +## Background |
| 10 | + |
| 11 | +Fluid's CacheRuntime provides a generic cache runtime abstraction that allows users to define implementation details for different caching systems through `CacheRuntimeClass`. Starting from the latest version of Fluid, CacheRuntime natively supports data operations, including DataLoad (data preloading) and DataProcess (data processing). |
| 12 | + |
| 13 | +This document demonstrates how to configure and use the DataLoad feature for CacheRuntime. |
| 14 | + |
| 15 | +> Note: The DataProcess Spec defines Pod information and mounts the Dataset as a PVC, so no modifications to the caching system are required to use the DataProcess feature. |
| 16 | +
|
| 17 | +## Environment Verification |
| 18 | + |
| 19 | +Before running this example, please refer to the [Installation Guide](../userguide/install.md) to complete the Fluid installation and verify that all Fluid components are running properly: |
| 20 | + |
| 21 | +```shell |
| 22 | +$ kubectl get pod -n fluid-system |
| 23 | +cacheruntime-controller-xxxxx 1/1 Running 0 8h |
| 24 | +csi-nodeplugin-fluid-fwgjh 2/2 Running 0 8h |
| 25 | +dataset-controller-5b7848dbbb-n44dj 1/1 Running 0 8h |
| 26 | +``` |
| 27 | + |
| 28 | +Ensure that your cluster has installed a CacheRuntime controller that supports data operations. |
| 29 | + |
| 30 | +## Core Concepts |
| 31 | + |
| 32 | +### The Only Change Compared to Basic Integration |
| 33 | + |
| 34 | +Compared to the basic CacheRuntime integration, supporting data operations **only requires adding one top-level field** to the CacheRuntimeClass: |
| 35 | + |
| 36 | +- **Only add** the `dataOperationSpecs` field |
| 37 | +- **No need to modify** any existing fields (topology, fileSystemType, extraResources, etc.) |
| 38 | +- **Backward compatible**: CacheRuntimeClass without this field configured can still use basic caching functionality normally |
| 39 | + |
| 40 | +```yaml |
| 41 | +apiVersion: data.fluid.io/v1alpha1 |
| 42 | +kind: CacheRuntimeClass |
| 43 | +metadata: |
| 44 | + name: curvine-demo |
| 45 | +fileSystemType: curvinefs |
| 46 | + |
| 47 | +# [NEW] Only this field is added; other configurations (topology, etc.) remain unchanged |
| 48 | +dataOperationSpecs: |
| 49 | + - name: DataLoad |
| 50 | + command: ["/bin/bash", "-c"] |
| 51 | + args: ["..."] |
| 52 | + |
| 53 | +# [ORIGINAL CONFIGURATION] Fields like topology and extraResources require no modifications |
| 54 | +topology: |
| 55 | + master: |
| 56 | + # ... Exactly the same as basic integration |
| 57 | + worker: |
| 58 | + # ... Exactly the same as basic integration |
| 59 | + client: |
| 60 | + # ... Exactly the same as basic integration |
| 61 | +``` |
| 62 | + |
| 63 | +### Detailed Explanation of dataOperationSpecs Field |
| 64 | + |
| 65 | +`dataOperationSpecs` is an array where each element defines the execution specification for a type of data operation. |
| 66 | + |
| 67 | +#### Field Structure |
| 68 | + |
| 69 | +```yaml |
| 70 | +dataOperationSpecs: |
| 71 | + - name: <operation type> |
| 72 | + command: [<command>, <parameters>] |
| 73 | + args: [<script or parameters>] |
| 74 | + image: <optional: dedicated image> |
| 75 | +``` |
| 76 | +
|
| 77 | +#### Field Description |
| 78 | +
|
| 79 | +| Field Name | Type | Required | Description | |
| 80 | +|--------|------|----|---------------------------------------------------------------------------------------------------------------------------| |
| 81 | +| `name` | string | Yes | Operation type identifier. Currently supported values:<br>• `DataLoad`: Data preloading operation<br>• `DataMigrate`: Data migration operation (not yet supported)<br>• `DataBackup`: Data backup operation (not yet supported) | |
| 82 | +| `command` | []string | Yes | Command to execute in the container (entrypoint), typically set to `["/bin/bash", "-c"]` to support script execution | |
| 83 | +| `args` | []string | Yes | Arguments for the command, usually containing the complete execution script. The script can use environment variables injected by Fluid (see below) | |
| 84 | +| `image` | string | No | Container image used for the operation.<br>• **If not specified**: Defaults to using the `worker` component image from `CacheRuntimeClass`<br>• **If specified**: Uses a custom dedicated image (suitable for scenarios requiring special tools) | |
| 85 | + |
| 86 | +### Available Environment Variables |
| 87 | + |
| 88 | +During data operation execution, Fluid automatically injects the following environment variables into the container: |
| 89 | + |
| 90 | +#### DataLoad-Specific Environment Variables |
| 91 | + |
| 92 | +| Environment Variable Name | Description | Example Value | |
| 93 | +|-----------|--------------------------------|--------| |
| 94 | +| `FLUID_DATALOAD_METADATA` | Whether to load metadata | `"true"` or `"false"` | |
| 95 | +| `FLUID_DATALOAD_DATA_PATH` | Data paths to be loaded (multiple paths separated by colons) | `/spark/spark-3.0.1:/spark/spark-2.4.7` | |
| 96 | +| `FLUID_DATALOAD_PATH_REPLICAS` | Number of replicas for each path (separated by colons, corresponding one-to-one with DATA_PATH) | `1:2` | |
| 97 | + |
| 98 | +The underlying caching system writes data preloading scripts based on the above environment variables and packages them into the image. When users define DataLoad operations, they can specify the script through the `command` and `args` fields. |
0 commit comments