This document is an extension of the CacheRuntime Integration Guide and assumes you have already completed the basic CacheRuntime integration (including defining topology, configuring components, etc.).
This document only explains how to add data operation support to an existing CacheRuntimeClass. The core change involves just one field: dataOperationSpecs.
Fluid's CacheRuntime provides a generic cache runtime abstraction that allows users to define implementation details for different caching systems through CacheRuntimeClass. Starting from the latest version of Fluid, CacheRuntime natively supports data operations, including DataLoad (data preloading) and DataProcess (data processing).
This document demonstrates how to configure and use the DataLoad feature for CacheRuntime.
Note: The DataProcess Spec defines Pod information and mounts the Dataset as a PVC, so no modifications to the caching system are required to use the DataProcess feature.
Before running this example, please refer to the Installation Guide to complete the Fluid installation and verify that all Fluid components are running properly:
$ kubectl get pod -n fluid-system
cacheruntime-controller-xxxxx 1/1 Running 0 8h
csi-nodeplugin-fluid-fwgjh 2/2 Running 0 8h
dataset-controller-5b7848dbbb-n44dj 1/1 Running 0 8hEnsure that your cluster has installed a CacheRuntime controller that supports data operations.
Compared to the basic CacheRuntime integration, supporting data operations only requires adding one top-level field to the CacheRuntimeClass:
- Only add the
dataOperationSpecsfield - No need to modify any existing fields (topology, fileSystemType, extraResources, etc.)
- Backward compatible: CacheRuntimeClass without this field configured can still use basic caching functionality normally
apiVersion: data.fluid.io/v1alpha1
kind: CacheRuntimeClass
metadata:
name: curvine-demo
fileSystemType: curvinefs
# [NEW] Only this field is added; other configurations (topology, etc.) remain unchanged
dataOperationSpecs:
- name: DataLoad
command: ["/bin/bash", "-c"]
args: ["..."]
# [ORIGINAL CONFIGURATION] Fields like topology and extraResources require no modifications
topology:
master:
# ... Exactly the same as basic integration
worker:
# ... Exactly the same as basic integration
client:
# ... Exactly the same as basic integrationdataOperationSpecs is an array where each element defines the execution specification for a type of data operation.
dataOperationSpecs:
- name: <operation type>
command: [<command>, <parameters>]
args: [<script or parameters>]
image: <optional: dedicated image>| Field Name | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Operation type identifier. Currently supported values: • DataLoad: Data preloading operation• DataMigrate: Data migration operation (not yet supported)• DataBackup: Data backup operation (not yet supported) |
command |
[]string | Yes | Command to execute in the container (entrypoint), typically set to ["/bin/bash", "-c"] to support script execution |
args |
[]string | Yes | Arguments for the command, usually containing the complete execution script. The script can use environment variables injected by Fluid (see below) |
image |
string | No | Container image used for the operation. • If not specified: Defaults to using the worker component image from CacheRuntimeClass• If specified: Uses a custom dedicated image (suitable for scenarios requiring special tools) |
During data operation execution, Fluid automatically injects the following environment variables into the container:
| Environment Variable Name | Description | Example Value |
|---|---|---|
FLUID_DATALOAD_METADATA |
Whether to load metadata | "true" or "false" |
FLUID_DATALOAD_DATA_PATH |
Data paths to be loaded (multiple paths separated by colons) | /spark/spark-3.0.1:/spark/spark-2.4.7 |
FLUID_DATALOAD_PATH_REPLICAS |
Number of replicas for each path (separated by colons, corresponding one-to-one with DATA_PATH) | 1:2 |
The underlying caching system writes data preloading scripts based on the above environment variables and packages them into the image. When users define DataLoad operations, they can specify the script through the command and args fields.