[FLINK-40005][cdc-cli] CliExecutor application mode resolves pipeline definition from local path, remote path or inline content

Jiyong Wang · Jiyong Wang · commit 0e10c043582b · 2026-06-30T14:26:31.000+08:00
In application mode, CliExecutor.main received args[0] and passed it to YamlPipelineDefinitionParser.parse(String, Configuration), which treats the String as YAML content. The path string itself was parsed as YAML and failed validation with: Missing required field "source" in top-level configuration. The two native application deployment executors share CliExecutor.main but pass args[0] with different semantics: - K8SApplicationDeploymentExecutor sets APPLICATION_ARGS = commandLine.getArgList(), so args[0] is the pipeline definition FILE PATH (shipped into the JobManager container, e.g. mounted via a ConfigMap by the Flink Kubernetes Operator). - YarnApplicationDeploymentExecutor reads the file on the client side and sets APPLICATION_ARGS to the pipeline definition CONTENT. Resolve args[0] in three cases, in order: 1. An explicit-scheme path (s3://, hdfs://, oss://, file://) is read through Flink's FileSystem so the matching plugin resolves it. The scheme is explicit, so it is not at risk of being hijacked by the cluster default FileSystem. 2. A bare local file path is read with the local JVM file API (java.nio Files.readAllBytes) instead of Flink's FileSystem, whose cluster default may be S3/HDFS and would not resolve a local path shipped next to the JobManager. 3. Otherwise the value is already the pipeline definition content (YARN) and is parsed verbatim. This fixes K8S application mode without regressing YARN application mode, and also supports placing the pipeline definition file on remote storage. Also update the Kubernetes deployment docs (EN + ZH): the FlinkDeployment example now uses entryClass org.apache.flink.cdc.cli.CliExecutor without --use-mini-cluster, and the "native application mode is not supported" note is removed. Regression introduced by #3643 (FLINK-35360, Support Yarn application mode for yaml job).
diff --git a/docs/content.zh/docs/deployment/kubernetes.md b/docs/content.zh/docs/deployment/kubernetes.md
@@ -241,10 +241,9 @@ spec:
   imagePullPolicy: Always
   job:
     args:
-      - '--use-mini-cluster'
       - /opt/flink/flink-cdc-{{< param Version >}}/conf/mysql-to-doris.yaml
-    entryClass: org.apache.flink.cdc.cli.CliFrontend
-    jarURI: 'local:///opt/flink/flink-cdc-{{< param Version >}}/lib/flink-cdc-dist-{{< param Version >}}.jar'
+    entryClass: org.apache.flink.cdc.cli.CliExecutor
+    jarURI: 'local:///opt/flink/lib/flink-cdc-dist-{{< param Version >}}.jar'
     parallelism: 1
     state: running
     upgradeMode: savepoint
@@ -276,7 +275,7 @@ spec:
 ```
 {{< hint info >}}
 1. 由于Flink的类加载机制，参数`classloader.resolve-order`必须设置为`parent-first`。 
-2. Flink CDC默认提交作业到远程Flink集群，在Operator模式下，您需要通过指定`--use-mini-cluster`参数在pod内部启动一个Standalone Flink集群。  
+2. `entryClass`必须设置为`org.apache.flink.cdc.cli.CliExecutor`，它是Flink **native application mode** 的入口类。Pipeline 定义文件路径通过`args`传入，由 JobManager 读取并构建作业图，作业随后在独立的 TaskManager 上执行。  
 {{< /hint >}}
 
 ### 提交Flink CDC作业
@@ -289,9 +288,4 @@ kubectl apply -f flink-cdc-pipeline-job.yaml
 ```shell
 flinkdeployment.flink.apache.org/flink-cdc-pipeline-job created
 ```
-如您需要查看日志、暴露Flink Web UI等，请参考：[Flink Kubernetes Operator文档](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/)。
-
-
-{{< hint info >}}  
-请注意，目前不支持使用**native application mode**提交作业。  
-{{< /hint >}}
+如您需要查看日志、暴露Flink Web UI等，请参考：[Flink Kubernetes Operator文档](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/)。
diff --git a/docs/content/docs/deployment/kubernetes.md b/docs/content/docs/deployment/kubernetes.md
@@ -244,9 +244,8 @@ spec:
   imagePullPolicy: Always
   job:
     args:
-      - '--use-mini-cluster'
       - /opt/flink/flink-cdc-{{< param Version >}}/conf/mysql-to-doris.yaml
-    entryClass: org.apache.flink.cdc.cli.CliFrontend
+    entryClass: org.apache.flink.cdc.cli.CliExecutor
     jarURI: 'local:///opt/flink/lib/flink-cdc-dist-{{< param Version >}}.jar'
     parallelism: 1
     state: running
@@ -279,7 +278,7 @@ spec:
 ```
 {{< hint info >}}  
 1. Due to Flink's class loader, the parameter of `classloader.resolve-order` must be `parent-first`.
-2. Flink CDC submits a job to a remote Flink cluster by default, you should start a Standalone Flink cluster in the pod by `--use-mini-cluster` in Operator mode.  
+2. The `entryClass` must be `org.apache.flink.cdc.cli.CliExecutor`, which is the entrypoint of Flink **native application mode**. The pipeline definition file path is passed through `args`; the JobManager reads it, builds the job graph, and the job is executed on dedicated TaskManagers.  
 {{< /hint >}}
 
 ### Submit a Flink CDC Job
@@ -292,8 +291,4 @@ After successful submission, the return information is as follows：
 ```shell
 flinkdeployment.flink.apache.org/flink-cdc-pipeline-job created
 ```
-If you want to trace the logs or expose the Flink Web UI, please refer to: [Flink Kubernetes Operator documentation](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/)。
-
-{{< hint info >}}  
-Please note that submitting with **native application mode** is not supported for now.  
-{{< /hint >}}
+If you want to trace the logs or expose the Flink Web UI, please refer to: [Flink Kubernetes Operator documentation](https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/)。
diff --git a/flink-cdc-cli/src/main/java/org/apache/flink/cdc/cli/CliExecutor.java b/flink-cdc-cli/src/main/java/org/apache/flink/cdc/cli/CliExecutor.java
@@ -30,11 +30,20 @@
 import org.apache.flink.cdc.composer.flink.deployment.K8SApplicationDeploymentExecutor;
 import org.apache.flink.cdc.composer.flink.deployment.YarnApplicationDeploymentExecutor;
 import org.apache.flink.configuration.DeploymentOptions;
+import org.apache.flink.core.fs.FSDataInputStream;
+import org.apache.flink.core.fs.FileSystem;
 import org.apache.flink.core.fs.Path;
 import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
 
 import org.apache.commons.cli.CommandLine;
 
+import java.io.ByteArrayOutputStream;
+import java.net.URI;
+import java.net.URISyntaxException;
+import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.InvalidPathException;
+import java.nio.file.Paths;
 import java.util.List;
 
 /** Executor for doing the composing and submitting logic for {@link CliFrontend}. */
@@ -108,14 +117,90 @@ public PipelineExecution.ExecutionInfo deployWithNoOpComposer() throws Exception
     // The main class for running application mode
     public static void main(String[] args) throws Exception {
         PipelineDefinitionParser pipelineDefinitionParser = new YamlPipelineDefinitionParser();
-        PipelineDef pipelineDef = pipelineDefinitionParser.parse(args[0], new Configuration());
+        PipelineDef pipelineDef =
+                pipelineDefinitionParser.parse(resolvePipelineDef(args[0]), new Configuration());
         StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         FlinkPipelineComposer flinkPipelineComposer =
                 FlinkPipelineComposer.ofApplicationCluster(env);
         PipelineExecution execution = flinkPipelineComposer.compose(pipelineDef);
         execution.execute();
     }
 
+    /**
+     * Resolves the application-mode entrypoint argument into the pipeline definition YAML content.
+     *
+     * <p>The two native application deployment executors share this single entrypoint but pass
+     * {@code args[0]} with different semantics:
+     *
+     * <ul>
+     *   <li>{@link org.apache.flink.cdc.composer.flink.deployment.K8SApplicationDeploymentExecutor}
+     *       sets {@code APPLICATION_ARGS = commandLine.getArgList()}, so {@code args[0]} is the
+     *       pipeline definition FILE PATH shipped into the JobManager container (e.g. mounted via a
+     *       ConfigMap by the Flink Kubernetes Operator).
+     *   <li>{@link
+     *       org.apache.flink.cdc.composer.flink.deployment.YarnApplicationDeploymentExecutor} reads
+     *       the file on the client side and sets {@code APPLICATION_ARGS} to the pipeline
+     *       definition CONTENT, because it does not ship the file into the YARN container.
+     * </ul>
+     *
+     * <p>Three cases are handled, in order:
+     *
+     * <ol>
+     *   <li>An explicit-scheme path (e.g. {@code s3://}, {@code hdfs://}, {@code oss://}, {@code
+     *       file://}) is read through Flink's FileSystem so the matching plugin resolves it. The
+     *       scheme is explicit, so — unlike a bare local path — it is not at risk of being hijacked
+     *       by the cluster default FileSystem.
+     *   <li>A bare local file path (e.g. shipped into the JobManager container / mounted by a
+     *       ConfigMap by the Flink Kubernetes Operator) is read with the local JVM file API rather
+     *       than Flink's FileSystem, whose cluster default may be S3/HDFS and would not resolve a
+     *       local path.
+     *   <li>Otherwise the value is already the pipeline definition CONTENT (the YARN application
+     *       executor reads the file on the client side and passes the content) and is used
+     *       verbatim. Without distinguishing these, the parser's String overload would treat a file
+     *       path as YAML content and fail with: Missing required field "source".
+     * </ol>
+     */
+    @VisibleForTesting
+    static String resolvePipelineDef(String pipelineDefPathOrContent) throws Exception {
+        // Case 1: explicit-scheme path -> read through Flink's FileSystem (plugin-aware).
+        URI uri = tryParseUri(pipelineDefPathOrContent);
+        if (uri != null && uri.getScheme() != null) {
+            Path remotePath = new Path(pipelineDefPathOrContent);
+            FileSystem fileSystem = remotePath.getFileSystem();
+            try (FSDataInputStream in = fileSystem.open(remotePath);
+                    ByteArrayOutputStream out = new ByteArrayOutputStream()) {
+                byte[] buffer = new byte[4096];
+                int bytesRead;
+                while ((bytesRead = in.read(buffer)) != -1) {
+                    out.write(buffer, 0, bytesRead);
+                }
+                return new String(out.toByteArray(), StandardCharsets.UTF_8);
+            }
+        }
+
+        // Case 2: bare local file path -> read with the local JVM file API to avoid the cluster
+        // default FileSystem (which may be S3/HDFS) hijacking a local path.
+        try {
+            java.nio.file.Path localPath = Paths.get(pipelineDefPathOrContent);
+            if (Files.isRegularFile(localPath)) {
+                return new String(Files.readAllBytes(localPath), StandardCharsets.UTF_8);
+            }
+        } catch (InvalidPathException ignored) {
+            // Not a valid local path; fall through.
+        }
+
+        // Case 3: not a path -> the YARN application executor already passes the CONTENT.
+        return pipelineDefPathOrContent;
+    }
+
+    private static URI tryParseUri(String value) {
+        try {
+            return new URI(value);
+        } catch (URISyntaxException e) {
+            return null;
+        }
+    }
+
     @VisibleForTesting
     void setComposer(PipelineComposer composer) {
         this.composer = composer;
diff --git a/flink-cdc-cli/src/test/java/org/apache/flink/cdc/cli/CliExecutorTest.java b/flink-cdc-cli/src/test/java/org/apache/flink/cdc/cli/CliExecutorTest.java
@@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.cdc.cli;
+
+import org.apache.flink.cdc.cli.parser.YamlPipelineDefinitionParser;
+import org.apache.flink.cdc.common.configuration.Configuration;
+import org.apache.flink.cdc.composer.definition.PipelineDef;
+
+import org.apache.flink.shaded.guava31.com.google.common.io.Resources;
+
+import org.junit.jupiter.api.Test;
+
+import java.net.URI;
+import java.net.URL;
+import java.nio.file.Paths;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+/**
+ * Tests covering how {@link CliExecutor#main(String[])} (the application-mode entrypoint) loads the
+ * pipeline definition through {@link CliExecutor#resolvePipelineDef(String)}.
+ *
+ * <p>The two native application deployment executors share this single entrypoint but pass {@code
+ * args[0]} with different semantics, so {@code resolvePipelineDef} must handle both:
+ *
+ * <ul>
+ *   <li><b>Kubernetes</b> ({@code K8SApplicationDeploymentExecutor}) passes the pipeline definition
+ *       FILE PATH (shipped into the JobManager container, e.g. mounted by a ConfigMap), so it must
+ *       be read from the file system.
+ *   <li><b>YARN</b> ({@code YarnApplicationDeploymentExecutor}) reads the file on the client side
+ *       and passes the pipeline definition CONTENT, which must be used verbatim.
+ * </ul>
+ */
+class CliExecutorTest {
+
+    /**
+     * Kubernetes application mode: {@code args[0]} is a file path, so {@code resolvePipelineDef}
+     * reads the file content, which then parses into a valid pipeline definition.
+     */
+    @Test
+    void testResolvePipelineDefFromFilePath() throws Exception {
+        URL resource = Resources.getResource("definitions/pipeline-definition-minimized.yaml");
+        String pipelineDefPath = Paths.get(resource.toURI()).toString();
+
+        String content = CliExecutor.resolvePipelineDef(pipelineDefPath);
+        assertThat(content).contains("source:").contains("type: mysql");
+
+        PipelineDef pipelineDef =
+                new YamlPipelineDefinitionParser().parse(content, new Configuration());
+        assertThat(pipelineDef.getSource()).isNotNull();
+    }
+
+    /**
+     * Explicit-scheme path (here {@code file://}, the same code path as {@code s3://}, {@code
+     * hdfs://}, {@code oss://}): read through Flink's FileSystem so the matching plugin resolves
+     * it.
+     */
+    @Test
+    void testResolvePipelineDefFromSchemePath() throws Exception {
+        URL resource = Resources.getResource("definitions/pipeline-definition-minimized.yaml");
+        String schemePath = resource.toURI().toString();
+        assertThat(new URI(schemePath).getScheme()).isNotNull();
+
+        String content = CliExecutor.resolvePipelineDef(schemePath);
+        assertThat(content).contains("source:").contains("type: mysql");
+
+        PipelineDef pipelineDef =
+                new YamlPipelineDefinitionParser().parse(content, new Configuration());
+        assertThat(pipelineDef.getSource()).isNotNull();
+    }
+
+    /**
+     * YARN application mode: {@code args[0]} is already the pipeline definition content (read on
+     * the client side), so {@code resolvePipelineDef} returns it verbatim and it parses correctly.
+     */
+    @Test
+    void testResolvePipelineDefFromInlineContent() throws Exception {
+        String pipelineDefContent = "source:\n  type: mysql\n\nsink:\n  type: kafka\n";
+
+        String resolved = CliExecutor.resolvePipelineDef(pipelineDefContent);
+        assertThat(resolved).isEqualTo(pipelineDefContent);
+
+        PipelineDef pipelineDef =
+                new YamlPipelineDefinitionParser().parse(resolved, new Configuration());
+        assertThat(pipelineDef.getSource()).isNotNull();
+    }
+
+    /**
+     * The FLINK-40005 root cause: passing a file PATH straight to the String (content) overload
+     * makes the parser treat the path as YAML content, yielding a scalar node without a {@code
+     * source}. {@link CliExecutor#resolvePipelineDef(String)} avoids this for the Kubernetes path
+     * by reading the file first.
+     */
+    @Test
+    void testParsingFilePathAsYamlContentFails() {
+        String pipelineDefPath = "/opt/flink/config/pipeline.yaml";
+        assertThatThrownBy(
+                        () ->
+                                new YamlPipelineDefinitionParser()
+                                        .parse(pipelineDefPath, new Configuration()))
+                .isInstanceOf(IllegalArgumentException.class)
+                .hasMessageContaining("Missing required field \"source\"");
+    }
+}