Skip to content

Commit d1a42c6

Browse files
authored
[doc] add FAQ for Stream Load Broken pipe when syncing large tables via Flink CDC (#3598)
## Summary Add a new FAQ entry to the Flink Doris Connector docs (English/Chinese, dev/4.x) for the `I/O exception (java.net.SocketException) ... Broken pipe` error that users may encounter when syncing large tables (e.g. Oracle) via Flink CDC.
1 parent a3000c1 commit d1a42c6

4 files changed

Lines changed: 42 additions & 6 deletions

File tree

docs/ecosystem/flink-doris-connector/flink-doris-connector.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1130,7 +1130,7 @@ In the whole database synchronization tool provided by the Connector, no additio
11301130
11311131
3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100**
11321132
1133-
This is because the concurrent imports into the same database exceed 100. It can be solved by adjusting the parameter `max_running_txn_num_per_db` in `fe.conf`. For specific details, please refer to [max_running_txn_num_per_db](../../admin-manual/config/fe-config#max_running_txn_num_per_db).
1133+
This is because the concurrent imports into the same database exceed 100. It can be solved by adjusting the parameter `max_running_txn_num_per_db` in `fe.conf`. For specific details, please refer to [FE Configuration](../../admin-manual/config/fe-config).
11341134
11351135
Meanwhile, frequently modifying the label and restarting a task may also lead to this error. In the 2pc scenario (for Duplicate/Aggregate models), the label of each task needs to be unique. And when restarting from a checkpoint, the Flink task will actively abort the transactions that have been pre-committed successfully but not yet committed. Frequent label modifications and restarts will result in a large number of pre-committed successful transactions that cannot be aborted and thus occupy transactions. In the Unique model, 2pc can also be disabled to achieve idempotent writes.
11361136
@@ -1148,4 +1148,13 @@ In the whole database synchronization tool provided by the Connector, no additio
11481148
11491149
7. **stream load error: HTTP/1.1 307 Temporary Redirect**
11501150
1151-
Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.
1151+
Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.
1152+
1153+
8. **When using Flink CDC to sync large tables from databases such as Oracle, an `I/O exception (java.net.SocketException) ... Broken pipe` error is reported. How to handle it?**
1154+
1155+
This error usually occurs when the data volume of a single Stream Load request exceeds the limit on the BE side. You can adjust it from the following aspects:
1156+
- Increase the `streaming_load_max_mb` parameter in `be.conf` on the BE side (default 10240, in MB), so that a single Stream Load can carry more data. The BE needs to be restarted to take effect.
1157+
- Enable batch mode (`sink.enable.batch-mode=true`), so that the Connector automatically splits the data into batches internally, avoiding too much data in a single Stream Load.
1158+
- Try to increase the parallelism of Oracle CDC by adding `--oracle-conf scan.incremental.snapshot.enabled=true` (experimental feature) to the startup command, which enables parallel reading of Oracle full data.
1159+
1160+
For more Flink CDC related issues, please refer to [Flink CDC FAQ](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.6/docs/faq/faq/).

i18n/zh-CN/docusaurus-plugin-content-docs/current/ecosystem/flink-doris-connector/flink-doris-connector.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1131,7 +1131,7 @@ from KAFKA_SOURCE;
11311131
11321132
3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100**
11331133
1134-
这是因为同一个库并发导入超过了 100,可通过调整 fe.conf 的参数 `max_running_txn_num_per_db` 来解决,具体可参考 [max_running_txn_num_per_db](../../admin-manual/config/fe-config#max_running_txn_num_per_db)。
1134+
这是因为同一个库并发导入超过了 100,可通过调整 fe.conf 的参数 `max_running_txn_num_per_db` 来解决,具体可参考 [FE 配置项](../../admin-manual/config/fe-config)。
11351135
同时,一个任务频繁修改 label 重启,也可能会导致这个错误。2pc 场景下 (Duplicate/Aggregate 模型),每个任务的 label 需要唯一,并且从 checkpoint 重启时,flink 任务才会主动 abort 掉之前已经 precommit 成功,没有 commit 的 txn,频繁修改 label 重启,会导致大量 precommit 成功的 txn 无法被 abort,占用事务。在 Unique 模型下也可关闭 2pc,可以实现幂等写入。
11361136
11371137
4. **tablet writer write failed, tablet_id=190958, txn_id=3505530, err=-235**
@@ -1149,3 +1149,12 @@ from KAFKA_SOURCE;
11491149
7. **stream load error: HTTP/1.1 307 Temporary Redirect**
11501150
11511151
Flink 会先向 FE 请求,收到 307 后会向重定向后的 BE 请求。当 FE 在 FullGC/压力大/网络延迟的时候,HttpClient 默认会在一定时间 (3 秒) 没有等到响应会发送数据,由于默认情况下请求体是 InputStream,当收到 307 响应时,数据无法重放,会直接报错。有三种方式可以解决:1.升级到 Connector25.1.0 以上,调长了默认时间;2.修改 auto-redirect=false,直接向 BE 发起请求(不适用部分云上场景);3.主键模型可以开启攒批模式。
1152+
1153+
8. **使用 Flink CDC 同步 Oracle 等数据库的大表时,出现 `I/O exception (java.net.SocketException) ... Broken pipe` 报错怎么办?**
1154+
1155+
该报错通常是单次 Stream Load 写入的数据量超过了 BE 端的限制导致的。可以从以下几个方面进行调整:
1156+
- 调大 BE 端 `be.conf` 中的 `streaming_load_max_mb` 参数(默认 10240,单位 MB),允许单次 Stream Load 写入更多数据,修改后需要重启 BE 生效。
1157+
- 开启攒批模式(`sink.enable.batch-mode=true`),由 Connector 内部按批次大小自动切分写入,避免单次写入数据量过大。
1158+
- 尝试调高 Oracle CDC 的并发度,在启动命令中增加 `--oracle-conf scan.incremental.snapshot.enabled=true`(实验性功能),开启后可以多并发读取 Oracle 全量数据。
1159+
1160+
更多 Flink CDC 相关问题可以参考 [Flink CDC FAQ](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.6/docs/faq/faq/)。

i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ecosystem/flink-doris-connector/flink-doris-connector.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1131,7 +1131,7 @@ from KAFKA_SOURCE;
11311131
11321132
3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100**
11331133
1134-
这是因为同一个库并发导入超过了 100,可通过调整 fe.conf 的参数 `max_running_txn_num_per_db` 来解决,具体可参考 [max_running_txn_num_per_db](../../admin-manual/config/fe-config#max_running_txn_num_per_db)。
1134+
这是因为同一个库并发导入超过了 100,可通过调整 fe.conf 的参数 `max_running_txn_num_per_db` 来解决,具体可参考 [FE 配置项](../../admin-manual/config/fe-config)。
11351135
同时,一个任务频繁修改 label 重启,也可能会导致这个错误。2pc 场景下 (Duplicate/Aggregate 模型),每个任务的 label 需要唯一,并且从 checkpoint 重启时,flink 任务才会主动 abort 掉之前已经 precommit 成功,没有 commit 的 txn,频繁修改 label 重启,会导致大量 precommit 成功的 txn 无法被 abort,占用事务。在 Unique 模型下也可关闭 2pc,可以实现幂等写入。
11361136
11371137
4. **tablet writer write failed, tablet_id=190958, txn_id=3505530, err=-235**
@@ -1149,3 +1149,12 @@ from KAFKA_SOURCE;
11491149
7. **stream load error: HTTP/1.1 307 Temporary Redirect**
11501150
11511151
Flink 会先向 FE 请求,收到 307 后会向重定向后的 BE 请求。当 FE 在 FullGC/压力大/网络延迟的时候,HttpClient 默认会在一定时间 (3 秒) 没有等到响应会发送数据,由于默认情况下请求体是 InputStream,当收到 307 响应时,数据无法重放,会直接报错。有三种方式可以解决:1.升级到 Connector25.1.0 以上,调长了默认时间;2.修改 auto-redirect=false,直接向 BE 发起请求(不适用部分云上场景);3.主键模型可以开启攒批模式。
1152+
1153+
8. **使用 Flink CDC 同步 Oracle 等数据库的大表时,出现 `I/O exception (java.net.SocketException) ... Broken pipe` 报错怎么办?**
1154+
1155+
该报错通常是单次 Stream Load 写入的数据量超过了 BE 端的限制导致的。可以从以下几个方面进行调整:
1156+
- 调大 BE 端 `be.conf` 中的 `streaming_load_max_mb` 参数(默认 10240,单位 MB),允许单次 Stream Load 写入更多数据,修改后需要重启 BE 生效。
1157+
- 开启攒批模式(`sink.enable.batch-mode=true`),由 Connector 内部按批次大小自动切分写入,避免单次写入数据量过大。
1158+
- 尝试调高 Oracle CDC 的并发度,在启动命令中增加 `--oracle-conf scan.incremental.snapshot.enabled=true`(实验性功能),开启后可以多并发读取 Oracle 全量数据。
1159+
1160+
更多 Flink CDC 相关问题可以参考 [Flink CDC FAQ](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.6/docs/faq/faq/)。

versioned_docs/version-4.x/ecosystem/flink-doris-connector/flink-doris-connector.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1130,7 +1130,7 @@ In the whole database synchronization tool provided by the Connector, no additio
11301130
11311131
3. **errCode = 2, detailMessage = current running txns on db 10006 is 100, larger than limit 100**
11321132
1133-
This is because the concurrent imports into the same database exceed 100. It can be solved by adjusting the parameter `max_running_txn_num_per_db` in `fe.conf`. For specific details, please refer to [max_running_txn_num_per_db](../../admin-manual/config/fe-config#max_running_txn_num_per_db).
1133+
This is because the concurrent imports into the same database exceed 100. It can be solved by adjusting the parameter `max_running_txn_num_per_db` in `fe.conf`. For specific details, please refer to [FE Configuration](../../admin-manual/config/fe-config).
11341134
11351135
Meanwhile, frequently modifying the label and restarting a task may also lead to this error. In the 2pc scenario (for Duplicate/Aggregate models), the label of each task needs to be unique. And when restarting from a checkpoint, the Flink task will actively abort the transactions that have been pre-committed successfully but not yet committed. Frequent label modifications and restarts will result in a large number of pre-committed successful transactions that cannot be aborted and thus occupy transactions. In the Unique model, 2pc can also be disabled to achieve idempotent writes.
11361136
@@ -1148,4 +1148,13 @@ In the whole database synchronization tool provided by the Connector, no additio
11481148
11491149
7. **stream load error: HTTP/1.1 307 Temporary Redirect**
11501150
1151-
Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.
1151+
Flink will first request FE, and after receiving 307, it will request BE after redirection. When FE is in FullGC/high pressure/network delay, HttpClient will send data without waiting for a response within a certain period of time (3 seconds) by default. Since the request body is InputStream by default, when a 307 response is received, the data cannot be replayed and an error will be reported directly. There are three ways to solve this problem: 1. Upgrade to Connector25.1.0 or above to increase the default time; 2. Modify auto-redirect=false to directly initiate a request to BE (not applicable to some cloud scenarios); 3. The unique key model can enable batch mode.
1152+
1153+
8. **When using Flink CDC to sync large tables from databases such as Oracle, an `I/O exception (java.net.SocketException) ... Broken pipe` error is reported. How to handle it?**
1154+
1155+
This error usually occurs when the data volume of a single Stream Load request exceeds the limit on the BE side. You can adjust it from the following aspects:
1156+
- Increase the `streaming_load_max_mb` parameter in `be.conf` on the BE side (default 10240, in MB), so that a single Stream Load can carry more data. The BE needs to be restarted to take effect.
1157+
- Enable batch mode (`sink.enable.batch-mode=true`), so that the Connector automatically splits the data into batches internally, avoiding too much data in a single Stream Load.
1158+
- Try to increase the parallelism of Oracle CDC by adding `--oracle-conf scan.incremental.snapshot.enabled=true` (experimental feature) to the startup command, which enables parallel reading of Oracle full data.
1159+
1160+
For more Flink CDC related issues, please refer to [Flink CDC FAQ](https://nightlies.apache.org/flink/flink-cdc-docs-release-3.6/docs/faq/faq/).

0 commit comments

Comments
 (0)