Skip to content

Commit ab7ae60

Browse files
committed
fix(ci): raise vm.max_map_count so the Doris 4.0.3 BE can start
Root cause of the doris CI hang, found by reproducing the container locally: 4.0.3's start_be.sh hard-`exit 1`s unless vm.max_map_count >= 2000000 (also checks swap off and ulimit -n >= 60000). The 2.1.0 image had no such gate, which is exactly why it passed in 13min while 4.0.3 did not. On the runner's low default the BE exits immediately; that ends the entrypoint's `wait $child_pid`, the container exits and never reports healthy, and every doris test then stalls ~360s in wait_for_be_alive (480s nextest timeout x 4 retries -> 1h job cap). Fix: - Shared container: raise vm.max_map_count on the host before launch (inherited by all containers) and pass SKIP_CHECK_ULIMIT=true to bypass the swap/ulimit gates without swapoff'ing the runner. Verified locally that this brings the BE alive=true (~40s) even under arm64 emulation. - Self-boot fixture: set SKIP_CHECK_ULIMIT=true too, so the fallback path (and local Linux dev, which can't easily sysctl) isn't blocked by the same gates. The earlier FE-2GB / BE-4GB memory caps stay: this still boots beside the cargo build on a 16GB runner, so bounding Doris to ~6GB keeps room for rustc.
1 parent aa53589 commit ab7ae60

2 files changed

Lines changed: 23 additions & 13 deletions

File tree

.github/actions/rust/pre-merge/action.yml

Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -229,21 +229,24 @@ runs:
229229
DORIS_STARTED=0
230230
if [[ "$RUNNER_OS" == "Linux" ]] && { [[ -z "$NEXTEST_FILTER" ]] || echo "$NEXTEST_FILTER" | grep -qi 'integration'; }; then
231231
echo "::group::Launch shared Doris container (boots during build)"
232-
# This container boots *concurrently with the cargo build* on a 16GB
233-
# GitHub-hosted runner, so Doris and rustc share that RAM. Out of the box
234-
# 4.0.3-all-slim sizes itself for a dedicated host: the FE reserves an 8GB
235-
# JVM heap (-Xms8192m, a hard commit) and the BE's mem_limit defaults to
236-
# `auto` (~90% of RAM ≈ 14GB). Under build pressure that over-commit gets
237-
# the BE OOM-killed; its death ends the entrypoint's `wait`, so the
238-
# container exits and never becomes healthy — which forces every doris
239-
# test onto the slow per-test self-boot path (1h timeout). Reproduced the
240-
# bare run locally: capping the FE heap brought it healthy in ~10s.
241-
# Bound both (FE 2GB hard heap + BE 4GB) so Doris fits beside the build,
242-
# roughly matching the footprint of the 2.1.0 image that used to pass.
243-
# Override the entrypoint to rewrite the confs, then hand off to the
244-
# image's normal entry_point.sh.
232+
# ROOT CAUSE of the regression: 4.0.3's start_be.sh hard-`exit 1`s unless
233+
# `vm.max_map_count >= 2000000` (the 2.1.0 image had no such gate, which is
234+
# why it worked). The runner default is far lower, so the BE died on start;
235+
# its death ends the entrypoint's `wait`, the container exits, and every
236+
# doris test falls onto the slow per-test self-boot path → 1h timeout.
237+
# Raise it on the host (inherited by all containers, shared and self-boot);
238+
# reproduced locally that this brings the BE alive. `SKIP_CHECK_ULIMIT`
239+
# additionally bypasses start_be.sh's swap/ulimit gates so we needn't
240+
# swapoff the runner.
241+
sudo sysctl -w vm.max_map_count=2000000 || true
242+
# Secondary: this boots *concurrently with the cargo build* on a 16GB
243+
# runner, and 4.0.3 sizes for a dedicated host (FE 8GB -Xms heap, BE
244+
# mem_limit `auto` ≈ 90% of RAM). Cap both (FE 2GB + BE 4GB) so Doris fits
245+
# beside the build, ~matching the 2.1.0 footprint. Override the entrypoint
246+
# to rewrite the confs, then hand off to the image's normal entry_point.sh.
245247
if docker run -d --name iggy-doris-shared \
246248
-p 8030 -p 9030 -p 8040:8040 \
249+
-e SKIP_CHECK_ULIMIT=true \
247250
--entrypoint bash apache/doris:4.0.3-all-slim \
248251
-c "sed -i 's/-Xmx8192m -Xms8192m/-Xmx2048m -Xms2048m/' /opt/apache-doris/fe/conf/fe.conf && echo 'mem_limit = 4096M' >> /opt/apache-doris/be/conf/be.conf && exec bash /usr/local/bin/entry_point.sh"; then
249252
DORIS_STARTED=1

core/integration/tests/connectors/fixtures/doris/container.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,13 @@ impl DorisContainer {
212212
.with_port(FE_HTTP_PORT.tcp())
213213
.with_expected_status_code(200u16),
214214
))
215+
// Doris 4.0.3's start_be.sh refuses to start (exit 1) unless
216+
// vm.max_map_count >= 2000000 and swap is off and ulimit -n >= 60000.
217+
// Hosts (CI runners, dev laptops) rarely satisfy all three, and the BE
218+
// dying makes the container exit before it ever reports healthy. Skip
219+
// those gate checks; the test workload is tiny so the conservative
220+
// thresholds don't matter. (CI also raises vm.max_map_count on the host.)
221+
.with_env_var("SKIP_CHECK_ULIMIT", "true")
215222
.with_network(unique_network)
216223
.with_mapped_port(0, FE_HTTP_PORT.tcp())
217224
.with_mapped_port(0, FE_MYSQL_PORT.tcp())

0 commit comments

Comments
 (0)