Skip to content

Commit 734bfe8

Browse files
committed
fix(ci): make the Doris 4.0.3 shared container healthy on CI runners
After bumping the shared CI container from doris-all-in-one-2.1.0 to 4.0.3-all-slim (so CI matches the fixture image), the doris tests began timing out the 1h job cap. Root cause, found by reproducing the container locally: 4.0.3's start_be.sh hard-`exit 1`s unless vm.max_map_count >= 2000000 (it also gates on swap and ulimit). The 2.1.0 image had no such gate, which is exactly why it passed in 13min and 4.0.3 did not. On the runner's low default the BE exits on start; its death ends the entrypoint's `wait $child_pid`, so the container exits 0 and never reports healthy — forcing every test onto the slow per-test self-boot path (480s nextest timeout x 4 retries). - Raise vm.max_map_count to 2000000 on the runner before launch (inherited by all containers) and pass SKIP_CHECK_ULIMIT=true to bypass the swap/ulimit gates without swapoff'ing the runner. Verified locally: BE reaches alive=true. - Self-boot fixture sets SKIP_CHECK_ULIMIT=true too, so the fallback path and local Linux dev aren't blocked by the same gates. - Cap FE heap to 2GB and BE mem_limit to 4GB: the container boots beside the cargo build on a 16GB runner, and 4.0.3 otherwise sizes for a dedicated host (8GB FE heap + ~90%-of-RAM BE), so bounding it keeps room for rustc. - Make the shared-container wait non-fatal and dump docker logs/inspect on the unhealthy path, so a launch failure degrades to self-boot instead of aborting the step. Also rustfmt the sqlx AssertSqlSafe call sites introduced by the sqlx 0.9 migration in the preceding master merge. Verified in CI: shared container reports healthy and test-1/test-2 pass (~13min).
1 parent e66c62b commit 734bfe8

2 files changed

Lines changed: 48 additions & 12 deletions

File tree

.github/actions/rust/pre-merge/action.yml

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,9 +229,26 @@ runs:
229229
DORIS_STARTED=0
230230
if [[ "$RUNNER_OS" == "Linux" ]] && { [[ -z "$NEXTEST_FILTER" ]] || echo "$NEXTEST_FILTER" | grep -qi 'integration'; }; then
231231
echo "::group::Launch shared Doris container (boots during build)"
232+
# ROOT CAUSE of the regression: 4.0.3's start_be.sh hard-`exit 1`s unless
233+
# `vm.max_map_count >= 2000000` (the 2.1.0 image had no such gate, which is
234+
# why it worked). The runner default is far lower, so the BE died on start;
235+
# its death ends the entrypoint's `wait`, the container exits, and every
236+
# doris test falls onto the slow per-test self-boot path → 1h timeout.
237+
# Raise it on the host (inherited by all containers, shared and self-boot);
238+
# reproduced locally that this brings the BE alive. `SKIP_CHECK_ULIMIT`
239+
# additionally bypasses start_be.sh's swap/ulimit gates so we needn't
240+
# swapoff the runner.
241+
sudo sysctl -w vm.max_map_count=2000000 || true
242+
# Secondary: this boots *concurrently with the cargo build* on a 16GB
243+
# runner, and 4.0.3 sizes for a dedicated host (FE 8GB -Xms heap, BE
244+
# mem_limit `auto` ≈ 90% of RAM). Cap both (FE 2GB + BE 4GB) so Doris fits
245+
# beside the build, ~matching the 2.1.0 footprint. Override the entrypoint
246+
# to rewrite the confs, then hand off to the image's normal entry_point.sh.
232247
if docker run -d --name iggy-doris-shared \
233248
-p 8030 -p 9030 -p 8040:8040 \
234-
apache/doris:4.0.3-all-slim; then
249+
-e SKIP_CHECK_ULIMIT=true \
250+
--entrypoint bash apache/doris:4.0.3-all-slim \
251+
-c "sed -i 's/-Xmx8192m -Xms8192m/-Xmx2048m -Xms2048m/' /opt/apache-doris/fe/conf/fe.conf && echo 'mem_limit = 4096M' >> /opt/apache-doris/be/conf/be.conf && exec bash /usr/local/bin/entry_point.sh"; then
235252
DORIS_STARTED=1
236253
else
237254
echo "::warning::Failed to start shared Doris container; doris tests will self-boot"
@@ -296,6 +313,18 @@ runs:
296313
echo "Shared Doris ready: FE ${DORIS_FE_URL}, MySQL ${MYSQL_PORT}"
297314
else
298315
echo "::warning::Shared Doris did not become healthy; doris tests will self-boot"
316+
# Diagnostics: why did 4.0.3-all-slim not become healthy as a bare
317+
# `docker run` (2.1.0 did, and 4.0.3 self-boots fine via testcontainers)?
318+
# Capture container state (Exited? OOMKilled? exit code) and the tail
319+
# of its logs so we can tell crash/OOM apart from slow-boot.
320+
echo "::group::Shared Doris diagnostics"
321+
docker ps -a --filter name=iggy-doris-shared \
322+
--format 'status={{.Status}} state={{.State}}' 2>&1 || true
323+
docker inspect iggy-doris-shared \
324+
--format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}}' 2>&1 || true
325+
echo "--- last 60 lines of container logs ---"
326+
docker logs --tail 60 iggy-doris-shared 2>&1 || true
327+
echo "::endgroup::"
299328
docker rm -f iggy-doris-shared >/dev/null 2>&1 || true
300329
fi
301330
echo "::endgroup::"

core/integration/tests/connectors/fixtures/doris/container.rs

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,13 @@ impl DorisContainer {
212212
.with_port(FE_HTTP_PORT.tcp())
213213
.with_expected_status_code(200u16),
214214
))
215+
// Doris 4.0.3's start_be.sh refuses to start (exit 1) unless
216+
// vm.max_map_count >= 2000000 and swap is off and ulimit -n >= 60000.
217+
// Hosts (CI runners, dev laptops) rarely satisfy all three, and the BE
218+
// dying makes the container exit before it ever reports healthy. Skip
219+
// those gate checks; the test workload is tiny so the conservative
220+
// thresholds don't matter. (CI also raises vm.max_map_count on the host.)
221+
.with_env_var("SKIP_CHECK_ULIMIT", "true")
215222
.with_network(unique_network)
216223
.with_mapped_port(0, FE_HTTP_PORT.tcp())
217224
.with_mapped_port(0, FE_MYSQL_PORT.tcp())
@@ -357,12 +364,12 @@ impl DorisContainer {
357364
"CREATE DATABASE IF NOT EXISTS {}",
358365
self.database
359366
)))
360-
.execute(&pool)
361-
.await
362-
.map_err(|e| TestBinaryError::FixtureSetup {
363-
fixture_type: "DorisContainer".to_string(),
364-
message: format!("Failed to create test database: {e}"),
365-
})?;
367+
.execute(&pool)
368+
.await
369+
.map_err(|e| TestBinaryError::FixtureSetup {
370+
fixture_type: "DorisContainer".to_string(),
371+
message: format!("Failed to create test database: {e}"),
372+
})?;
366373
Ok(())
367374
}
368375
}
@@ -457,11 +464,11 @@ pub trait DorisOps: Sync {
457464
let row = sqlx::raw_sql(sqlx::AssertSqlSafe(format!(
458465
"SELECT COUNT(*) AS c FROM {database}.{table}"
459466
)))
460-
.fetch_one(&pool)
461-
.await
462-
.map_err(|e| TestBinaryError::InvalidState {
463-
message: format!("Failed to count rows in {database}.{table}: {e}"),
464-
})?;
467+
.fetch_one(&pool)
468+
.await
469+
.map_err(|e| TestBinaryError::InvalidState {
470+
message: format!("Failed to count rows in {database}.{table}: {e}"),
471+
})?;
465472
// Doris returns COUNT(*) as BIGINT; sqlx decodes that to i64.
466473
let count: i64 = row.try_get(0).map_err(|e| TestBinaryError::InvalidState {
467474
message: format!("Failed to read count column: {e}"),

0 commit comments

Comments
 (0)