[improvement](build) Speed up Hive thirdparty startup with named volumes, OSS baseline, and parallel HQL refresh#62103
Open
suxiaogang223 wants to merge 4 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run external |
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 7, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#62103 Problem Summary: Fix Hive startup timeout in external pipelines by replacing netcat-based Hive metastore port probes with bash /dev/tcp checks, because the Hive container image does not reliably provide nc and the previous implementation could block metastore readiness forever. ### Release note None ### Check List (For Author) - Test: Script validation only - No need to test (with reason): validated updated shell scripts with bash -n; waiting for external pipeline rerun to verify runtime behavior - Behavior changed: No - Does this need documentation: No
bec5cc8 to
aca9035
Compare
…, and metadata backend ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows. Main changes: - add structured Hive startup modes: --hive-mode fast|refresh|rebuild - add module-scoped refresh: --hive-modules - persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh - optimize healthy refresh path to skip unnecessary compose rebuild/up steps - reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal) - refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns) - remove redundant startup-heavy operations in refresh path - switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup - add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting ### Release note None ### Check List (For Author) - Test: Manual test - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend - Behavior changed: Yes - Hive startup now follows mode/module-based refresh semantics - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META) - Does this need documentation: No
aca9035 to
3943667
Compare
Contributor
Author
|
run external |
Contributor
Author
|
run external |
88c6ce3 to
d256249
Compare
Contributor
Author
|
run buildall |
Contributor
Author
|
run external |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Consolidate the Hive docker startup, shared-volume baseline, and OSS baseline publishing changes into a single commit. This keeps Hive metadata on a stable shared hostname, makes rebuild/refresh work with shared container names, versions the downloaded baseline cache and OSS filenames, and bumps the baseline version used by CI.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Verified hive2/hive3 rebuild and refresh on aliyun shared-volume environment
- Published updated hive2/hive3 baseline tarballs to OSS with versioned filenames
- Behavior changed: Yes (Hive shared-volume startup and baseline restore now use stable host identity and versioned baseline artifacts)
- Does this need documentation: No
4a92cca to
554e816
Compare
Contributor
Author
|
run external |
### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Update the Hive docker documentation to reflect the versioned baseline cache naming, versioned OSS tarball naming, and the fact that HIVE_BASELINE_VERSION is the single variable that controls baseline rollout. ### Release note None ### Check List (For Author) - Test: No need to test (documentation only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Prevent CI hosts from reusing truncated Hive baseline cache tarballs by validating cached and freshly downloaded baseline archives before restore, and by downloading into a temporary file before replacing the cache. ### Release note None ### Check List (For Author) - Test: No need to test (download/cache hardening only) - Behavior changed: Yes (corrupt Hive baseline cache files are discarded and re-downloaded) - Does this need documentation: No
Contributor
Author
|
run external |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Related Issue: #62101
Problem Summary:
Hive thirdparty startup in
docker/thirdpartieswas slow, stateful in hard-to-reason ways, and hard to iterate on in CI. This PR restructures it so that:Main changes
Docker named volumes replace bind mounts. Hive NameNode, DataNode, Postgres metastore, and the Hive state dir now live in four Docker named volumes per version (
<CONTAINER_UID><hive_version>-{namenode,datanode,pgdata,state}). This makes volume lifecycle explicit (docker volume create/rm), avoids host-path permission issues, and lets us snapshot/restore volumes atomically.OSS baseline snapshot + auto-download. On first start, if the volumes are empty, the script downloads a pre-tarred baseline (
<hive_version>-baseline-<arch>.tar.gz) from a configurable OSS prefix (defaults todoris-thirdparty.oss-cn-beijing.aliyuncs.com/thirdparties/hive-baseline) and unpacks it directly into the four volumes via a singlealpine tarcontainer. Tarballs are cached under/tmp/hive-baseline-cache/and keyed per Hive version.snapshot-hive-baseline.shis the inverse operation used to produce the tarball.Structured startup modes (
--hive-mode fast|refresh|rebuild).fast: skip compose up if the stack is already healthy; skip data refresh entirely.refresh(default): skip compose up if healthy; re-run only modules / HQL files whose SHA changed.rebuild: tear down compose, wipe the four volumes, full cold start.Module-scoped, SHA-based incremental refresh. Nine modules (
default,multi_catalog,partition_type,statistics,tvf,regression,test,preinstalled_hql,view) each track a content SHA under/mnt/state/modules/. Only modules with a SHA mismatch get re-run.--hive-moduleslimits refresh to a subset.Parallel
preinstalled_hqlexecution. The ~77 HQL files increate_preinstalled_scripts/are now refreshed in two phases — serial SHA check, thenxargs -P ${LOAD_PARALLEL}for the files that actually changed. Combined with a beeline-backedhiveshim on PATH (avoids the per-call Hive CLI JVM cold-start), this is the single largest wall-clock win.Bootstrap groups (
common,hive2_only,hive3_only). Files that only belong to one Hive version are listed underscripts/bootstrap/<group>.<kind>.listand skipped for the other version, so Hive2 no longer runs Hive3-only HQL and vice versa.HQL idempotency. All
create_preinstalled_scripts/*.hqland modulecreate_table.hqlfiles are rewritten toDROP TABLE IF EXISTS+CREATE, so reruns are safe whether the baseline is fresh or already populated.Logs and observability. Per-phase timing (
compose up,init-hive-baseline,refresh-hive-modules, per-HQLBEGIN/END took=Ns) is printed to the component log. Verbose xtrace is gated byHIVE_DEBUG=1.Documentation. New Hive docker README covers architecture, modes, modules, bootstrap groups, how to add test data, and troubleshooting.
How to use
Release note
None
Check List (For Author)
--hive-mode refresh(healthy stack, no change) — data refresh ~6-7s--hive-mode refresh(HQL content change) — only changed files re-run in parallel--hive-mode rebuild— full cold start from empty volumes--hive-mode fast— stack reuse, ~1-2s--hive-mode/--hive-modulessemantics