Skip to content

[improvement](build) Speed up Hive thirdparty startup with named volumes, OSS baseline, and parallel HQL refresh#62103

Open
suxiaogang223 wants to merge 4 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-structured-modes
Open

[improvement](build) Speed up Hive thirdparty startup with named volumes, OSS baseline, and parallel HQL refresh#62103
suxiaogang223 wants to merge 4 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-structured-modes

Conversation

@suxiaogang223
Copy link
Copy Markdown
Contributor

@suxiaogang223 suxiaogang223 commented Apr 3, 2026

What problem does this PR solve?

Related Issue: #62101

Problem Summary:

Hive thirdparty startup in docker/thirdparties was slow, stateful in hard-to-reason ways, and hard to iterate on in CI. This PR restructures it so that:

  • a fresh CI host can restore a ready-to-use Hive stack from a pre-built baseline in under a minute, instead of paying the full HDFS + HMS + HQL bootstrap cost every run,
  • repeated local runs only re-execute the HQL / run.sh files that actually changed,
  • Hive2 and Hive3 no longer share any on-disk state, so the two versions can be started, rebuilt, or refreshed independently.

Main changes

  1. Docker named volumes replace bind mounts. Hive NameNode, DataNode, Postgres metastore, and the Hive state dir now live in four Docker named volumes per version (<CONTAINER_UID><hive_version>-{namenode,datanode,pgdata,state}). This makes volume lifecycle explicit (docker volume create/rm), avoids host-path permission issues, and lets us snapshot/restore volumes atomically.

  2. OSS baseline snapshot + auto-download. On first start, if the volumes are empty, the script downloads a pre-tarred baseline (<hive_version>-baseline-<arch>.tar.gz) from a configurable OSS prefix (defaults to doris-thirdparty.oss-cn-beijing.aliyuncs.com/thirdparties/hive-baseline) and unpacks it directly into the four volumes via a single alpine tar container. Tarballs are cached under /tmp/hive-baseline-cache/ and keyed per Hive version. snapshot-hive-baseline.sh is the inverse operation used to produce the tarball.

  3. Structured startup modes (--hive-mode fast|refresh|rebuild).

    • fast: skip compose up if the stack is already healthy; skip data refresh entirely.
    • refresh (default): skip compose up if healthy; re-run only modules / HQL files whose SHA changed.
    • rebuild: tear down compose, wipe the four volumes, full cold start.
  4. Module-scoped, SHA-based incremental refresh. Nine modules (default, multi_catalog, partition_type, statistics, tvf, regression, test, preinstalled_hql, view) each track a content SHA under /mnt/state/modules/. Only modules with a SHA mismatch get re-run. --hive-modules limits refresh to a subset.

  5. Parallel preinstalled_hql execution. The ~77 HQL files in create_preinstalled_scripts/ are now refreshed in two phases — serial SHA check, then xargs -P ${LOAD_PARALLEL} for the files that actually changed. Combined with a beeline-backed hive shim on PATH (avoids the per-call Hive CLI JVM cold-start), this is the single largest wall-clock win.

  6. Bootstrap groups (common, hive2_only, hive3_only). Files that only belong to one Hive version are listed under scripts/bootstrap/<group>.<kind>.list and skipped for the other version, so Hive2 no longer runs Hive3-only HQL and vice versa.

  7. HQL idempotency. All create_preinstalled_scripts/*.hql and module create_table.hql files are rewritten to DROP TABLE IF EXISTS + CREATE, so reruns are safe whether the baseline is fresh or already populated.

  8. Logs and observability. Per-phase timing (compose up, init-hive-baseline, refresh-hive-modules, per-HQL BEGIN/END took=Ns) is printed to the component log. Verbose xtrace is gated by HIVE_DEBUG=1.

  9. Documentation. New Hive docker README covers architecture, modes, modules, bootstrap groups, how to add test data, and troubleshooting.

How to use

# Default (refresh mode, all modules)
./docker/thirdparties/run-thirdparties-docker.sh -c hive3

# Fast mode — stack is healthy, no data changes
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode fast

# Rebuild from scratch (wipes volumes; re-downloads baseline if available)
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode rebuild

# Refresh only one module
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 \
  --hive-mode refresh --hive-modules preinstalled_hql

Release note

None

Check List (For Author)

  • Test: Manual test on aliyun dev machine
    • --hive-mode refresh (healthy stack, no change) — data refresh ~6-7s
    • --hive-mode refresh (HQL content change) — only changed files re-run in parallel
    • --hive-mode rebuild — full cold start from empty volumes
    • --hive-mode fast — stack reuse, ~1-2s
    • Baseline restore from OSS for both hive2 and hive3, then healthy compose up
  • Behavior changed: Yes
    • Hive state now lives in Docker named volumes, not bind mounts
    • Startup follows --hive-mode / --hive-modules semantics
  • Does this need documentation: No (README included)

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 7, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62103

Problem Summary: Fix Hive startup timeout in external pipelines by replacing netcat-based Hive metastore port probes with bash /dev/tcp checks, because the Hive container image does not reliably provide nc and the previous implementation could block metastore readiness forever.

### Release note

None

### Check List (For Author)

- Test: Script validation only
    - No need to test (with reason): validated updated shell scripts with bash -n; waiting for external pipeline rerun to verify runtime behavior
- Behavior changed: No
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from bec5cc8 to aca9035 Compare April 8, 2026 03:20
@suxiaogang223 suxiaogang223 changed the title [improvement](build) Add structured Hive startup modes and state reuse [improvement](build) Improve Hive docker startup refresh, idempotency, and metadata backend Apr 8, 2026
…, and metadata backend

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary:
This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows.

Main changes:
- add structured Hive startup modes: --hive-mode fast|refresh|rebuild
- add module-scoped refresh: --hive-modules
- persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh
- optimize healthy refresh path to skip unnecessary compose rebuild/up steps
- reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal)
- refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns)
- remove redundant startup-heavy operations in refresh path
- switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup
- add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql
    - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend
- Behavior changed: Yes
    - Hive startup now follows mode/module-based refresh semantics
    - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META)
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from aca9035 to 3943667 Compare April 8, 2026 03:25
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@suxiaogang223 suxiaogang223 marked this pull request as ready for review April 8, 2026 03:33
@suxiaogang223 suxiaogang223 changed the title [improvement](build) Improve Hive docker startup refresh, idempotency, and metadata backend [improvement](build) Improve Hive docker startup refresh and idempotency Apr 8, 2026
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from 88c6ce3 to d256249 Compare April 9, 2026 06:46
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run buildall

@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@suxiaogang223 suxiaogang223 changed the title [improvement](build) Improve Hive docker startup refresh and idempotency [improvement](build) Speed up Hive thirdparty startup with named volumes, OSS baseline, and parallel HQL refresh Apr 14, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Consolidate the Hive docker startup, shared-volume baseline, and OSS baseline publishing changes into a single commit. This keeps Hive metadata on a stable shared hostname, makes rebuild/refresh work with shared container names, versions the downloaded baseline cache and OSS filenames, and bumps the baseline version used by CI.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Verified hive2/hive3 rebuild and refresh on aliyun shared-volume environment
    - Published updated hive2/hive3 baseline tarballs to OSS with versioned filenames
- Behavior changed: Yes (Hive shared-volume startup and baseline restore now use stable host identity and versioned baseline artifacts)
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from 4a92cca to 554e816 Compare April 14, 2026 17:24
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Update the Hive docker documentation to reflect the versioned baseline cache naming, versioned OSS tarball naming, and the fact that HIVE_BASELINE_VERSION is the single variable that controls baseline rollout.

### Release note

None

### Check List (For Author)

- Test: No need to test (documentation only)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Prevent CI hosts from reusing truncated Hive baseline cache tarballs by validating cached and freshly downloaded baseline archives before restore, and by downloading into a temporary file before replacing the cache.

### Release note

None

### Check List (For Author)

- Test: No need to test (download/cache hardening only)
- Behavior changed: Yes (corrupt Hive baseline cache files are discarded and re-downloaded)
- Does this need documentation: No
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants