Skip to content

Commit 3943667

Browse files
committed
[improvement](build) Improve Hive docker startup refresh, idempotency, and metadata backend
### What problem does this PR solve? Related Issue: #62101 Related PR: #62102 Problem Summary: This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows. Main changes: - add structured Hive startup modes: --hive-mode fast|refresh|rebuild - add module-scoped refresh: --hive-modules - persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh - optimize healthy refresh path to skip unnecessary compose rebuild/up steps - reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal) - refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns) - remove redundant startup-heavy operations in refresh path - switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup - add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting ### Release note None ### Check List (For Author) - Test: Manual test - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend - Behavior changed: Yes - Hive startup now follows mode/module-based refresh semantics - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META) - Does this need documentation: No
1 parent 632f791 commit 3943667

197 files changed

Lines changed: 2432 additions & 895 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either implied. See the License for the specific
16+
language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Hive Docker Environment
21+
22+
This directory contains Hive2/Hive3 Docker Compose templates and bootstrap scripts used by Doris thirdparty startup.
23+
24+
## Components
25+
26+
- `hive-server`: HiveServer2 endpoint
27+
- `hive-metastore`: Hive Metastore service
28+
- `hive-metastore-postgresql`: metastore backend database
29+
- `namenode` / `datanode`: HDFS services for Hive test data
30+
31+
## Component Segmentation
32+
33+
Hive startup can be understood in 3 layers:
34+
35+
### 1) Docker Service Layer
36+
37+
- Compute/SQL entry:
38+
- `hive-server`
39+
- Metadata:
40+
- `hive-metastore`
41+
- `hive-metastore-postgresql`
42+
- Storage:
43+
- `namenode`
44+
- `datanode`
45+
46+
### 2) Refresh Module Layer (`--hive-modules`)
47+
48+
- `default`: basic default-db external tables
49+
- `multi_catalog`: multi-format and multi-path external table cases
50+
- `partition_type`: partition type coverage cases
51+
- `statistics`: table stats and empty-table stats cases
52+
- `tvf`: tvf data/bootstrap cases
53+
- `regression`: special regression datasets (serde, delimiters, etc.)
54+
- `test`: lightweight smoke test datasets
55+
- `preinstalled_hql`: centralized preinstalled HQL scripts (`create_preinstalled_scripts/*.hql`)
56+
- `view`: view bootstrap (`create_view_scripts/create_view.hql`)
57+
58+
### 3) Bootstrap Group Layer
59+
60+
- `common`: shared items for hive2/hive3
61+
- `hive2_only`: hive2-only items
62+
- `hive3_only`: hive3-only items
63+
64+
By default:
65+
66+
- Hive2 uses: `common,hive2_only`
67+
- Hive3 uses: `common,hive3_only`
68+
69+
This grouping controls which files are selected during `run.sh`/HQL refresh.
70+
71+
## Start/Stop
72+
73+
```bash
74+
# Start Hive3
75+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3
76+
77+
# Start Hive2
78+
./docker/thirdparties/run-thirdparties-docker.sh -c hive2
79+
80+
# Stop Hive3
81+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --stop
82+
```
83+
84+
## Startup Modes
85+
86+
Use `--hive-mode` to control startup behavior:
87+
88+
- `fast`: reuse existing state as much as possible
89+
- `refresh` (default): refresh only changed modules by SHA
90+
- `rebuild`: force reset and rebuild hive state
91+
92+
Examples:
93+
94+
```bash
95+
# Default mode (refresh)
96+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3
97+
98+
# Explicit refresh
99+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
100+
101+
# Full rebuild
102+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode rebuild
103+
```
104+
105+
## Module Refresh
106+
107+
Use `--hive-modules` to limit refresh scope:
108+
109+
- `default,multi_catalog,partition_type,statistics,tvf,regression,test,preinstalled_hql,view`
110+
- `all` means all modules
111+
112+
Examples:
113+
114+
```bash
115+
# Refresh only preinstalled HQL scripts
116+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql
117+
118+
# Refresh selected data modules
119+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules default,multi_catalog
120+
```
121+
122+
## Idempotency Rules
123+
124+
To keep `refresh` repeatable:
125+
126+
- `run.sh` scripts should be idempotent
127+
- HQL should use `DROP ... IF EXISTS` then `CREATE ...`
128+
- avoid relying on `CREATE ... IF NOT EXISTS` for table/view recreation
129+
130+
## JuiceFS Metadata Backend
131+
132+
Hive now defaults JuiceFS metadata to PostgreSQL (Hive metastore DB), so Hive startup no longer auto-requires MySQL.
133+
134+
- Hive2 default (in `hive-2x_settings.env`):
135+
- `postgres://postgres@127.0.0.1:${PG_PORT}/juicefs_meta?sslmode=disable`
136+
- Hive3 default (in `hive-3x_settings.env`):
137+
- `postgres://postgres@127.0.0.1:${PG_PORT}/juicefs_meta?sslmode=disable`
138+
139+
If your environment still needs MySQL metadata, override before startup:
140+
141+
```bash
142+
export JFS_CLUSTER_META="mysql://root:123456@(127.0.0.1:3316)/juicefs_meta"
143+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3
144+
```
145+
146+
## Logs and Debug
147+
148+
- Hive3 startup log: `docker/thirdparties/logs/start_hive3.log`
149+
- Hive2 startup log: `docker/thirdparties/logs/start_hive2.log`
150+
151+
By default, helper scripts keep xtrace off to reduce log noise.
152+
Enable debug trace when needed:
153+
154+
```bash
155+
export HIVE_DEBUG=1
156+
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
157+
```
158+
159+
## Common Troubleshooting
160+
161+
- Metastore health check fails:
162+
- check `hive-metastore-postgresql` is healthy
163+
- inspect `start_hive3.log` or `start_hive2.log`
164+
- JuiceFS format/status fails:
165+
- verify `JFS_CLUSTER_META` is reachable
166+
- ensure target metadata database exists (startup script auto-creates for local MySQL/PostgreSQL)
167+
- Refresh is unexpectedly slow:
168+
- confirm `--hive-mode refresh` is used
169+
- use `--hive-modules` to narrow refresh scope

docker/thirdparties/docker-compose/hive/hive-2x.yaml.tpl

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@
1616
#
1717

1818

19-
version: "3.8"
20-
2119
services:
2220
namenode:
2321
image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
@@ -35,6 +33,8 @@ services:
3533
interval: 5s
3634
timeout: 120s
3735
retries: 120
36+
volumes:
37+
- ${HIVE_STATE_ROOT}/namenode:/hadoop/dfs/name
3838
network_mode: "host"
3939

4040
datanode:
@@ -52,6 +52,8 @@ services:
5252
interval: 5s
5353
timeout: 60s
5454
retries: 120
55+
volumes:
56+
- ${HIVE_STATE_ROOT}/datanode:/hadoop/dfs/data
5557
network_mode: "host"
5658

5759
hive-server:
@@ -61,9 +63,12 @@ services:
6163
environment:
6264
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://${IP_HOST}:${PG_PORT}/metastore"
6365
SERVICE_PRECONDITION: "${IP_HOST}:${HMS_PORT}"
66+
HIVE_SITE_CONF_hive_aux_jars_path: "file:///mnt/scripts/auxlib/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar"
6467
container_name: ${CONTAINER_UID}hive2-server
6568
expose:
6669
- "${HS_PORT}"
70+
volumes:
71+
- ./scripts:/mnt/scripts
6772
depends_on:
6873
datanode:
6974
condition: service_healthy
@@ -81,7 +86,7 @@ services:
8186
image: bde2020/hive:2.3.2-postgresql-metastore
8287
env_file:
8388
- ./hadoop-hive-2x.env
84-
command: /bin/bash /mnt/scripts/hive-metastore.sh
89+
command: /bin/bash /mnt/scripts/start-hive-metastore.sh
8590
environment:
8691
SERVICE_PRECONDITION: "${IP_HOST}:50070 ${IP_HOST}:50075 ${IP_HOST}:${PG_PORT}"
8792
HMS_PORT: "${HMS_PORT}"
@@ -90,6 +95,7 @@ services:
9095
- "${HMS_PORT}"
9196
volumes:
9297
- ./scripts:/mnt/scripts
98+
- ${HIVE_STATE_ROOT}/state:/mnt/state
9399
depends_on:
94100
hive-metastore-postgresql:
95101
condition: service_healthy
@@ -105,6 +111,8 @@ services:
105111
container_name: ${CONTAINER_UID}hive2-metastore-postgresql
106112
ports:
107113
- "${PG_PORT}:5432"
114+
volumes:
115+
- ${HIVE_STATE_ROOT}/pgdata:/var/lib/postgresql/data
108116
healthcheck:
109117
test: ["CMD-SHELL", "pg_isready -U postgres"]
110118
interval: 5s

docker/thirdparties/docker-compose/hive/hive-2x_settings.env

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ export HS_PORT=10000 # should be same as hive2ServerPort in regression-conf.groo
2626
export PG_PORT=5432 # should be same as hive2PgPort in regression-conf.groovy
2727

2828
# JuiceFS metadata endpoint for property `juicefs.cluster.meta`.
29-
# CI can override this env, e.g.:
29+
# CI can override this env, e.g. to point at the docker-published mysql_57 port:
3030
# export JFS_CLUSTER_META="mysql://user:pwd@(127.0.0.1:3316)/juicefs_meta"
31-
# default to mysql_57 (3316) because external pipeline always starts mysql, but not redis.
32-
export JFS_CLUSTER_META="${JFS_CLUSTER_META:-mysql://root:123456@(127.0.0.1:3316)/juicefs_meta}"
31+
# default to hive metastore postgresql to avoid external mysql dependency.
32+
export JFS_CLUSTER_META="${JFS_CLUSTER_META:-postgres://postgres@127.0.0.1:${PG_PORT}/juicefs_meta?sslmode=disable}"

docker/thirdparties/docker-compose/hive/hive-3x.yaml.tpl

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@
1616
#
1717

1818

19-
version: "3.8"
20-
2119
services:
2220
namenode:
2321
image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
@@ -35,6 +33,8 @@ services:
3533
interval: 5s
3634
timeout: 120s
3735
retries: 120
36+
volumes:
37+
- ${HIVE_STATE_ROOT}/namenode:/hadoop/dfs/name
3838
network_mode: "host"
3939

4040
datanode:
@@ -52,6 +52,8 @@ services:
5252
interval: 5s
5353
timeout: 60s
5454
retries: 120
55+
volumes:
56+
- ${HIVE_STATE_ROOT}/datanode:/hadoop/dfs/data
5557
network_mode: "host"
5658

5759
hive-server:
@@ -63,9 +65,12 @@ services:
6365
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://${IP_HOST}:${PG_PORT}/metastore"
6466
SERVICE_PRECONDITION: "${IP_HOST}:${HMS_PORT}"
6567
JVM_OPTS: -Xmx2g
68+
HIVE_SITE_CONF_hive_aux_jars_path: "file:///mnt/scripts/auxlib/json-serde-1.3.9-SNAPSHOT-jar-with-dependencies.jar"
6669
container_name: ${CONTAINER_UID}hive3-server
6770
expose:
6871
- "${HS_PORT}"
72+
volumes:
73+
- ./scripts:/mnt/scripts
6974
depends_on:
7075
datanode:
7176
condition: service_healthy
@@ -83,7 +88,7 @@ services:
8388
image: doristhirdpartydocker/hive:3.1.2-postgresql-metastore
8489
env_file:
8590
- ./hadoop-hive-3x.env
86-
command: /bin/bash /mnt/scripts/hive-metastore.sh
91+
command: /bin/bash /mnt/scripts/start-hive-metastore.sh
8792
environment:
8893
SERVICE_PRECONDITION: "${IP_HOST}:9870 ${IP_HOST}:9864 ${IP_HOST}:${PG_PORT}"
8994
HMS_PORT: "${HMS_PORT}"
@@ -92,6 +97,7 @@ services:
9297
- "${HMS_PORT}"
9398
volumes:
9499
- ./scripts:/mnt/scripts
100+
- ${HIVE_STATE_ROOT}/state:/mnt/state
95101
- /tmp/jfs-bucket:/tmp/jfs-bucket
96102
depends_on:
97103
hive-metastore-postgresql:
@@ -108,6 +114,8 @@ services:
108114
container_name: ${CONTAINER_UID}hive3-metastore-postgresql
109115
ports:
110116
- "${PG_PORT}:5432"
117+
volumes:
118+
- ${HIVE_STATE_ROOT}/pgdata:/var/lib/postgresql/data
111119
healthcheck:
112120
test: ["CMD-SHELL", "pg_isready -U postgres"]
113121
interval: 5s

docker/thirdparties/docker-compose/hive/hive-3x_settings.env

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,10 @@ export HS_PORT=13000 # should be same as hive3ServerPort in regression-conf.groo
2626
export PG_PORT=5732 # should be same as hive3PgPort in regression-conf.groovy
2727

2828
# JuiceFS metadata endpoint for property `juicefs.cluster.meta`.
29-
# CI can override this env, e.g.:
29+
# Default to hive metastore postgresql to avoid external mysql dependency.
30+
# CI can still override this env, e.g.:
3031
# export JFS_CLUSTER_META="mysql://user:pwd@(127.0.0.1:3316)/juicefs_meta"
31-
export JFS_CLUSTER_META="${JFS_CLUSTER_META:-mysql://root:123456@(127.0.0.1:3316)/juicefs_meta}"
32+
export JFS_CLUSTER_META="${JFS_CLUSTER_META:-postgres://postgres@127.0.0.1:${PG_PORT}/juicefs_meta?sslmode=disable}"
3233

3334
# prepare for paimon hms test,control load paimon hms data or not
3435
export enablePaimonHms="false"

0 commit comments

Comments
 (0)