Skip to content

Commit cbcb555

Browse files
jhambletonjhambleton
andauthored
feat: mapreduce hashtable/synctable jobs (#3688)
* feat: mapreduce hashtable/synctable jobs * refactored SyncTable, added job args unit test, added matches validation tests * add additional testing for verifying args/options set, fixes for pr * revised configs, cleanup, and updated readme * moved all access with configuration outside accessor, isolated mock connection in its own method in syncmapper, addressed other comments * minor readme changes, fail upon ioe for initialization in map tasks and throw up exception Co-authored-by: jhambleton <jhambleton@google.com>
1 parent 48abb6d commit cbcb555

9 files changed

Lines changed: 1996 additions & 51 deletions

File tree

bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce/README.md

Lines changed: 180 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -2,51 +2,89 @@
22

33
This module provides a work alike to some of the jobs implemented in hbase-server.
44
Specifically this currently has the ability to export and import SequenceFiles
5-
from/to Cloud Bigtable and import HBase snapshots using a Map Reduce cluster (ie. dataproc).
5+
from/to Cloud Bigtable, import HBase snapshots using a Map Reduce cluster (ie.
6+
dataproc), and HashTable/SyncTable for validation.
67

7-
## Expected Usage
8+
## Setup
9+
10+
To use the tools in this folder, you can download them from the maven repository, or
11+
you can build them using Maven.
812

913
[//]: # ({x-version-update-start:bigtable-client-parent:released})
14+
### Download the jars
15+
Download [bigtable-hbase-1.x-mapreduce jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-hbase-1.x-mapreduce), which is an aggregation of all required jars.
16+
17+
### Build the jars yourself
18+
19+
Go to the top level directory and build the repo
20+
then return to this sub directory.
21+
22+
```
23+
cd ../../
24+
mvn clean install -DskipTests=true
25+
cd bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce
26+
```
27+
28+
## Expected Usage
29+
1030
### On-prem Hadoop
1131

12-
1. Download or build bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar
32+
1. Download or build bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar
1333
2. Download service account credentials json from Google Cloud Console.
1434
3. Submit the job using your edge node's hadoop installation.
1535
```bash
16-
# Export to SequenceFiles
17-
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
18-
hadoop jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
19-
export-table \
20-
-Dgoogle.bigtable.project.id=<project-id> \
21-
-Dgoogle.bigtable.instance.id=<instance-id> \
22-
<table-id> \
23-
<outputdir>
36+
# Export to SequenceFiles
37+
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
38+
hadoop jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
39+
export-table \
40+
-Dgoogle.bigtable.project.id=<project-id> \
41+
-Dgoogle.bigtable.instance.id=<instance-id> \
42+
<table-id> \
43+
<outputdir>
2444

25-
# Import from SequenceFiles
26-
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
27-
hadoop jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
28-
import-table \
29-
-Dgoogle.bigtable.project.id=<project-id> \
30-
-Dgoogle.bigtable.instance.id=<instance-id> \
31-
<table-id> \
32-
<inputdir>
45+
# Import from SequenceFiles
46+
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
47+
hadoop jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
48+
import-table \
49+
-Dgoogle.bigtable.project.id=<project-id> \
50+
-Dgoogle.bigtable.instance.id=<instance-id> \
51+
<table-id> \
52+
<inputdir>
53+
54+
# Import from HBase snapshot
55+
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
56+
hadoop jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
57+
import-snapshot \
58+
-Dgoogle.bigtable.project.id=<project-id> \
59+
-Dgoogle.bigtable.instance.id=<instance-id> \
60+
<snapshot-name> \
61+
<snapshot-dir> \
62+
<table-id> \
63+
<tmp-dir>
64+
65+
# HashTable on HBase
66+
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
67+
hadoop jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
68+
hash-table \
69+
-Dhbase.zookeeper.quorum=<source-zk-quorum> \
70+
<source-table-id> \
71+
<hash-outputdir-hbase>
3372

34-
# Import from HBase snapshot
35-
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
36-
hadoop jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
37-
import-snapshot \
38-
-Dgoogle.bigtable.project.id=<project-id> \
39-
-Dgoogle.bigtable.instance.id=<instance-id> \
40-
<snapshot-name> \
41-
<snapshot-dir> \
42-
<table-id> \
43-
<tmp-dir>
73+
# SyncTable on Bigtable (dryrun enabled by default)
74+
GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json \
75+
hadoop jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
76+
sync-table \
77+
--sourcezkcluster=<source-zk-quorum> \
78+
--targetbigtableproject=<project-id> \
79+
--targetbigtableinstance=<instance-id> \
80+
<hash-outputdir-hbase> \
81+
<source-table-id> \
82+
<target-table-id>
4483
```
4584

46-
4785
### Dataproc
4886

49-
1. Download or build bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar.
87+
1. Download or build bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar.
5088
2. Install the gcloud sdk.
5189
3. Configure [Bigtable IAM roles](https://cloud.google.com/bigtable/docs/access-control#roles)
5290
for the [Dataproc Service Account](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts#what_are_service_accounts)
@@ -55,9 +93,10 @@ from/to Cloud Bigtable and import HBase snapshots using a Map Reduce cluster (ie
5593
```bash
5694
# Export to SequenceFiles
5795
gcloud dataproc jobs submit hadoop \
96+
--project <project-id> \
5897
--cluster <dataproc-cluster> \
5998
--region <dataproc-region> \
60-
--jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
99+
--jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
61100
-- \
62101
export-table \
63102
-Dgoogle.bigtable.project.id=<project-id> \
@@ -67,9 +106,10 @@ from/to Cloud Bigtable and import HBase snapshots using a Map Reduce cluster (ie
67106

68107
# Import from SequenceFiles
69108
gcloud dataproc jobs submit hadoop \
109+
--project <project-id> \
70110
--cluster <dataproc-cluster> \
71111
--region <dataproc-region> \
72-
--jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
112+
--jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
73113
-- \
74114
import-table \
75115
-Dgoogle.bigtable.project.id=<project-id> \
@@ -79,17 +119,45 @@ from/to Cloud Bigtable and import HBase snapshots using a Map Reduce cluster (ie
79119

80120
# Import from HBase snapshot
81121
gcloud dataproc jobs submit hadoop \
122+
--project <project-id> \
82123
--cluster <dataproc-cluster> \
83124
--region <dataproc-region> \
84-
--jar bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar \
125+
--jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
85126
-- \
86127
import-snapshot \
87128
-Dgoogle.bigtable.project.id=<project-id> \
88129
-Dgoogle.bigtable.instance.id=<instance-id> \
89130
<snapshot-name> \
90131
<snapshot-dir> \
91132
<table-id> \
92-
<tmp-dir>
133+
<tmp-dir>
134+
135+
# HashTable on HBase
136+
gcloud dataproc jobs submit hadoop \
137+
--project <project-id> \
138+
--cluster <dataproc-cluster> \
139+
--region <dataproc-region> \
140+
--jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
141+
-- \
142+
hash-table \
143+
-Dhbase.zookeeper.quorum=<source-zk-quorum> \
144+
<table-id> \
145+
<hash-outputdir-hbase>
146+
147+
# SyncTable on Bigtable (dryrun enabled by default)
148+
gcloud dataproc jobs submit hadoop \
149+
--project <project-id> \
150+
--cluster <dataproc-cluster> \
151+
--region <dataproc-region> \
152+
--jar bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar \
153+
-- \
154+
sync-table \
155+
--sourcezkcluster=<source-zk-quorum> \
156+
--targetbigtableproject=<project-id> \
157+
--targetbigtableinstance=<instance-id> \
158+
<hash-outputdir-hbase> \
159+
<source-table-id> \
160+
<target-table-id>
93161
```
94162

95163
## Examples
@@ -109,7 +177,7 @@ for the on-prem application to write to GCS).
109177
```bash
110178
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
111179
-snapshot <snapshotName> \
112-
-copy-to gs://<bucket/<snapshot-dir> \
180+
-copy-to gs://<bucket>/<snapshot-dir> \
113181
-mappers <num-mappers>
114182
```
115183

@@ -121,24 +189,24 @@ environment variables for running the subsequent steps.
121189
GCP environment properties
122190
```bash
123191
# set env properties
124-
export PROJECT_ID=<PROJECT_ID>
125-
export ZONE=<ZONE>
126-
export REGION=${ZONE%-*}
127-
export DATAPROC_CLUSTER=<DATAPROC_CLUSTER_NAME>
192+
PROJECT_ID=<PROJECT_ID>
193+
ZONE=<ZONE>
194+
REGION=${ZONE%-*}
195+
DATAPROC_CLUSTER=<DATAPROC_CLUSTER_NAME>
128196
129197
# bigtable table properties
130-
export CBT_INSTANCE=<BIGTABLE_INSTANCE>
131-
export CBT_CLUSTER=<BIGTABLE_CLUSTER>
132-
export CBT_TABLENAME=<TABLENAME>
133-
export CBT_COLUMN_FAMILY=<CF1[,CF]>
198+
CBT_INSTANCE=<BIGTABLE_INSTANCE>
199+
CBT_CLUSTER=<BIGTABLE_CLUSTER>
200+
CBT_TABLENAME=<TABLENAME>
201+
CBT_COLUMN_FAMILY=<CF1[,CF]>
134202
135203
# dataproc job jar
136-
export JOB_JAR=bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar
204+
JOB_JAR=bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar
137205
138206
# dataproc job args
139-
export JOB_ARG_SNAPSHOT_NAME=<SNAPSHOT_NAME>
140-
export JOB_ARG_SNAPSHOT_DIR=<SNAPSHOT_DIR>
141-
export JOB_ARG_TEMP_DIR=<JOB_TEMP_DIR>
207+
JOB_ARG_SNAPSHOT_NAME=<SNAPSHOT_NAME>
208+
JOB_ARG_SNAPSHOT_DIR=<SNAPSHOT_DIR>
209+
JOB_ARG_TEMP_DIR=<JOB_TEMP_DIR>
142210
```
143211

144212
2. [Create a Dataproc Cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) for executing the import snapshot job.
@@ -172,13 +240,14 @@ the command:
172240

173241
#### Run the import snapshot job
174242

175-
1. Run the following command to start the import snapshot job on the Dataproc cluster
243+
1. Run the following command to start the `import-snapshot` job on the Dataproc cluster
176244
that was created. Slowly scale the dataproc cluster to increase/decrease throughput
177245
and similarly scale up/down the bigtable cluster to meet the throughput demand. See
178246
Bigtable [scaling limitations](https://cloud.google.com/bigtable/docs/scaling#limitations) if observing slower performance than expected.
179247

180248
```bash
181249
gcloud dataproc jobs submit hadoop \
250+
--project ${PROJECT_ID} \
182251
--cluster ${DATAPROC_CLUSTER} \
183252
--region ${REGION} \
184253
--project ${PROJECT_ID} \
@@ -226,12 +295,72 @@ setting the properties for the job. For example:
226295
-Dhbase.snapshot.thread.pool.max=10
227296
```
228297

298+
### Example jobs to validate the data migrated from source to target
299+
300+
1. Set the following additional environment variables for running the validation steps.
301+
```bash
302+
# hash-table validation job
303+
HBASE_TABLENAME=<HBASE_TABLENAME>
304+
# hbase zookeeper quorum (ie. zk1.example.com:2181)
305+
HBASE_ZK_QUORUM=<ZK_QUORUM>
306+
HASH_OUTPUTDIR=<HASH_OUTPUTDIR>
307+
308+
# sync-table validation job
309+
HBASE_ZK_QUORUM_FULL=${HBASE_ZK_QUORUM}:/hbase
310+
```
311+
312+
2. Run `hash-table` and compute hashes for ranges on the source table and output
313+
results to a GCS bucket (See [HashTable/SyncTable](https://hbase.apache.org/book.html#_step_1_hashtable) doc for more details).
314+
```bash
315+
hadoop jar ${JOB_JAR} \
316+
hash-table \
317+
-Dhbase.zookeeper.quorum=${HBASE_ZK_QUORUM} \
318+
${HBASE_TABLENAME} \
319+
${HASH_OUTPUTDIR}
320+
```
321+
322+
3. Run `sync-table` to generate hashes on the target table and compare these hashes with
323+
the output from `hash-table`. For diverging hashes, a cell-level comparison is performed
324+
between the source and target and summarized in the job counters.
325+
```bash
326+
# dryrun mode (readonly) enabled by default
327+
gcloud dataproc jobs submit hadoop \
328+
--project ${PROJECT_ID} \
329+
--cluster ${DATAPROC_CLUSTER} \
330+
--region ${REGION} \
331+
--project ${PROJECT_ID} \
332+
--jar ${JOB_JAR} \
333+
-- \
334+
sync-table \
335+
--sourcezkcluster=${HBASE_ZK_QUORUM_FULL} \
336+
--targetbigtableproject=${PROJECT_ID} \
337+
--targetbigtableinstance=${CBT_INSTANCE} \
338+
${HASH_OUTPUTDIR} \
339+
${HBASE_TABLENAME} \
340+
${CBT_TABLENAME}
341+
```
342+
Note: Connection with the source is required for providing cell-level comparison. Users may
343+
enable debug mode `--properties mapreduce.map.log.level=DEBUG` on the job to provide additional
344+
details on the diverging hash ranges and cell mismatches if divergence is detected. Job
345+
configurations may also be updated to run `hash-table` against bigtable and `sync-table` run
346+
against hbase.
347+
348+
Additional Options:
349+
350+
1. Disable dry run mode to perform synchronization between source and target for diverging hash ranges.
351+
352+
```bash
353+
--dryrun=false
354+
```
355+
356+
2. Other job configuration and details may be referred to in [HBase SyncTable description](https://www.google.com/url?sa=D&q=https%3A%2F%2Fhbase.apache.org%2Fbook.html%23_step_2_synctable).
357+
229358
## Backwards compatibility
230359

231360
To maintain backwards compatibility of this artifact, we still provide
232-
`bigtable-hbase-1.x-mapreduce-2.0.0-alpha1.jar` artifact that includes
361+
`bigtable-hbase-1.x-mapreduce-2.5.0-shaded.jar` artifact that includes
233362
hadoop jars. However we encourage our users to migrate to
234-
`bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar` to avoid dependency
363+
`bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar` to avoid dependency
235364
conflicts with the existing classpath on Hadoop workers.
236365

237366
[//]: # ({x-version-update-end})

bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce/src/main/java/com/google/cloud/bigtable/mapreduce/Driver.java

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,9 @@
1616
package com.google.cloud.bigtable.mapreduce;
1717

1818
import com.google.cloud.bigtable.mapreduce.hbasesnapshots.ImportHBaseSnapshotJob;
19+
import com.google.cloud.bigtable.mapreduce.validation.BigtableSyncTableJob;
1920
import org.apache.hadoop.classification.InterfaceStability.Evolving;
21+
import org.apache.hadoop.hbase.mapreduce.HashTable;
2022
import org.apache.hadoop.util.ProgramDriver;
2123

2224
/** Driver for bigtable mapreduce jobs. Select which to run by passing name of job to this main. */
@@ -44,6 +46,14 @@ public static void main(String[] args) {
4446
"import-snapshot",
4547
ImportHBaseSnapshotJob.class,
4648
"A map/reduce program that imports an hbase snapshot to a table.");
49+
programDriver.addClass(
50+
"hash-table",
51+
HashTable.class,
52+
"A map/reduce program that computes hashes on source and outputs to filesystem (or cloud storage).");
53+
programDriver.addClass(
54+
"sync-table",
55+
BigtableSyncTableJob.class,
56+
"A map/reduce program that computes hashes on target and compares with hashes from source.");
4757
programDriver.driver(args);
4858
exitCode = programDriver.run(args);
4959
} catch (Throwable e) {

bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce/src/main/java/com/google/cloud/bigtable/mapreduce/hbasesnapshots/ImportHBaseSnapshotJob.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,9 @@ protected static int setConfFromArgs(Configuration conf, String[] args) {
184184
conf.get(BigtableOptionsFactory.INSTANCE_ID_KEY),
185185
conf.get(BigtableOptionsFactory.APP_PROFILE_ID_KEY, ""));
186186

187+
// Set user agent
188+
conf.set(BigtableOptionsFactory.CUSTOM_USER_AGENT_KEY, "HBaseMRImport");
189+
187190
// implicit table outputformat configs that are used in the job to write map output to a table
188191
conf.set(TableOutputFormat.OUTPUT_TABLE, conf.get(TABLENAME_KEY));
189192
conf.setStrings(

0 commit comments

Comments
 (0)