You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: mapreduce hashtable/synctable jobs
* refactored SyncTable, added job args unit test, added matches validation tests
* add additional testing for verifying args/options set, fixes for pr
* revised configs, cleanup, and updated readme
* moved all access with configuration outside accessor, isolated mock connection in its own method in syncmapper, addressed other comments
* minor readme changes, fail upon ioe for initialization in map tasks and throw up exception
Co-authored-by: jhambleton <jhambleton@google.com>
Download [bigtable-hbase-1.x-mapreduce jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-hbase-1.x-mapreduce), which is an aggregation of all required jars.
16
+
17
+
### Build the jars yourself
18
+
19
+
Go to the top level directory and build the repo
20
+
then return to this sub directory.
21
+
22
+
```
23
+
cd ../../
24
+
mvn clean install -DskipTests=true
25
+
cd bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce
26
+
```
27
+
28
+
## Expected Usage
29
+
10
30
### On-prem Hadoop
11
31
12
-
1. Download or build bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar
32
+
1. Download or build bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar
13
33
2. Download service account credentials json from Google Cloud Console.
14
34
3. Submit the job using your edge node's hadoop installation.
2. [Create a Dataproc Cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) for executing the import snapshot job.
@@ -172,13 +240,14 @@ the command:
172
240
173
241
#### Run the import snapshot job
174
242
175
-
1. Run the following command to start the importsnapshot job on the Dataproc cluster
243
+
1. Run the following command to start the `import-snapshot` job on the Dataproc cluster
176
244
that was created. Slowly scale the dataproc cluster to increase/decrease throughput
177
245
and similarly scale up/down the bigtable cluster to meet the throughput demand. See
178
246
Bigtable [scaling limitations](https://cloud.google.com/bigtable/docs/scaling#limitations) if observing slower performance than expected.
179
247
180
248
```bash
181
249
gcloud dataproc jobs submit hadoop \
250
+
--project ${PROJECT_ID} \
182
251
--cluster ${DATAPROC_CLUSTER} \
183
252
--region ${REGION} \
184
253
--project ${PROJECT_ID} \
@@ -226,12 +295,72 @@ setting the properties for the job. For example:
226
295
-Dhbase.snapshot.thread.pool.max=10
227
296
```
228
297
298
+
### Example jobs to validate the data migrated from source to target
299
+
300
+
1. Set the following additional environment variables for running the validation steps.
301
+
```bash
302
+
# hash-table validation job
303
+
HBASE_TABLENAME=<HBASE_TABLENAME>
304
+
# hbase zookeeper quorum (ie. zk1.example.com:2181)
305
+
HBASE_ZK_QUORUM=<ZK_QUORUM>
306
+
HASH_OUTPUTDIR=<HASH_OUTPUTDIR>
307
+
308
+
# sync-table validation job
309
+
HBASE_ZK_QUORUM_FULL=${HBASE_ZK_QUORUM}:/hbase
310
+
```
311
+
312
+
2. Run `hash-table` and compute hashes for ranges on the source table and output
313
+
results to a GCS bucket (See [HashTable/SyncTable](https://hbase.apache.org/book.html#_step_1_hashtable) doc for more details).
314
+
```bash
315
+
hadoop jar ${JOB_JAR} \
316
+
hash-table \
317
+
-Dhbase.zookeeper.quorum=${HBASE_ZK_QUORUM} \
318
+
${HBASE_TABLENAME} \
319
+
${HASH_OUTPUTDIR}
320
+
```
321
+
322
+
3. Run `sync-table` to generate hashes on the target table and compare these hashes with
323
+
the output from `hash-table`. For diverging hashes, a cell-level comparison is performed
324
+
between the source and target and summarized in the job counters.
325
+
```bash
326
+
# dryrun mode (readonly) enabled by default
327
+
gcloud dataproc jobs submit hadoop \
328
+
--project ${PROJECT_ID} \
329
+
--cluster ${DATAPROC_CLUSTER} \
330
+
--region ${REGION} \
331
+
--project ${PROJECT_ID} \
332
+
--jar ${JOB_JAR} \
333
+
-- \
334
+
sync-table \
335
+
--sourcezkcluster=${HBASE_ZK_QUORUM_FULL} \
336
+
--targetbigtableproject=${PROJECT_ID} \
337
+
--targetbigtableinstance=${CBT_INSTANCE} \
338
+
${HASH_OUTPUTDIR} \
339
+
${HBASE_TABLENAME} \
340
+
${CBT_TABLENAME}
341
+
```
342
+
Note: Connection with the source is required for providing cell-level comparison. Users may
343
+
enable debug mode `--properties mapreduce.map.log.level=DEBUG` on the job to provide additional
344
+
details on the diverging hash ranges and cell mismatches if divergence is detected. Job
345
+
configurations may also be updated to run `hash-table` against bigtable and `sync-table` run
346
+
against hbase.
347
+
348
+
Additional Options:
349
+
350
+
1. Disable dry run mode to perform synchronization between source and target for diverging hash ranges.
351
+
352
+
```bash
353
+
--dryrun=false
354
+
```
355
+
356
+
2. Other job configuration and details may be referred to in [HBase SyncTable description](https://www.google.com/url?sa=D&q=https%3A%2F%2Fhbase.apache.org%2Fbook.html%23_step_2_synctable).
357
+
229
358
## Backwards compatibility
230
359
231
360
To maintain backwards compatibility of this artifact, we still provide
232
-
`bigtable-hbase-1.x-mapreduce-2.0.0-alpha1.jar` artifact that includes
361
+
`bigtable-hbase-1.x-mapreduce-2.5.0-shaded.jar` artifact that includes
233
362
hadoop jars. However we encourage our users to migrate to
234
-
`bigtable-hbase-1.x-mapreduce-2.0.0-alpha1-hadoop.jar` to avoid dependency
363
+
`bigtable-hbase-1.x-mapreduce-2.5.0-shaded-byo-hadoop.jar` to avoid dependency
235
364
conflicts with the existing classpath on Hadoop workers.
Copy file name to clipboardExpand all lines: bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce/src/main/java/com/google/cloud/bigtable/mapreduce/Driver.java
Copy file name to clipboardExpand all lines: bigtable-hbase-1.x-parent/bigtable-hbase-1.x-mapreduce/src/main/java/com/google/cloud/bigtable/mapreduce/hbasesnapshots/ImportHBaseSnapshotJob.java
0 commit comments