Migrate Data Catalog to Dataplex Knowledge Catalog for CDC templates by stankiewicz · Pull Request #3927 · GoogleCloudPlatform/DataflowTemplates

stankiewicz · 2026-06-17T10:47:20Z

This PR resolves the INVALID_ARGUMENT write operation failure (Project is not allowed to perform write operations due to Data Catalog deprecation) caused by the deprecation of the legacy Google Cloud Data Catalog API.

It migrates the Debezium-to-PubSub CDC pipeline's schema publishing and schema retrieval logic to use the new Dataplex Knowledge Catalog API (com.google.cloud.dataplex.v1.CatalogServiceClient).

Key Changes

Dependency Update: Replaced google-cloud-datacatalog with google-cloud-dataplex (v1.90.0) in cdc-common.
Schema Aspect Migration: Refactored SchemaUtils.java to convert Apache Beam Schema objects into Dataplex technical schema aspects (represented as google.protobuf.Struct following the system-defined dataplex-types.global.schema format).
API Client Update:
- Replaced all usages of the legacy DataCatalogClient with CatalogServiceClient inside DataCatalogSchemaUtils.java.
- Migrated createEntryGroup to utilize createEntryGroupAsync as required by the Dataplex SDK for long-running operations.
Metadata Restoration: Preserved userSpecifiedSystem and userSpecifiedType logic by mapping them appropriately to the new Dataplex EntrySource object (system field and user_specified_type label).

gemini-code-assist · 2026-06-17T10:47:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical failure in the CDC pipeline caused by the deprecation of the legacy Google Cloud Data Catalog API. By migrating the schema publishing and retrieval logic to the Dataplex Knowledge Catalog API, the pipeline maintains compatibility with current Google Cloud infrastructure standards. The changes ensure that metadata, including system types and labels, is correctly preserved and mapped to the new Dataplex entry structures.

Highlights

API Migration: Migrated the CDC pipeline's schema management from the deprecated Google Cloud Data Catalog API to the new Dataplex Knowledge Catalog API.
Dependency Update: Replaced the google-cloud-datacatalog dependency with google-cloud-dataplex (v1.90.0) in the cdc-common module.
Schema Refactoring: Updated SchemaUtils to convert Apache Beam schemas into Dataplex technical schema aspects using protobuf Structs.
Asynchronous Operations: Refactored entry group creation to use asynchronous operations as required by the Dataplex SDK.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request migrates the CDC parent and connector modules from Google Cloud Data Catalog to Google Cloud Dataplex Catalog, updating dependencies, client initializations, and schema mapping utilities to use Dataplex's aspect-based schema model. The review feedback identifies several critical issues, including resource leaks from unclosed CatalogServiceClient instances, an N+1 query performance bottleneck when retrieving entry schemas, improper handling of InterruptedException, and potential NullPointerException risks when parsing protobuf Struct fields without defensive checks.

stankiewicz · 2026-06-17T11:04:39Z

/gemini review

gemini-code-assist

Code Review

This pull request migrates the CDC common library from Google Cloud Data Catalog to Google Cloud Dataplex Catalog, updating dependencies, clients, and schema mapping utilities to use Dataplex aspects. Feedback on these changes suggests propagating exceptions during asynchronous entry group creation to avoid invalid states, filtering search results in lookupPubSubEntry to verify they belong to the @pubsub entry group, and normalizing data types to lowercase in SchemaUtils to ensure robust, case-insensitive type mapping.

gemini-code-assist · 2026-06-17T11:07:26Z

+      } catch (InterruptedException e) {
+        Thread.currentThread().interrupt();
+        LOG.error("Interrupted while creating EntryGroup", e);
+      } catch (ExecutionException e) {
+        if (e.getCause() instanceof AlreadyExistsException) {
+          this.entryGroupCreated = true;
+        } else {
+          LOG.error("Failed to create EntryGroup", e);
+        }
      }


If createEntryGroupAsync fails with an exception other than AlreadyExistsException (or if the thread is interrupted), the error is logged but the exception is swallowed. This allows the constructor of SingleTopicSchemaManager to complete successfully in an invalid state, which will cause subsequent schema updates to fail.

To ensure fail-fast behavior and make debugging easier, we should propagate these failures by throwing a RuntimeException.

} catch (InterruptedException e) { Thread.currentThread().interrupt(); LOG.error("Interrupted while creating EntryGroup", e); throw new RuntimeException("Interrupted while creating EntryGroup", e); } catch (ExecutionException e) { if (e.getCause() instanceof AlreadyExistsException) { this.entryGroupCreated = true; } else { LOG.error("Failed to create EntryGroup", e); throw new RuntimeException("Failed to create EntryGroup", e); } }

References

When throwing exceptions in environments where the caller or global exception handler might not log them (such as certain Dataflow templates), log the error message before throwing the exception to ensure the failure is recorded.

codecov · 2026-06-17T11:32:31Z

Codecov Report

❌ Patch coverage is 0% with 255 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.58%. Comparing base (65d422a) to head (85b0612).
⚠️ Report is 20 commits behind head on main.

Files with missing lines	Patch %	Lines
...ud/dataflow/cdc/common/DataCatalogSchemaUtils.java	0.00%	151 Missing ⚠️
.../google/cloud/dataflow/cdc/common/SchemaUtils.java	0.00%	75 Missing ⚠️
.../dataflow/cdc/applier/CdcPCollectionsFetchers.java	0.00%	28 Missing ⚠️
...d/dataflow/cdc/connector/PubSubChangeConsumer.java	0.00%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3927      +/-   ##
============================================
- Coverage     55.59%   55.58%   -0.01%     
+ Complexity     7052     6731     -321     
============================================
  Files          1103     1114      +11     
  Lines         67681    68377     +696     
  Branches       7603     7718     +115     
============================================
+ Hits          37625    38006     +381     
- Misses        27636    27906     +270     
- Partials       2420     2465      +45

Components	Coverage Δ
spanner-templates	`87.80% <ø> (-0.70%)`	⬇️
spanner-import-export	`68.69% <ø> (-0.02%)`	⬇️
spanner-live-forward-migration	`90.21% <ø> (-0.04%)`	⬇️
spanner-live-reverse-replication	`83.33% <ø> (-1.11%)`	⬇️
spanner-bulk-migration	`92.58% <ø> (-0.05%)`	⬇️
gcs-spanner-dv	`88.88% <ø> (-0.02%)`	⬇️

Files with missing lines	Coverage Δ
...d/dataflow/cdc/connector/PubSubChangeConsumer.java	`0.00% <0.00%> (ø)`
.../dataflow/cdc/applier/CdcPCollectionsFetchers.java	`0.00% <0.00%> (ø)`
.../google/cloud/dataflow/cdc/common/SchemaUtils.java	`0.00% <0.00%> (ø)`
...ud/dataflow/cdc/common/DataCatalogSchemaUtils.java	`0.00% <0.00%> (ø)`

... and 55 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

derrickaw · 2026-06-17T12:22:23Z

Fixes: #3921

stankiewicz · 2026-06-26T21:22:19Z

/gemini review

gemini-code-assist

Code Review

This pull request migrates the CDC application from Google Cloud Data Catalog to Google Cloud Dataplex Catalog, updating dependencies, client implementations, and schema conversion utilities to use Dataplex aspects. The review feedback highlights several critical issues: a resource leak due to an unclosed CatalogServiceClient in DataCatalogSchemaManager, swallowed exceptions in getSchemasForEntryGroup and createEntryGroup that hinder debugging, a potential thread hang from calling .get() without a timeout, misleading error handling during JSON parsing, and a lack of string trimming when parsing comma-separated topics or subscriptions.

gemini-code-assist · 2026-06-26T21:25:14Z

  public abstract static class DataCatalogSchemaManager {
    final String gcpProject;
    final String location;
-    DataCatalogClient client;
+    CatalogServiceClient client;
+    private final java.util.Set<String> createdEntryGroups = new java.util.HashSet<>();


The CatalogServiceClient is instantiated as an instance variable in DataCatalogSchemaManager but is never closed, which will leak the client and its underlying gRPC channels/connections. To prevent resource leaks, DataCatalogSchemaManager should implement AutoCloseable and close the client in its close() method.

Suggested change

public abstract static class DataCatalogSchemaManager {

final String gcpProject;

final String location;

DataCatalogClient client;

CatalogServiceClient client;

private final java.util.Set<String> createdEntryGroups = new java.util.HashSet<>();

public abstract static class DataCatalogSchemaManager implements AutoCloseable {

final String gcpProject;

final String location;

CatalogServiceClient client;

private final java.util.Set<String> createdEntryGroups = new java.util.HashSet<>();

@Override

public void close() {

if (client != null) {

client.close();

client = null;

}

}

gemini-code-assist · 2026-06-26T21:25:14Z

+    } catch (Exception e) {
+      LOG.error("Failed to list entries: ", e);
    }


Catching and silently swallowing all exceptions in getSchemasForEntryGroup is a regression from the original code. If listing entries fails (e.g., due to permission or network issues), returning an empty map silently makes debugging extremely difficult. The exception should be logged and rethrown (wrapped in a RuntimeException).

Suggested change

} catch (Exception e) {

LOG.error("Failed to list entries: ", e);

}

} catch (Exception e) {

LOG.error("Failed to list entries: ", e);

throw new RuntimeException("Failed to list entries for entry group " + entryGroupId, e);

}

gemini-code-assist · 2026-06-26T21:25:14Z

+        client.createEntryGroupAsync(entryGroupRequest).get();
+        LOG.info("Created EntryGroup: {}", entryGroupId);

-        this.entryGroupCreated = true;
+        this.createdEntryGroups.add(entryGroupId);
      } catch (AlreadyExistsException e) {
-        // EntryGroup already exists. There is no further action needed.
+        // EntryGroup already exists.
+        this.createdEntryGroups.add(entryGroupId);
+      } catch (InterruptedException e) {
+        Thread.currentThread().interrupt();
+        LOG.error("Interrupted while creating EntryGroup", e);
+      } catch (ExecutionException e) {
+        if (e.getCause() instanceof AlreadyExistsException) {
+          this.createdEntryGroups.add(entryGroupId);
+        } else {
+          LOG.error("Failed to create EntryGroup", e);
+        }
      }


In createEntryGroup, InterruptedException and ExecutionException (for non-AlreadyExists causes) are logged but swallowed, allowing the code to proceed and fail later with a less clear error. Additionally, calling .get() without a timeout can cause the thread to hang indefinitely if the service is unresponsive. We should use a timeout and propagate these exceptions wrapped in a RuntimeException to fail fast and preserve the root cause.

client.createEntryGroupAsync(entryGroupRequest).get(1, java.util.concurrent.TimeUnit.MINUTES); LOG.info("Created EntryGroup: {}", entryGroupId); this.createdEntryGroups.add(entryGroupId); } catch (AlreadyExistsException e) { // EntryGroup already exists. this.createdEntryGroups.add(entryGroupId); } catch (InterruptedException e) { Thread.currentThread().interrupt(); LOG.error("Interrupted while creating EntryGroup", e); throw new RuntimeException("Interrupted while creating EntryGroup " + entryGroupId, e); } catch (ExecutionException e) { if (e.getCause() instanceof AlreadyExistsException) { this.createdEntryGroups.add(entryGroupId); } else { LOG.error("Failed to create EntryGroup", e); throw new RuntimeException("Failed to create EntryGroup " + entryGroupId, e.getCause()); } } catch (java.util.concurrent.TimeoutException e) { LOG.error("Timeout while creating EntryGroup", e); throw new RuntimeException("Timeout while creating EntryGroup " + entryGroupId, e); }

gemini-code-assist · 2026-06-26T21:25:14Z

+        JsonFormat.parser().merge(schemaData.getFieldsOrThrow("type").getStringValue(), builder);
+        return SchemaUtils.toBeamSchema(builder.build());
+      }


JsonFormat.parser().merge throws an IOException (specifically InvalidProtocolBufferException) if the JSON is malformed. Currently, this is caught by the outer catch block, which logs a misleading error message: "Unable to create a CatalogServiceClient". This exception should be caught separately with an accurate error message.

Struct.Builder builder = Struct.newBuilder(); try { JsonFormat.parser().merge(schemaData.getFieldsOrThrow("type").getStringValue(), builder); } catch (IOException e) { LOG.error("Failed to parse schema JSON from aspect", e); throw new RuntimeException("Failed to parse schema JSON from aspect", e); } return SchemaUtils.toBeamSchema(builder.build());

gemini-code-assist · 2026-06-26T21:25:14Z

+      if (options.getInputSubscriptions() != null && !options.getInputSubscriptions().isEmpty()) {
+        subscriptionList = Arrays.asList(options.getInputSubscriptions().split(","));
+        topicList =
+            subscriptionList.stream()
+                .map(
+                    s -> {
+                      try {
+                        return PubsubUtils.getPubSubTopicFromSubscription(project, s).getTopic();
+                      } catch (IOException e) {
+                        throw new RuntimeException(e);
+                      }
+                    })
+                .collect(Collectors.toList());
+      } else {
+        Preconditions.checkArgument(
+            options.getInputTopics() != null && !options.getInputTopics().isEmpty(),
+            "Must provide an inputSubscriptions or inputTopics parameter.");
+        topicList = Arrays.asList(options.getInputTopics().split(","));
+        subscriptionList = topicList.stream().map(t -> (String) null).collect(Collectors.toList());
+      }


Splitting comma-separated options (subscriptions or topics) without trimming whitespace can cause runtime failures if users include spaces (e.g., "sub1, sub2"). It is safer to trim each element after splitting.

if (options.getInputSubscriptions() != null && !options.getInputSubscriptions().isEmpty()) { subscriptionList = Arrays.stream(options.getInputSubscriptions().split(",")) .map(String::trim) .collect(Collectors.toList()); topicList = subscriptionList.stream() .map( s -> { try { return PubsubUtils.getPubSubTopicFromSubscription(project, s).getTopic(); } catch (IOException e) { throw new RuntimeException(e); } }) .collect(Collectors.toList()); } else { Preconditions.checkArgument( options.getInputTopics() != null && !options.getInputTopics().isEmpty(), "Must provide an inputSubscriptions or inputTopics parameter."); topicList = Arrays.stream(options.getInputTopics().split(",")) .map(String::trim) .collect(Collectors.toList()); subscriptionList = topicList.stream().map(t -> (String) null).collect(Collectors.toList()); }

Migrate Data Catalog to Dataplex Knowledge Catalog for CDC templates

88a40f5

pull-request-size Bot added the size/L label Jun 17, 2026

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Fix ID sanitization to comply with Dataplex Resource Name constraints

233c899

stankiewicz added improvement Making existing code better bug-fix labels Jun 17, 2026

Address code review comments for Dataplex integration

f5574de

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

stankiewicz added 2 commits June 17, 2026 13:13

Normalize Dataplex field types to lowercase for robust mapping

02ff5e2

Filter Dataplex search results to system @PubSub entry group

fc9f8d2

stankiewicz added 2 commits June 17, 2026 14:01

Add missing required metadataType to Dataplex schema aspect fields

e0a53a5

Map Beam types to valid Dataplex technical types for metadataType

240f836

stankiewicz added 6 commits June 17, 2026 14:44

Fix missing generic aspect for Dataplex Entries

bde4fb0

revert sanitation

3301277

revert filter

d69242e

add more verbose logging, change location

6fe779c

add more verbose logging, change location

05cc1ec

Handle system aspect keys dynamically in Dataplex

3c584b4

pull-request-size Bot added size/XL and removed size/L labels Jun 17, 2026

stankiewicz added 6 commits June 17, 2026 23:49

spotless

e9ceefc

verbose logging

2c8f12c

Map dataType to Dataplex expected capitalized types

e2c4812

verbose logging

35e24cc

verbose logging

494f2a5

refactor schema persistence

f174180

stankiewicz added 4 commits June 18, 2026 19:25

add sleep to dataplex to ingest

a03ec04

fix entrygroups in dataplex

a72111c

fix entrygroups in dataplex

e59dea4

move to entrygroups

85b0612

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

stankiewicz commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stankiewicz commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

derrickaw commented Jun 17, 2026

Uh oh!

stankiewicz commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 17, 2026 •

edited

Loading