|
| 1 | +# Project Context: SourceDb to Spanner |
| 2 | + |
| 3 | +<!-- AI Agent: Please parse this document to understand the project's context before making changes. --> |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +* **Core Intent:** A bulk migration Dataflow pipeline to migrate data from various Source Databases (MySQL, PostgreSQL, Cassandra) into Cloud Spanner. It handles sharded and non-sharded databases. |
| 8 | +* **Primary Users:** Internal SREs, external customers migrating to Cloud Spanner, and users of Spanner Migration Tool. |
| 9 | +* **Critical SLOs/Guarantees:** Must effectively handle bulk data extraction and mapping to Cloud Spanner mutations while maintaining data integrity. Features a Dead Letter Queue (DLQ) for failed mutations. |
| 10 | +* **Terminology:** |
| 11 | + * **DLQ:** Dead Letter Queue (for failed records). |
| 12 | + * **SourceRow:** Intermediate representation of a row read from the source database. |
| 13 | + * **Mutation:** Spanner mutation to be applied. |
| 14 | + |
| 15 | +## Technical Details |
| 16 | + |
| 17 | +* **Tech Stack & Versions:** |
| 18 | + * **Languages:** Java 17 |
| 19 | + * **Frameworks/Libraries:** Apache Beam 2.73.0, Maven |
| 20 | + * **Key Google Technologies:** Cloud Spanner, Cloud Storage (GCS), Dataflow |
| 21 | +* **Code Location:** `/usr/local/google/home/aasthabharill/DataflowTemplates/v2/sourcedb-to-spanner` |
| 22 | +* **Data Flow:** Data is read from Source Databases (MySQL/PostgreSQL/Cassandra) using JDBC or Datastax driver -> Mapped into SourceRows -> Transformed to Spanner Mutations -> Written to Cloud Spanner. Failed mutations are logged to a GCS DLQ. |
| 23 | +* **Project Structure (Logical Architecture Mapping):** |
| 24 | + * `src/main/java/com/google/cloud/teleport/v2/source/reader`: Source Readers (IoWrappers for Cassandra, JDBC, etc., RowMappers) |
| 25 | + * `src/main/java/com/google/cloud/teleport/v2/transformer`: Transformers (e.g., `SourceRowToMutationDoFn`) |
| 26 | + * `src/main/java/com/google/cloud/teleport/v2/writer`: Writers and error handling (`SpannerWriter`, `DeadLetterQueue`) |
| 27 | + * `src/main/java/com/google/cloud/teleport/v2/templates`: Main pipeline definition (`SourceDbToSpanner`) |
| 28 | +* **Build/Run Commands:** |
| 29 | + ```bash |
| 30 | + # To build the flex template |
| 31 | + export PROJECT=span-cloud-ck-testing-external |
| 32 | + export BUCKET_NAME=ea-functional-tests |
| 33 | + mvn clean package -PtemplatesStage -DskipTests -DprojectId="$PROJECT" -DbucketName="$BUCKET_NAME" -DstagePrefix="templates-<replace-with-your-prefix>" -DtemplateName="Sourcedb_to_Spanner_Flex" -pl v2/sourcedb-to-spanner -am |
| 34 | + |
| 35 | + # To run tests |
| 36 | + mvn clean test -pl v2/sourcedb-to-spanner -am |
| 37 | + |
| 38 | + # To run pipeline |
| 39 | + export JOB_NAME="bulk-migrate-to-spanner-$(date +%Y%m%d-%H%M%S)" |
| 40 | + export OUTPUT_DIR="gs://${BUCKET_NAME}/bulk-migration" |
| 41 | + gcloud dataflow flex-template run $JOB_NAME \ |
| 42 | + --project=$PROJECT_ID \ |
| 43 | + --region=$REGION \ |
| 44 | + --template-file-gcs-location="gs://dataflow-templates-${REGION}/latest/flex/Sourcedb_to_Spanner_Flex" \ |
| 45 | + --max-workers=2 \ |
| 46 | + --num-workers=1 \ |
| 47 | + --worker-machine-type=n2-highmem-8 \ |
| 48 | + --parameters sourceConfigURL=$GCS_SHARDING_PATH,instanceId=$SPANNER_INSTANCE_NAME,databaseId=$SPANNER_DATABASE_NAME,projectId=$PROJECT_ID,outputDirectory=$OUTPUT_DIR,username=datastream_user,password=complex_password_123,schemaOverridesFilePath=$GCS_OVERRIDES_PATH,transformationJarPath=$CUSTOM_JAR_PATH,transformationClassName=com.custom.CustomTransformationFetcher |
| 49 | + ``` |
| 50 | + |
| 51 | +## Project Management |
| 52 | + |
| 53 | +* **Buganizer Component:** [Infrastructure > Spanner > Cloud > Migrations](https://b.corp.google.com/issues?q=componentid:1008064) - (Cloud Spanner migrations component) |
| 54 | +* **Key Contacts:** |
| 55 | + * **Recent Contributors:** darshan-sj, aasthabharill, shreyakhajanchi, sm745052 |
| 56 | + |
| 57 | +## Documentation |
| 58 | + |
| 59 | +* **Key Design Docs:** |
| 60 | + * [Bulk Migration to Spanner Design](http://go/bulk-migration-to-spanner-design) - Overall pipeline design. |
| 61 | + * [CS Reader for Bulk Migration](http://go/cs-reader-for-bulk-migration-to-spanner) - Reader design. |
| 62 | + * [Spanner Bulk Migration User Guide](http://go/spanner-bulk-migration-user-guide) - Usage instructions. |
| 63 | +* **Architecture Diagram:** [architecture.svg](architecture.svg) |
| 64 | + |
| 65 | +## AI Agent Tips |
| 66 | + |
| 67 | +* **Common Tasks:** Adding new JDBC dialects, fixing parsing errors, implementing new transformations or schema overrides, adding new source reader capabilities. |
| 68 | +* **Coding Standards & Best Practices:** |
| 69 | + * Use `AutoValue` for POJOs. |
| 70 | + * Strict adherence to Apache Beam paradigms (PTransforms, DoFns). Use `TupleTag` for side outputs like the DLQ. |
| 71 | + * Use structured logging (`com.google.cloud.teleport.structured-logging`). |
| 72 | +* **Testing Frameworks & Guidelines:** |
| 73 | + * **Frameworks:** JUnit 4, Google Truth for assertions, Mockito for mocking. |
| 74 | + * **Rules:** Ensure tests use `@RunWith(JUnit4.class)`. Use embedded databases for testing when possible (e.g. `derby` or `embedded-cassandra`). |
| 75 | +* **Areas to be Careful:** Cross-shard querying logic, causal ordering around the DLQ, and schema mappings parsing. |
| 76 | +* **Example CLs:** |
| 77 | + * [39a8ae5e0](https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/39a8ae5e0) - Fix GCS Avro Export flow |
| 78 | + * [90964dca6](https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/90964dca6) - Add Support for UUID-based Partitioning |
0 commit comments