Skip to content

Commit ac452be

Browse files
committed
Generate project-context.md for sourcedb-to-spanner
1 parent b8ef535 commit ac452be

3 files changed

Lines changed: 247 additions & 0 deletions

File tree

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
digraph Architecture {
2+
node [shape=box, style=filled, color=lightblue];
3+
4+
SourceDb [label="Source Database\n(Cassandra, MySQL, PostgreSQL)"];
5+
6+
subgraph cluster_Reader {
7+
label = "com.google.cloud.teleport.v2.source.reader";
8+
ReaderImpl [label="ReaderImpl"];
9+
IoWrapper [label="IoWrapper (Cassandra, JDBC)"];
10+
RowMapper [label="RowMapper"];
11+
}
12+
13+
subgraph cluster_Transformer {
14+
label = "com.google.cloud.teleport.v2.transformer";
15+
SourceRowToMutation [label="SourceRowToMutationDoFn"];
16+
}
17+
18+
subgraph cluster_Writer {
19+
label = "com.google.cloud.teleport.v2.writer";
20+
SpannerWriter [label="SpannerWriter"];
21+
DLQ [label="DeadLetterQueue"];
22+
}
23+
24+
Spanner [label="Cloud Spanner"];
25+
GCS [label="GCS (DLQ)"];
26+
27+
SourceDb -> IoWrapper;
28+
IoWrapper -> RowMapper;
29+
RowMapper -> ReaderImpl;
30+
ReaderImpl -> SourceRowToMutation [label="SourceRow"];
31+
SourceRowToMutation -> SpannerWriter [label="Mutation"];
32+
SpannerWriter -> Spanner;
33+
SpannerWriter -> DLQ [label="Failed Mutations"];
34+
DLQ -> GCS;
35+
}
Lines changed: 134 additions & 0 deletions
Loading
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Project Context: SourceDb to Spanner
2+
3+
<!-- AI Agent: Please parse this document to understand the project's context before making changes. -->
4+
5+
## Overview
6+
7+
* **Core Intent:** A bulk migration Dataflow pipeline to migrate data from various Source Databases (MySQL, PostgreSQL, Cassandra) into Cloud Spanner. It handles sharded and non-sharded databases.
8+
* **Primary Users:** Internal SREs, external customers migrating to Cloud Spanner, and users of Spanner Migration Tool.
9+
* **Critical SLOs/Guarantees:** Must effectively handle bulk data extraction and mapping to Cloud Spanner mutations while maintaining data integrity. Features a Dead Letter Queue (DLQ) for failed mutations.
10+
* **Terminology:**
11+
* **DLQ:** Dead Letter Queue (for failed records).
12+
* **SourceRow:** Intermediate representation of a row read from the source database.
13+
* **Mutation:** Spanner mutation to be applied.
14+
15+
## Technical Details
16+
17+
* **Tech Stack & Versions:**
18+
* **Languages:** Java 17
19+
* **Frameworks/Libraries:** Apache Beam 2.73.0, Maven
20+
* **Key Google Technologies:** Cloud Spanner, Cloud Storage (GCS), Dataflow
21+
* **Code Location:** `/usr/local/google/home/aasthabharill/DataflowTemplates/v2/sourcedb-to-spanner`
22+
* **Data Flow:** Data is read from Source Databases (MySQL/PostgreSQL/Cassandra) using JDBC or Datastax driver -> Mapped into SourceRows -> Transformed to Spanner Mutations -> Written to Cloud Spanner. Failed mutations are logged to a GCS DLQ.
23+
* **Project Structure (Logical Architecture Mapping):**
24+
* `src/main/java/com/google/cloud/teleport/v2/source/reader`: Source Readers (IoWrappers for Cassandra, JDBC, etc., RowMappers)
25+
* `src/main/java/com/google/cloud/teleport/v2/transformer`: Transformers (e.g., `SourceRowToMutationDoFn`)
26+
* `src/main/java/com/google/cloud/teleport/v2/writer`: Writers and error handling (`SpannerWriter`, `DeadLetterQueue`)
27+
* `src/main/java/com/google/cloud/teleport/v2/templates`: Main pipeline definition (`SourceDbToSpanner`)
28+
* **Build/Run Commands:**
29+
```bash
30+
# To build the flex template
31+
export PROJECT=span-cloud-ck-testing-external
32+
export BUCKET_NAME=ea-functional-tests
33+
mvn clean package -PtemplatesStage -DskipTests -DprojectId="$PROJECT" -DbucketName="$BUCKET_NAME" -DstagePrefix="templates-<replace-with-your-prefix>" -DtemplateName="Sourcedb_to_Spanner_Flex" -pl v2/sourcedb-to-spanner -am
34+
35+
# To run tests
36+
mvn clean test -pl v2/sourcedb-to-spanner -am
37+
38+
# To run pipeline
39+
export JOB_NAME="bulk-migrate-to-spanner-$(date +%Y%m%d-%H%M%S)"
40+
export OUTPUT_DIR="gs://${BUCKET_NAME}/bulk-migration"
41+
gcloud dataflow flex-template run $JOB_NAME \
42+
--project=$PROJECT_ID \
43+
--region=$REGION \
44+
--template-file-gcs-location="gs://dataflow-templates-${REGION}/latest/flex/Sourcedb_to_Spanner_Flex" \
45+
--max-workers=2 \
46+
--num-workers=1 \
47+
--worker-machine-type=n2-highmem-8 \
48+
--parameters sourceConfigURL=$GCS_SHARDING_PATH,instanceId=$SPANNER_INSTANCE_NAME,databaseId=$SPANNER_DATABASE_NAME,projectId=$PROJECT_ID,outputDirectory=$OUTPUT_DIR,username=datastream_user,password=complex_password_123,schemaOverridesFilePath=$GCS_OVERRIDES_PATH,transformationJarPath=$CUSTOM_JAR_PATH,transformationClassName=com.custom.CustomTransformationFetcher
49+
```
50+
51+
## Project Management
52+
53+
* **Buganizer Component:** [Infrastructure > Spanner > Cloud > Migrations](https://b.corp.google.com/issues?q=componentid:1008064) - (Cloud Spanner migrations component)
54+
* **Key Contacts:**
55+
* **Recent Contributors:** darshan-sj, aasthabharill, shreyakhajanchi, sm745052
56+
57+
## Documentation
58+
59+
* **Key Design Docs:**
60+
* [Bulk Migration to Spanner Design](http://go/bulk-migration-to-spanner-design) - Overall pipeline design.
61+
* [CS Reader for Bulk Migration](http://go/cs-reader-for-bulk-migration-to-spanner) - Reader design.
62+
* [Spanner Bulk Migration User Guide](http://go/spanner-bulk-migration-user-guide) - Usage instructions.
63+
* **Architecture Diagram:** [architecture.svg](architecture.svg)
64+
65+
## AI Agent Tips
66+
67+
* **Common Tasks:** Adding new JDBC dialects, fixing parsing errors, implementing new transformations or schema overrides, adding new source reader capabilities.
68+
* **Coding Standards & Best Practices:**
69+
* Use `AutoValue` for POJOs.
70+
* Strict adherence to Apache Beam paradigms (PTransforms, DoFns). Use `TupleTag` for side outputs like the DLQ.
71+
* Use structured logging (`com.google.cloud.teleport.structured-logging`).
72+
* **Testing Frameworks & Guidelines:**
73+
* **Frameworks:** JUnit 4, Google Truth for assertions, Mockito for mocking.
74+
* **Rules:** Ensure tests use `@RunWith(JUnit4.class)`. Use embedded databases for testing when possible (e.g. `derby` or `embedded-cassandra`).
75+
* **Areas to be Careful:** Cross-shard querying logic, causal ordering around the DLQ, and schema mappings parsing.
76+
* **Example CLs:**
77+
* [39a8ae5e0](https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/39a8ae5e0) - Fix GCS Avro Export flow
78+
* [90964dca6](https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/90964dca6) - Add Support for UUID-based Partitioning

0 commit comments

Comments
 (0)