This document is the practical runbook for operating the DBLog paper implementation in this repository.
If you are trying to get one real sync run working quickly, start with one of
the canonical demo scripts under scripts/demo/ (see §7), then use this file
as the deeper operator reference.
Use this file for:
- choosing a boot mode,
- wiring runtime properties,
- enabling sinks and target apply,
- enabling the local control plane,
- interpreting health and schema-issue state,
- deciding what to do when DBLog fails closed.
For semantic guarantees and fail-closed boundaries, use README.md, §6.2 in
this file, plus the Javadocs on the key source types under src/main/java
(TableSchema, WindowReconciler, DefaultDumpWindowCoordinator,
DumpTableProgress, CheckpointFlushPolicy, and ControlPlaneCommandService).
For a compact supported-source summary, see README.md.
For control-plane semantics and payload guidance, use docs/CONTROL_PLANE.md.
Current boot modes:
| Mode | Purpose | Typical use |
|---|---|---|
runtime |
Long-running replication host | Local demos and real replication runs |
scenario |
Proof/fault-injection harness | Verification and adversarial testing |
startup-check |
One-shot startup validation | Smoke checks and preflight-only runs |
For normal replication, use:
dblog.boot-mode=runtimeEvery runtime deployment needs:
dblog.boot-mode=runtime
dblog.runtime.state-path=/path/to/state
dblog.source.adapter=mysql|postgres
dblog.source.id=your_source_id
dblog.source.tables[0]=schema_or_database.tableNote on state-path: this is the H2 database-file prefix, not a directory.
Given dblog.runtime.state-path=/path/to/state, H2 writes the data file at
/path/to/state.mv.db and the trace file at /path/to/state.trace.db. To
reset state between runs, remove /path/to/state* — rm -rf /path/to/state
(treating it as a directory) leaves the real state file behind. The same
shape applies to dblog.scenario.state-path.
Then add the adapter-specific source properties.
DBLog also requires at least one explicit sink configuration for runtime and
startup-check; it does not auto-install an implicit discard sink.
The repo Dockerfile now sets a production-oriented default JAVA_TOOL_OPTIONS
for the packaged container runtime:
-XX:+UseG1GC-XX:InitialRAMPercentage=25.0-XX:MaxRAMPercentage=65.0-XX:+HeapDumpOnOutOfMemoryError-XX:HeapDumpPath=/var/lib/dblog/state/heapdump.hprof-XX:+ExitOnOutOfMemoryError-Xlog:gc*:stdout:time,uptime,level,tags-XX:StartFlightRecording=filename=/var/lib/dblog/state/runtime.jfr,settings=profile,maxage=30m,maxsize=256m,dumponexit=true
Practical effect for the shipped packaged example:
- heap sizing follows container memory instead of a hard-coded
-Xmx, - GC logs go to stdout/stderr with timestamps,
- OOM exits fail fast and produce a heap dump under the mounted state path,
- a rolling profile JFR is written to
/var/lib/dblog/state/runtime.jfr.
The host-run ./gradlew bootRun examples still use the host JVM defaults unless
you export your own JAVA_TOOL_OPTIONS or run java -jar with explicit flags.
Canonical MySQL source properties:
dblog.source.adapter=mysql
dblog.source.id=my_mysql_source
dblog.source.tables[0]=app.sample_orders
dblog.source.mysql.jdbc-url=jdbc:mysql://host:3306/app
dblog.source.mysql.username=dblog
dblog.source.mysql.password=dblog
dblog.source.mysql.hostname=host
dblog.source.mysql.port=3306
dblog.source.mysql.server-id=223401Optional MySQL runtime tuning:
dblog.source.mysql.connect-timeoutdblog.source.mysql.source-event-queue-capacitydblog.source.mysql.retry-log-connection-lossdblog.source.mysql.reconnect-backoff
DBLog ships no typed TLS configuration. See docs/adapters/mysql.md for the full picture
— in short, the binlog client cannot negotiate TLS, so a server with
require_secure_transport=ON is unsupported by the shipped DBLog.
Canonical PostgreSQL source properties:
dblog.source.adapter=postgres
dblog.source.id=my_postgres_source
dblog.source.tables[0]=demo.sample_orders
dblog.source.postgres.jdbc-url=jdbc:postgresql://host:5432/app
dblog.source.postgres.replication-jdbc-url=jdbc:postgresql://host:5432/app
dblog.source.postgres.database-name=app
dblog.source.postgres.username=dblog
dblog.source.postgres.password=dblog
dblog.source.postgres.publication-name=my_pub
dblog.source.postgres.slot-name=my_slotUse a dedicated runtime role rather than the postgres superuser. With the
default DBLog-managed publication and slot settings, that role needs logical
replication privileges, write access to dblog_meta, and ownership of every
captured table so PostgreSQL permits DBLog to create or repair the explicit
publication. One-time setup, fixture reset, and emergency cleanup can still use
an administrator connection out of band.
Optional PostgreSQL runtime tuning:
dblog.source.postgres.publication-ownershipdblog.source.postgres.slot-ownershipdblog.source.postgres.status-intervaldblog.source.postgres.retry-log-connection-lossdblog.source.postgres.reconnect-backoff
DBLog ships no typed TLS configuration. If you need TLS, append the standard pgJDBC
parameters (sslmode, sslrootcert, sslcert, sslkey) to both jdbc-url and
replication-jdbc-url directly — see docs/adapters/postgres.md.
For a healthy runtime startup, expect all of the following:
- the process starts with
dblog.boot-mode=runtime, - the configured state path is created or reused successfully,
- source preflight passes for the chosen adapter,
- the runtime logs show the application has started,
- if the control plane is enabled,
GET /api/v1/runtimebecomes reachable, GET /api/v1/runtime/healthreportsUP,GET /api/v1/runtime/statusshows a healthy source component and, if enabled, a healthy sink component.
Fail-fast startup boundary:
- the embedded H2 state-store schema is treated as fixed for the current repo build; DBLog does not perform in-place H2 schema migration or version negotiation.
For TABLE and ALL_TABLES dump requests, the current runtime captures the
table's maximum primary key before the first chunk of that table is coordinated.
Practical effect:
- later chunk reads stop at that captured upper bound,
- rows inserted above that bound are left to the live log path rather than being pulled into the in-flight full-table scan,
- the table scan finishes when no more rows remain inside that captured request-scoped primary-key frontier.
The DBLog paper names chunk-selection throttling as a production requirement for large-table dumps against hot sources. This implementation exposes two levers for that today and does not ship a runtime rate limiter:
chunkSize— rows per chunk SELECT. Lower values reduce per-SELECT load on the source primary-key index; higher values reduce coordination overhead. Tune per source, per table family.- Process-level start / stop. Killing the DBLog process is safe; chunk-level
progress is persisted at batch boundaries, so a restart resumes
ACTIVErequests from the last completed chunk without source-side re-reading (seedocs/CONTROL_PLANE.md§8).
Not present (deliberate):
chunksPerSecond/rowsPerSecondconfig-level rate limiter,- adaptive throttling driven by source CPU / IO feedback,
- inter-chunk delay knob,
- operator pause / resume / cancel endpoints on running requests.
If you need sustained throttling against a hot source, combine a smaller
chunkSize with process restarts (stop the DBLog process during peak hours;
start it again off-peak). A future configuration-level rate limiter would be a
design decision beyond the current scope of this implementation.
Both live-runtime adapters buffer the full committed transaction in heap before
handing it to the sink in a single sink.appendEvents call. The PostgreSQL
pgoutput session accumulates events in PostgresTransactionStreamingSession's
internal transaction buffer until the COMMIT message arrives; the MySQL
binlog session does the same between BEGIN / Gtid and Xid / COMMIT.
This keeps commit-boundary semantics intact at the sink (downstream target apply sees whole transactions, never partial) but caps scalability at largest-single-committed-tx × retained-heap-per-event.
Empirical measurements against the PostgreSQL adapter on a typical four-column
(id BIGINT, name TEXT, status TEXT, updated_at TIMESTAMPTZ) schema:
| single-tx rows | observed peak heap | converge time |
|---|---|---|
| 50 000 | ~430 MB | 3 s |
| 200 000 | ~855 MB | 6 s |
| 1 000 000 | ~4.0 GB | 45 s |
That works out to roughly 4 KB of retained heap per in-flight ChangeEvent
under G1 with default settings. Wider rows (long strings, large blobs) push
this up; narrower rows bring it down. Treat 4 KB/event as an order-of-magnitude
planning figure, not a guarantee.
Size -Xmx so the planned largest-committed-tx can fit with a safety margin
for GC young-generation headroom:
-Xmx >= largest_expected_tx_events * 4 KB * 1.5
For a workload where the largest single transaction is 500 000 rows, that's about 3 GB. For 1 M rows, about 6 GB.
Today there is no fail-closed configurable cap. A transaction large enough to
exhaust the configured heap produces a generic
java.lang.OutOfMemoryError: Java heap space inside the pgoutput decode or
binlog decode path, the JVM logs the stack, and the runtime exits. Durable
state persisted before the oversized transaction remains intact (checkpoint,
dump progress, request status), so a restart on a larger heap — or against a
source whose oversized workload has completed — resumes cleanly.
If your workload includes occasional very large transactions (bulk inserts,
DELETE FROM ..., bulk UPDATE), either size -Xmx for the worst case or
route those through application-level chunking at the source.
This implementation is designed as a single-process CDC runtime. It is not a clustered, multi-instance, leader-elected system, and it does not ship a lease, fence token, or takeover protocol. The DBLog paper describes an active-passive deployment coordinated by Zookeeper; this implementation intentionally stays within a smaller single-process scope.
What the code does provide:
SourceOwnershipRepositoryrecords a single owningsourceIdin the local state store on first claim. A second process started with a differentsourceIdpointing at the same state store crashes at startup withIllegalStateException.- The default H2-backed state store file is held with an exclusive file lock, so two processes on the same host cannot open the same state store concurrently. A second process fails to start rather than silently corrupting state.
What the code does not provide:
- No liveness detection. If the owning process dies without shutdown, the
owning-source-id row sits forever. A replacement process with the same
sourceIdcan restart and resume, but there is no automated takeover. - No lease / fence-token protection. If two processes somehow both connect to the same upstream (e.g. a DBA-induced state-store copy, or the H2 file lock is defeated), each will independently read the source replication stream and emit events. Downstream sinks that are not idempotent will see duplicates.
- No passive-standby process. Any standby you run is a fresh cold start, not a hot-failover replica.
For production-like deployments that need resilience, pair DBLog with an external single-instance supervisor. Representative choices:
- Kubernetes: a
Deploymentwithreplicas: 1and aRollingUpdatestrategy (or aStatefulSetif you want stable network identity plus a PersistentVolume for the H2 file). Let the pod controller handle restart. If you need hot failover, front it with a k8s lease-based leader-elector sidecar that holds a lease external to DBLog and only starts the DBLog process while it holds the lease. - systemd: a
Restart=alwaysunit withStartLimitBurst=. Pair withRuntimeDirectory=for the state store. - Nomad: a single-count job with
restart { attempts = ... }. - Docker Swarm / ECS: single-replica service with a restart policy.
If the supervisor needs to discover the OS-assigned control-plane port (e.g. when
dblog.control-plane.port=0), use dblog.control-plane.port-file — see §5.1.
Two instances accidentally pointed at the same source with the same
sourceId produces duplicate events on the sink. The JDBC apply sink is
idempotent via upsert-on-primary-key and will absorb duplicates at the cost of
redundant writes; the NDJSON sink is not idempotent and will emit both
copies to the output file. If your pipeline downstream of DBLog is not
idempotent, treat "single live instance per sourceId" as a hard operational
invariant enforced by the supervisor.
Adding real HA to DBLog would require, at minimum: a lease table with TTL and monotonic fence tokens, fence-token checks on every state-store write, a renewal thread, a graceful release on shutdown, and tests for expiry, takeover, and stale-fence rejection. This is a meaningful project, not a small patch. If it lands, it will be behind a configuration flag and the single-process posture above will remain the default.
Current PostgreSQL runtime requirements:
- captured tables must use
REPLICA IDENTITY FULL - the configured publication/slot contract must be valid for the captured set
- the metadata tables must be writable and visible on the same logical stream
- the runtime role must satisfy the privilege and ownership contract documented
in
docs/adapters/postgres.md
Slot lifecycle is an operator responsibility. DBLog creates the logical
replication slot on first run, reuses it across restarts, and never drops it
— matching Debezium's default posture. A decommissioned DBLog process leaves
its slot active on the source; without cleanup the server accumulates WAL
indefinitely and will eventually exhaust disk. The max_slot_wal_keep_size
server setting (PG13+) is the recommended safety net. For cleanup procedures,
health monitoring queries, and the relationship between slot health and the
runtime's slot-feedback WARN log, see the slot lifecycle section in
docs/adapters/postgres.md.
Current MySQL runtime requirements:
- binary logging enabled
binlog_format=ROWbinlog_row_image=FULL- one actual MySQL database per runtime request
- a unique replication
serverId
Enable NDJSON to stdout:
dblog.sink.ndjson.stdout=trueEnable NDJSON to file:
dblog.sink.ndjson.path=/path/to/events.ndjsonThe NDJSON sink is intended for debugging, CDC stream inspection, and local demos — not production event delivery. Its durability guarantees are deliberately thin, and operators who treat the output file as an audit log will be surprised.
What DBLog does on write:
- Appends each encoded event followed by
\ninto aBufferedWriter. - Calls
BufferedWriter.flush()at the end of every batch. This pushes bytes from the JVM buffer into the operating-system buffer; it does not callfsyncon the file descriptor.
What this means in practice:
- JVM crash / kill -9 mid-batch: any events already written in the current batch that have not yet been flushed are lost. Events from prior batches that were flushed are held by the OS buffer but may still not be on stable storage.
- Kernel crash / power loss: any OS-buffered events are lost, even if the JVM flushed them. DBLog never fsyncs.
- Process restart: the sink opens the target file with
StandardOpenOption.APPEND. Prior content is preserved; new events are appended on top. A restart after a crash therefore replays any events that were not acknowledged at the source, producing duplicate lines for already- persisted events. This is the documented at-least-once behavior; downstream consumers are responsible for deduplication if needed. - Long-running processes: the file grows without bound. There is no
rotation, size cap, or retention policy. Operators should pipe through
logrotateexternally or point the sink at a path they rotate. - Disk full: writes begin to throw
IOException, which propagates as a runtime failure. There is no back-pressure to the source.
The NDJSON sink also does not provide:
- atomic file replacement (no write-to-temp + rename on success),
- idempotency (replays after recovery produce duplicate lines on any observer who reads the file),
- ordering guarantees across sink restarts (events from the replay window overlap with events from before the crash),
- schema-change tracking in the output header.
For production workloads use the JDBC apply sink (see §4.3), which provides idempotent upsert semantics, or the typed H2 inspection sink (see §4.2) for structured local inspection with proper relational semantics.
Enable the strict typed H2 inspection sink:
dblog.sink.typed-h2.path=/path/to/typed-eventsEnable JDBC target apply:
dblog.target.enabled=true
dblog.target.dialect=POSTGRES|MYSQL
dblog.target.jdbc-url=jdbc:...
dblog.target.username=...
dblog.target.password=...
dblog.target.maximum-pool-size=4Optional target tuning:
dblog.target.connection-timeoutdblog.target.retry-backoffdblog.target.table-mappings[]
Current target dialects:
- PostgreSQL
- MySQL
DBLog does not currently claim support for arbitrary JDBC databases.
- target tables must already exist,
- target table names/namespaces must match incoming event identity unless explicit table mappings are configured,
- target apply supports single-column and composite primary keys,
- the sink uses the ordered primary-key columns from DBLog's neutral model.
Current operation mapping is intentionally modest:
INSERTandUPDATEbecome target upsert operations,DELETEbecomes a target delete keyed only by the ordered primary-key values,- internal watermark/heartbeat events are not target-applied as user-table work.
For upsert:
- primary-key columns are always taken from
event.primaryKey, - non-primary-key columns come from
event.afterRow, - omitted non-primary-key columns are omitted from generated SQL,
- on insert, omitted columns therefore fall back to target defaults /
NULL, - on conflict update, omitted columns are left unchanged.
For delete:
- delete uses only the ordered primary-key values,
- deleting an absent target row is treated as a successful no-op.
One appendEvents(List<ChangeEvent>) call is applied inside one target JDBC
transaction:
- auto-commit is disabled for that apply transaction,
- every applicable event in the batch is translated to target SQL,
- if all statements succeed, the batch is committed once,
- if any statement fails, the whole target batch is rolled back.
The current delivery model remains at-least-once.
The sink is designed to tolerate replay in the specific shape that DBLog currently produces:
- unacknowledged source work may be replayed after restart or recovery,
- replay of the same upserted row state converges because apply uses upsert,
- replay of an already-applied delete converges because delete is a no-op when the row is absent.
This is not a generic claim that arbitrary out-of-order stale event replay is always safe.
If target apply fails before source acknowledgement:
- the target batch is rolled back,
- the source checkpoint is not durably advanced through that failed batch,
- DBLog may replay the same source events later.
Transient target availability failures are retried indefinitely with bounded per-attempt connection timeout plus configured backoff. Target contract breaches such as auth failure, privilege failure, or missing target objects fail closed instead of retrying forever. If a JDBC batch reports both target contract evidence and later availability-looking follow-on errors, DBLog treats the batch as a hard target contract failure.
When source and target engines differ, source TableId shapes may not map
directly to target-side schemas. MySQL events carry the database name in the
schemaName position (MySQL has no analog of Postgres schemas), so the
default identity resolver sends a MySQL event from database app to target
"app"."<table>". This works only if the Postgres target happens to have a
schema literally named after the MySQL database.
For cross-engine flows, configure explicit table mappings:
# Example: MySQL db=app table=orders → Postgres schema=public table=orders
dblog.target.table-mappings[0].source-schema=app
dblog.target.table-mappings[0].source-table=orders
dblog.target.table-mappings[0].target-schema=public
dblog.target.table-mappings[0].target-table=ordersIf a source table does not resolve to an existing target table, DBLog fails
closed at preflight with a missingTargetTable error. The error message names
the expected target identity and hints at this dblog.target.table-mappings
configuration.
The JDBC-apply sink caches target-table metadata, compiled statement plans,
and prepared-statement factories keyed by source TableId and statement
shape. The caches grow with table_count × distinct_column_shape_count × statement_kinds and do not evict. For typical replication flows (a
fixed set of tables with stable column shapes) the caches stay small. For
sources with many tables or highly heterogeneous column shapes across events,
the caches grow without bound — operators who anticipate high cardinality
should budget heap accordingly. This implementation does not bound
these caches.
If you intentionally want DBLog to discard emitted events, configure the explicit no-op sink:
dblog.sink.noop.enabled=trueThis is the only discard path. If no sink is configured, runtime and
startup-check now fail fast instead of silently dropping data.
dblog.control-plane.enabled=true
dblog.control-plane.host=127.0.0.1
dblog.control-plane.port=8085Use the loopback host for host-run deployments. The packaged Docker example is
the only different bind mechanic in the shipped assets: DBLog listens on
0.0.0.0 inside the container, but ops/docker/compose.runtime.yml publishes
the host port on 127.0.0.1:8085 only so operator access remains local.
For supervisor integration (k8s sidecars, systemd units, integration tests) where
the operator wants the OS to assign a free port, set dblog.control-plane.port=0
and point dblog.control-plane.port-file at a path the supervisor can read:
dblog.control-plane.port=0
dblog.control-plane.port-file=/var/run/dblog/portThe bound port is written to that file once the HTTP server is up. The write is atomic (temp file + rename), so a concurrent reader cannot observe a partial value. The file is removed on graceful shutdown. This pairs naturally with the single-instance supervisor patterns in §2.5.
The shipped control plane has no built-in authentication or authorization. This is intentional for a study-friendly implementation and relies on an operator-local access model:
- The default bind is
127.0.0.1, which scopes reachability to the host. - The packaged Docker compose publishes
127.0.0.1:8085on the host only. - Anyone with host access — local shells, co-located processes, compromised sidecars — can submit dump requests and read all runtime state without a credential.
Any deployment that binds beyond loopback is the operator's responsibility to front with a reverse proxy (e.g. nginx, Caddy, Envoy, a cloud load balancer) that enforces:
- TLS termination,
- authentication (bearer token, mTLS, OIDC — whatever your org already operates),
- optional authorization policy per route (e.g. read-only
GETaccess for monitoring systems, admin access for request mutation endpoints).
Do not expose the control plane directly on an untrusted network. The runtime writes control-plane state (dump request lifecycle, schema drift signals) to the local state store, and a malicious caller with unfettered access can trivially submit arbitrary dump requests or induce a full re-dump via the primary-key-drift signal path.
If a future release adds in-process auth, it will be opt-in via a property and
the loopback bind will remain the default posture. Until then, the shipped
control plane is designed for local operator tools (curl on the same host,
the shipped demo scripts) and nothing further.
Useful optional properties:
dblog.control-plane.executor-max-threadsdblog.control-plane.executor-queue-capacitydblog.control-plane.max-request-body-bytes
Current important endpoints:
GET /api/v1/runtimeGET /api/v1/runtime/healthGET /api/v1/runtime/statusGET /api/v1/runtime/schemasGET /api/v1/runtime/schema-issuesGET /api/v1/requestsPOST /api/v1/requestsGET /api/v1/requests/{requestId}GET /api/v1/metricsGET /api/v1/events/summary
Current backpressure visibility:
- MySQL now reads source-log events through a bounded in-process queue and pauses source fetching when that queue is full instead of failing closed or dropping events,
- PostgreSQL remains direct-poll rather than prefetch-queue based, so sink pressure slows later source reads without an intermediate queue inside DBLog,
runtime,runtime/status, andmetricsnow expose source flow-control state such as queue depth/capacity, whether source fetching is paused, and whether the current pressure diagnosis looks like sink unavailability or sink slowdown under load.
Never enable the tap in production. The tap deliberately blocks the DBLog pump thread whenever a subscriber cannot keep up — that is the feature's entire purpose and the reason it exists only as a teaching artefact. A slow subscriber will stall CDC for the whole runtime.
dblog.tap.enabled=false
# dblog.tap.queue-capacity=65536
# dblog.tap.standby-threshold-ms=1000
# dblog.tap.heartbeat-interval=2sWhen enabled, the tap mounts GET /api/v1/tap/stream (chunked NDJSON,
one subscriber at a time) on the control plane. The TUI reads events
from this endpoint; the underlying TCP flow control is what stalls the
pump when the subscriber can't keep up (step-mode). The tap route is
only reachable when the control plane itself is enabled.
The startup WARN means only that the educational tap is enabled. Actual tap-induced pump blocking is surfaced separately:
- server logs emit one
DBLog tap queue is full ...WARN per tap queue lifetime when the queue first fills and the pump blocks, - the tap stream emits
stream.standbyafterdblog.tap.standby-threshold-ms, thenstream.resumedafter the HTTP writer catches up, - tap
stream.heartbeatevents includequeue_depthandqueue_capacityfor the tap queue.
Do not use /api/v1/runtime/status sourceFlowControl.queueDepth as tap
queue health. That field describes source-log flow control, not the tap
HTTP stream queue.
HTTP request submission currently requires:
- the process to report request submission as available,
- an H2 driver on the runtime path,
- one resolved state path from:
dblog.runtime.state-pathdblog.scenario.state-path
Operational response expectations:
- unknown request ids return
404 not_found, - invalid
PRIMARY_KEYSliteral content that is rejected during submission-time schema-aware canonicalization returns400 bad_request, - invalid
PRIMARY_KEYSliteral content that is only discovered later during runtime binding marks that accepted requestFAILED; the DBLog runtime itself continues operating, TABLEorPRIMARY_KEYSrequests that target a table outside the current captured-schema set are markedFAILED; the DBLog runtime itself continues operating,- missing state-path / embedded state-store wiring returns service unavailable.
Current request lifecycle summary:
- new submissions start as
QUEUED, - the runtime transitions work to
ACTIVE, - completed work lands in terminal
COMPLETED, - work that cannot be completed (schema drift, missing table, unsupported PK
type) lands in terminal
FAILED.
The control plane does not ship operator pause / resume / cancel endpoints.
To pause ingest, stop the DBLog process; to resume, start it again. The
embedded state store persists chunk-level progress at batch boundaries, so a
restart resumes ACTIVE requests from the last completed chunk without
re-reading from the source. COMPLETED and FAILED requests are unaffected
by restart. See CONTROL_PLANE.md §8 for the detailed
kill-safety contract.
At a glance:
GET /api/v1/runtime/healthshould reportUPGET /api/v1/runtime/statusshould normally show healthy source and sink components- recent logs and metrics should continue moving forward
DBLog does not implement online schema evolution as a supported feature. The runtime treats schema as a selected-column contract:
- startup inspects each captured source table and stores the observed schema,
- the first successful startup stores the contract schema,
- later startups reconcile the live schema against the stored contract,
- dump and repair chunks must keep the same selected-column fingerprint,
- live stream adapters decode by the selected contract where the source protocol gives enough metadata,
- when continuity is uncertain, DBLog records a signal and fails closed rather than widening the emitted row shape silently.
The selected contract is the set of columns DBLog emits and fingerprints. Columns outside that surface are ignored; they do not appear in emitted rows and do not change the selected-column fingerprint. An extra supported column added after the contract exists is therefore treated like an ignored column if DBLog encounters it at a safe inspection boundary.
Startup and restart reconciliation:
| Source shape at startup | Result |
|---|---|
| no stored contract | live schema becomes contract |
| extra non-PK column | accepted as ignored |
| selected column removed | fail closed |
| selected type/source/nullability drift | fail closed |
| primary-key shape drift | fail closed |
Live stream behavior differs by adapter because the source protocols expose different metadata:
| Live condition | MySQL | PostgreSQL |
|---|---|---|
captured-table ADD COLUMN |
fail closed | may decode if unselected |
| extra row metadata column | ignored by contract | ignored by contract |
| missing selected column | fail closed | fail closed |
| selected type drift | fail via DDL path | not an OID guard |
TRUNCATE |
fail closed | fail closed |
| live PK update | fail closed | fail closed |
| key-only old tuple | n/a | fail closed |
| replica identity drift | n/a | fail closed |
Important details for code reviewers and AI agents:
- MySQL
Queryevents expose raw SQL plus a default database, not a structured target table id. DBLog therefore fails closed on row-state- or schema-affecting DDL in a captured database, including additiveADD COLUMN. - The lower MySQL row decoder can ignore extra
TABLE_MAPcolumns when no DDL query has stopped the stream first, but online MySQL DDL is still unsupported. - PostgreSQL
pgoutputrelation messages are structured. The live decoder maps tuple values by selected column name, so added unselected relation columns can be tolerated mechanically. - PostgreSQL relation messages carry type OIDs, but the live runtime does not use those OIDs as a selected-column type-drift detector. Startup/restart schema inspection is the supported point that catches selected type/source/nullability drift.
- A tolerated adapter mechanism is not a promise that operators can perform online schema evolution. Stop DBLog, make the source change, verify startup reconciliation, and submit a fresh full dump when the selected contract changed or correctness is uncertain.
Useful code and test anchors for review:
core/schema/SchemaPolicyEngine.javaruntime/bootstrap/RelationalRuntimeBootstrap.javaadapter/mysql/internal/MySqlBinlogSession.javaadapter/postgres/internal/PostgresTransactionStreamingSession.javaadapter/mysql/internal/MySqlBinlogSessionTests.javaadapter/postgres/internal/PostgresTransactionStreamingSessionTests.java
GET /api/v1/runtime/schemas reports per-table schemaStatus values:
| Status | Meaning |
|---|---|
OK |
no persisted schema signal for that table |
SCHEMA_UNCERTAIN |
schema/metadata continuity is suspect |
FULL_DUMP_REQUIRED |
operator action and fresh dump required |
GET /api/v1/runtime/schema-issues returns the raw signal lists:
fullDumpRequiredSignalsschemaUncertaintySignals
FULL_DUMP_REQUIRED wins over SCHEMA_UNCERTAIN for the per-table status.
If DBLog reports FULL_DUMP_REQUIRED:
- inspect the reported reason
- fix the underlying source issue if needed
- submit a fresh
ALL_TABLESdump when the source is trustworthy again
Current invalidation behavior:
- PK-drift / full-dump-required invalidation retires stale queued and active
dump requests as
FAILED, - DBLog does not silently put old
TABLE/ALL_TABLESrequests back into the queue after that invalidation boundary, - full-dump-required signals for that invalidation stay keyed under the
configured runtime
sourceId, - operator recovery should use a fresh submitted request rather than resuming a retired pre-drift request.
Typical triggers:
- purged upstream history
- unresolved schema continuity after DDL
- primary-key drift that invalidated in-flight progress
- live primary-key update on a captured table
- PostgreSQL key-only or missing old tuple for a captured-table update/delete
MySQL-specific limitation: the live binlog path does not parse DDL. MySQL Query events carry a default database and raw SQL text, so row-state- or schema-affecting DDL in the captured database can fail closed even when it targets an uncaptured table. Treat that as a conservative false positive: verify the source and target state, then use the same fresh full-dump recovery path.
If DBLog reports schema uncertainty without a full-dump-required state:
- monitor the runtime and logs
- verify whether the adapter can recover once trustworthy schema evidence appears
- treat repeated or escalating uncertainty as a candidate for a fresh full dump
Typical triggers include malformed DBLog metadata rows, unexpected singleton metadata behavior, or another DBLog run writing heartbeats on the same source stream after this runtime has confirmed its own heartbeat.
Current canonical local demos:
scripts/demo/mysql_to_ndjson.pyscripts/demo/mysql_to_postgres.pyscripts/demo/postgres_to_mysql.py
Packaged Docker example:
ops/docker/README.mdops/docker/compose.runtime.ymlops/docker/examples/mysql-to-postgres/application.properties
These are the best starting points if you want a real property set instead of
inventing one from scratch. The packaged example still keeps the operator-facing
control plane local by publishing 127.0.0.1:8085 on the host. When you
recreate the fixture databases from scratch, also clear the bind-mounted example
runtime state under ops/docker/example-state/mysql-to-postgres/ before the
next packaged start so DBLog does not resume from a stale example checkpoint
against a fresh source history.
The canonical demos and local host-run examples enable source reconnect retry
for transient availability failures. Contract and correctness failures, such as
unsupported DDL or incompatible schema changes, still fail closed by design.
For unsupported live DDL, restart alone is not recovery: the same persisted
state/checkpoint can replay the DDL before a fresh dump can be submitted. When
the state store is available, destructive live DDL, truncate, selected-column
source metadata drift, selected-column row tuple drift, and live primary-key
update paths record a full-dump-required signal before failing closed. Verify
or rebootstrap the target, clear or replace the affected runtime state files,
then restart and submit a fresh ALL_TABLES dump. Remember that
dblog.runtime.state-path is an H2 file prefix, not a directory: remove
<state-path>.mv.db and any <state-path>.trace.db / <state-path>.lock.db
files, or use a new state path.
For the packaged container path, do not stop at runtime/status: use the
packaged proof flow in ops/docker/README.md.
It seeds MySQL, starts the packaged runtime, submits ALL_TABLES, verifies
PostgreSQL convergence, applies a later live MySQL change, and verifies
PostgreSQL converges again.
Use these as the practical "it is working" signals:
| Example | Startup success | Replication success |
|---|---|---|
mysql_to_ndjson.py |
runtime starts and writes logs cleanly | NDJSON file receives live events |
mysql_to_postgres.py |
runtime starts and control plane is reachable | initial ALL_TABLES dump converges, then later live MySQL changes converge into PostgreSQL |
postgres_to_mysql.py |
runtime starts and control plane is reachable | initial ALL_TABLES dump converges, then later live PostgreSQL changes converge into MySQL |
| Packaged Docker example | container starts and control plane is reachable on 127.0.0.1:8085 |
initial ALL_TABLES dump converges, then later live MySQL changes converge into PostgreSQL |
Good general success signals:
- request submission is available when expected,
- queued requests drain,
- no schema-issue escalation appears unless intentionally provoked,
- recent logs and metrics continue to move forward,
- source flow-control remains healthy, or if source fetching is paused because a bounded queue is full, the operator-facing diagnostics clearly explain whether the sink is unavailable or simply slower than the current inbound source pressure,
- target-side final row state matches the source-side intended outcome.
For first-contact validation, start with one local demo or one
startup-check run.
Quick live proof (fastest way to see data flowing end-to-end):
# macOS/Linux
python3 scripts/demo/mysql_to_postgres.py
# Windows (PowerShell / CMD)
py -3 scripts/demo/mysql_to_postgres.pyOther canonical demos (see §7):
# macOS/Linux
python3 scripts/demo/mysql_to_ndjson.py
python3 scripts/demo/postgres_to_mysql.py
# Windows (PowerShell / CMD)
py -3 scripts/demo/mysql_to_ndjson.py
py -3 scripts/demo/postgres_to_mysql.pyOn macOS/Linux, the demo entrypoints are also executable, so
./scripts/demo/mysql_to_postgres.py and its siblings work too.
Short local verification:
check is the normal local developer verification task for this repo.
./gradlew checkUnit and in-process integration tests:
./gradlew test
./gradlew integrationTestCI-style verification (includes the Docker-backed integration and e2e lanes):
./gradlew check
./gradlew integrationTestDocker
./gradlew e2eTestDockerCross-version support matrix:
./gradlew compatibilityMatrixCurrent practical interpretation:
- startup preflight failures usually mean source prerequisites or credentials are wrong
- target apply retries usually indicate transient target unavailability
- target apply hard failures usually indicate schema/privilege/contract drift
- missing or malformed metadata tables are treated as contract boundaries, not soft warnings
- purged source history is a fail-closed condition rather than an automatic resnapshot path
The local H2 state store does not carry a schema version. If a DBLog binary
upgrade changes a state-store table shape, the new binary will attempt to open
the old store and may fail mid-boot with an H2 column-shape error. This
implementation does not auto-migrate. Operators upgrading across
binary versions should reset the local state files for
dblog.runtime.state-path before starting the new binary: remove
<state-path>.mv.db and any <state-path>.trace.db / <state-path>.lock.db
files, or use a new state path. Then re-submit any long-running dump requests.
This matches the study-friendly positioning — for a
production deployment, run a versioned migration outside DBLog.
Source passwords (dblog.source.mysql.password, dblog.source.postgres.password,
dblog.target.password) are plain string properties bound by Spring's
@ConfigurationProperties. DBLog does not wrap them in a redacting
secret-value type. Two practical consequences:
- Do not enable DEBUG-level logging on Spring's property-binding packages
(
org.springframework.boot.context.properties.bind,org.springframework.core.env). At DEBUG, Spring may log bound property values including passwords. The default log level does not print them. - The embedded Spring Boot actuator is on the classpath but is not reachable
over HTTP in the default configuration (
spring.main.web-application-type=none). If you ever enable a web application type or expose the actuator separately, configuremanagement.endpoint.env.keys-to-sanitizeto redact password properties.
Heap/thread dumps will contain the password strings in plaintext regardless.
For deployments that carry real secrets, consider injecting credentials from
an external secrets store rather than pinning them in
application.properties.
Use this table as the first-pass operator guide.
| Symptom | Likely cause | Where to look | Operator action |
|---|---|---|---|
| Startup fails before runtime begins | bad credentials, missing table, invalid source prerequisite, or selected-column schema drift against a stored contract | startup logs, implementation spec for the adapter; persisted schema issues if a state store was available | fix source config or source DB state; for contract drift, submit a fresh ALL_TABLES dump after the source is trustworthy |
| Control plane is enabled but request submission is unavailable | no active request-processing runtime or unresolved state path | GET /api/v1/runtime, request submission message |
ensure the process is in the right boot mode and a valid runtime/scenario state path is configured |
runtime/health is DOWN |
runtime fail-closed boundary hit | GET /api/v1/runtime/health, recent logs |
inspect the failure class/message, fix the underlying contract problem, then restart or rerun |
schemaStatus=FULL_DUMP_REQUIRED |
purged history, unresolved schema continuity, or primary-key drift | GET /api/v1/runtime/schemas, GET /api/v1/runtime/schema-issues, logs |
fix the underlying issue, then submit a fresh ALL_TABLES dump |
schemaStatus=SCHEMA_UNCERTAIN persists too long |
adapter cannot obtain trustworthy schema or metadata evidence | runtime/schemas, runtime/schema-issues, recent logs |
monitor first; if it does not clear, treat it as a candidate for a fresh full dump |
| Target apply keeps retrying | transient target outage or target DB not yet reachable | runtime/status, recent logs |
restore target availability and wait for retry convergence |
| Target apply fails hard instead of retrying | target contract breach such as missing schema/table/column or PK mismatch | recent logs, target apply spec | align target schema/privileges/PK contract, then restart or retry |
| No requests drain after submission | runtime not processing requests or state-path mismatch | GET /api/v1/requests, runtime/status |
confirm request submission path and active runtime state |
| Demo starts but no convergence happens | source not changing, dump not submitted, or target not aligned | demo log, control plane, target DB query | submit the expected request and verify target schema plus runtime health |
When in doubt:
- check
runtime/health - check
runtime/status - check
runtime/schema-issues - inspect recent logs
- trigger a fresh
ALL_TABLESdump only after the underlying source/target contract is healthy again