|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Exploring Change Data Capture with Debezium and Jupyter" |
| 4 | +date: 2026-04-29 |
| 5 | +tags: [ debezium, jupyter, python, pandas, demo ] |
| 6 | +author: jpechane |
| 7 | +--- |
| 8 | + |
| 9 | +:imagesdir: /assets/images/2026-05-05-debezium-and-jupyter-integration |
| 10 | + |
| 11 | +When people think about Debezium integrations, the first idea is usually a production pipeline: |
| 12 | +capture changes from a database, deliver them to Kafka, and route them to downstream services or analytical systems. |
| 13 | +That is, of course, a very common and very good use case. |
| 14 | +But there is another side to change data capture that deserves attention too: exploration. |
| 15 | + |
| 16 | +Sometimes you do not want to start with a full deployment and a chain of consumers. |
| 17 | +Sometimes you want to inspect the events, validate a connector configuration, build a quick proof of concept, or show colleagues what the stream actually looks like. |
| 18 | +In these situations, an interactive environment can be much more effective than a traditional application. |
| 19 | + |
| 20 | +This is where Jupyter notebooks fit surprisingly well. |
| 21 | + |
| 22 | +In this post, we will look at a simple Debezium and Jupyter integration based on the https://github.com/debezium/debezium-examples/tree/main/jupyter[`jupyter` example] in the `debezium-examples` repository. |
| 23 | +The demo runs Debezium Engine from a notebook by using https://github.com/memiiso/pydbzengine[`pydbzengine`], captures changes from PostgreSQL, stores the resulting records in a pandas data frame, and lets you inspect and aggregate them interactively. |
| 24 | + |
| 25 | ++++<!-- more -->+++ |
| 26 | + |
| 27 | +== Why Jupyter? |
| 28 | + |
| 29 | +Jupyter is usually associated with data science, experimentation, and teaching. |
| 30 | +All three are relevant to CDC as well. |
| 31 | + |
| 32 | +When working with Debezium, there are several recurring tasks for which notebooks can be very convenient: |
| 33 | + |
| 34 | +* exploring the structure of change events without writing a full consumer application, |
| 35 | +* validating connector properties and table filters, |
| 36 | +* demonstrating snapshots and streaming phases during workshops or internal training, |
| 37 | +* performing ad-hoc analysis over captured events with pandas, |
| 38 | +* troubleshooting whether changes are actually emitted for a particular table or primary key. |
| 39 | + |
| 40 | +This does not mean that Jupyter replaces Kafka Connect, Debezium Server, or the embedded engine in a production deployment. |
| 41 | +It does not. |
| 42 | +What it gives you is a very low-friction working environment for understanding your data stream before you decide what should happen with it next. |
| 43 | + |
| 44 | +The interesting part is that the integration is not based on any artificial mock. |
| 45 | +The notebook uses the same Debezium connector logic that you would use elsewhere. |
| 46 | +The only difference is the runtime environment and the consumer implementation. |
| 47 | + |
| 48 | +== The Example |
| 49 | + |
| 50 | +The demo is intentionally small. |
| 51 | +It uses two containers: |
| 52 | + |
| 53 | +* a PostgreSQL instance based on the Debezium tutorial database, |
| 54 | +* a Jupyter Lab container with Python dependencies and a JDK installed. |
| 55 | + |
| 56 | +The complete example is available in https://github.com/debezium/debezium-examples/tree/main/jupyter[`debezium-examples/jupyter`]. |
| 57 | + |
| 58 | +The flow is straightforward: |
| 59 | + |
| 60 | +1. PostgreSQL stores the source data. |
| 61 | +2. Debezium Engine runs inside the notebook kernel through `pydbzengine`. |
| 62 | +3. A Python change handler receives Debezium events. |
| 63 | +4. The handler converts the incoming records into Python structures. |
| 64 | +5. pandas is used to inspect and aggregate the captured changes. |
| 65 | + |
| 66 | +This is a good example of a pattern that is often overlooked. |
| 67 | +Debezium is not limited to "database to Kafka" pipelines. |
| 68 | +If you can consume change events in-process, you can place Debezium into developer tools, notebooks, scripts, and exploratory workflows too. |
| 69 | + |
| 70 | +image::overview.svg[image,caption="Figure 1: Architecture of the demo: PostgreSQL emits changes, Debezium Engine captures them inside the notebook through pydbzengine, and pandas exposes the resulting events for interactive analysis.",width=800,role=centered-image] |
| 71 | + |
| 72 | +== Starting the Demo |
| 73 | + |
| 74 | +The example is designed to be easy to run locally. |
| 75 | +From the `jupyter` directory in `debezium-examples`, start it with: |
| 76 | + |
| 77 | +[source,bash] |
| 78 | +---- |
| 79 | +docker compose up --build |
| 80 | +---- |
| 81 | + |
| 82 | +Once the services are ready, open Jupyter Lab at `http://localhost:8888`. |
| 83 | +The prepared notebook is named `postgres_cdc_pk_change_counts.ipynb`. |
| 84 | + |
| 85 | +The environment is configured via Docker Compose. |
| 86 | +The Jupyter container receives the PostgreSQL connection settings as well as several Debezium properties, for example: |
| 87 | + |
| 88 | +[source,yaml] |
| 89 | +---- |
| 90 | +environment: |
| 91 | + PG_HOST: postgres |
| 92 | + PG_PORT: "5432" |
| 93 | + PG_USER: postgres |
| 94 | + PG_PASSWORD: postgres |
| 95 | + PG_DBNAME: postgres |
| 96 | + TABLE_INCLUDE_LIST: inventory.customers |
| 97 | + SCHEMA_INCLUDE_LIST: inventory |
| 98 | + TOPIC_PREFIX: dbserver1 |
| 99 | + SLOT_NAME: pydbzengine_slot |
| 100 | + PLUGIN_NAME: pgoutput |
| 101 | + SNAPSHOT_MODE: initial |
| 102 | +---- |
| 103 | + |
| 104 | +This is already enough to show an important point. |
| 105 | +The notebook is not built around hardcoded sample logic only. |
| 106 | +You can adjust the target table, schema, replication slot, or snapshot behavior and immediately observe the effect. |
| 107 | +For workshops and experiments, that feedback loop is very useful. |
| 108 | + |
| 109 | +image::jupyter-lab-notebook.png[image,caption="Figure 2: The prepared notebook opened in JupyterLab, showing the CDC walkthrough in a single interactive workspace.",width=900,role=centered-image] |
| 110 | + |
| 111 | +== Configuring Debezium in the Notebook |
| 112 | + |
| 113 | +The notebook starts by defining the connector configuration in Python: |
| 114 | + |
| 115 | +[source,python] |
| 116 | +---- |
| 117 | +dbz_props = { |
| 118 | + 'name': 'pydbzengine-postgres-to-pandas', |
| 119 | + 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', |
| 120 | + 'database.hostname': PG_HOST, |
| 121 | + 'database.port': PG_PORT, |
| 122 | + 'database.user': PG_USER, |
| 123 | + 'database.password': PG_PASSWORD, |
| 124 | + 'database.dbname': PG_DBNAME, |
| 125 | + 'topic.prefix': TOPIC_PREFIX, |
| 126 | + 'schema.include.list': SCHEMA_INCLUDE_LIST, |
| 127 | + 'table.include.list': TABLE_INCLUDE_LIST, |
| 128 | + 'slot.name': SLOT_NAME, |
| 129 | + 'plugin.name': PLUGIN_NAME, |
| 130 | + 'publication.autocreate.mode': PUBLICATION_AUTOCREATE_MODE, |
| 131 | + 'include.schema.changes': 'false', |
| 132 | + 'snapshot.mode': SNAPSHOT_MODE, |
| 133 | + 'offset.storage': 'org.apache.kafka.connect.storage.FileOffsetBackingStore', |
| 134 | + 'offset.storage.file.filename': OFFSET_FILE, |
| 135 | + 'offset.flush.interval.ms': '1000' |
| 136 | +} |
| 137 | +---- |
| 138 | + |
| 139 | +If you have configured Debezium connectors before, this should look very familiar. |
| 140 | +That is precisely the benefit. |
| 141 | +The notebook approach does not introduce a new conceptual model. |
| 142 | +It reuses the same connector properties and the same source connector implementation. |
| 143 | + |
| 144 | +In other words, the distance between "I am experimenting in a notebook" and "I am configuring a real Debezium deployment" stays small. |
| 145 | +That reduces surprises later. |
| 146 | + |
| 147 | +The example uses PostgreSQL and captures changes from `inventory.customers`. |
| 148 | +For a first notebook-based CDC demo this is a sensible choice, because the table is small, easy to understand, and already known from the tutorial. |
| 149 | + |
| 150 | +== Collecting Records in Python |
| 151 | + |
| 152 | +The core of the integration is a very small Python change handler. |
| 153 | +It receives records from Debezium and stores a normalized representation in memory: |
| 154 | + |
| 155 | +[source,python] |
| 156 | +---- |
| 157 | +class DataFrameCollector(BasePythonChangeHandler): |
| 158 | + def handleJsonBatch(self, records: List[ChangeEvent]): |
| 159 | + rows = [] |
| 160 | + for record in records: |
| 161 | + key_raw = _as_python_str(record.key()) |
| 162 | + key_json = json.loads(key_raw if key_raw else '{}') |
| 163 | + value_raw = _as_python_str(record.value()) |
| 164 | +
|
| 165 | + value_json = json.loads(value_raw) |
| 166 | + payload = value_json.get('payload', {}) |
| 167 | +
|
| 168 | + rows.append({ |
| 169 | + 'destination': _as_python_str(record.destination()), |
| 170 | + 'pk': json.dumps(key_json, sort_keys = True), |
| 171 | + 'op': payload.get('op'), |
| 172 | + 'ts_ms': payload.get('ts_ms'), |
| 173 | + 'before': payload.get('before'), |
| 174 | + 'after': payload.get('after'), |
| 175 | + }) |
| 176 | +---- |
| 177 | + |
| 178 | +This is probably my favorite part of the demo because it shows how little code is needed to become productive. |
| 179 | + |
| 180 | +The handler does not try to solve every possible processing scenario. |
| 181 | +It simply extracts a few useful fields: |
| 182 | + |
| 183 | +* destination, |
| 184 | +* primary key, |
| 185 | +* operation type, |
| 186 | +* event timestamp, |
| 187 | +* before state, |
| 188 | +* after state. |
| 189 | + |
| 190 | +That is enough to inspect snapshots, updates, and deletes, and to build compact summaries on top of them. |
| 191 | + |
| 192 | +The notebook then turns the in-memory list into a pandas data frame. |
| 193 | +From there you can use standard Python data analysis techniques instead of writing custom Java or Kafka consumer logic. |
| 194 | + |
| 195 | +For example, a timestamp column is derived from `ts_ms`, making it easy to sort, filter, or plot events later. |
| 196 | + |
| 197 | +image::captured-records-dataframe.png[image,caption="Figure 3: Real notebook output after running the demo, with snapshot rows and subsequent update events materialized as a pandas DataFrame.",width=900,role=centered-image] |
| 198 | + |
| 199 | +== Running the Engine Interactively |
| 200 | + |
| 201 | +Debezium Engine is a blocking process, so the notebook starts it in a background thread. |
| 202 | +Again, the implementation is intentionally minimal: |
| 203 | + |
| 204 | +[source,python] |
| 205 | +---- |
| 206 | +def start_engine(): |
| 207 | + global engine, engine_thread |
| 208 | + engine = DebeziumJsonEngine(properties = dbz_props, handler = DataFrameCollector()) |
| 209 | + engine_thread = threading.Thread(target = engine.run, daemon = True) |
| 210 | + engine_thread.start() |
| 211 | + print('Engine started in background.') |
| 212 | +
|
| 213 | +def stop_engine(): |
| 214 | + global engine, engine_thread |
| 215 | + engine.close() |
| 216 | + if engine_thread is not None: |
| 217 | + engine_thread.join(timeout = 10) |
| 218 | + print('Engine stopped.') |
| 219 | +---- |
| 220 | + |
| 221 | +This is one of the areas where Jupyter provides a noticeably better experience than a standalone command-line example. |
| 222 | +You can start the engine, leave it running, execute SQL changes, inspect captured events, run another batch of SQL statements, and continue analysis in the same session. |
| 223 | + |
| 224 | +The demo adds `ipywidgets` buttons on top of these functions: |
| 225 | + |
| 226 | +* `Start Engine` |
| 227 | +* `Run Sample SQL` |
| 228 | +* `Stop Engine` |
| 229 | + |
| 230 | +That makes the notebook suitable not only for developers, but also for demos and technical presentations. |
| 231 | +Instead of switching among several terminals, you keep the setup, execution, and analysis in a single view. |
| 232 | + |
| 233 | +== Generating and Inspecting Changes |
| 234 | + |
| 235 | +To produce a stream of updates, the notebook contains a helper function that modifies records in the source table: |
| 236 | + |
| 237 | +[source,python] |
| 238 | +---- |
| 239 | +def run_sample_changes(pk: int = SAMPLE_PK): |
| 240 | + sql_statements = [ |
| 241 | + 'SET search_path TO inventory', |
| 242 | + f'UPDATE customers SET first_name = first_name || \'-x\' WHERE id = {pk}' |
| 243 | + ] |
| 244 | +---- |
| 245 | + |
| 246 | +This is intentionally simple, but very effective for demonstration purposes. |
| 247 | +After you start the engine and execute the sample SQL, you can refresh the data frame cell and immediately see additional change events. |
| 248 | + |
| 249 | +The notebook also makes the distinction between snapshot and streaming events visible. |
| 250 | +Debezium operation codes are preserved, so you can decide whether snapshot reads (`r`) should be counted together with actual changes (`c`, `u`, `d`) or filtered out. |
| 251 | + |
| 252 | +That is something many people struggle with when they first start consuming Debezium events. |
| 253 | +Seeing it in a notebook helps a lot. |
| 254 | + |
| 255 | +For example, the aggregation cell counts changes per primary key: |
| 256 | + |
| 257 | +[source,python] |
| 258 | +---- |
| 259 | +ops = ['c', 'u', 'd'] |
| 260 | +working_df = records_dataframe() |
| 261 | +
|
| 262 | +change_counts = ( |
| 263 | + working_df[working_df['op'].fillna('').astype(str).isin(ops)] |
| 264 | + .groupby('pk', as_index = False) |
| 265 | + .size() |
| 266 | + .rename(columns = {'size': 'change_count'}) |
| 267 | + .sort_values('change_count', ascending = False) |
| 268 | +) |
| 269 | +---- |
| 270 | + |
| 271 | +This is exactly the kind of analysis that would be cumbersome if you only inspected raw JSON in logs. |
| 272 | +With pandas, it becomes trivial. |
| 273 | + |
| 274 | +image::change-counts-per-primary-key.png[image,caption="Figure 4: Aggregated change counts per primary key after executing sample updates against the source table.",width=700,role=centered-image] |
| 275 | + |
| 276 | +And once the data is in a data frame, you can naturally extend the notebook further: |
| 277 | + |
| 278 | +* compare activity across keys, |
| 279 | +* inspect change velocity over time, |
| 280 | +* identify hot rows, |
| 281 | +* validate whether a filter includes the expected tables, |
| 282 | +* experiment with delete handling and snapshots. |
| 283 | + |
| 284 | +== Why This Matters |
| 285 | + |
| 286 | +At first glance, a notebook may seem like a toy compared to a full CDC pipeline. |
| 287 | +I do not think that is the right way to look at it. |
| 288 | + |
| 289 | +Interactive environments are often the fastest route to clarity. |
| 290 | +When a team evaluates CDC for a new use case, the first obstacle is usually not throughput or deployment topology. |
| 291 | +It is understanding the shape and semantics of the emitted events. |
| 292 | + |
| 293 | +Questions tend to be very basic: |
| 294 | + |
| 295 | +* What does the key look like? |
| 296 | +* What is in `before` and `after`? |
| 297 | +* How do snapshot records differ from streaming updates? |
| 298 | +* Did my table filter work? |
| 299 | +* Are deletes represented the way my downstream processing expects? |
| 300 | + |
| 301 | +You can answer all these questions in a notebook in minutes. |
| 302 | + |
| 303 | +There is also a broader Python angle here. |
| 304 | +A lot of analytical and data engineering work today happens in Python-first environments. |
| 305 | +By using `pydbzengine`, Debezium becomes accessible in that ecosystem without forcing users to abandon the tools they already know. |
| 306 | + |
| 307 | +This does not replace the Java APIs, Kafka-based deployments, or Debezium Server. |
| 308 | +It complements them. |
| 309 | +In practice, that means: |
| 310 | + |
| 311 | +* analysts can explore CDC interactively, |
| 312 | +* data engineers can prototype consumers quickly, |
| 313 | +* platform teams can validate source behavior before provisioning infrastructure, |
| 314 | +* educators can explain Debezium concepts with immediate feedback. |
| 315 | + |
| 316 | +== Where to Go Next |
| 317 | + |
| 318 | +The provided demo intentionally keeps the scope narrow. |
| 319 | +It focuses on one table, one notebook, and one simple aggregation. |
| 320 | +That is the right choice for a first example, but it should also give you ideas for extensions. |
| 321 | + |
| 322 | +For instance, you could: |
| 323 | + |
| 324 | +* capture multiple tables and analyze them together, |
| 325 | +* flatten events before analysis, |
| 326 | +* persist the captured records into Parquet or DuckDB, |
| 327 | +* visualize event rates with matplotlib, |
| 328 | +* connect the notebook to a machine learning workflow, |
| 329 | +* compare snapshot and streaming latency under different connector settings. |
| 330 | + |
| 331 | +If this sounds familiar, it should. |
| 332 | +We already saw in earlier Debezium examples that notebooks can be useful for machine learning scenarios too. |
| 333 | +The difference here is that the new Jupyter example is much more direct: it focuses on CDC exploration itself instead of using the notebook only as a UI on top of another pipeline. |
| 334 | + |
| 335 | +That makes it a good building block for many other demos. |
| 336 | + |
| 337 | +== Final Thoughts |
| 338 | + |
| 339 | +Debezium is often introduced through large-scale streaming architectures, and rightly so. |
| 340 | +But one of its strengths is that the same CDC foundation can be applied at very different scales and in very different runtimes. |
| 341 | + |
| 342 | +The Jupyter integration shows that Debezium can be just as useful in an interactive notebook as it is in a production data pipeline. |
| 343 | +You can configure a real connector, capture real change events, and analyze them with familiar Python tools in a matter of minutes. |
| 344 | + |
| 345 | +If you are evaluating Debezium, teaching CDC concepts, or simply trying to understand the behavior of a source table, this is a very practical place to start. |
| 346 | +And if you already use Debezium in production, a notebook like this can become a handy addition to your toolbox for experimentation, validation, and troubleshooting. |
| 347 | + |
| 348 | +You can find the full demo in the https://github.com/debezium/debezium-examples/tree/main/jupyter[`debezium-examples` repository]. |
| 349 | +If you build on top of it, let us know what kind of notebook-based CDC workflows you find useful. |
0 commit comments