Skip to content

Commit a4322ae

Browse files
committed
Jupyter demo blogpost
Signed-off-by: Jiri Pechanec <jiri.pechanec@centrum.cz>
1 parent 78ecf4c commit a4322ae

6 files changed

Lines changed: 785 additions & 0 deletions

File tree

Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
---
2+
layout: post
3+
title: "Exploring Change Data Capture with Debezium and Jupyter"
4+
date: 2026-04-29
5+
tags: [ debezium, jupyter, python, pandas, demo ]
6+
author: jpechane
7+
---
8+
9+
:imagesdir: /assets/images/2026-05-05-debezium-and-jupyter-integration
10+
11+
When people think about Debezium integrations, the first idea is usually a production pipeline:
12+
capture changes from a database, deliver them to Kafka, and route them to downstream services or analytical systems.
13+
That is, of course, a very common and very good use case.
14+
But there is another side to change data capture that deserves attention too: exploration.
15+
16+
Sometimes you do not want to start with a full deployment and a chain of consumers.
17+
Sometimes you want to inspect the events, validate a connector configuration, build a quick proof of concept, or show colleagues what the stream actually looks like.
18+
In these situations, an interactive environment can be much more effective than a traditional application.
19+
20+
This is where Jupyter notebooks fit surprisingly well.
21+
22+
In this post, we will look at a simple Debezium and Jupyter integration based on the https://github.com/debezium/debezium-examples/tree/main/jupyter[`jupyter` example] in the `debezium-examples` repository.
23+
The demo runs Debezium Engine from a notebook by using https://github.com/memiiso/pydbzengine[`pydbzengine`], captures changes from PostgreSQL, stores the resulting records in a pandas data frame, and lets you inspect and aggregate them interactively.
24+
25+
+++<!-- more -->+++
26+
27+
== Why Jupyter?
28+
29+
Jupyter is usually associated with data science, experimentation, and teaching.
30+
All three are relevant to CDC as well.
31+
32+
When working with Debezium, there are several recurring tasks for which notebooks can be very convenient:
33+
34+
* exploring the structure of change events without writing a full consumer application,
35+
* validating connector properties and table filters,
36+
* demonstrating snapshots and streaming phases during workshops or internal training,
37+
* performing ad-hoc analysis over captured events with pandas,
38+
* troubleshooting whether changes are actually emitted for a particular table or primary key.
39+
40+
This does not mean that Jupyter replaces Kafka Connect, Debezium Server, or the embedded engine in a production deployment.
41+
It does not.
42+
What it gives you is a very low-friction working environment for understanding your data stream before you decide what should happen with it next.
43+
44+
The interesting part is that the integration is not based on any artificial mock.
45+
The notebook uses the same Debezium connector logic that you would use elsewhere.
46+
The only difference is the runtime environment and the consumer implementation.
47+
48+
== The Example
49+
50+
The demo is intentionally small.
51+
It uses two containers:
52+
53+
* a PostgreSQL instance based on the Debezium tutorial database,
54+
* a Jupyter Lab container with Python dependencies and a JDK installed.
55+
56+
The complete example is available in https://github.com/debezium/debezium-examples/tree/main/jupyter[`debezium-examples/jupyter`].
57+
58+
The flow is straightforward:
59+
60+
1. PostgreSQL stores the source data.
61+
2. Debezium Engine runs inside the notebook kernel through `pydbzengine`.
62+
3. A Python change handler receives Debezium events.
63+
4. The handler converts the incoming records into Python structures.
64+
5. pandas is used to inspect and aggregate the captured changes.
65+
66+
This is a good example of a pattern that is often overlooked.
67+
Debezium is not limited to "database to Kafka" pipelines.
68+
If you can consume change events in-process, you can place Debezium into developer tools, notebooks, scripts, and exploratory workflows too.
69+
70+
image::overview.svg[image,caption="Figure 1: Architecture of the demo: PostgreSQL emits changes, Debezium Engine captures them inside the notebook through pydbzengine, and pandas exposes the resulting events for interactive analysis.",width=800,role=centered-image]
71+
72+
== Starting the Demo
73+
74+
The example is designed to be easy to run locally.
75+
From the `jupyter` directory in `debezium-examples`, start it with:
76+
77+
[source,bash]
78+
----
79+
docker compose up --build
80+
----
81+
82+
Once the services are ready, open Jupyter Lab at `http://localhost:8888`.
83+
The prepared notebook is named `postgres_cdc_pk_change_counts.ipynb`.
84+
85+
The environment is configured via Docker Compose.
86+
The Jupyter container receives the PostgreSQL connection settings as well as several Debezium properties, for example:
87+
88+
[source,yaml]
89+
----
90+
environment:
91+
PG_HOST: postgres
92+
PG_PORT: "5432"
93+
PG_USER: postgres
94+
PG_PASSWORD: postgres
95+
PG_DBNAME: postgres
96+
TABLE_INCLUDE_LIST: inventory.customers
97+
SCHEMA_INCLUDE_LIST: inventory
98+
TOPIC_PREFIX: dbserver1
99+
SLOT_NAME: pydbzengine_slot
100+
PLUGIN_NAME: pgoutput
101+
SNAPSHOT_MODE: initial
102+
----
103+
104+
This is already enough to show an important point.
105+
The notebook is not built around hardcoded sample logic only.
106+
You can adjust the target table, schema, replication slot, or snapshot behavior and immediately observe the effect.
107+
For workshops and experiments, that feedback loop is very useful.
108+
109+
image::jupyter-lab-notebook.png[image,caption="Figure 2: The prepared notebook opened in JupyterLab, showing the CDC walkthrough in a single interactive workspace.",width=900,role=centered-image]
110+
111+
== Configuring Debezium in the Notebook
112+
113+
The notebook starts by defining the connector configuration in Python:
114+
115+
[source,python]
116+
----
117+
dbz_props = {
118+
'name': 'pydbzengine-postgres-to-pandas',
119+
'connector.class': 'io.debezium.connector.postgresql.PostgresConnector',
120+
'database.hostname': PG_HOST,
121+
'database.port': PG_PORT,
122+
'database.user': PG_USER,
123+
'database.password': PG_PASSWORD,
124+
'database.dbname': PG_DBNAME,
125+
'topic.prefix': TOPIC_PREFIX,
126+
'schema.include.list': SCHEMA_INCLUDE_LIST,
127+
'table.include.list': TABLE_INCLUDE_LIST,
128+
'slot.name': SLOT_NAME,
129+
'plugin.name': PLUGIN_NAME,
130+
'publication.autocreate.mode': PUBLICATION_AUTOCREATE_MODE,
131+
'include.schema.changes': 'false',
132+
'snapshot.mode': SNAPSHOT_MODE,
133+
'offset.storage': 'org.apache.kafka.connect.storage.FileOffsetBackingStore',
134+
'offset.storage.file.filename': OFFSET_FILE,
135+
'offset.flush.interval.ms': '1000'
136+
}
137+
----
138+
139+
If you have configured Debezium connectors before, this should look very familiar.
140+
That is precisely the benefit.
141+
The notebook approach does not introduce a new conceptual model.
142+
It reuses the same connector properties and the same source connector implementation.
143+
144+
In other words, the distance between "I am experimenting in a notebook" and "I am configuring a real Debezium deployment" stays small.
145+
That reduces surprises later.
146+
147+
The example uses PostgreSQL and captures changes from `inventory.customers`.
148+
For a first notebook-based CDC demo this is a sensible choice, because the table is small, easy to understand, and already known from the tutorial.
149+
150+
== Collecting Records in Python
151+
152+
The core of the integration is a very small Python change handler.
153+
It receives records from Debezium and stores a normalized representation in memory:
154+
155+
[source,python]
156+
----
157+
class DataFrameCollector(BasePythonChangeHandler):
158+
def handleJsonBatch(self, records: List[ChangeEvent]):
159+
rows = []
160+
for record in records:
161+
key_raw = _as_python_str(record.key())
162+
key_json = json.loads(key_raw if key_raw else '{}')
163+
value_raw = _as_python_str(record.value())
164+
165+
value_json = json.loads(value_raw)
166+
payload = value_json.get('payload', {})
167+
168+
rows.append({
169+
'destination': _as_python_str(record.destination()),
170+
'pk': json.dumps(key_json, sort_keys = True),
171+
'op': payload.get('op'),
172+
'ts_ms': payload.get('ts_ms'),
173+
'before': payload.get('before'),
174+
'after': payload.get('after'),
175+
})
176+
----
177+
178+
This is probably my favorite part of the demo because it shows how little code is needed to become productive.
179+
180+
The handler does not try to solve every possible processing scenario.
181+
It simply extracts a few useful fields:
182+
183+
* destination,
184+
* primary key,
185+
* operation type,
186+
* event timestamp,
187+
* before state,
188+
* after state.
189+
190+
That is enough to inspect snapshots, updates, and deletes, and to build compact summaries on top of them.
191+
192+
The notebook then turns the in-memory list into a pandas data frame.
193+
From there you can use standard Python data analysis techniques instead of writing custom Java or Kafka consumer logic.
194+
195+
For example, a timestamp column is derived from `ts_ms`, making it easy to sort, filter, or plot events later.
196+
197+
image::captured-records-dataframe.png[image,caption="Figure 3: Real notebook output after running the demo, with snapshot rows and subsequent update events materialized as a pandas DataFrame.",width=900,role=centered-image]
198+
199+
== Running the Engine Interactively
200+
201+
Debezium Engine is a blocking process, so the notebook starts it in a background thread.
202+
Again, the implementation is intentionally minimal:
203+
204+
[source,python]
205+
----
206+
def start_engine():
207+
global engine, engine_thread
208+
engine = DebeziumJsonEngine(properties = dbz_props, handler = DataFrameCollector())
209+
engine_thread = threading.Thread(target = engine.run, daemon = True)
210+
engine_thread.start()
211+
print('Engine started in background.')
212+
213+
def stop_engine():
214+
global engine, engine_thread
215+
engine.close()
216+
if engine_thread is not None:
217+
engine_thread.join(timeout = 10)
218+
print('Engine stopped.')
219+
----
220+
221+
This is one of the areas where Jupyter provides a noticeably better experience than a standalone command-line example.
222+
You can start the engine, leave it running, execute SQL changes, inspect captured events, run another batch of SQL statements, and continue analysis in the same session.
223+
224+
The demo adds `ipywidgets` buttons on top of these functions:
225+
226+
* `Start Engine`
227+
* `Run Sample SQL`
228+
* `Stop Engine`
229+
230+
That makes the notebook suitable not only for developers, but also for demos and technical presentations.
231+
Instead of switching among several terminals, you keep the setup, execution, and analysis in a single view.
232+
233+
== Generating and Inspecting Changes
234+
235+
To produce a stream of updates, the notebook contains a helper function that modifies records in the source table:
236+
237+
[source,python]
238+
----
239+
def run_sample_changes(pk: int = SAMPLE_PK):
240+
sql_statements = [
241+
'SET search_path TO inventory',
242+
f'UPDATE customers SET first_name = first_name || \'-x\' WHERE id = {pk}'
243+
]
244+
----
245+
246+
This is intentionally simple, but very effective for demonstration purposes.
247+
After you start the engine and execute the sample SQL, you can refresh the data frame cell and immediately see additional change events.
248+
249+
The notebook also makes the distinction between snapshot and streaming events visible.
250+
Debezium operation codes are preserved, so you can decide whether snapshot reads (`r`) should be counted together with actual changes (`c`, `u`, `d`) or filtered out.
251+
252+
That is something many people struggle with when they first start consuming Debezium events.
253+
Seeing it in a notebook helps a lot.
254+
255+
For example, the aggregation cell counts changes per primary key:
256+
257+
[source,python]
258+
----
259+
ops = ['c', 'u', 'd']
260+
working_df = records_dataframe()
261+
262+
change_counts = (
263+
working_df[working_df['op'].fillna('').astype(str).isin(ops)]
264+
.groupby('pk', as_index = False)
265+
.size()
266+
.rename(columns = {'size': 'change_count'})
267+
.sort_values('change_count', ascending = False)
268+
)
269+
----
270+
271+
This is exactly the kind of analysis that would be cumbersome if you only inspected raw JSON in logs.
272+
With pandas, it becomes trivial.
273+
274+
image::change-counts-per-primary-key.png[image,caption="Figure 4: Aggregated change counts per primary key after executing sample updates against the source table.",width=700,role=centered-image]
275+
276+
And once the data is in a data frame, you can naturally extend the notebook further:
277+
278+
* compare activity across keys,
279+
* inspect change velocity over time,
280+
* identify hot rows,
281+
* validate whether a filter includes the expected tables,
282+
* experiment with delete handling and snapshots.
283+
284+
== Why This Matters
285+
286+
At first glance, a notebook may seem like a toy compared to a full CDC pipeline.
287+
I do not think that is the right way to look at it.
288+
289+
Interactive environments are often the fastest route to clarity.
290+
When a team evaluates CDC for a new use case, the first obstacle is usually not throughput or deployment topology.
291+
It is understanding the shape and semantics of the emitted events.
292+
293+
Questions tend to be very basic:
294+
295+
* What does the key look like?
296+
* What is in `before` and `after`?
297+
* How do snapshot records differ from streaming updates?
298+
* Did my table filter work?
299+
* Are deletes represented the way my downstream processing expects?
300+
301+
You can answer all these questions in a notebook in minutes.
302+
303+
There is also a broader Python angle here.
304+
A lot of analytical and data engineering work today happens in Python-first environments.
305+
By using `pydbzengine`, Debezium becomes accessible in that ecosystem without forcing users to abandon the tools they already know.
306+
307+
This does not replace the Java APIs, Kafka-based deployments, or Debezium Server.
308+
It complements them.
309+
In practice, that means:
310+
311+
* analysts can explore CDC interactively,
312+
* data engineers can prototype consumers quickly,
313+
* platform teams can validate source behavior before provisioning infrastructure,
314+
* educators can explain Debezium concepts with immediate feedback.
315+
316+
== Where to Go Next
317+
318+
The provided demo intentionally keeps the scope narrow.
319+
It focuses on one table, one notebook, and one simple aggregation.
320+
That is the right choice for a first example, but it should also give you ideas for extensions.
321+
322+
For instance, you could:
323+
324+
* capture multiple tables and analyze them together,
325+
* flatten events before analysis,
326+
* persist the captured records into Parquet or DuckDB,
327+
* visualize event rates with matplotlib,
328+
* connect the notebook to a machine learning workflow,
329+
* compare snapshot and streaming latency under different connector settings.
330+
331+
If this sounds familiar, it should.
332+
We already saw in earlier Debezium examples that notebooks can be useful for machine learning scenarios too.
333+
The difference here is that the new Jupyter example is much more direct: it focuses on CDC exploration itself instead of using the notebook only as a UI on top of another pipeline.
334+
335+
That makes it a good building block for many other demos.
336+
337+
== Final Thoughts
338+
339+
Debezium is often introduced through large-scale streaming architectures, and rightly so.
340+
But one of its strengths is that the same CDC foundation can be applied at very different scales and in very different runtimes.
341+
342+
The Jupyter integration shows that Debezium can be just as useful in an interactive notebook as it is in a production data pipeline.
343+
You can configure a real connector, capture real change events, and analyze them with familiar Python tools in a matter of minutes.
344+
345+
If you are evaluating Debezium, teaching CDC concepts, or simply trying to understand the behavior of a source table, this is a very practical place to start.
346+
And if you already use Debezium in production, a notebook like this can become a handy addition to your toolbox for experimentation, validation, and troubleshooting.
347+
348+
You can find the full demo in the https://github.com/debezium/debezium-examples/tree/main/jupyter[`debezium-examples` repository].
349+
If you build on top of it, let us know what kind of notebook-based CDC workflows you find useful.
43.3 KB
Loading
8.72 KB
Loading
133 KB
Loading

0 commit comments

Comments
 (0)