Skip to content

Commit 7606c76

Browse files
committed
Add README with usage examples and feature documentation
1 parent 1615e4f commit 7606c76

1 file changed

Lines changed: 115 additions & 0 deletions

File tree

README.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# dlt-openlineage
2+
3+
OpenLineage integration for [dlt](https://dlthub.com/) pipelines. Automatically emits lineage events to [Marquez](https://marquezproject.ai/) or any OpenLineage-compatible backend.
4+
5+
## Features
6+
7+
- **Input/output dataset lineage**: Track which tables a pipeline reads and writes
8+
- **Schema capture**: Column names and types for each destination table
9+
- **Row counts**: Per-table row counts from the normalize step
10+
- **Destination-aware namespaces**: Output datasets namespaced by destination type and fingerprint
11+
- **Per-step events**: START, RUNNING (extract), COMPLETE (load), FAIL events
12+
- **Processing engine metadata**: dlt version and adapter version in every event
13+
14+
## Installation
15+
16+
```bash
17+
pip install dlt-openlineage
18+
```
19+
20+
Or with uv:
21+
22+
```bash
23+
uv add dlt-openlineage
24+
```
25+
26+
## Quick Start
27+
28+
Add two lines before your pipeline runs:
29+
30+
```python
31+
import dlt
32+
import dlt_openlineage
33+
34+
dlt_openlineage.install(url="http://localhost:5000")
35+
36+
@dlt.resource
37+
def users():
38+
yield [
39+
{"id": 1, "name": "Alice", "email": "alice@example.com"},
40+
{"id": 2, "name": "Bob", "email": "bob@example.com"},
41+
]
42+
43+
pipeline = dlt.pipeline(
44+
pipeline_name="my_pipeline",
45+
destination="duckdb",
46+
dataset_name="raw_data",
47+
)
48+
49+
pipeline.run(users())
50+
```
51+
52+
That's it. OpenLineage events are emitted automatically during pipeline execution.
53+
54+
## Environment Variables
55+
56+
You can also configure via environment variables:
57+
58+
```bash
59+
export OPENLINEAGE_URL=http://localhost:5000
60+
export OPENLINEAGE_NAMESPACE=my_project
61+
export OPENLINEAGE_API_KEY=... # optional, for authenticated endpoints
62+
```
63+
64+
Then:
65+
66+
```python
67+
import dlt_openlineage
68+
dlt_openlineage.install() # reads from env vars
69+
```
70+
71+
## How It Works
72+
73+
This package implements dlt's `SupportsTracking` protocol and registers via `dlt.pipeline.trace.TRACKING_MODULES`. The tracker intercepts pipeline lifecycle events and emits corresponding OpenLineage events:
74+
75+
| dlt Step | OpenLineage Event | Data Included |
76+
|----------|-------------------|---------------|
77+
| Pipeline start | RunEvent(START) | Job type, processing engine |
78+
| Extract complete | RunEvent(RUNNING) | Input datasets (extracted tables) |
79+
| Load complete | RunEvent(COMPLETE) | Output datasets with schema, row counts, input datasets |
80+
| Any step failure | RunEvent(FAIL) | Error message and stack trace |
81+
| Pipeline end (no load) | RunEvent(COMPLETE) | Fallback terminal event from pipeline schema |
82+
83+
Output dataset namespaces are derived from the destination (e.g., `duckdb://local`, `postgres://host:5432`), and dataset names are qualified with the dataset name (e.g., `raw_data.users`).
84+
85+
## Testing with Marquez
86+
87+
```bash
88+
# Start Marquez locally
89+
docker run -p 5000:5000 -p 5001:5001 -p 3000:3000 marquezproject/marquez
90+
91+
# Point your pipeline at it
92+
export OPENLINEAGE_URL=http://localhost:5000
93+
94+
# Run your dlt pipeline
95+
python my_pipeline.py
96+
97+
# View lineage at http://localhost:3000
98+
```
99+
100+
## Development
101+
102+
```bash
103+
# Install dependencies
104+
uv sync --dev
105+
106+
# Run tests (unit + integration with real dlt+DuckDB)
107+
uv run pytest tests/ -v
108+
109+
# Lint
110+
uv run ruff check src/
111+
```
112+
113+
## License
114+
115+
MIT

0 commit comments

Comments
 (0)