|
| 1 | +# dlt-openlineage |
| 2 | + |
| 3 | +OpenLineage integration for [dlt](https://dlthub.com/) pipelines. Automatically emits lineage events to [Marquez](https://marquezproject.ai/) or any OpenLineage-compatible backend. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- **Input/output dataset lineage**: Track which tables a pipeline reads and writes |
| 8 | +- **Schema capture**: Column names and types for each destination table |
| 9 | +- **Row counts**: Per-table row counts from the normalize step |
| 10 | +- **Destination-aware namespaces**: Output datasets namespaced by destination type and fingerprint |
| 11 | +- **Per-step events**: START, RUNNING (extract), COMPLETE (load), FAIL events |
| 12 | +- **Processing engine metadata**: dlt version and adapter version in every event |
| 13 | + |
| 14 | +## Installation |
| 15 | + |
| 16 | +```bash |
| 17 | +pip install dlt-openlineage |
| 18 | +``` |
| 19 | + |
| 20 | +Or with uv: |
| 21 | + |
| 22 | +```bash |
| 23 | +uv add dlt-openlineage |
| 24 | +``` |
| 25 | + |
| 26 | +## Quick Start |
| 27 | + |
| 28 | +Add two lines before your pipeline runs: |
| 29 | + |
| 30 | +```python |
| 31 | +import dlt |
| 32 | +import dlt_openlineage |
| 33 | + |
| 34 | +dlt_openlineage.install(url="http://localhost:5000") |
| 35 | + |
| 36 | +@dlt.resource |
| 37 | +def users(): |
| 38 | + yield [ |
| 39 | + {"id": 1, "name": "Alice", "email": "alice@example.com"}, |
| 40 | + {"id": 2, "name": "Bob", "email": "bob@example.com"}, |
| 41 | + ] |
| 42 | + |
| 43 | +pipeline = dlt.pipeline( |
| 44 | + pipeline_name="my_pipeline", |
| 45 | + destination="duckdb", |
| 46 | + dataset_name="raw_data", |
| 47 | +) |
| 48 | + |
| 49 | +pipeline.run(users()) |
| 50 | +``` |
| 51 | + |
| 52 | +That's it. OpenLineage events are emitted automatically during pipeline execution. |
| 53 | + |
| 54 | +## Environment Variables |
| 55 | + |
| 56 | +You can also configure via environment variables: |
| 57 | + |
| 58 | +```bash |
| 59 | +export OPENLINEAGE_URL=http://localhost:5000 |
| 60 | +export OPENLINEAGE_NAMESPACE=my_project |
| 61 | +export OPENLINEAGE_API_KEY=... # optional, for authenticated endpoints |
| 62 | +``` |
| 63 | + |
| 64 | +Then: |
| 65 | + |
| 66 | +```python |
| 67 | +import dlt_openlineage |
| 68 | +dlt_openlineage.install() # reads from env vars |
| 69 | +``` |
| 70 | + |
| 71 | +## How It Works |
| 72 | + |
| 73 | +This package implements dlt's `SupportsTracking` protocol and registers via `dlt.pipeline.trace.TRACKING_MODULES`. The tracker intercepts pipeline lifecycle events and emits corresponding OpenLineage events: |
| 74 | + |
| 75 | +| dlt Step | OpenLineage Event | Data Included | |
| 76 | +|----------|-------------------|---------------| |
| 77 | +| Pipeline start | RunEvent(START) | Job type, processing engine | |
| 78 | +| Extract complete | RunEvent(RUNNING) | Input datasets (extracted tables) | |
| 79 | +| Load complete | RunEvent(COMPLETE) | Output datasets with schema, row counts, input datasets | |
| 80 | +| Any step failure | RunEvent(FAIL) | Error message and stack trace | |
| 81 | +| Pipeline end (no load) | RunEvent(COMPLETE) | Fallback terminal event from pipeline schema | |
| 82 | + |
| 83 | +Output dataset namespaces are derived from the destination (e.g., `duckdb://local`, `postgres://host:5432`), and dataset names are qualified with the dataset name (e.g., `raw_data.users`). |
| 84 | + |
| 85 | +## Testing with Marquez |
| 86 | + |
| 87 | +```bash |
| 88 | +# Start Marquez locally |
| 89 | +docker run -p 5000:5000 -p 5001:5001 -p 3000:3000 marquezproject/marquez |
| 90 | + |
| 91 | +# Point your pipeline at it |
| 92 | +export OPENLINEAGE_URL=http://localhost:5000 |
| 93 | + |
| 94 | +# Run your dlt pipeline |
| 95 | +python my_pipeline.py |
| 96 | + |
| 97 | +# View lineage at http://localhost:3000 |
| 98 | +``` |
| 99 | + |
| 100 | +## Development |
| 101 | + |
| 102 | +```bash |
| 103 | +# Install dependencies |
| 104 | +uv sync --dev |
| 105 | + |
| 106 | +# Run tests (unit + integration with real dlt+DuckDB) |
| 107 | +uv run pytest tests/ -v |
| 108 | + |
| 109 | +# Lint |
| 110 | +uv run ruff check src/ |
| 111 | +``` |
| 112 | + |
| 113 | +## License |
| 114 | + |
| 115 | +MIT |
0 commit comments