Skip to content

Commit f96b917

Browse files
committed
Expand README with detailed data emission docs
Clarify what gets emitted as inputs/outputs, how dlt's reshaping (flattening, renaming, type coercion) affects lineage, why there's no column-level lineage, and what facets are included on each event.
1 parent 7606c76 commit f96b917

1 file changed

Lines changed: 39 additions & 1 deletion

File tree

README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,45 @@ This package implements dlt's `SupportsTracking` protocol and registers via `dlt
8080
| Any step failure | RunEvent(FAIL) | Error message and stack trace |
8181
| Pipeline end (no load) | RunEvent(COMPLETE) | Fallback terminal event from pipeline schema |
8282

83-
Output dataset namespaces are derived from the destination (e.g., `duckdb://local`, `postgres://host:5432`), and dataset names are qualified with the dataset name (e.g., `raw_data.users`).
83+
## What Gets Emitted
84+
85+
### Input datasets
86+
87+
Input datasets are derived from dlt's `ExtractInfo.metrics`, which tracks which tables/resources were extracted. These are the table names as dlt sees them after extraction (i.e., the resource names or table names from `table_metrics`). Internal dlt tables (`_dlt_loads`, `_dlt_version`, etc.) are filtered out.
88+
89+
Input datasets appear in both the RUNNING event (after extract) and the COMPLETE event (after load), so lineage consumers can see the full input-to-output picture on the terminal event.
90+
91+
Inputs are namespaced with the configured namespace (defaults to `"dlt"`), not the destination, since they represent the logical source data before loading.
92+
93+
### Output datasets
94+
95+
Output datasets are built from `LoadInfo.load_packages`, which lists every completed load job with its target table name. Each output dataset includes:
96+
97+
- **Namespace**: derived from the destination type and fingerprint, e.g. `duckdb://local`, `postgres://myhost.com`, `bigquery://project-id`
98+
- **Name**: qualified as `{dataset_name}.{table_name}`, e.g. `raw_data.users`, `raw_data.orders`
99+
- **Schema facet**: column names and dlt data types (e.g. `bigint`, `text`, `decimal`) from the pipeline's default schema
100+
- **Row counts**: per-table row counts from the normalize step's `NormalizeInfo.row_counts`
101+
102+
If `LoadInfo` isn't available (e.g. extract-only runs), output datasets fall back to the pipeline's `default_schema.tables`.
103+
104+
### Important: dlt reshaping and what that means for lineage
105+
106+
dlt aggressively reshapes data between extract and load. This affects what shows up in lineage:
107+
108+
- **Nested data is flattened**: dlt unnests JSON objects and arrays into separate tables. A resource `users` with a nested `addresses` array becomes two destination tables: `users` and `users__addresses`. Both appear as output datasets. The input side just shows `users` (the original resource name).
109+
- **Column names are normalized**: dlt converts column names to snake_case and applies naming conventions. The schema facet reflects the normalized names as they exist in the destination, not the original source field names.
110+
- **Column types are dlt types**: schema facets use dlt's internal type system (`bigint`, `text`, `double`, `complex`, `date`, `timestamp`, `wei`, etc.), not the destination's native SQL types.
111+
- **No column-level lineage**: because dlt's reshaping (flattening, renaming, type coercion) isn't tracked as a transformation DAG, we emit table-level lineage only. There is no `ColumnLineageDatasetFacet`. This is an honest representation: we can tell you that `users` resource produced the `raw_data.users` and `raw_data.users__addresses` tables, but we can't trace individual columns through dlt's normalizer.
112+
- **Internal tables are excluded**: dlt creates `_dlt_loads`, `_dlt_pipeline_state`, `_dlt_version`, and similar bookkeeping tables. These are filtered from both input and output datasets.
113+
114+
### Run and job facets
115+
116+
Every event includes:
117+
118+
- **`jobType`**: `{processingType: "BATCH", integration: "DLT", jobType: "PIPELINE"}`
119+
- **`processing_engine`**: dlt version and adapter version
120+
- **`dlt_execution`** (custom facet, on COMPLETE): current step, destination type/name, dataset name, total/failed job counts
121+
- **`errorMessage`** (on FAIL): error message, programming language, stack trace (the string representation dlt provides)
84122

85123
## Testing with Marquez
86124

0 commit comments

Comments
 (0)