You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: mddocs/docs/en/entities/index.md
+67-39Lines changed: 67 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,6 @@
1
-
(entities)=
1
+
# Entities { #entities }
2
2
3
-
# Entities
4
-
5
-
```{eval-rst}
6
-
.. plantuml::
3
+
```plantuml
7
4
8
5
@startuml
9
6
title Entities diagram
@@ -41,6 +38,52 @@
41
38
@enduml
42
39
```
43
40
41
+
```mermaid
42
+
---
43
+
title: Entities diagram
44
+
---
45
+
46
+
flowchart LR
47
+
subgraph locations1 [locations 1]
48
+
addresses1@{shape: docs, label: "addresses"}
49
+
end
50
+
subgraph locations2 [locations 2]
51
+
addresses2@{shape: docs, label: "addresses"}
52
+
end
53
+
subgraph locations3 [locations 3]
54
+
addresses3@{shape: docs, label: "addresses"}
55
+
end
56
+
dataset1[(dataset 1)]
57
+
dataset2[(dataset 2)]
58
+
operations@{shape: procs}
59
+
runs@{shape: procs, fill: yellow}
60
+
61
+
style runs fill:lightyellow
62
+
job
63
+
style job fill:lightblue
64
+
user@{shape: stadium}
65
+
style user fill:lightblue
66
+
67
+
dataset1 -- SYMLINK ---> dataset2
68
+
dataset2 -- SYMLINK --> dataset1
69
+
70
+
dataset2 -- located in --> locations2
71
+
72
+
dataset1 -. INPUT .-> operations
73
+
operations -. OUTPUT .-> dataset1
74
+
dataset1 -- located in --> locations1
75
+
76
+
operations -- PARENT --> runs
77
+
78
+
runs -- PARENT ----> job
79
+
runs -- started by ----> user
80
+
81
+
job -- located in ---> locations3
82
+
83
+
runs -- PARENT --> runs
84
+
85
+
```
86
+
44
87
## Nodes
45
88
46
89
Nodes are independent entities which describe information about some real entity, like table, ETL job, ETL job run and so on.
@@ -74,8 +117,7 @@ It contains following fields:
74
117
75
118
-`url: str` - alternative address, in URL form.
76
119
77
-
```{image} location_list.png
78
-
```
120
+

79
121
80
122
#### Location addresses
81
123
@@ -115,8 +157,7 @@ That's why the information about datasets is very limited:
115
157
-`name: str` - qualified name of Dataset, like `mydb.myschema.mytable` or `/app/warehouse/hive/managed/myschema.df/mytable`
116
158
-`schema: Schema | None` - schema of dataset.
117
159
118
-
```{image} dataset_list.png
119
-
```
160
+

120
161
121
162
#### Dataset schema
122
163
@@ -146,8 +187,7 @@ It contains following fields:
146
187
-`EXACT_MATCH` - returned if all interactions with this dataset used only one schema.
147
188
-`LATEST_KNOWN` - if there are multiple interactions with this dataset, but with different schemas. In this case a schema of the most recent interaction is returned.
148
189
149
-
```{image} dataset_schema.png
150
-
```
190
+

151
191
152
192
### Job
153
193
@@ -180,8 +220,7 @@ It contains following fields:
180
220
-`DBT_JOB`
181
221
-`UNKNOWN`
182
222
183
-
```{image} job_list.png
184
-
```
223
+

185
224
186
225
### User
187
226
@@ -241,17 +280,13 @@ It contains following fields:
241
280
242
281
-`persistent_log_url: str | None` - external URL there specific Run logs could be found (e.g. Spark History server, Airflow Web UI).
@@ -309,13 +343,12 @@ It contains following fields:
309
343
-`METASTORE` - from HDFS location to Hive table in metastore.
310
344
-`WAREHOUSE` - from Hive table to HDFS/S3 location.
311
345
312
-
:::{note}
313
-
Currently, OpenLineage sends only symlinks `HDFS location → Hive table` which [do not exist in the real world](https://github.com/OpenLineage/OpenLineage/issues/2718#issuecomment-2134746258).
314
-
Message consumer automatically adds a reverse symlink `Hive table → HDFS location` to simplify building lineage graph, but this is temporary solution.
315
-
:::
346
+
!!! note
316
347
317
-
```{image} dataset_symlinks.png
318
-
```
348
+
Currently, OpenLineage sends only symlinks `HDFS location → Hive table` which [do not exist in the real world](https://github.com/OpenLineage/OpenLineage/issues/2718#issuecomment-2134746258).
349
+
Message consumer automatically adds a reverse symlink `Hive table → HDFS location` to simplify building lineage graph, but this is temporary solution.
350
+
351
+

319
352
320
353
### Parent Relation
321
354
@@ -331,8 +364,7 @@ It contains following fields:
331
364
-`from: Job | Run` - parent entity.
332
365
-`to: Run | Operation` - child entity.
333
366
334
-
```{image} parent.png
335
-
```
367
+

336
368
337
369
### Input relation
338
370
@@ -348,8 +380,7 @@ It contains following fields:
348
380
-`num_bytes: int | None` - number of bytes read from dataset. For `granularity=JOB|RUN` it is a sum of all read bytes from this dataset. For `granularity=DATASET` always `None`.
349
381
-`num_files: int | None` - number of files read from dataset. For `granularity=JOB|RUN` it is a sum of all read files from this dataset. For `granularity=DATASET` always `None`.
350
382
351
-
```{image} input.png
352
-
```
383
+

353
384
354
385
### Output relation
355
386
@@ -381,8 +412,7 @@ It contains following fields:
381
412
382
413
-`num_files: int | None` - number of files written from dataset. For `granularity=JOB|RUN` it is a sum of all written files to this dataset.
383
414
384
-
```{image} output.png
385
-
```
415
+

386
416
387
417
### Direct Column Lineage relation
388
418
@@ -405,8 +435,7 @@ Relation Dataset columns → Dataset columns, describing how each target dataset
405
435
-`AGGREGATION_MASKING` - some masking aggregation function is applied to column value, e.g. `SELECT count(DISTINCT source_column) AS target_column`
0 commit comments