You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/comparison.rst
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ DataHub cons
13
13
To extract and draw lineage between tables, it is required to *both* connect ingestor to all databases, and to enable integration with ETL (Spark, Airflow, etc).
14
14
15
15
There is an option ``spark.datahub.metadata.dataset.materialize=true``, but in this case DataHub creates datasets without schema,
16
-
so ingestors are still required.
16
+
so ingestors are still required for column lineage.
17
17
18
18
* DataHub Spark agent doesn't properly work if *Platform Instances* are enabled in DataHub.
19
19
Platform Instance is an additional hierarchy level for databases,
@@ -23,6 +23,8 @@ DataHub cons
23
23
24
24
Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph.
25
25
26
+
* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
27
+
26
28
* High CPU and memory consumption.
27
29
28
30
DataHub pros
@@ -41,6 +43,7 @@ OpenMetadata cons
41
43
42
44
* Database ingestors are required to build a lineage graph, just like DataHub.
43
45
* OpenLineage → OpenMetadata integration produces no lineage, for some unknown reason.
46
+
* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
44
47
* High CPU and memory consumption.
45
48
46
49
OpenMetadata pros
@@ -64,7 +67,7 @@ Marquez cons
64
67
65
68
* Severe performance issues while consuming lineage events.
66
69
* No support for dataset symlinks, e.g. HDFS location → Hive table.
67
-
* No support for parent runs, e.g. Airflow task → Spark application.
70
+
* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
Copy file name to clipboardExpand all lines: docs/entities/index.rst
+14Lines changed: 14 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -321,6 +321,20 @@ It contains following fields:
321
321
322
322
.. image:: parent_relation.png
323
323
324
+
Dependency relation
325
+
~~~~~~~~~~~~~~~~~~~
326
+
327
+
Relation between job/job or run/run which shows the order of executing ETL jobs.
328
+
For example, one Airflow Task can depend on another Airflow Task.
329
+
330
+
It contains following fields:
331
+
332
+
- ``from: Job | Run`` - entity which should be waited before current job/run will be started.
333
+
- ``to: Job | Run`` - entity which waits.
334
+
- ``type: str`` - type of dependency, any arbitrary string provided by integration, usually something like ``DIRECT_DEPENDENCY``, ``INDIRECT_DEPENDENCY``.
Copy file name to clipboardExpand all lines: docs/integrations/dbt/index.rst
+63Lines changed: 63 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,3 +166,66 @@ It is possible to provide custom tags via model config:
166
166
+tags:
167
167
- environment:production
168
168
- layer:bronze
169
+
170
+
Binding Airflow Task with Spark application
171
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
172
+
173
+
If OpenLineage event contains `Parent Run facet <https://openlineage.io/docs/spec/facets/run-facets/parent_run/>`_,
174
+
DataRentgen can use this information to bind dbt run to the run it was triggered by, e.g. Airflow task:
175
+
176
+
.. image:: ../airflow/job_hierarchy.png
177
+
178
+
To fill up this facet, it is required to:
179
+
180
+
* Setup OpenLineage integration for dbt
181
+
* Setup :ref:`OpenLineage integration for Airflow <overview-setup-airflow>`
182
+
* Pass parent Run info from Airflow to dbt by using `Airflow macros <https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html#lineage-job-run-macros>`_:
183
+
184
+
.. tabs::
185
+
186
+
.. code-tab:: py BashOperator
187
+
188
+
from airflow.providers.standard.operators.bash import BashOperator
* SparkSubmitOperator - via `spark_inject_parent_job_info=true in airflow.conf <https://openlineage.io/docs/integrations/spark/configuration/airflow#automatic-injection>`_
0 commit comments