[DOP-33903] Document job dependencies

dolfinus · dolfinus · commit 4b03f852a6db · 2026-03-18T13:08:21.000+03:00
diff --git a/README.rst b/README.rst
@@ -106,6 +106,12 @@ Run-level lineage graph
 .. image:: docs/entities/run_lineage.png
     :alt: Job-level lineage graph
 
+Hierarchy graph
+~~~~~~~~~~~~~~~
+
+.. image:: docs/integrations/airflow/job_hierarchy.png
+    :alt: Job hierarchy
+
 Datasets
 ~~~~~~~~
 
@@ -143,13 +149,13 @@ Hive query
     :alt: Hive query details
 
 Airflow DagRun
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~
 
 .. image:: docs/integrations/airflow/dag_run_details.png
     :alt: Airflow DagRun details
 
 Airflow TaskInstance
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
 
 .. image:: docs/integrations/airflow/task_run_details.png
     :alt: Airflow TaskInstance details
diff --git a/docs/comparison.rst b/docs/comparison.rst
@@ -13,7 +13,7 @@ DataHub cons
   To extract and draw lineage between tables, it is required to *both* connect ingestor to all databases, and to enable integration with ETL (Spark, Airflow, etc).
 
   There is an option ``spark.datahub.metadata.dataset.materialize=true``, but in this case DataHub creates datasets without schema,
-  so ingestors are still required.
+  so ingestors are still required for column lineage.
 
 * DataHub Spark agent doesn't properly work if *Platform Instances* are enabled in DataHub.
   Platform Instance is an additional hierarchy level for databases,
@@ -23,6 +23,8 @@ DataHub cons
 
   Data.Rentgen has configurable ``granularity`` option while rendering the lineage graph.
 
+* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
+
 * High CPU and memory consumption.
 
 DataHub pros
@@ -41,6 +43,7 @@ OpenMetadata cons
 
 * Database ingestors are required to build a lineage graph, just like DataHub.
 * OpenLineage → OpenMetadata integration produces no lineage, for some unknown reason.
+* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
 * High CPU and memory consumption.
 
 OpenMetadata pros
@@ -64,7 +67,7 @@ Marquez cons
 
 * Severe performance issues while consuming lineage events.
 * No support for dataset symlinks, e.g. HDFS location → Hive table.
-* No support for parent runs, e.g. Airflow task → Spark application.
+* No support for Job → Job hierarchy like Airflow Task → Spark application, or Airflow Task → Airflow Task dependencies.
 * No releases since 2024.
 
 Marquez pros
diff --git a/docs/entities/index.rst b/docs/entities/index.rst
@@ -321,6 +321,20 @@ It contains following fields:
 
 .. image:: parent_relation.png
 
+Dependency relation
+~~~~~~~~~~~~~~~~~~~
+
+Relation between job/job or run/run which shows the order of executing ETL jobs.
+For example, one Airflow Task can depend on another Airflow Task.
+
+It contains following fields:
+
+- ``from: Job | Run`` - entity which should be waited before current job/run will be started.
+- ``to: Job | Run`` - entity which waits.
+- ``type: str`` - type of dependency, any arbitrary string provided by integration, usually something like ``DIRECT_DEPENDENCY``, ``INDIRECT_DEPENDENCY``.
+
+.. image:: job_dependencies.png
+
 Input relation
 ~~~~~~~~~~~~~~
 
diff --git a/docs/entities/job_dependencies.png b/docs/entities/job_dependencies.png
diff --git a/docs/index.rst b/docs/index.rst
@@ -86,6 +86,12 @@ Run-level lineage graph
 .. image:: entities/run_lineage.png
     :alt: Job-level lineage graph
 
+Hierarchy graph
+~~~~~~~~~~~~~~~
+
+.. image:: integrations/airflow/job_hierarchy.png
+    :alt: Job hierarchy
+
 Datasets
 ~~~~~~~~
 
@@ -123,13 +129,13 @@ Hive query
     :alt: Hive query details
 
 Airflow DagRun
-~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~
 
 .. image:: integrations/airflow/dag_run_details.png
     :alt: Airflow DagRun details
 
 Airflow TaskInstance
-~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~
 
 .. image:: integrations/airflow/task_run_details.png
     :alt: Airflow TaskInstance details
diff --git a/docs/integrations/airflow/index.rst b/docs/integrations/airflow/index.rst
@@ -197,6 +197,11 @@ Job level lineage
 
 .. image:: ./job_lineage.png
 
+Job dependencies
+~~~~~~~~~~~~~~~~
+
+.. image:: ./job_hierarchy.png
+
 Extra configuration
 -------------------
 
diff --git a/docs/integrations/airflow/job_hierarchy.png b/docs/integrations/airflow/job_hierarchy.png
diff --git a/docs/integrations/dbt/index.rst b/docs/integrations/dbt/index.rst
@@ -166,3 +166,66 @@ It is possible to provide custom tags via model config:
         +tags:
             - environment:production
             - layer:bronze
+
+Binding Airflow Task with Spark application
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If OpenLineage event contains `Parent Run facet <https://openlineage.io/docs/spec/facets/run-facets/parent_run/>`_,
+DataRentgen can use this information to bind dbt run to the run it was triggered by, e.g. Airflow task:
+
+.. image:: ../airflow/job_hierarchy.png
+
+To fill up this facet, it is required to:
+
+* Setup OpenLineage integration for dbt
+* Setup :ref:`OpenLineage integration for Airflow <overview-setup-airflow>`
+* Pass parent Run info from Airflow to dbt by using `Airflow macros <https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html#lineage-job-run-macros>`_:
+
+  .. tabs::
+
+    .. code-tab:: py BashOperator
+
+        from airflow.providers.standard.operators.bash import BashOperator
+
+        task = BashOperator(
+            task_id="dbt_run_task",
+            cwd="/path/to/project",
+            bash_command="dbt-ol run",
+            append_env=True,
+            env={
+                # Pass parent Run info from Airflow to Spark
+                "OPENLINEAGE_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_parent_id(task_instance) }}",
+                # For apache-airflow-providers-openlineage 2.4.0 or above
+                "OPENLINEAGE_ROOT_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_root_parent_id(task_instance) }}",
+            }
+        )
+
+    .. code-tab:: py SSHOperator
+
+        from airflow.providers.ssh.operators.ssh import SSHOperator
+
+        task = SSHOperator(
+            task_id="dbt_run_task",
+            ssh_conn_id="some_host",
+            command="cd /path/to/project && dbt-ol run",
+            environment={
+                "OPENLINEAGE_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_parent_id(task_instance) }}",
+                # For apache-airflow-providers-openlineage 2.4.0 or above
+                "OPENLINEAGE_ROOT_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_root_parent_id(task_instance) }}",
+            }
+        )
+
+    .. code-tab:: py KubernetesPodOperator
+
+        from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
+
+        task = SSHOperator(
+            task_id="dbt_run_task",
+            cmds=["bash", "-cx"],
+            arguments=["cd /path/to/project && dbt-ol run"],
+            env_vars={
+                "OPENLINEAGE_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_parent_id(task_instance) }}",
+                # For apache-airflow-providers-openlineage 2.4.0 or above
+                "OPENLINEAGE_ROOT_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_root_parent_id(task_instance) }}",
+            }
+        )
diff --git a/docs/integrations/spark/index.rst b/docs/integrations/spark/index.rst
@@ -304,3 +304,67 @@ It is possible to provide custom job tags using OpenLineage configuration:
     :caption: etl.py
 
     SparkSession.builder.config("spark.openlineage.job.tags", "environment:production;layer:bronze")
+
+Binding Airflow Task with Spark application
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If OpenLineage event contains `Parent Run facet <https://openlineage.io/docs/spec/facets/run-facets/parent_run/>`_,
+DataRentgen can use this information to bind Spark application to the run it was triggered by, e.g. Airflow task:
+
+.. image:: ../airflow/job_hierarchy.png
+
+To fill up this facet, it is required to:
+
+* Setup OpenLineage integration for Spark
+* Setup :ref:`OpenLineage integration for Airflow <overview-setup-airflow>`
+* `Pass parent Run info from Airflow to Spark <https://openlineage.io/docs/integrations/spark/configuration/airflow#preserving-job-hierarchy>`_:
+
+  .. code-block:: python
+    :caption: dag.py
+
+    def my_etl(
+        parent_job_namespace: str,
+        parent_job_name: str,
+        parent_run_id: str,
+        root_job_namespace: str,
+        root_job_name: str,
+        root_run_id: str,
+    ):
+        spark = (
+            SparkSession.builder
+            # install OpenLineage integration (see above)
+            # Pass parent Run info from Airflow to Spark
+            .config("spark.openlineage.parentJobNamespace", parent_job_namespace)
+            .config("spark.openlineage.parentJobName", parent_job_name)
+            .config("spark.openlineage.parentRunId", parent_run_id)
+            .config("spark.openlineage.rootJobNamespace", root_job_namespace)
+            .config("spark.openlineage.rootJobName", root_job_name)
+            .config("spark.openlineage.rootRunId", root_run_id)
+            .getOrCreate()
+        )
+
+        with spark:
+            # actual ETL code
+
+
+    from airflow.providers.standard.operators.python import PythonOperator
+
+    task = PythonOperator(
+        task_id="spark_etl",
+        python_callable=my_etl,
+        # Using Jinja templates to pass Airflow macros to Python function
+        op_kwargs={
+            "parent_job_namespace": "{{ macros.OpenLineageProviderPlugin.lineage_job_namespace() }}",
+            "parent_job_name": "{{ macros.OpenLineageProviderPlugin.lineage_job_name(task_instance) }}",
+            "parent_run_id": "{{ macros.OpenLineageProviderPlugin.lineage_run_id(task_instance) }}",
+            # For apache-airflow-providers-openlineage 2.4.0 or above
+            "root_job_namespace": "{{ macros.OpenLineageProviderPlugin.lineage_root_job_namespace(task_instance) }}",
+            "root_job_name": "{{ macros.OpenLineageProviderPlugin.lineage_root_job_name(task_instance) }}",
+            "root_run_id": "{{ macros.OpenLineageProviderPlugin.lineage_root_run_id(task_instance) }}",
+        },
+    )
+
+  The exact way of substituting Airflow macros to SparkSession config may be different depending on used Airflow operator:
+    * PythonOperator - via kwargs & `Airflow macros <https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/macros.html#lineage-job-run-macros>`_:
+    * BashOperator, SSHOperator, KubernetesPodOperator - via environment variables & Airflow macros
+    * SparkSubmitOperator - via `spark_inject_parent_job_info=true in airflow.conf <https://openlineage.io/docs/integrations/spark/configuration/airflow#automatic-injection>`_