You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/data_observability/jobs_monitoring/glue.md
+28Lines changed: 28 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -108,6 +108,34 @@ This helps ensure the logs are searchable and available under the {{< ui >}}Glue
108
108
Enable the [Glue Integration][4] tile for Glue metrics collection.
109
109
Metrics should be available under the {{< ui >}}Glue{{< /ui >}} job tab in **Data Observability: Jobs Monitoring**.
110
110
111
+
## (Optional) Enable dataset lineage
112
+
113
+
Glue jobs that run with the Spark engine can emit OpenLineage events directly to Datadog. This provides dataset-level lineage, showing which datasets your job reads and writes.
114
+
115
+
**Note**: AWS Glue includes the Spark OpenLineage connector in its default class path. To use a more recent version, add the connector JAR manually through the `--extra-jars` Glue job parameter and set `--user-jars-first=true` to override the bundled version. For example: `--extra-jars s3://<YOUR_BUCKET>/openlineage-spark-<VERSION>.jar` and `--user-jars-first true`.
116
+
117
+
### Configure the SparkSession
118
+
119
+
In your Glue job script, configure the `SparkSession` with the following settings:
Replace `<DD_DATA_OBSERVABILITY_INTAKE>` with `https://data-obs-intake.`{{< region-param key="dd_site" code="true" >}}. Replace `<DATADOG_API_KEY>` with your Datadog API key. `spark.glue.JOB_RUN_ID` is the Spark configuration property automatically set by AWS Glue with the current job run ID — use it verbatim.
134
+
135
+
### Validate
136
+
137
+
After enabling OpenLineage, open a job run in [Data Observability: Jobs Monitoring][6]. In the flame graph, additional spans such as `spark.application` or `spark.sql_job` should appear. The payloads of these spans should be helpful when debugging dataset extraction.
0 commit comments