Skip to content

Commit 170cca6

Browse files
Add OpenLineage integration to AWS Glue Jobs Monitoring docs (#37632)
* Add OpenLineage integration section to AWS Glue Jobs Monitoring docs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 05cc88a commit 170cca6

1 file changed

Lines changed: 28 additions & 0 deletions

File tree

  • content/en/data_observability/jobs_monitoring

content/en/data_observability/jobs_monitoring/glue.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,34 @@ This helps ensure the logs are searchable and available under the {{< ui >}}Glue
108108
Enable the [Glue Integration][4] tile for Glue metrics collection.
109109
Metrics should be available under the {{< ui >}}Glue{{< /ui >}} job tab in **Data Observability: Jobs Monitoring**.
110110

111+
## (Optional) Enable dataset lineage
112+
113+
Glue jobs that run with the Spark engine can emit OpenLineage events directly to Datadog. This provides dataset-level lineage, showing which datasets your job reads and writes.
114+
115+
**Note**: AWS Glue includes the Spark OpenLineage connector in its default class path. To use a more recent version, add the connector JAR manually through the `--extra-jars` Glue job parameter and set `--user-jars-first=true` to override the bundled version. For example: `--extra-jars s3://<YOUR_BUCKET>/openlineage-spark-<VERSION>.jar` and `--user-jars-first true`.
116+
117+
### Configure the SparkSession
118+
119+
In your Glue job script, configure the `SparkSession` with the following settings:
120+
121+
```python
122+
spark = SparkSession.builder \
123+
.config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \
124+
.config("spark.openlineage.transport.type", "http") \
125+
.config("spark.openlineage.transport.url", "<DD_DATA_OBSERVABILITY_INTAKE>") \
126+
.config("spark.openlineage.transport.auth.type", "api_key") \
127+
.config("spark.openlineage.transport.auth.apiKey", "<DATADOG_API_KEY>") \
128+
.config("spark.redaction.regex", "(?i)secret|password|token|access[.]key|apikey") \
129+
.config("spark.openlineage.capturedProperties", "spark.glue.JOB_RUN_ID") \
130+
.getOrCreate()
131+
```
132+
133+
Replace `<DD_DATA_OBSERVABILITY_INTAKE>` with `https://data-obs-intake.`{{< region-param key="dd_site" code="true" >}}. Replace `<DATADOG_API_KEY>` with your Datadog API key. `spark.glue.JOB_RUN_ID` is the Spark configuration property automatically set by AWS Glue with the current job run ID — use it verbatim.
134+
135+
### Validate
136+
137+
After enabling OpenLineage, open a job run in [Data Observability: Jobs Monitoring][6]. In the flame graph, additional spans such as `spark.application` or `spark.sql_job` should appear. The payloads of these spans should be helpful when debugging dataset extraction.
138+
111139
## Next steps
112140

113141
The crawler runs every few minutes.

0 commit comments

Comments
 (0)