Skip to content

Executor wrapper never transitions ingestion job to SUCCESS after subprocess exits cleanly #17397

@milad1372

Description

@milad1372

Description

When running ingestion via the DataHub executor (acryl-datahub-executor v1.4.0.3), the ingestion subprocess completes successfully and exits cleanly, but the executor wrapper never detects the exit. The job remains permanently stuck in RUNNING in the DataHub UI and never transitions to SUCCESS.

This is not a pipeline issue — the data is ingested correctly. The bug is in the executor's subprocess monitoring loop.


Steps to Reproduce

  1. Run any scheduled ingestion source via the DataHub executor (confirmed with iceberg and azure_ad)
  2. Wait for the pipeline subprocess to complete
  3. Observe that the run status never changes from Running to Success in the UI

Expected Behavior

Job transitions to SUCCESS after the subprocess exits.


Actual Behavior

  • Subprocess exits cleanly (Pipeline finished successfully, pending_requests: 0, checkpoint committed)
  • Executor wrapper continues sending unchanged RUNNING heartbeats to GMS
  • GMS logs: Skipped producing MCL for ingested aspect dataHubExecutionRequestResult ... Aspect has not changed.
  • "Stale logs" warning appears ~4 minutes after completion, job never resolves

Evidence

  • Pipeline finished at 21:20:22 with 0 failures, 0 warnings, all events confirmed
  • Stale-logs warning at ~21:24:12 — 230 seconds later, still RUNNING
  • GMS k9s log confirms no dataHubExecutionRequestResult aspect write after subprocess exit
  • Reproduced across multiple sources (iceberg, azure_ad) — confirms systemic executor-level issue, not connector-specific

Environment

Field Value
acryl-datahub version 1.4.0.3
Executor datahub-executor pod (Kubernetes)
Sources affected iceberg, azure_ad (likely all sources)

Suspected Root Cause

Race condition or missing waitpid / signal handler in the executor's subprocess monitoring loop — the wrapper never receives or acts on the child process exit signal.


Workaround

Switching the sink to SYNC mode reduces occurrence by eliminating the async drain/flush phase, but does not fully resolve the issue for all cases.

sink:
    type: datahub-rest
    config:
        mode: SYNC

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions