Fix separated external source prefetch drain by JanuszL · Pull Request #6397 · NVIDIA/DALI

JanuszL · 2026-06-16T15:29:55Z

Category:

Bug fix

Description:

This PR fixes a hang at the end of an epoch when a Python-managed
external_source is used with separated CPU/GPU prefetch queues.

The newer prefetch path fed all inputs before running the backend. For
separated execution this could leave CPU-prefetched external source batches
without scheduled Mixed/GPU work once the source reached end of epoch. The
consumer then waited indefinitely for output indexes in the separated queue
policy.

The fix keeps prefetching interleaved with backend runs, so input feeding and
backend scheduling stay aligned at epoch boundaries.

Related to #5199

Additional information:

Affected modules and functionalities:

nvidia.dali.Pipeline prefetch scheduling in the legacy executor path.
Python regression coverage for external_source(batch=True, cycle="raise")
with mixed image decoding and separated prefetch queues.

Key points relevant for the review:

Review whether routing prefetch through the interleaved path is acceptable for
legacy separated execution. This restores the behavior that avoids a CPU-only
tail when input callbacks reach end of epoch.

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

JanuszL · 2026-06-16T15:34:04Z

!build

greptile-apps · 2026-06-16T15:39:11Z

Greptile Summary

This PR fixes a hang at epoch boundaries when external_source with cycle="raise" is used with separated CPU/GPU prefetch queues. The old path fed all inputs upfront via _prefetch_inputs before calling _pipe.Prefetch(), leaving CPU-prefetched batches without corresponding Mixed/GPU work when the source reached end of epoch. The fix routes all execution — separated and non-separated — through _legacy_interleaved_prefetch, with the loop count adjusted to max(cpu_size, gpu_size) for separated queues.

pipeline.py: _prefetch_inputs and the is_prefetch parameter to _run_input_callbacks are removed; _prefetch() now unconditionally delegates to _legacy_interleaved_prefetch with a queue-depth-aware iteration count.
async_separated_pipelined_executor.cc: Prefetch() is updated to use max(0, cpu_size - gpu_size) CPU-only extra rounds and InputFeedCount now returns max(cpu_size, gpu_size) instead of cpu_size + gpu_size.
test_pipeline.py: New regression test parametrized over (2,2), (3,2), and (2,3) queue depths verifies no hang and validates decoded image content.

Confidence Score: 5/5

Safe to merge — the fix correctly eliminates the epoch-boundary hang by interleaving input feeding with pipeline runs, and the test covers all three queue-depth configurations.

All three changed files are consistent: the Python loop count, the C++ Prefetch CPU-only tail, and InputFeedCount all converge on max(cpu_size, gpu_size). The removed _prefetch_inputs path was the source of the hang, and its elimination along with the is_prefetch variant of _run_input_callbacks is clean. The test parametrization covers symmetric and both asymmetric depth orderings, and the content assertions (non-zero pixels, correct rank and channel count) go beyond cardinality checks. No blocking issues were introduced by this change.

No files require special attention.

Important Files Changed

Filename	Overview
dali/python/nvidia/dali/pipeline.py	Core fix: removes _prefetch_inputs and the is_prefetch variant of _run_input_callbacks, routing both separated and non-separated execution through _legacy_interleaved_prefetch with a corrected loop count; logic is correct and consistent with the C++ change.
dali/pipeline/executor/async_separated_pipelined_executor.cc	Prefetch() CPU-only tail reduced from cpu_size to max(0, cpu_size - gpu_size) rounds; InputFeedCount returns max(cpu_size, gpu_size) — both consistent with the Python-side loop count change.
dali/test/python/test_pipeline.py	New regression test covers all three queue-depth combinations (symmetric and both asymmetric directions); validates decoded image cardinality, rank, channel count, and non-zero pixel content; expects StopIteration on the over-run call.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Py as pipeline.py (_prefetch)
    participant LIP as _legacy_interleaved_prefetch
    participant ICS as _iter_setup / _run_input_callbacks
    participant Exec as AsyncSeparatedPipelinedExecutor
    participant CPU as CPU stage
    participant Mix as Mixed stage
    participant GPU as GPU stage

    Note over Py,GPU: New flow (both separated and non-separated)
    Py->>LIP: always delegate
    LIP->>LIP: "prefetch_count = max(cpu_size, gpu_size) if exec_separated else cpu_size"

    loop prefetch_count times
        LIP->>ICS: _iter_setup()
        ICS->>ICS: _run_input_callbacks() - feed 1 batch
        LIP->>Exec: _pipe.Run()
        Exec->>CPU: RunCPU()
        Exec->>Mix: RunMixed()
        Exec->>GPU: RunGPU()
    end

    Note over Py,GPU: Old separated flow (removed) - fed cpu_size+gpu_size batches upfront
    Note over Py,GPU: then called _pipe.Prefetch() leaving CPU-only tail at epoch boundary

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Py as pipeline.py (_prefetch)
    participant LIP as _legacy_interleaved_prefetch
    participant ICS as _iter_setup / _run_input_callbacks
    participant Exec as AsyncSeparatedPipelinedExecutor
    participant CPU as CPU stage
    participant Mix as Mixed stage
    participant GPU as GPU stage

    Note over Py,GPU: New flow (both separated and non-separated)
    Py->>LIP: always delegate
    LIP->>LIP: "prefetch_count = max(cpu_size, gpu_size) if exec_separated else cpu_size"

    loop prefetch_count times
        LIP->>ICS: _iter_setup()
        ICS->>ICS: _run_input_callbacks() - feed 1 batch
        LIP->>Exec: _pipe.Run()
        Exec->>CPU: RunCPU()
        Exec->>Mix: RunMixed()
        Exec->>GPU: RunGPU()
    end

    Note over Py,GPU: Old separated flow (removed) - fed cpu_size+gpu_size batches upfront
    Note over Py,GPU: then called _pipe.Prefetch() leaving CPU-only tail at epoch boundary

_{Reviews (23): Last reviewed commit: "Reset all external sources on epoch end" | Re-trigger Greptile}

dali-automaton · 2026-06-16T15:42:04Z

CI MESSAGE: [55012969]: BUILD STARTED

dali-automaton · 2026-06-16T17:18:27Z

CI MESSAGE: [55012969]: BUILD PASSED

JanuszL · 2026-06-17T08:37:19Z

@greptile review

JanuszL · 2026-06-17T08:53:40Z

@greptile review

JanuszL · 2026-06-17T09:13:01Z

@greptile review

JanuszL · 2026-06-17T09:17:37Z

@greptile review

Keep pipeline prefetching interleaved with backend runs so separated execution does not leave CPU-prefetched external source batches without scheduled Mixed/GPU work at end of epoch. Prime separated execution for the maximum of CPU and GPU queue depths to avoid underfilling asymmetric queue configurations. Add a regression that drains a batch external source through mixed image decoding with symmetric and asymmetric separated CPU/GPU prefetch queues. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T09:20:52Z

@greptile review

JanuszL · 2026-06-17T09:26:42Z

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T09:30:33Z

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T09:40:56Z

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T10:20:16Z

!build

dali-automaton · 2026-06-17T10:25:24Z

CI MESSAGE: [55097813]: BUILD STARTED

dali-automaton · 2026-06-17T12:55:28Z

CI MESSAGE: [55097813]: BUILD FAILED

JanuszL · 2026-06-17T13:23:45Z

@greptile review

Make async separated executor prefetch and InputFeedCount use the same maximum queue-depth contract. Keep the Python separated prefetch path for drainable queue shapes, but use interleaved prefetch when the CPU queue is longer than the GPU queue so end-of-epoch Python sources do not leave CPU-only work without scheduled Mixed/GPU stages. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T13:26:35Z

@greptile review

JanuszL · 2026-06-17T13:28:43Z

!build

dali-automaton · 2026-06-17T13:30:21Z

CI MESSAGE: [55110109]: BUILD STARTED

Route separated Python prefetch through the interleaved path for all queue shapes. The bulk _prefetch_inputs path could call _run_input_callbacks with a stale argument and underfeed the backend Prefetch schedule for Python external sources, causing either a TypeError or a hang at epoch end. Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-17T14:49:10Z

@greptile review

JanuszL · 2026-06-17T14:55:00Z

@greptile review

JanuszL · 2026-06-17T14:59:08Z

!build

dali-automaton · 2026-06-17T15:05:46Z

CI MESSAGE: [55116811]: BUILD STARTED

dali-automaton · 2026-06-17T15:57:57Z

CI MESSAGE: [55110109]: BUILD FAILED

dali-automaton · 2026-06-17T17:27:30Z

CI MESSAGE: [55116811]: BUILD FAILED

JanuszL · 2026-06-22T20:27:05Z

!build

dali-automaton · 2026-06-22T20:30:50Z

CI MESSAGE: [55485233]: BUILD STARTED

dali-automaton · 2026-06-22T22:54:35Z

CI MESSAGE: [55485233]: BUILD FAILED

JanuszL · 2026-06-23T06:16:13Z

!build

dali-automaton · 2026-06-23T06:21:09Z

CI MESSAGE: [55529667]: BUILD STARTED

dali-automaton · 2026-06-23T08:43:21Z

CI MESSAGE: [55529667]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

JanuszL · 2026-06-23T08:52:57Z

!build

dali-automaton · 2026-06-23T08:55:32Z

CI MESSAGE: [55541646]: BUILD STARTED

dali-automaton · 2026-06-23T13:11:07Z

CI MESSAGE: [55541646]: BUILD PASSED

JanuszL force-pushed the fix-separated-external-source-prefetch branch from a17716a to d972238 Compare June 16, 2026 15:33

JanuszL requested a review from mzient June 16, 2026 15:37

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread dali/test/python/test_pipeline.py Outdated

JanuszL force-pushed the fix-separated-external-source-prefetch branch from d972238 to 82c4432 Compare June 17, 2026 08:37

JanuszL force-pushed the fix-separated-external-source-prefetch branch from 82c4432 to fe2b252 Compare June 17, 2026 08:53

JanuszL force-pushed the fix-separated-external-source-prefetch branch from fe2b252 to 3c8286e Compare June 17, 2026 09:12

JanuszL force-pushed the fix-separated-external-source-prefetch branch from 3c8286e to 1276420 Compare June 17, 2026 09:20

Explain separated prefetch queue priming

9ca2e8c

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Remove stale prefetch callback path

533544f

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

Parametrize separated queue prefetch test

8e20760

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

dali-automaton assigned rostan-t and mzient Jun 17, 2026

mzient reviewed Jun 17, 2026

View reviewed changes

Comment thread dali/python/nvidia/dali/pipeline.py

JanuszL force-pushed the fix-separated-external-source-prefetch branch from a0755d2 to 6fa12d5 Compare June 17, 2026 13:26

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread dali/python/nvidia/dali/pipeline.py Outdated

Reset all external sources on epoch end

c737089

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

mzient approved these changes Jun 23, 2026

View reviewed changes

rostan-t approved these changes Jun 24, 2026

View reviewed changes

JanuszL merged commit cfb5734 into NVIDIA:main Jun 24, 2026
7 checks passed

JanuszL deleted the fix-separated-external-source-prefetch branch June 24, 2026 14:39

JanuszL mentioned this pull request Jun 24, 2026

external_source doesn't return last incomplete batch #5199

Open

1 task

Uh oh!

Conversation

JanuszL commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

JanuszL commented Jun 16, 2026

Uh oh!

greptile-apps Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

dali-automaton commented Jun 16, 2026

Uh oh!

dali-automaton commented Jun 16, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 17, 2026

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

dali-automaton commented Jun 17, 2026

Uh oh!

JanuszL commented Jun 22, 2026

Uh oh!

dali-automaton commented Jun 22, 2026

Uh oh!

dali-automaton commented Jun 22, 2026

Uh oh!

JanuszL commented Jun 23, 2026

JanuszL commented Jun 16, 2026 •

edited

Loading

greptile-apps Bot commented Jun 16, 2026 •

edited

Loading