Skip to content

Fix separated external source prefetch drain#6397

Merged
JanuszL merged 7 commits into
NVIDIA:mainfrom
JanuszL:fix-separated-external-source-prefetch
Jun 24, 2026
Merged

Fix separated external source prefetch drain#6397
JanuszL merged 7 commits into
NVIDIA:mainfrom
JanuszL:fix-separated-external-source-prefetch

Conversation

@JanuszL

@JanuszL JanuszL commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Category:

Bug fix

Description:

This PR fixes a hang at the end of an epoch when a Python-managed
external_source is used with separated CPU/GPU prefetch queues.

The newer prefetch path fed all inputs before running the backend. For
separated execution this could leave CPU-prefetched external source batches
without scheduled Mixed/GPU work once the source reached end of epoch. The
consumer then waited indefinitely for output indexes in the separated queue
policy.

The fix keeps prefetching interleaved with backend runs, so input feeding and
backend scheduling stay aligned at epoch boundaries.

Related to #5199

Additional information:

Affected modules and functionalities:

  • nvidia.dali.Pipeline prefetch scheduling in the legacy executor path.
  • Python regression coverage for external_source(batch=True, cycle="raise")
    with mixed image decoding and separated prefetch queues.

Key points relevant for the review:

Review whether routing prefetch through the interleaved path is acceptable for
legacy separated execution. This restores the behavior that avoids a CPU-only
tail when input callbacks reach end of epoch.

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
      • test_pipeline.py:test_separated_queue_external_source_drains_prefetched_batches
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from a17716a to d972238 Compare June 16, 2026 15:33
@JanuszL

JanuszL commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

!build

@JanuszL JanuszL requested a review from mzient June 16, 2026 15:37
@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a hang at epoch boundaries when external_source with cycle="raise" is used with separated CPU/GPU prefetch queues. The old path fed all inputs upfront via _prefetch_inputs before calling _pipe.Prefetch(), leaving CPU-prefetched batches without corresponding Mixed/GPU work when the source reached end of epoch. The fix routes all execution — separated and non-separated — through _legacy_interleaved_prefetch, with the loop count adjusted to max(cpu_size, gpu_size) for separated queues.

  • pipeline.py: _prefetch_inputs and the is_prefetch parameter to _run_input_callbacks are removed; _prefetch() now unconditionally delegates to _legacy_interleaved_prefetch with a queue-depth-aware iteration count.
  • async_separated_pipelined_executor.cc: Prefetch() is updated to use max(0, cpu_size - gpu_size) CPU-only extra rounds and InputFeedCount now returns max(cpu_size, gpu_size) instead of cpu_size + gpu_size.
  • test_pipeline.py: New regression test parametrized over (2,2), (3,2), and (2,3) queue depths verifies no hang and validates decoded image content.

Confidence Score: 5/5

Safe to merge — the fix correctly eliminates the epoch-boundary hang by interleaving input feeding with pipeline runs, and the test covers all three queue-depth configurations.

All three changed files are consistent: the Python loop count, the C++ Prefetch CPU-only tail, and InputFeedCount all converge on max(cpu_size, gpu_size). The removed _prefetch_inputs path was the source of the hang, and its elimination along with the is_prefetch variant of _run_input_callbacks is clean. The test parametrization covers symmetric and both asymmetric depth orderings, and the content assertions (non-zero pixels, correct rank and channel count) go beyond cardinality checks. No blocking issues were introduced by this change.

No files require special attention.

Important Files Changed

Filename Overview
dali/python/nvidia/dali/pipeline.py Core fix: removes _prefetch_inputs and the is_prefetch variant of _run_input_callbacks, routing both separated and non-separated execution through _legacy_interleaved_prefetch with a corrected loop count; logic is correct and consistent with the C++ change.
dali/pipeline/executor/async_separated_pipelined_executor.cc Prefetch() CPU-only tail reduced from cpu_size to max(0, cpu_size - gpu_size) rounds; InputFeedCount returns max(cpu_size, gpu_size) — both consistent with the Python-side loop count change.
dali/test/python/test_pipeline.py New regression test covers all three queue-depth combinations (symmetric and both asymmetric directions); validates decoded image cardinality, rank, channel count, and non-zero pixel content; expects StopIteration on the over-run call.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Py as pipeline.py (_prefetch)
    participant LIP as _legacy_interleaved_prefetch
    participant ICS as _iter_setup / _run_input_callbacks
    participant Exec as AsyncSeparatedPipelinedExecutor
    participant CPU as CPU stage
    participant Mix as Mixed stage
    participant GPU as GPU stage

    Note over Py,GPU: New flow (both separated and non-separated)
    Py->>LIP: always delegate
    LIP->>LIP: "prefetch_count = max(cpu_size, gpu_size) if exec_separated else cpu_size"

    loop prefetch_count times
        LIP->>ICS: _iter_setup()
        ICS->>ICS: _run_input_callbacks() - feed 1 batch
        LIP->>Exec: _pipe.Run()
        Exec->>CPU: RunCPU()
        Exec->>Mix: RunMixed()
        Exec->>GPU: RunGPU()
    end

    Note over Py,GPU: Old separated flow (removed) - fed cpu_size+gpu_size batches upfront
    Note over Py,GPU: then called _pipe.Prefetch() leaving CPU-only tail at epoch boundary
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Py as pipeline.py (_prefetch)
    participant LIP as _legacy_interleaved_prefetch
    participant ICS as _iter_setup / _run_input_callbacks
    participant Exec as AsyncSeparatedPipelinedExecutor
    participant CPU as CPU stage
    participant Mix as Mixed stage
    participant GPU as GPU stage

    Note over Py,GPU: New flow (both separated and non-separated)
    Py->>LIP: always delegate
    LIP->>LIP: "prefetch_count = max(cpu_size, gpu_size) if exec_separated else cpu_size"

    loop prefetch_count times
        LIP->>ICS: _iter_setup()
        ICS->>ICS: _run_input_callbacks() - feed 1 batch
        LIP->>Exec: _pipe.Run()
        Exec->>CPU: RunCPU()
        Exec->>Mix: RunMixed()
        Exec->>GPU: RunGPU()
    end

    Note over Py,GPU: Old separated flow (removed) - fed cpu_size+gpu_size batches upfront
    Note over Py,GPU: then called _pipe.Prefetch() leaving CPU-only tail at epoch boundary
Loading

Reviews (23): Last reviewed commit: "Reset all external sources on epoch end" | Re-trigger Greptile

Comment thread dali/test/python/test_pipeline.py Outdated
@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55012969]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55012969]: BUILD PASSED

@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from d972238 to 82c4432 Compare June 17, 2026 08:37
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from 82c4432 to fe2b252 Compare June 17, 2026 08:53
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from fe2b252 to 3c8286e Compare June 17, 2026 09:12
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

1 similar comment
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

Keep pipeline prefetching interleaved with backend runs so separated execution does not leave CPU-prefetched external source batches without scheduled Mixed/GPU work at end of epoch. Prime separated execution for the maximum of CPU and GPU queue depths to avoid underfilling asymmetric queue configurations.

Add a regression that drains a batch external source through mixed image decoding with symmetric and asymmetric separated CPU/GPU prefetch queues.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from 3c8286e to 1276420 Compare June 17, 2026 09:20
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

1 similar comment
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55097813]: BUILD STARTED

Comment thread dali/python/nvidia/dali/pipeline.py
@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55097813]: BUILD FAILED

@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

Make async separated executor prefetch and InputFeedCount use the same maximum queue-depth contract. Keep the Python separated prefetch path for drainable queue shapes, but use interleaved prefetch when the CPU queue is longer than the GPU queue so end-of-epoch Python sources do not leave CPU-only work without scheduled Mixed/GPU stages.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL JanuszL force-pushed the fix-separated-external-source-prefetch branch from a0755d2 to 6fa12d5 Compare June 17, 2026 13:26
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55110109]: BUILD STARTED

Comment thread dali/python/nvidia/dali/pipeline.py Outdated
Route separated Python prefetch through the interleaved path for all queue shapes. The bulk _prefetch_inputs path could call _run_input_callbacks with a stale argument and underfeed the backend Prefetch schedule for Python external sources, causing either a TypeError or a hang at epoch end.

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

1 similar comment
@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@greptile review

@JanuszL

JanuszL commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55116811]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55110109]: BUILD FAILED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55116811]: BUILD FAILED

@JanuszL

JanuszL commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55485233]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55485233]: BUILD FAILED

@JanuszL

JanuszL commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55529667]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55529667]: BUILD FAILED

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
@JanuszL

JanuszL commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

!build

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55541646]: BUILD STARTED

@dali-automaton

Copy link
Copy Markdown
Collaborator

CI MESSAGE: [55541646]: BUILD PASSED

@JanuszL JanuszL merged commit cfb5734 into NVIDIA:main Jun 24, 2026
7 checks passed
@JanuszL JanuszL deleted the fix-separated-external-source-prefetch branch June 24, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants