Skip to content

[https://nvbugs/6336747][fix] Fail fast when executor worker stalls#15561

Draft
2ez4bz wants to merge 2 commits into
NVIDIA:mainfrom
2ez4bz:dev-nvbug-6336747
Draft

[https://nvbugs/6336747][fix] Fail fast when executor worker stalls#15561
2ez4bz wants to merge 2 commits into
NVIDIA:mainfrom
2ez4bz:dev-nvbug-6336747

Conversation

@2ez4bz

@2ez4bz 2ez4bz commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

@coderabbitai summary

Description

  • Why?

A stuck or disconnected executor worker left the proxy blocked
indefinitely: the request queue uses an unbounded send HWM with no
send timeout, so request_queue.put -> socket.send never returned once
the worker stopped draining, and the error monitor never tripped. In
CI this could surface as a ~1h hang ending in an opaque timeout kill.
The stall itself is non-deterministic and not yet root-caused.

  • What?

Make the failure fast and legible instead:

  • Bound request submission: poll the socket for send-readiness and
    check worker liveness, raising RequestError if the worker has not
    accepted the request within a timeout.
  • Add a progress watchdog to the error monitor that marks the worker
    stalled and aborts in-flight requests when no result arrives while
    requests are outstanding.
  • Honor the previously-ignored timeout in GenerationResult.result()
    and bound the per-request wait in the VideoMME evaluator.
  • On a detected stall, signal the worker (SIGUSR1/faulthandler) to
    dump all thread stacks so the next occurrence is diagnosable.

This mitigates the hang and captures worker state; it does not fix
the underlying intermittent stall.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz

2ez4bz commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-*" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55338 [ run ] triggered by Bot. Commit: a2b7c52 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55338 [ run ] completed with state FAILURE. Commit: a2b7c52
/LLM/main/L0_MergeRequest_PR pipeline #44289 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from a2b7c52 to af6cae1 Compare June 23, 2026 23:56
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-*" --disable-reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55353 [ run ] triggered by Bot. Commit: af6cae1 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55353 [ run ] completed with state FAILURE. Commit: af6cae1
/LLM/main/L0_MergeRequest_PR pipeline #44303 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55362 [ run ] triggered by Bot. Commit: af6cae1 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55362 [ run ] completed with state FAILURE. Commit: af6cae1
/LLM/main/L0_MergeRequest_PR pipeline #44311 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from af6cae1 to 620220c Compare June 24, 2026 05:23
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55412 [ run ] triggered by Bot. Commit: 620220c Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 620220c to 8ce44ce Compare June 24, 2026 06:26
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55412 [ run ] completed with state FAILURE. Commit: 620220c
/LLM/main/L0_MergeRequest_PR pipeline #44354 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55427 [ run ] triggered by Bot. Commit: 8ce44ce Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55427 [ run ] completed with state SUCCESS. Commit: 8ce44ce
/LLM/main/L0_MergeRequest_PR pipeline #44365 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 8ce44ce to 2bf1ae1 Compare June 24, 2026 17:55
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 2bf1ae1 to 1a46405 Compare June 24, 2026 18:16
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55562 [ run ] triggered by Bot. Commit: 1a46405 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55563 [ run ] triggered by Bot. Commit: 1a46405 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55562 [ run ] completed with state ABORTED. Commit: 1a46405

Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55563 [ run ] completed with state FAILURE. Commit: 1a46405
/LLM/main/L0_MergeRequest_PR pipeline #44484 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 1a46405 to bfe993b Compare June 24, 2026 20:42
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from bfe993b to 33ef1ec Compare June 24, 2026 20:48
@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55580 [ run ] triggered by Bot. Commit: 33ef1ec Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55580 [ run ] completed with state SUCCESS. Commit: 33ef1ec
/LLM/main/L0_MergeRequest_PR pipeline #44497 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz

2ez4bz commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55607 [ run ] triggered by Bot. Commit: 33ef1ec Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55607 [ run ] completed with state FAILURE. Commit: 33ef1ec
/LLM/main/L0_MergeRequest_PR pipeline #44524 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 33ef1ec to 183f4f7 Compare June 25, 2026 04:33
@2ez4bz

2ez4bz commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55697 [ run ] triggered by Bot. Commit: 183f4f7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55697 [ run ] completed with state FAILURE. Commit: 183f4f7
/LLM/main/L0_MergeRequest_PR pipeline #44602 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch 2 times, most recently from 0b1b09d to 807178c Compare June 25, 2026 05:37
@2ez4bz

2ez4bz commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55708 [ run ] triggered by Bot. Commit: 807178c Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 807178c to 740a40d Compare June 25, 2026 06:16
2ez4bz added 2 commits June 24, 2026 23:17
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nvbug-6336747 branch from 740a40d to 2ff36aa Compare June 25, 2026 06:21
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55708 [ run ] completed with state FAILURE. Commit: 807178c
/LLM/main/L0_MergeRequest_PR pipeline #44611 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@2ez4bz

2ez4bz commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@github-actions

Copy link
Copy Markdown

⚠️ Bot command ignored: The /bot command must appear at the very beginning of the comment (no leading blank lines or spaces). Please post a new comment with /bot as the first character.

@2ez4bz

2ez4bz commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1" --disable-reuse-test --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55715 [ run ] triggered by Bot. Commit: 2ff36aa Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55715 [ run ] completed with state SUCCESS. Commit: 2ff36aa
/LLM/main/L0_MergeRequest_PR pipeline #44618 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants