Skip to content

Tooling/gap9 cluster hang trace#35

Open
runwangdl wants to merge 1 commit into
develfrom
tooling/gap9-cluster-hang-trace
Open

Tooling/gap9 cluster hang trace#35
runwangdl wants to merge 1 commit into
develfrom
tooling/gap9-cluster-hang-trace

Conversation

@runwangdl

Copy link
Copy Markdown
Owner

Describe the intent of your PR here.

Added

Changed

Fixed

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

scripts/gap9-cluster-hang-trace.sh traces a GAP9 (GVSoC) cluster core
(default pe0; use pe8 for the cluster controller) to locate where a
training/inference run hangs or crashes. It distinguishes a PC-corruption
trap loop (c.unimp), an FC Invalid fetch, a stalled/idle core, and a clean
exit, then prints the last instructions + addr2line of the crash site.
Needs the GVSoC debug models on LD_LIBRARY_PATH (handled internally).
@runwangdl runwangdl force-pushed the tooling/gap9-cluster-hang-trace branch from 86458fe to 832fddf Compare June 11, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant