Skip to content

Fix mesh creation to use local devices for single-host benchmarks#125

Open
simrankaurb wants to merge 1 commit into
AI-Hypercomputer:chsfrom
simrankaurb:production-mesh-fix
Open

Fix mesh creation to use local devices for single-host benchmarks#125
simrankaurb wants to merge 1 commit into
AI-Hypercomputer:chsfrom
simrankaurb:production-mesh-fix

Conversation

@simrankaurb
Copy link
Copy Markdown

Enable independent single-host execution for GEMM and Collectives benchmarks

This PR enables single-host TPU benchmarks (GEMM and single-node collectives) to execute independently on individual hosts within a multi-host slice, without requiring coordination from other hosts.

Previously, mesh creation for these benchmarks defaulted to global devices (jax.devices()). When running diagnostics on a single host while other hosts in the slice were idle, JAX would attempt to coordinate execution across the entire global mesh, resulting in execution blocking.

To allow independent single-host execution:

  1. For GEMM: We updated benchmark_utils.py to scope the device mesh to local_devices() and use local_device_count() for single-host sharding strategies.
  2. For Collectives: We updated benchmark_collectives.py to check the requested ici_size. If the required devices fit within a single node (e.g., ici_size <= 8), the mesh is strictly scoped to local_devices(), only falling back to global jax.devices() for multi-host workloads (e.g., ici_size: 16).

This allows diagnostic workloads to run independently on active nodes, while preserving standard global mesh behavior for multi-host benchmarks.

@linamy85
Copy link
Copy Markdown
Collaborator

linamy85 commented Jun 5, 2026

Thanks @simrankaurb , Have we verified the functionality and metrics correctness on this change?

Also, should we also cover the HBM and H2D/D2H?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants