You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
QDP currently assumes a single-GPU execution model for amplitude state construction. This becomes a hard limit once the target state no longer fits on one device, even when the aggregate memory of multiple GPUs on the same host would be sufficient.
This issue tracks the first concrete feature target for distributed multi-GPU amplitude execution in QDP.
Add an MPI-ready, single-node distributed execution foundation for amplitude state construction in QDP.
This issue is the first concrete implementation target that should establish a reusable distributed substrate while remaining scoped to one host and the amplitude path.
The intended output of this issue is:
a validated multi-device execution context
distributed amplitude planning and shard placement
feasibility validation before materialization
a materialized distributed state representation with per-shard GPU buffers
a collective seam that works for the current single-process path while remaining extensible toward future MPI-backed execution
tests and a probe-level executable path that demonstrates the distributed state can be built successfully on real hardware
Why
The immediate goal is to let QDP exceed single-GPU limits for amplitude state construction on one machine.
The longer-term goal is to do this in a way that does not force future work to rewrite the architecture when QDP grows toward:
richer placement strategies
stronger communication backends
broader workflow support
gather/export workflows
multi-node execution
This issue therefore sits between the current single-GPU implementation and the broader roadmap in #1297.
How
The design should keep the following concerns separate:
request validation
device mesh discovery and topology metadata
placement planning
shard feasibility validation
distributed execution context
logical distributed layout
materialized distributed state
collective / communication seam
Key abstractions for this issue may include:
DeviceMesh
GpuTopology
PlacementRequest
PlacementPlan
DistributedAmplitudePlan
DistributedExecutionContext
DistributedStateLayout
DistributedStateVector
CollectiveCommunicator
flowchart TD
A[QDP distributed request] --> B[Request validation]
B --> C[DeviceMesh]
C --> D[GpuTopology]
C --> E[DistributedExecutionContext]
D --> F[PlacementPlanner]
B --> F
F --> G[PlacementPlan]
G --> H[DistributedAmplitudePlan]
H --> I[DistributedStateLayout]
E --> J[Distributed runtime]
I --> J
J --> K[CollectiveCommunicator]
J --> L[DistributedStateVector]
Loading
sequenceDiagram
participant U as Upstream Caller
participant E as QdpEngine
participant M as DeviceMesh
participant P as PlacementPlanner
participant X as DistributedExecutionContext
participant R as Distributed Runtime
participant C as CollectiveCommunicator
U->>E: submit distributed amplitude request
E->>E: validate input and resolve request
E->>M: build multi-device mesh
E->>X: construct execution context
E->>P: build placement plan
P-->>E: shard placement metadata
E->>R: execute distributed encode
R->>R: bind planned device handles
R->>C: reduce local norm contributions
C-->>R: global norm result
R-->>U: distributed state handle
Loading
Scope
In scope:
single-node execution only
amplitude state construction only
distributed planning and runtime scaffolding
distributed execution context and collective seam
topology-aware placement ordering where helpful
per-shard buffer materialization
tests for planning, validation, runtime behavior, and probe-level execution
Out of scope:
multi-node execution
true MPI multi-rank launcher support in this first feature slice
NCCL collectives
peer-to-peer optimization
final gather/export workflow surface
broader workflow or encoder support beyond what is needed to establish the distributed substrate
QDP currently assumes a single-GPU execution model for amplitude state construction. This becomes a hard limit once the target state no longer fits on one device, even when the aggregate memory of multiple GPUs on the same host would be sufficient.
This issue tracks the first concrete feature target for distributed multi-GPU amplitude execution in QDP.
Reference: #1210
Roadmap: #1297
What
Add an MPI-ready, single-node distributed execution foundation for amplitude state construction in QDP.
This issue is the first concrete implementation target that should establish a reusable distributed substrate while remaining scoped to one host and the amplitude path.
The intended output of this issue is:
Why
The immediate goal is to let QDP exceed single-GPU limits for amplitude state construction on one machine.
The longer-term goal is to do this in a way that does not force future work to rewrite the architecture when QDP grows toward:
This issue therefore sits between the current single-GPU implementation and the broader roadmap in #1297.
How
The design should keep the following concerns separate:
Key abstractions for this issue may include:
DeviceMeshGpuTopologyPlacementRequestPlacementPlanDistributedAmplitudePlanDistributedExecutionContextDistributedStateLayoutDistributedStateVectorCollectiveCommunicatorflowchart TD A[QDP distributed request] --> B[Request validation] B --> C[DeviceMesh] C --> D[GpuTopology] C --> E[DistributedExecutionContext] D --> F[PlacementPlanner] B --> F F --> G[PlacementPlan] G --> H[DistributedAmplitudePlan] H --> I[DistributedStateLayout] E --> J[Distributed runtime] I --> J J --> K[CollectiveCommunicator] J --> L[DistributedStateVector]sequenceDiagram participant U as Upstream Caller participant E as QdpEngine participant M as DeviceMesh participant P as PlacementPlanner participant X as DistributedExecutionContext participant R as Distributed Runtime participant C as CollectiveCommunicator U->>E: submit distributed amplitude request E->>E: validate input and resolve request E->>M: build multi-device mesh E->>X: construct execution context E->>P: build placement plan P-->>E: shard placement metadata E->>R: execute distributed encode R->>R: bind planned device handles R->>C: reduce local norm contributions C-->>R: global norm result R-->>U: distributed state handleScope
In scope:
Out of scope: