[Feature] Add MPI-ready single-node distributed amplitude execution in QDP

QDP currently assumes a single-GPU execution model for amplitude state construction. This becomes a hard limit once the target state no longer fits on one device, even when the aggregate memory of multiple GPUs on the same host would be sufficient.

This issue tracks the first concrete feature target for distributed multi-GPU amplitude execution in QDP.

Reference: #1210
Roadmap: #1297

## What

Add an MPI-ready, single-node distributed execution foundation for amplitude state construction in QDP.

This issue is the first concrete implementation target that should establish a reusable distributed substrate while remaining scoped to one host and the amplitude path.

The intended output of this issue is:

- a validated multi-device execution context
- distributed amplitude planning and shard placement
- feasibility validation before materialization
- a materialized distributed state representation with per-shard GPU buffers
- a collective seam that works for the current single-process path while remaining extensible toward future MPI-backed execution
- tests and a probe-level executable path that demonstrates the distributed state can be built successfully on real hardware

## Why

The immediate goal is to let QDP exceed single-GPU limits for amplitude state construction on one machine.

The longer-term goal is to do this in a way that does not force future work to rewrite the architecture when QDP grows toward:

- richer placement strategies
- stronger communication backends
- broader workflow support
- gather/export workflows
- multi-node execution

This issue therefore sits between the current single-GPU implementation and the broader roadmap in #1297.

## How

The design should keep the following concerns separate:

1. request validation
2. device mesh discovery and topology metadata
3. placement planning
4. shard feasibility validation
5. distributed execution context
6. logical distributed layout
7. materialized distributed state
8. collective / communication seam

Key abstractions for this issue may include:

- `DeviceMesh`
- `GpuTopology`
- `PlacementRequest`
- `PlacementPlan`
- `DistributedAmplitudePlan`
- `DistributedExecutionContext`
- `DistributedStateLayout`
- `DistributedStateVector`
- `CollectiveCommunicator`

```mermaid
flowchart TD
    A[QDP distributed request] --> B[Request validation]
    B --> C[DeviceMesh]
    C --> D[GpuTopology]
    C --> E[DistributedExecutionContext]
    D --> F[PlacementPlanner]
    B --> F
    F --> G[PlacementPlan]
    G --> H[DistributedAmplitudePlan]
    H --> I[DistributedStateLayout]
    E --> J[Distributed runtime]
    I --> J
    J --> K[CollectiveCommunicator]
    J --> L[DistributedStateVector]
```

```mermaid
sequenceDiagram
    participant U as Upstream Caller
    participant E as QdpEngine
    participant M as DeviceMesh
    participant P as PlacementPlanner
    participant X as DistributedExecutionContext
    participant R as Distributed Runtime
    participant C as CollectiveCommunicator

    U->>E: submit distributed amplitude request
    E->>E: validate input and resolve request
    E->>M: build multi-device mesh
    E->>X: construct execution context
    E->>P: build placement plan
    P-->>E: shard placement metadata
    E->>R: execute distributed encode
    R->>R: bind planned device handles
    R->>C: reduce local norm contributions
    C-->>R: global norm result
    R-->>U: distributed state handle
```

## Scope

In scope:

- single-node execution only
- amplitude state construction only
- distributed planning and runtime scaffolding
- distributed execution context and collective seam
- topology-aware placement ordering where helpful
- per-shard buffer materialization
- tests for planning, validation, runtime behavior, and probe-level execution

Out of scope:

- multi-node execution
- true MPI multi-rank launcher support in this first feature slice
- NCCL collectives
- peer-to-peer optimization
- final gather/export workflow surface
- broader workflow or encoder support beyond what is needed to establish the distributed substrate
- full performance tuning


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add MPI-ready single-node distributed amplitude execution in QDP #1295

What

Why

How

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Add MPI-ready single-node distributed amplitude execution in QDP #1295

Description

What

Why

How

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions