Skip to content

[vLLM v0.13] Re-architect forge's integration with vLLM (generator.py) #669

@JenniferWang

Description

@JenniferWang

Context

forge's current integration with vLLM is essentially a bespoke fork of the vLLM EngineCore that integrates with Monarch's distributed process mesh architecture. It manually recreates the EngineCore execution loop while replacing vLLM's multi-process ZMQ communication with Monarch RPC.

However, the attempt to upgrade from v0.10.0 to v0.13.0 has made it clear that this engine-level fork is not sustainable and a complete refactor is required before we add any more optimization logic into the generator.py.

What's Implemented (Aligned with vLLM v0.13):
- Scheduler integration (request scheduling, KV cache management)
- InputProcessor (tokenization, validation)
- OutputProcessor (detokenization, output formatting)
- StructuredOutputManager (grammar-based constrained decoding)
- Two-phase execution (execute_model → sample_tokens for grammar constraints)
- Request block hashing (prefix caching)
- KV cache configuration and initialization
- Tensor-parallel (TP) model execution

What's NOT Implemented (vs. vLLM v0.13 EngineCore):
- Pipeline-parallel (PP) execution - Generator assumes single-rank workers
- Multi-modal input processing (mm_receiver_cache integration incomplete)
- Async scheduling (overlapping schedule/execute for throughput)
- Batch queue for pipeline parallelism (batch_queue_size > 1)
- Data-parallel coordination (DPCoordinator for multi-engine deployments)
- Abort queue (mid-generation request cancellation)
- EngineCoreClient abstraction (direct worker calls instead)

Goal

  • Re-architect the integration point -- the minimum requirement is to integrate at least the EngineCore level so that we do not need to manage the scheduler, the kv cache etc manually.
  • Upgrade to v0.13.0 or the latest RC.

(Diagram from https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1)
Image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions