Skip to content

[Build] feat: multi-backend build system to consolidate CUDA, ROCm, CPU, etc#1540

Open
AlpinDale wants to merge 1 commit into
mainfrom
multibackend
Open

[Build] feat: multi-backend build system to consolidate CUDA, ROCm, CPU, etc#1540
AlpinDale wants to merge 1 commit into
mainfrom
multibackend

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

This PR is an attempt to consolidate all our supported hardware backends, so that we can create one wheel for all of them.

For now, it copies the compiled binaries for each of the 3 backends that compile C++ code to aphrodite/extensions, and those are dispatched based on the detected hardware device.

PyTorch on each of the 3 backends needs to be downloaded from a specific index url, so perhaps we'll need to add that in the setup.py based on the detected target device. Other than that, this should be pretty straightforward.

Currently only tested with CUDA, will need more testing on CPU and ROCm.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @AlpinDale, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural change to the project's build and distribution system. The primary objective is to enable the creation of a single Python wheel that supports multiple hardware backends, including CUDA, ROCm, and CPU. This is achieved by centralizing compiled C++ extensions into a new aphrodite/extensions directory, organized by backend, and implementing dynamic loading mechanisms in the Python codebase. This consolidation streamlines the build process and simplifies deployment across diverse hardware environments, ensuring that the correct, optimized extensions are loaded based on the detected platform.

Highlights

  • Build System Consolidation: The build system has been refactored to support a multi-backend approach, allowing for the compilation of extensions for CUDA, ROCm, and CPU within a unified structure, with the goal of creating a single distribution wheel.
  • Dynamic Extension Loading: Python code now dynamically imports backend-specific compiled extensions (e.g., _C, _moe_C, _flashmla_C, cumem_allocator) from dedicated subdirectories within aphrodite/extensions based on the detected hardware platform at runtime.
  • New Directory Structure for Extensions: A new aphrodite/extensions directory has been introduced. This directory will house compiled binaries for each supported backend (e.g., aphrodite/extensions/cuda, aphrodite/extensions/rocm, aphrodite/extensions/cpu), facilitating the consolidated build and dynamic loading.
  • ROCm Version Update: The rocm-build.txt requirements file has been updated to specify PyTorch wheels for ROCm 6.4, moving from the previous 6.3 version.
  • Ray Dependency Adjustment: The ray dependency in rocm.txt has been modified to remove its upper version bound, allowing for greater flexibility in ray versions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the build system to support multiple backends (CUDA, ROCm, CPU) within a single wheel, which is a great improvement for distribution and usability. The changes in CMake and setup.py are well-structured to handle the different backends by placing compiled extensions into backend-specific directories. The Python code is then correctly updated to dynamically load these extensions at runtime based on the detected platform.

My review includes a few suggestions:

  • A minor formatting fix in a CMake file for consistency.
  • A refactoring suggestion in _custom_ops.py to reduce code duplication and improve maintainability.
  • A note on a dependency change in requirements/rocm.txt regarding a removed version constraint, which could be a potential risk.

Overall, this is a solid contribution that simplifies the project's packaging and deployment.

Comment thread CMakeLists.txt
Comment on lines +224 to +226
define_gpu_extension_target(
cumem_allocator
DESTINATION aphrodite/extensions/cuda
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indentation for define_gpu_extension_target and its arguments appears to have been accidentally changed, making it inconsistent with the surrounding code style. Please restore the original indentation to improve readability and maintain consistency.

  define_gpu_extension_target(
    cumem_allocator
    DESTINATION aphrodite/extensions/cuda

Comment thread aphrodite/_custom_ops.py
Comment on lines 11 to +36
if not current_platform.is_tpu() and not current_platform.is_xpu():
try:
import aphrodite._C
if current_platform.is_cuda():
import aphrodite.extensions.cuda._C # noqa: F401
elif current_platform.is_rocm():
import aphrodite.extensions.rocm._C # noqa: F401
# Also register ROCm-specific ops if present
with contextlib.suppress(ImportError):
import aphrodite.extensions.rocm._rocm_C # noqa: F401
elif current_platform.is_cpu():
import aphrodite.extensions.cpu._C # noqa: F401
else:
# Other platforms not handled here
pass
except ImportError as e:
logger.warning("Failed to import from aphrodite._C with {!r}", e)
logger.warning("Failed to import platform-specific _C with {!r}", e)

supports_moe_ops = False
with contextlib.suppress(ImportError):
import aphrodite._moe_C # noqa: F401
supports_moe_ops = True
if current_platform.is_cuda():
with contextlib.suppress(ImportError):
import aphrodite.extensions.cuda._moe_C # noqa: F401
supports_moe_ops = True
elif current_platform.is_rocm():
with contextlib.suppress(ImportError):
import aphrodite.extensions.rocm._moe_C # noqa: F401
supports_moe_ops = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for importing platform-specific extensions and setting supports_moe_ops contains duplicated code for CUDA and ROCm backends. This can be refactored to improve readability and maintainability by first determining the backend and then using a single block of code for the import logic. Using importlib.import_module would make dynamic imports cleaner.

import importlib

if not current_platform.is_tpu() and not current_platform.is_xpu():
    backend = None
    if current_platform.is_cuda():
        backend = "cuda"
    elif current_platform.is_rocm():
        backend = "rocm"
    elif current_platform.is_cpu():
        backend = "cpu"

    if backend:
        try:
            importlib.import_module(f"aphrodite.extensions.{backend}._C")
            if backend == "rocm":
                # Also register ROCm-specific ops if present
                with contextlib.suppress(ImportError):
                    importlib.import_module("aphrodite.extensions.rocm._rocm_C")
        except ImportError as e:
            logger.warning("Failed to import platform-specific _C for backend {} with {!r}", backend, e)
    else:
        # Other platforms not handled here
        pass

supports_moe_ops = False
if current_platform.is_cuda() or current_platform.is_rocm():
    backend = "cuda" if current_platform.is_cuda() else "rocm"
    with contextlib.suppress(ImportError):
        importlib.import_module(f"aphrodite.extensions.{backend}._moe_C")
        supports_moe_ops = True

Comment thread requirements/rocm.txt
botocore
datasets
ray>=2.10.0,<2.45.0
ray>=2.10.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing the upper version constraint for ray could introduce instability if a future version includes breaking changes. It's generally safer to specify a tested upper bound. If this change is intentional and has been tested, consider adding a comment to explain why the upper bound was removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant