|
| 1 | +Image and Video Serverless ComfyUI This guide shows how to deploy a ComfyUI server on Beam using Pod. We’ll set up a server to generate images with Flux1 Schnell, but you can easily adapt it to use other models like Stable Diffusion v1.5. View the Code See the code for this example on Github. Setting Up the ComfyUI Server Create the Deployment Script Create a file named app.py with the following code. This script sets up a Beam Pod with ComfyUI, installs dependencies, downloads the Flux1 Schnell model, and launches the server. Copy from beam import Image, Pod ORG_NAME = "Comfy-Org" REPO_NAME = "flux1-schnell" WEIGHTS_FILE = "flux1-schnell-fp8.safetensors" COMMIT = "f2808ab17fe9ff81dcf89ed0301cf644c281be0a" image = ( Image() .add_commands(["apt update && apt install git -y"]) .add_python_packages( [ "fastapi[standard]==0.115.4", "comfy-cli==1.3.5", "huggingface_hub[hf_transfer]==0.26.2", ] ) .add_commands( [ "comfy --skip-prompt install --nvidia --version 0.3.10", "comfy node install was-node-suite-comfyui@1.0.2", "mkdir -p /root/comfy/ComfyUI/models/checkpoints/", f"huggingface-cli download {ORG_NAME}/{REPO_NAME} {WEIGHTS_FILE} --cache-dir /comfy-cache", f"ln -s /comfy-cache/models--{ORG_NAME}--{REPO_NAME}/snapshots/{COMMIT}/{WEIGHTS_FILE} /root/comfy/ComfyUI/models/checkpoints/{WEIGHTS_FILE}", ] ) ) comfyui_server = Pod( image=image, ports=[8000], cpu=12, memory="32Gi", gpu="A100-40", entrypoint=["sh", "-c", "comfy launch -- --listen 0.0.0.0 --port 8000"], ) res = comfyui_server.create() print("✨ ComfyUI hosted at:", res.url) Start ComfyUI Copy python app.py This deploys the ComfyUI server to Beam. After deployment, you’ll see a URL (e.g., https://pod-12345.apps.beam.cloud) where your server is hosted. ComfyUI takes a minute or two to start after deploying it for the first time. Accessing the Server Open the URL from your terminal in a browser to access the ComfyUI interface. Use the web UI to load workflows or generate images. Using Different Models You can swap the Flux1 Schnell model for another, such as Stable Diffusion v1.5, by updating the model variables in app.py. Here’s how: Update the Model Variables Define the organization, repository, weights file, and commit ID for your desired model. For example, to use Stable Diffusion v1.5: Copy ORG_NAME = "Comfy-Org" REPO_NAME = "stable-diffusion-v1-5-archive" WEIGHTS_FILE = "v1-5-pruned-emaonly-fp16.safetensors" COMMIT = "21e044065c0b2d82dafd35397a553847c70c0445" Apply to the Image Commands The rest of the script uses these variables, so no further changes are needed to the image section: Copy image = ( Image() .add_commands(["apt update && apt install git -y"]) .add_python_packages( [ "fastapi[standard]==0.115.4", "comfy-cli==1.3.5", "huggingface_hub[hf_transfer]==0.26.2", ] ) .add_commands( [ "comfy --skip-prompt install --nvidia --version 0.3.10", "comfy node install was-node-suite-comfyui@1.0.2", "mkdir -p /root/comfy/ComfyUI/models/checkpoints/", f"huggingface-cli download {ORG_NAME}/{REPO_NAME} {WEIGHTS_FILE} --cache-dir /comfy-cache", f"ln -s /comfy-cache/models--{ORG_NAME}--{REPO_NAME}/snapshots/{COMMIT}/{WEIGHTS_FILE} /root/comfy/ComfyUI/models/checkpoints/{WEIGHTS_FILE}", ] ) ) Find Model Details To use any other model: Visit Comfy-Org Hugging Face and find your desired model. Update ORG_NAME, REPO_NAME, WEIGHTS_FILE, and COMMIT with values from the model’s repository. Check the “Files and versions” tab for the weights file and commit hash. Running Workflows as APIs You can also expose ComfyUI workflows as APIs using Beam’s ASGI support. This allows you to programmatically generate images by sending requests with prompts. Below is an example of how to set this up: Create the API Script Copy from beam import Image, asgi, Output image = ( Image() .add_commands(["apt update && apt install git -y"]) .add_python_packages( [ "fastapi[standard]==0.115.4", "comfy-cli", "huggingface_hub[hf_transfer]==0.26.2", ] ) .add_commands( [ "yes | comfy install --nvidia --version 0.3.10", "comfy node install was-node-suite-comfyui@1.0.2", "mkdir -p /root/comfy/ComfyUI/models/checkpoints/", "huggingface-cli download Comfy-Org/flux1-schnell flux1-schnell-fp8.safetensors --cache-dir /comfy-cache", "ln -s /comfy-cache/models--Comfy-Org--flux1-schnell/snapshots/f2808ab17fe9ff81dcf89ed0301cf644c281be0a/flux1-schnell-fp8.safetensors /root/comfy/ComfyUI/models/checkpoints/flux1-schnell-fp8.safetensors", ] ) ) def init_models(): import subprocess cmd = "comfy launch --background" subprocess.run(cmd, shell=True, check=True) @asgi( name="comfy", image=image, on_start=init_models, cpu=8, memory="32Gi", gpu="A100-40", timeout=-1, ) def handler(): from fastapi import FastAPI, HTTPException import subprocess import json from pathlib import Path import uuid from typing import Dict app = FastAPI() # This is where you specify the path to your workflow file. # Make sure "workflow_api.json" exists in the same directory as this script. WORKFLOW_FILE = Path(__file__).parent / "workflow_api.json" OUTPUT_DIR = Path("/root/comfy/ComfyUI/output") @app.post("/generate") async def generate(item: Dict): if not WORKFLOW_FILE.exists(): raise HTTPException(status_code=500, detail="Workflow file not found.") workflow_data = json.loads(WORKFLOW_FILE.read_text()) workflow_data["6"]["inputs"]["text"] = item["prompt"] request_id = uuid.uuid4().hex workflow_data["9"]["inputs"]["filename_prefix"] = request_id new_workflow_file = Path(f"{request_id}.json") new_workflow_file.write_text(json.dumps(workflow_data, indent=4)) # Run inference cmd = f"comfy run --workflow {new_workflow_file} --wait --timeout 1200 --verbose" subprocess.run(cmd, shell=True, check=True) image_files = list(OUTPUT_DIR.glob("*")) # Find the latest image latest_image = max( (f for f in image_files if f.suffix.lower() in {".png", ".jpg", ".jpeg"}), key=lambda f: f.stat().st_mtime, default=None ) if not latest_image: raise HTTPException(status_code=404, detail="No output image found.") output_file = Output(path=latest_image) output_file.save() public_url = output_file.public_url(expires=-1) print(public_url) return {"output_url": public_url} return app Prepare a Workflow File Create a workflow_api.json file in the same directory as app.py. This file should contain your ComfyUI workflow, which you can export from the ComfyUI web interface. You can also store your workflow_api.json file in your Volume and use it like WORKFLOW_FILE = Path("/your_volume/workflow_api.json") Deploy the API Copy beam deploy api.py:handler Use the API Send a POST request to the /generate endpoint with a JSON payload containing a prompt: Copy curl -X POST https://12345.apps.beam.cloud/generate \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer YOUR_BEAM_API' \ -d '{"prompt": "A cat image"}' The response will include a public URL to the generated image: Copy { "output_url": "https://app.beam.cloud/output/id/9a003889-8345-4969-bdf8-2808eebc1c4b" } Was this page helpful? Yes No Qwen2.5-7B with SGLang |
| 2 | +Audio and Transcription Parler TTS This guide demonstrates how to set up and run the Parler TTS text-to-speech model as a serverless API on Beam. View the Code See the code for this example on Github. Introduction Parler-TTS Mini is a lightweight text-to-speech (TTS) model, trained on 45K hours of audio data, that can generate high-quality, natural sounding speech with features that can be controlled using a simple text prompt. This guide explains how to deploy and use it on Beam. Deployment Setup Define the model and its dependencies using the parlertts_image: Copy from beam import endpoint, env, Image, Output if env.is_remote(): from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf import uuid def load_models(): model = ParlerTTSForConditionalGeneration.from_pretrained( "parler-tts/parler-tts-mini-v1").to("cuda:0") tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1") return model, tokenizer parlertts_image = ( Image( python_version="python3.10", python_packages=[ "torch", "transformers", "soundfile", "Pillow", "wheel", "packaging", "ninja", "huggingface_hub[hf-transfer]", ], ) .add_commands( [ "apt update && apt install git -y", "pip install git+https://github.com/huggingface/parler-tts.git", ] ) .with_envs("HF_HUB_ENABLE_HF_TRANSFER=1") ) Inference Function The generate_speech function processes text and generates speech audio: Copy @endpoint( name="parler-tts", on_start=load_models, cpu=2, memory="32Gi", gpu="A10G", gpu_count=2, image=parlertts_image ) def generate_speech(context, **inputs): model, tokenizer = context.on_start_value prompt = inputs.pop("prompt", None) description = inputs.pop("description", None) if not prompt or not description: return {"error": "Please provide a prompt and description"} device = "cuda:0" input_ids = tokenizer( description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer( prompt, return_tensors="pt").input_ids.to(device) generation = model.generate( input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() file_name = f"/tmp/parler_tts_out_{uuid.uuid4()}.wav" sf.write(file_name, audio_arr, model.config.sampling_rate) output_file = Output(path=file_name) output_file.save() public_url = output_file.public_url(expires=1200000000) print(public_url) return {"output_url": public_url} Deployment Deploy the API to Beam: Copy beam deploy app.py:generate_speech API Usage Send a POST request with the following JSON payload: Copy { "prompt": "Your text to convert to speech", "description": "Description of the voice/style" } Example Request Copy { "prompt": "On Beam run AI workloads anywhere with zero complexity. One line of Python, global GPUs, full control!!!", "description": "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up." } Example Response A generated audio file will be returned: Copy { "output_url": "https://app.beam.cloud/output/id/dc443a80-7fcc-42bc-928b-4605e41b0825" } Audio Example Here’s an example of the generated audio output: Summary You’ve successfully deployed a Parler TTS text-to-speech API using Beam. Was this page helpful? Yes No |
| 3 | +Run an OpenAI-Compatible vLLM Server |
| 4 | +In this example we are going to use vLLM to host an OpenAI compatible InternVL2.5 8B API on Beam. |
| 5 | + |
| 6 | +View the Code |
| 7 | +See the code for this example on Github. |
| 8 | + |
| 9 | + |
| 10 | +Introduction to vLLM |
| 11 | +vLLM is a high-performance, easy-to-use library for LLM inference. It can be up to 24 times faster than HuggingFace’s Transformers library and it allows you to easily setup an OpenAI compatible API for your LLM. Additionally, a number of LLMs (like Llama 3.1) support LoRA. This means that you can easily follow our LoRA guide and host your resulting model using vLLM. |
| 12 | + |
| 13 | +The key to vLLM’s performance is Paged Attention. In LLMs, input tokens produce attention keys and value tensors, which are typically stored in GPU memory. Paged Attention stores these continuous keys and values in non-contiguous memory by partitioning them into blocks that are fetched on a need-to-use basis. |
| 14 | + |
| 15 | +Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. - vLLM Explainer Doc |
| 16 | + |
| 17 | + |
| 18 | +Hosting an OpenAI-Compatible Chat API with vLLM |
| 19 | +With vLLM, we can host a fully functional chat API that we can use with already built SDKs to interact with. You could build this functionality yourself, but vLLM provides a great out of the box solution as well. |
| 20 | + |
| 21 | + |
| 22 | +Initial Setup |
| 23 | +To get started with vLLM on Beam, we can use the VLLM class from the Beam SDK. This class supports all of the flags and arguments of the vLLM command line tool as arguments. |
| 24 | + |
| 25 | + |
| 26 | +Setup Compute Environment |
| 27 | +Let’s take a look at the code required to deploy the InternVL2.5 8B model from OpenGVLab. Just like a normal Beam application, we start by defining the environment. For this model, we will use 8 CPU, 16GB of memory, and two A10G GPU. With those details set, we can focus on what arguments we need to pass to our vLLM server. |
| 28 | + |
| 29 | +models.py |
| 30 | + |
| 31 | +Copy |
| 32 | +from beam.integrations import VLLM, VLLMArgs |
| 33 | + |
| 34 | +INTERNVL2_5 = "OpenGVLab/InternVL2_5-8B" |
| 35 | + |
| 36 | +internvl = VLLM( |
| 37 | + name=INTERNVL2_5.split("/")[-1], |
| 38 | + cpu=8, |
| 39 | + memory="32Gi", |
| 40 | + gpu="A10G", |
| 41 | + gpu_count=2, |
| 42 | + vllm_args=VLLMArgs( |
| 43 | + model=INTERNVL2_5, |
| 44 | + served_model_name=[INTERNVL2_5], |
| 45 | + trust_remote_code=True, |
| 46 | + max_model_len=4096, |
| 47 | + gpu_memory_utilization=0.95, |
| 48 | + limit_mm_per_prompt={"image": 2}, |
| 49 | + ) |
| 50 | +) |
| 51 | +The first argument we need to set is the model. Then, for this model, we will set the trust_remote_code to True since it will be using tool calling functionality. Finally, we will set the max_model_len to 4096, which is the maximum number of tokens that can be used in a single request and the limit_mm_per_prompt to 2, which limits the number of images that can be used in a single request. |
| 52 | + |
| 53 | +The equivalent vLLM command line tool command would be: |
| 54 | + |
| 55 | + |
| 56 | +Copy |
| 57 | +vllm serve OpenGVLab/InternVL2_5-8B --trust-remote-code \ |
| 58 | +--max-model-len 4096 --limit-mm-per-prompt image=2 |
| 59 | + |
| 60 | +Deploying the API |
| 61 | +To deploy our model, we can run the following command: |
| 62 | + |
| 63 | + |
| 64 | +Copy |
| 65 | +beam deploy models.py:internvl |
| 66 | +The output will look like this: |
| 67 | + |
| 68 | + |
| 69 | +Copy |
| 70 | +=> Building image |
| 71 | +=> Using cached image |
| 72 | +=> Syncing files |
| 73 | +Reading .beamignore file |
| 74 | +Collecting files from /Users/minzi/Dev/beam/ex-repo/vllm |
| 75 | +Added /Users/minzi/Dev/beam/ex-repo/vllm/models.py |
| 76 | +Added /Users/minzi/Dev/beam/ex-repo/vllm/tool_chat_template_mistral.jinja |
| 77 | +Added /Users/minzi/Dev/beam/ex-repo/vllm/README.md |
| 78 | +Added /Users/minzi/Dev/beam/ex-repo/vllm/chat.py |
| 79 | +Added /Users/minzi/Dev/beam/ex-repo/vllm/inference.py |
| 80 | +Collected object is 14.46 KB |
| 81 | +=> Files already synced |
| 82 | +=> Deploying |
| 83 | +=> Deployed 🎉 |
| 84 | +=> Invocation details |
| 85 | +curl -X POST 'https://internvl-15c4487-v4.app.beam.cloud' \ |
| 86 | +-H 'Connection: keep-alive' \ |
| 87 | +-H 'Content-Type: application/json' \ |
| 88 | +-H 'Authorization: Bearer YOUR_TOKEN' \ |
| 89 | +-d '{}' |
| 90 | + |
| 91 | +Using the API |
| 92 | + |
| 93 | +Pre-requisites |
| 94 | +Once your function is deployed, you can interact with it using the OpenAI Python client. |
| 95 | + |
| 96 | +To get started, you can clone the example repository and run the chat.py script. |
| 97 | + |
| 98 | +Make sure you have the openai library installed locally, since that is how we interact with the deployed API. |
| 99 | + |
| 100 | + |
| 101 | +Copy |
| 102 | +git clone https://github.com/beam-cloud/examples.git |
| 103 | +cd examples/vllm |
| 104 | +pip install openai |
| 105 | +python chat.py |
| 106 | + |
| 107 | +Starting a Dialogue |
| 108 | +You will be greeted with a prompt to enter the URL of your deployed function. |
| 109 | + |
| 110 | +Once you enter the URL, the container will initialize on Beam and you will be able to interact with the model. |
| 111 | + |
| 112 | + |
| 113 | +Copy |
| 114 | +Welcome to the CLI Chat Application! |
| 115 | + |
| 116 | +Type 'quit' to exit the conversation. |
| 117 | + |
| 118 | +Enter the app URL: https://internvl-instruct-15c4487-v3.app.beam.cloud |
| 119 | + |
| 120 | +Model OpenGVLab/InternVL2_5-8B is ready |
| 121 | + |
| 122 | +Question: What is in this image? |
| 123 | + |
| 124 | +Image link (press enter to skip): https://upload.wikimedia.org/wikipedia/commons/7/74/White_domesticated_duck,_stretching.jpg |
| 125 | + |
| 126 | +Assistant: The image you've shared is of a white duck standing on a grassy field. The duck, with its distinctive orange beak and feet, is facing to the left. |
| 127 | +To host other models, you can simply change the arguments you pass into the VLLM class. |
| 128 | + |
| 129 | + |
| 130 | +Yi Coder 9B Chat |
| 131 | + |
| 132 | +Mistral 7B Instruct v0.3 |
| 133 | + |
| 134 | +Copy |
| 135 | +from beam.integrations import VLLM, VLLMArgs |
| 136 | + |
| 137 | +YI_CODER_CHAT = "01-ai/Yi-Coder-9B-Chat" |
| 138 | + |
| 139 | +yicoder_chat = VLLM( |
| 140 | + name=YI_CODER_CHAT.split("/")[-1], |
| 141 | + cpu=8, |
| 142 | + memory="16Gi", |
| 143 | + gpu="A100-40", |
| 144 | + vllm_args=VLLMArgs( |
| 145 | + model=YI_CODER_CHAT, |
| 146 | + served_model_name=[YI_CODER_CHAT], |
| 147 | + task="chat", |
| 148 | + trust_remote_code=True, |
| 149 | + max_model_len=8096, |
| 150 | + ), |
| 151 | +) |
| 152 | +Was this page helpful? |
| 153 | + |
| 154 | + |
| 155 | +Yes |
| 156 | + |
| 157 | +No |
0 commit comments