Skip to content

Latest commit

 

History

History
254 lines (180 loc) · 8.59 KB

File metadata and controls

254 lines (180 loc) · 8.59 KB

Model Engine

The Model Engine is an API server that allows users to create, deploy, edit, and delete machine learning endpoints.

Architecture

Core Components

Supporting Services

  • Kubernetes Cache - Stores endpoint metadata in Redis to reduce API server load.
  • Celery Autoscaler - Automatically scales inference pods based on request volume for async endpoints.

Getting Started

Prerequisites

Install global dev requirements and pre-commit hooks from the llm-engine root:

pip install -r ../requirements-dev.txt
pre-commit install

Installation

pip install -r requirements.txt
pip install -r requirements-test.txt
pip install -r requirements_override.txt
pip install -e .

Set up mypy:

mypy . --install-types

Running Tests

pytest

Unit tests are in tests/unit.

OpenAPI Schema Generation

Model Engine is the source of truth for the Launch API schema. We generate OpenAPI schemas that are consumed by client libraries (e.g., launch-python-client).

Why Two Schema Versions?

FastAPI with Pydantic v2 generates OpenAPI 3.1 schemas. However, code generators like OpenAPI Generator 6.x have incomplete 3.1 support. We provide two versions:

File Version Use Case
openapi.json OpenAPI 3.1 Native FastAPI output, documentation
openapi-3.0.json OpenAPI 3.0 Code generation (OpenAPI Generator 6.x)

Generating Schemas

python scripts/generate_openapi_schemas.py [output_dir]

This generates:

  • openapi.json - Native 3.1 schema
  • openapi-3.0.json - Processed 3.0-compatible schema
  • metadata.json - Generation timestamp and git tag

What the 3.0 Processing Does

The get_openapi_schema(openapi_30_compatible=True) function converts:

  1. Nullable types: anyOf: [{type: string}, {type: null}]{type: string, nullable: true}
  2. Const removal: Removes const when enum is present (3.1-only feature)
  3. Schema renaming: Converts auto-generated discriminated union names to clean names (e.g., RootModel_Annotated_Union_...CreateLLMModelEndpointV1Request)

Client Library Workflow

┌─────────────┐     generate      ┌──────────────────┐
│ Model Engine│ ─────────────────▶│ openapi-3.0.json │
│   (FastAPI) │                   └────────┬─────────┘
└─────────────┘                            │
                                           │ copy to client repos
                              ┌────────────┴────────────┐
                              ▼                         ▼
                    ┌─────────────────┐       ┌─────────────────┐
                    │ launch-python-  │       │ other clients   │
                    │ client          │       │                 │
                    └────────┬────────┘       └────────┬────────┘
                             │                         │
                             ▼                         ▼
                    ┌─────────────────┐       ┌─────────────────┐
                    │ OpenAPI Generator│       │ OpenAPI Generator│
                    │ (python, 6.4.0) │       │ (any language)  │
                    └─────────────────┘       └─────────────────┘

Updating Client Libraries

When the API changes:

  1. Generate new schemas in model-engine:

    python scripts/generate_openapi_schemas.py specs/
  2. Copy specs/openapi-3.0.json to client repos as openapi.json

  3. Run the client's code generator (see client repo for specific commands)

  4. Test and commit

Other Scripts

Generating OpenAI Types

For OpenAI-compatible V2 APIs, we generate Pydantic models from OpenAI's spec:

  1. Fetch spec from https://github.com/openai/openai-openapi/blob/master/openapi.yaml
  2. Run scripts/generate-openai-types.sh

Local Development

Control Plane Local Setup

The control plane (Gateway API server, Service Builder, K8s Cache) can be run entirely locally without GPU hardware or cloud credentials. Endpoint creation calls succeed against a fake k8s/SQS/ECR backend, letting you iterate on control plane code quickly.

Prerequisites: Python 3.10+, Docker

One-time setup

cd model-engine/

# Install Python dependencies
make install

# Start Postgres + Redis
make dev-up

# Apply database migrations
make dev-migrate

Run the API server

make dev-server

The gateway starts at http://localhost:5000 with auto-reload on file changes. Authentication is skipped automatically (SKIP_AUTH=true) so any token works.

Make API calls

# List model endpoints
curl http://localhost:5000/v1/model-endpoints \
  -H "Authorization: Bearer test-user"

# Create an LLM endpoint (uses fake k8s — no real infra needed)
curl -X POST http://localhost:5000/v1/llm/model-endpoints \
  -H "Authorization: Bearer test-user" \
  -H "Content-Type: application/json" \
  -d '{"name":"local-test","model_name":"meta-llama/Meta-Llama-3.1-8B-Instruct","inference_framework":"vllm","min_workers":0,"max_workers":1,"gpus":1,"gpu_type":"nvidia-ampere-a10","endpoint_type":"sync"}'

Stop backing services

make dev-down

What LOCAL=true does

Running with LOCAL=true (set automatically by make dev-server and make dev-migrate):

  • Skips the GIT_TAG env var requirement
  • Uses a fake queue delegate (no SQS/Azure Service Bus needed)
  • Uses a fake Docker repository (no ECR/ACR/GAR needed)
  • Auth is skipped when identity_service_url is absent from config (default)
  • Postgres and Redis are real local services (via docker-compose)

This means you can create/update/delete endpoints via the API and see them reflected in Postgres, without any Kubernetes cluster or cloud account.

Running individual components manually

If you prefer to set env vars yourself rather than use make:

export LOCAL=true
export GIT_TAG=local
export ML_INFRA_DATABASE_URL=postgresql://postgres:password@localhost:5432/llm_engine
export DEPLOY_SERVICE_CONFIG_PATH=$(pwd)/service_configs/service_config_local.yaml

# Gateway
start-fastapi-server --port 5000 --num-workers 1 --debug

# Database migration
bash model_engine_server/db/migrations/run_database_migration.sh

Testing the HTTP Forwarder

Start an endpoint on port 5005:

export IMAGE=692474966980.dkr.ecr.us-west-2.amazonaws.com/vllm:0.10.1.1-rc2
export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
export MODEL_PATH=/data/model_files/$MODEL
export REPO_PATH=/mnt/home/dmchoi/repos/scale

docker run \
    --runtime nvidia \
    --shm-size=16gb \
    --gpus '"device=0,1,2,3"' \
    -v $MODEL_PATH:/workspace/model_files:ro \
    -v ${REPO_PATH}/llm-engine/model-engine/model_engine_server/inference/vllm/vllm_server.py:/workspace/vllm_server.py \
    -p 5005:5005 \
    --name vllm \
    ${IMAGE} \
    python -m vllm_server --model model_files --port 5005 --disable-log-requests --max-model-len 4096 --max-num-seqs 16 --enforce-eager

Run the forwarder:

GIT_TAG=test python model_engine_server/inference/forwarding/http_forwarder.py \
    --config model_engine_server/inference/configs/service--http_forwarder.yaml \
    --num-workers 1 \
    --set "forwarder.sync.extra_routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.stream.extra_routes=['/v1/chat/completions','/v1/completions']" \
    --set "forwarder.sync.healthcheck_route=/health" \
    --set "forwarder.stream.healthcheck_route=/health"

Test it:

curl -X POST localhost:5000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"args": {"model":"meta-llama/Meta-Llama-3.1-8B-Instruct", "messages":[{"role": "system", "content": "Hello"}], "max_tokens":100}}'