The Model Engine is an API server that allows users to create, deploy, edit, and delete machine learning endpoints.
- Gateway - REST API server. Routes are defined in
model_engine_server.api. - Service Builder - Creates inference pods when endpoints are created/edited via
[POST,PUT] /v1/model-endpoints.
- Kubernetes Cache - Stores endpoint metadata in Redis to reduce API server load.
- Celery Autoscaler - Automatically scales inference pods based on request volume for async endpoints.
Install global dev requirements and pre-commit hooks from the llm-engine root:
pip install -r ../requirements-dev.txt
pre-commit installpip install -r requirements.txt
pip install -r requirements-test.txt
pip install -r requirements_override.txt
pip install -e .Set up mypy:
mypy . --install-typespytestUnit tests are in tests/unit.
Model Engine is the source of truth for the Launch API schema. We generate OpenAPI schemas that are consumed by client libraries (e.g., launch-python-client).
FastAPI with Pydantic v2 generates OpenAPI 3.1 schemas. However, code generators like OpenAPI Generator 6.x have incomplete 3.1 support. We provide two versions:
| File | Version | Use Case |
|---|---|---|
openapi.json |
OpenAPI 3.1 | Native FastAPI output, documentation |
openapi-3.0.json |
OpenAPI 3.0 | Code generation (OpenAPI Generator 6.x) |
python scripts/generate_openapi_schemas.py [output_dir]This generates:
openapi.json- Native 3.1 schemaopenapi-3.0.json- Processed 3.0-compatible schemametadata.json- Generation timestamp and git tag
The get_openapi_schema(openapi_30_compatible=True) function converts:
- Nullable types:
anyOf: [{type: string}, {type: null}]→{type: string, nullable: true} - Const removal: Removes
constwhenenumis present (3.1-only feature) - Schema renaming: Converts auto-generated discriminated union names to clean names (e.g.,
RootModel_Annotated_Union_...→CreateLLMModelEndpointV1Request)
┌─────────────┐ generate ┌──────────────────┐
│ Model Engine│ ─────────────────▶│ openapi-3.0.json │
│ (FastAPI) │ └────────┬─────────┘
└─────────────┘ │
│ copy to client repos
┌────────────┴────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ launch-python- │ │ other clients │
│ client │ │ │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ OpenAPI Generator│ │ OpenAPI Generator│
│ (python, 6.4.0) │ │ (any language) │
└─────────────────┘ └─────────────────┘
When the API changes:
-
Generate new schemas in model-engine:
python scripts/generate_openapi_schemas.py specs/
-
Copy
specs/openapi-3.0.jsonto client repos asopenapi.json -
Run the client's code generator (see client repo for specific commands)
-
Test and commit
For OpenAI-compatible V2 APIs, we generate Pydantic models from OpenAI's spec:
- Fetch spec from https://github.com/openai/openai-openapi/blob/master/openapi.yaml
- Run
scripts/generate-openai-types.sh
The control plane (Gateway API server, Service Builder, K8s Cache) can be run entirely locally without GPU hardware or cloud credentials. Endpoint creation calls succeed against a fake k8s/SQS/ECR backend, letting you iterate on control plane code quickly.
Prerequisites: Python 3.10+, Docker
cd model-engine/
# Install Python dependencies
make install
# Start Postgres + Redis
make dev-up
# Apply database migrations
make dev-migratemake dev-serverThe gateway starts at http://localhost:5000 with auto-reload on file changes.
Authentication is skipped automatically (SKIP_AUTH=true) so any token works.
# List model endpoints
curl http://localhost:5000/v1/model-endpoints \
-H "Authorization: Bearer test-user"
# Create an LLM endpoint (uses fake k8s — no real infra needed)
curl -X POST http://localhost:5000/v1/llm/model-endpoints \
-H "Authorization: Bearer test-user" \
-H "Content-Type: application/json" \
-d '{"name":"local-test","model_name":"meta-llama/Meta-Llama-3.1-8B-Instruct","inference_framework":"vllm","min_workers":0,"max_workers":1,"gpus":1,"gpu_type":"nvidia-ampere-a10","endpoint_type":"sync"}'make dev-downRunning with LOCAL=true (set automatically by make dev-server and make dev-migrate):
- Skips the
GIT_TAGenv var requirement - Uses a fake queue delegate (no SQS/Azure Service Bus needed)
- Uses a fake Docker repository (no ECR/ACR/GAR needed)
- Auth is skipped when
identity_service_urlis absent from config (default) - Postgres and Redis are real local services (via docker-compose)
This means you can create/update/delete endpoints via the API and see them reflected in Postgres, without any Kubernetes cluster or cloud account.
If you prefer to set env vars yourself rather than use make:
export LOCAL=true
export GIT_TAG=local
export ML_INFRA_DATABASE_URL=postgresql://postgres:password@localhost:5432/llm_engine
export DEPLOY_SERVICE_CONFIG_PATH=$(pwd)/service_configs/service_config_local.yaml
# Gateway
start-fastapi-server --port 5000 --num-workers 1 --debug
# Database migration
bash model_engine_server/db/migrations/run_database_migration.shStart an endpoint on port 5005:
export IMAGE=692474966980.dkr.ecr.us-west-2.amazonaws.com/vllm:0.10.1.1-rc2
export MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
export MODEL_PATH=/data/model_files/$MODEL
export REPO_PATH=/mnt/home/dmchoi/repos/scale
docker run \
--runtime nvidia \
--shm-size=16gb \
--gpus '"device=0,1,2,3"' \
-v $MODEL_PATH:/workspace/model_files:ro \
-v ${REPO_PATH}/llm-engine/model-engine/model_engine_server/inference/vllm/vllm_server.py:/workspace/vllm_server.py \
-p 5005:5005 \
--name vllm \
${IMAGE} \
python -m vllm_server --model model_files --port 5005 --disable-log-requests --max-model-len 4096 --max-num-seqs 16 --enforce-eagerRun the forwarder:
GIT_TAG=test python model_engine_server/inference/forwarding/http_forwarder.py \
--config model_engine_server/inference/configs/service--http_forwarder.yaml \
--num-workers 1 \
--set "forwarder.sync.extra_routes=['/v1/chat/completions','/v1/completions']" \
--set "forwarder.stream.extra_routes=['/v1/chat/completions','/v1/completions']" \
--set "forwarder.sync.healthcheck_route=/health" \
--set "forwarder.stream.healthcheck_route=/health"Test it:
curl -X POST localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"args": {"model":"meta-llama/Meta-Llama-3.1-8B-Instruct", "messages":[{"role": "system", "content": "Hello"}], "max_tokens":100}}'