Skip to content

Decreasing warmup for use on spot instances #819

@mattkrick

Description

@mattkrick

Feature request

I'd like to deploy TEI (CPU v1.8 serving Qwen3 0.6B quantized) on a spot instance. To do so, the container would have to start up in 14 seconds to maintain QoS.
Poking around, I saw MAX_WARMUP_SEQUENCE_LENGTH, but that appears to only be used for Intel HPU deployments. I already set my max tokens to a modest value (MAX_BATCH_TOKENS=1028).
Is there anything else I can do to drive down warmup time, or anything I can cache to speed it up for a subsequent startup?

Motivation

Reduce costs by moving TEI replicas to spot instances

Your contribution

Happy to write a PR if there's a suitable path forward, or update the docs if a strategy already exists

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions