Feature request
I'd like to deploy TEI (CPU v1.8 serving Qwen3 0.6B quantized) on a spot instance. To do so, the container would have to start up in 14 seconds to maintain QoS.
Poking around, I saw MAX_WARMUP_SEQUENCE_LENGTH, but that appears to only be used for Intel HPU deployments. I already set my max tokens to a modest value (MAX_BATCH_TOKENS=1028).
Is there anything else I can do to drive down warmup time, or anything I can cache to speed it up for a subsequent startup?
Motivation
Reduce costs by moving TEI replicas to spot instances
Your contribution
Happy to write a PR if there's a suitable path forward, or update the docs if a strategy already exists
Feature request
I'd like to deploy TEI (CPU v1.8 serving Qwen3 0.6B quantized) on a spot instance. To do so, the container would have to start up in 14 seconds to maintain QoS.
Poking around, I saw
MAX_WARMUP_SEQUENCE_LENGTH, but that appears to only be used for Intel HPU deployments. I already set my max tokens to a modest value (MAX_BATCH_TOKENS=1028).Is there anything else I can do to drive down warmup time, or anything I can cache to speed it up for a subsequent startup?
Motivation
Reduce costs by moving TEI replicas to spot instances
Your contribution
Happy to write a PR if there's a suitable path forward, or update the docs if a strategy already exists