Target Version
No response
Category
Load Balancing
Motivation & Problem Statement
负载均衡能够有效平衡api key的消耗,以防止单点密集消耗,触发服务端的限制。
但不同于传统的C/S服务,模型服务的消耗成本更大——token是确定性的费用。
token的消耗主要来源于input和output,output费用往往是不可节省的,但input可以通过cache实现大幅成本削减(对于用户和服务商都是),这一点在长任务下更为显著。
一旦切换负载,cache很可能"丢失",造成隐性的成本。
因此,兼顾负载均衡和缓存命中,是有必要的(当然,也取决于场景和具体模型成本)。
Affinity,在k8s中常用的特性之一,非常适配当前的场景需求,将作为未来的计划之一。
Load balancing can effectively balance the consumption of API keys to prevent excessive concentrated usage on a single endpoint, which could trigger server-side restrictions. However, unlike traditional client-server services, model services come with higher consumption costs—tokens represent a deterministic expense.
Token consumption mainly comes from input and output. While output costs are often unavoidable, input costs can be significantly reduced through caching (benefiting both users and service providers), especially for long-running tasks.
Once the load is switched, the cache is likely to be lost, resulting in hidden costs.
Therefore, it is necessary to balance load balancing and cache hit rates (depending, of course, on the scenario and the specific model cost).
Affinity, a commonly used feature in Kubernetes, is well-suited for this scenario and will be part of future plans.
Proposed Design
No response
Target Version
No response
Category
Load Balancing
Motivation & Problem Statement
负载均衡能够有效平衡api key的消耗,以防止单点密集消耗,触发服务端的限制。
但不同于传统的C/S服务,模型服务的消耗成本更大——token是确定性的费用。
token的消耗主要来源于input和output,output费用往往是不可节省的,但input可以通过cache实现大幅成本削减(对于用户和服务商都是),这一点在长任务下更为显著。
一旦切换负载,cache很可能"丢失",造成隐性的成本。
因此,兼顾负载均衡和缓存命中,是有必要的(当然,也取决于场景和具体模型成本)。
Affinity,在k8s中常用的特性之一,非常适配当前的场景需求,将作为未来的计划之一。
Load balancing can effectively balance the consumption of API keys to prevent excessive concentrated usage on a single endpoint, which could trigger server-side restrictions. However, unlike traditional client-server services, model services come with higher consumption costs—tokens represent a deterministic expense.
Token consumption mainly comes from input and output. While output costs are often unavoidable, input costs can be significantly reduced through caching (benefiting both users and service providers), especially for long-running tasks.
Once the load is switched, the cache is likely to be lost, resulting in hidden costs.
Therefore, it is necessary to balance load balancing and cache hit rates (depending, of course, on the scenario and the specific model cost).
Affinity, a commonly used feature in Kubernetes, is well-suited for this scenario and will be part of future plans.
Proposed Design
No response