Skip to content

[TEP-1076] Affinity - Keep both load balance and token cache | 亲缘性 - 保持负载均衡的同时,兼顾token缓存 #1076

@FFengIll

Description

@FFengIll

Target Version

No response

Category

Load Balancing

Motivation & Problem Statement

负载均衡能够有效平衡api key的消耗,以防止单点密集消耗,触发服务端的限制。
但不同于传统的C/S服务,模型服务的消耗成本更大——token是确定性的费用。

token的消耗主要来源于input和output,output费用往往是不可节省的,但input可以通过cache实现大幅成本削减(对于用户和服务商都是),这一点在长任务下更为显著。

一旦切换负载,cache很可能"丢失",造成隐性的成本。

因此,兼顾负载均衡和缓存命中,是有必要的(当然,也取决于场景和具体模型成本)。

Affinity,在k8s中常用的特性之一,非常适配当前的场景需求,将作为未来的计划之一。


Load balancing can effectively balance the consumption of API keys to prevent excessive concentrated usage on a single endpoint, which could trigger server-side restrictions. However, unlike traditional client-server services, model services come with higher consumption costs—tokens represent a deterministic expense.

Token consumption mainly comes from input and output. While output costs are often unavoidable, input costs can be significantly reduced through caching (benefiting both users and service providers), especially for long-running tasks.

Once the load is switched, the cache is likely to be lost, resulting in hidden costs.

Therefore, it is necessary to balance load balancing and cache hit rates (depending, of course, on the scenario and the specific model cost).

Affinity, a commonly used feature in Kubernetes, is well-suited for this scenario and will be part of future plans.

Proposed Design

No response

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions