[Performance] UpdateLastUsed 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天）

# [Performance] `UpdateLastUsed` 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天）

## 环境

- sub2api 版本: `weishaw/sub2api:0.1.114`（`latest`）
- 部署: 双机热备（主机 B + 副机 C），B 上跑 Redis 集中服务，C 跨公网连 B 的 Redis
- 账号数量: ~900+ accounts（anthropic + openai 混合）
- 典型负载: ~1028 ops/s 平均 Redis 命令

## 问题描述

Redis 写流量异常巨大。6.5 天实测：

```
redis-cli INFO stats:
  total_commands_processed:   577,053,596  (5.77 亿)
  total_net_input_bytes:    1,880,785,779,720  (1.88 TB)
  total_net_output_bytes:   1,875,727,530,269  (1.88 TB)

INFO commandstats (top by calls):
  cmdstat_set: calls=338,501,700   ← 占全部命令 58.6%
  cmdstat_zremrangebyscore: 53,625,772
  cmdstat_zcard: 53,625,853
  ...
```

- **`SET` 命令占 58.6%**，3.38 亿次 / 6.5 天 ≈ 603 次/秒
- 平均每命令载荷 **~3.4 KB**，远高于正常 Redis 使用模式（典型 <200 B）
- 总 Redis 数据集仅 13.8 MB（3929 keys），但写流量达 1.85 TB —— **巨大写放大**

## 根因定位

抽样 `redis-cli --bigkeys`：

```
Sampled 3918 keys in the keyspace!
3863 strings with 9,623,623 bytes (98.60% of keys, avg size 2491.23 bytes)

Biggest strings:
  sched:acc:54  = 12,426 bytes
  sched:acc:39  = 11,979 bytes
  sched:acc:642 = 10,337 bytes
  ...
```

每个 `sched:acc:<id>` 是一个完整 account 对象的 JSON 序列化，典型 3-12 KB，包含：

```json
{
  "ID": 968,
  "Name": "...",
  "Platform": "anthropic",
  "Credentials": { "api_key": "...", "base_url": "..." },
  "Extra": { "quota_limit": 100, "quota_used": 2.37 },
  "Priority": 1,
  "Concurrency": 10,
  "RateMultiplier": 1.2,
  "Schedulable": true,
  "Status": "active",
  "LastUsedAt": "2026-04-17T14:13:37.997464+08:00",   // ← 高频变动字段
  "UpdatedAt": "...",
  "CreatedAt": "...",
  ...（30+ 字段）
}
```

### 问题代码

`backend/internal/repository/scheduler_cache.go:170-205`：

```go
func (c *schedulerCache) UpdateLastUsed(ctx context.Context, updates map[int64]time.Time) error {
    if len(updates) == 0 {
        return nil
    }
    keys := make([]string, 0, len(updates))
    ids := make([]int64, 0, len(updates))
    for id := range updates {
        keys = append(keys, schedulerAccountKey(strconv.FormatInt(id, 10)))
        ids = append(ids, id)
    }

    values, err := c.rdb.MGet(ctx, keys...).Result()  // 1. 读出完整 JSON (3-12 KB 每个)
    if err != nil {
        return err
    }

    pipe := c.rdb.Pipeline()
    for i, val := range values {
        if val == nil {
            continue
        }
        account, err := decodeCachedAccount(val)       // 2. 反序列化
        if err != nil {
            return err
        }
        account.LastUsedAt = ptrTime(updates[ids[i]])  // 3. 只改 LastUsedAt 一个字段
        updated, err := json.Marshal(account)          // 4. 重新序列化整个对象
        if err != nil {
            return err
        }
        pipe.Set(ctx, keys[i], updated, 0)             // 5. SET 整个 3-12 KB JSON
    }
    _, err = pipe.Exec(ctx)
    return err
}
```

**行为**：每次调度热路径触发 `UpdateLastUsed`，只为更新 `LastUsedAt` 一个时间戳字段，就要：
1. `MGET` 完整 account JSON（3-12 KB/个）
2. 反序列化
3. 修改 1 个字段
4. 重新 marshal
5. `SET` 整个 3-12 KB 回 Redis

数据对账：
- 3.38 亿次 SET × 平均 5 KB ≈ **1.65 TB 的 SET 载荷**
- 和 `total_net_input_bytes` 的 1.88 TB 基本吻合

## 影响

1. **公网带宽成本**：多机部署中 sub2api 跨公网连 Redis，该写放大直接转化为公网流量账单。本部署观测到 eth0 约 286 GB/天（双向总 572 GB/天），其中约一半（~140 GB/天）来源于跨机 Redis 写。
2. **Redis CPU 与网络**：单机部署虽不影响公网带宽，但每次热路径包含 MGET + JSON decode + marshal + SET，CPU 开销显著高于必要。
3. **调度延迟**：pipeline 中仍是 N 次 JSON encode/decode + 大 value SET，高 QPS 下成为瓶颈。

## 建议修复

按改动量由小到大：

### 方案 A：节流写入（最小改动，建议首选）

在调用 `UpdateLastUsed` 之前加一层内存窗口合并：每个 account 在 N 秒内（例如 30s）只实际写 Redis 一次。代码位置建议在 scheduler 热路径。

- 优点：不改 schema，PR 几十行
- 代价：`LastUsedAt` 精度从毫秒降到 30s 级，绝大多数调度算法不敏感
- 预期效果：SET 次数和字节数减少 **90%+**

### 方案 B：拆分 hot field 为独立 key（根本解）

把 `LastUsedAt` 从 `sched:acc:<id>` 主 JSON 中剥离，单独用 `sched:acc:last_used:<id>` 存 unix nano 时间戳（10-20 字节）：

```go
func (c *schedulerCache) UpdateLastUsed(ctx context.Context, updates map[int64]time.Time) error {
    if len(updates) == 0 {
        return nil
    }
    pipe := c.rdb.Pipeline()
    for id, t := range updates {
        pipe.Set(ctx, schedulerLastUsedKey(id), t.UnixNano(), 0)
    }
    _, err := pipe.Exec(ctx)
    return err
}
```

读取账号时从两个 key 合并（`GetAccount` 内可加 MGET）。
- 优点：从根本消除写放大，每次 SET 从 3-12 KB 降到 10-30 字节
- 预期效果：写流量减少 **>99%**（1.65 TB → ~30 GB 量级）
- 代价：需要在 `GetAccount`/`GetAccounts` 里补一次合并读，增加一次 MGET

### 方案 C：改用 Hash 结构

`sched:acc:<id>` 从 string 改为 hash，用 `HSET`/`HGETALL` 替代 `SET`/`GET`。每次只更新变化字段。
- 优点：更 Redis-idiomatic
- 代价：改动较大，涉及所有读写路径及 `decodeCachedAccount`

## 复现方式

任意规模 sub2api 部署运行一段时间后：

```bash
redis-cli INFO commandstats | grep cmdstat_set
redis-cli --bigkeys
redis-cli --scan --count 100 | xargs -I{} redis-cli MEMORY USAGE {}
```

应能看到 `SET` 占比异常高，且 `sched:acc:*` 为主要大 key。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] UpdateLastUsed 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天） #1723

[Performance] `UpdateLastUsed` 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天）

环境

问题描述

根因定位

问题代码

影响

建议修复

方案 A：节流写入（最小改动，建议首选）

方案 B：拆分 hot field 为独立 key（根本解）

方案 C：改用 Hash 结构

复现方式

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Performance] UpdateLastUsed 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天） #1723

Description

[Performance] UpdateLastUsed 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天）

环境

问题描述

根因定位

问题代码

影响

建议修复

方案 A：节流写入（最小改动，建议首选）

方案 B：拆分 hot field 为独立 key（根本解）

方案 C：改用 Hash 结构

复现方式

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Performance] `UpdateLastUsed` 写放大导致 Redis 大量不必要流量（单部署实测 1.85 TB / 6.5 天）