[Docs] Add Mooncake HA redis backend deployment example#8058
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8058 +/- ##
==========================================
Coverage ? 67.50%
==========================================
Files ? 475
Lines ? 66669
Branches ? 10286
==========================================
Hits ? 45006
Misses ? 18792
Partials ? 2871
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-17 02:29:16
📋 Review 摘要
PR 概述:新增 Mooncake HA redis 后端示例和文档,并调整 etcd HA 示例的 leader 清理逻辑。
变更范围:examples/cache_storage/、docs/features/、docs/zh/features/
影响面 Tag:[Docs]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| - | - | 未发现新的阻塞性问题。历史未解决项状态见下方。 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | run_ha_redis.sh 仍未断言 failover 后是否真的命中全局缓存 |
|
| F2 | redis 后端文档仍缺少 redis-server / redis-cli 安装前置说明 |
📝 PR 规范检查
符合规范:标题使用官方 [Docs] Tag,PR 描述包含模板要求的 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 章节。
总体评价
本轮按风险优先覆盖了 6 个变更文件中的 HA redis 脚本、HA etcd 脚本改动、Mooncake 配置透传路径以及中英文文档运行步骤;bash -n 和 JSON 解析校验均通过。除历史建议项仍未修复外,未发现需要阻塞合入的新问题。
Motivation
The HA example added in #8051 only supports the etcd coordination backend, which forces every HA deployment to bring up and operate a 3-node etcd cluster. Some environments already run redis and prefer not to introduce etcd as an extra component.
This PR adds a redis coordination backend for the Mooncake HA example: multiple
mooncake_masterinstances perform lease-based leader election through a single redis instance, and FastDeploy clients discover the current leader via redis. When the leader fails, a standby is automatically re-elected, transparently to clients — same guarantees as the etcd path, fewer moving parts.Modifications
examples/cache_storage/run_ha_redis.sh(new): self-contained mirror ofrun_ha.shthatmooncake_masterwith--enable_ha --ha_backend_type redis --ha_backend_connstring redis://127.0.0.1:6399(rpc 8081/8082/8083), one elected leader via a redis lease,master_viewHASH) to trigger re-election,examples/cache_storage/ha_redis_mooncake_config.json(new): HA client config;metadata_serverandmaster_server_addruse theredis://prefix for leader discovery.examples/cache_storage/run_ha.sh: harden the leader teardown —kill_master_by_rpc_port()now also kills child PIDs (in case a child's cmdline didn't match the--rpc_portgrep), and drop--root_fs_dir/--enable_offloadfrom the master launch to align with the redis script.docs/features/global_cache_pooling.md&docs/zh/features/global_cache_pooling.md: restructure the HA section into two backend options — Option A: etcd (run_ha.sh) and Option B: redis (run_ha_redis.sh) — each with its own client config and run steps; add the redis build flags (-DSTORE_USE_REDIS=ON -DUSE_REDIS=ON, deplibhiredis-dev) and the--ha_backend_type/--ha_backend_connstringmaster parameters.examples/cache_storage/README.md: listrun_ha_redis.shandha_redis_mooncake_config.json.Usage or Command
cd examples/cache_storage bash run_ha_redis.shAccuracy Tests
This PR is docs/examples only and does not change model outputs. Below is the full output of
run_ha_redis.shverifying cache pooling survives a leader failover on the redis backend.The key signal is
prompt_tokens_details.cached_tokens: it is0on the warm-up request and128on the reuse request — both before and after the leader is killed and a new leader is elected (127.0.0.1:8081→127.0.0.1:8083).Result: cache pooling works correctly before and after failover —
cached_tokens=128on reuse in both phases, confirming the re-elected leader (8083) serves the global pool after the original leader (8081) is killed.Checklist
[Docs])pre-commitbefore commit.run_ha_redis.shfailover verification instead.releasebranch, make sure the PR has been submitted to thedevelopbranch first. — Targetingdevelop.