Skip to content

[Docs] Add Mooncake HA redis backend deployment example#8058

Merged
juncaipeng merged 2 commits into
PaddlePaddle:developfrom
jackyYang6:docs/mooncake-ha-redis-backend
Jun 17, 2026
Merged

[Docs] Add Mooncake HA redis backend deployment example#8058
juncaipeng merged 2 commits into
PaddlePaddle:developfrom
jackyYang6:docs/mooncake-ha-redis-backend

Conversation

@jackyYang6

Copy link
Copy Markdown
Contributor

Motivation

The HA example added in #8051 only supports the etcd coordination backend, which forces every HA deployment to bring up and operate a 3-node etcd cluster. Some environments already run redis and prefer not to introduce etcd as an extra component.

This PR adds a redis coordination backend for the Mooncake HA example: multiple mooncake_master instances perform lease-based leader election through a single redis instance, and FastDeploy clients discover the current leader via redis. When the leader fails, a standby is automatically re-elected, transparently to clients — same guarantees as the etcd path, fewer moving parts.

Modifications

  • examples/cache_storage/run_ha_redis.sh (new): self-contained mirror of run_ha.sh that
    1. starts a single redis instance (port 6399),
    2. starts 3 mooncake_master with --enable_ha --ha_backend_type redis --ha_backend_connstring redis://127.0.0.1:6399 (rpc 8081/8082/8083), one elected leader via a redis lease,
    3. launches 2 FastDeploy instances joining the same cache pool,
    4. verifies cache pooling before failover (prompt A on server_0 → hit on server_1),
    5. kills the current leader (read from the redis master_view HASH) to trigger re-election,
    6. re-verifies pooling after failover with a brand-new prompt B, so the hit on server_1 can only come from the new leader's global pool.
  • examples/cache_storage/ha_redis_mooncake_config.json (new): HA client config; metadata_server and master_server_addr use the redis:// prefix for leader discovery.
  • examples/cache_storage/run_ha.sh: harden the leader teardown — kill_master_by_rpc_port() now also kills child PIDs (in case a child's cmdline didn't match the --rpc_port grep), and drop --root_fs_dir / --enable_offload from the master launch to align with the redis script.
  • docs/features/global_cache_pooling.md & docs/zh/features/global_cache_pooling.md: restructure the HA section into two backend options — Option A: etcd (run_ha.sh) and Option B: redis (run_ha_redis.sh) — each with its own client config and run steps; add the redis build flags (-DSTORE_USE_REDIS=ON -DUSE_REDIS=ON, dep libhiredis-dev) and the --ha_backend_type / --ha_backend_connstring master parameters.
  • examples/cache_storage/README.md: list run_ha_redis.sh and ha_redis_mooncake_config.json.

Usage or Command

cd examples/cache_storage
bash run_ha_redis.sh

Accuracy Tests

This PR is docs/examples only and does not change model outputs. Below is the full output of run_ha_redis.sh verifying cache pooling survives a leader failover on the redis backend.

The key signal is prompt_tokens_details.cached_tokens: it is 0 on the warm-up request and 128 on the reuse request — both before and after the leader is killed and a new leader is elected (127.0.0.1:8081127.0.0.1:8083).

=== [1/6] start redis ===
=== redis health check ===
PONG
=== [2/6] start 3 HA mooncake_master (redis backend) ===
waiting for leader election...
✅ current leader: 127.0.0.1:8081
=== [3/6] start FastDeploy instances ===
server 0 port: 52700
server 1 port: 52800
Port 52700: [OK]   200
All services are ready!    [29s]
Port 52800: [OK]   200
All services are ready!    [0s]
=== [4/6] verify pooling before failover ===
>>> warmup msg_a on server_0 (52700)
{... "usage":{"prompt_tokens":191,"total_tokens":241,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0, ...}}}
>>> reuse msg_a on server_1 (52800), expect cache hit
{... "usage":{"prompt_tokens":191,"total_tokens":241,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":128, ...}}}
=== [5/6] kill leader and wait for failover ===
old leader: 127.0.0.1:8081 (rpc_port=8081)
kill leader master pids=104766 104796 104796 (rpc_port=8081)
waiting for a new leader to be elected...
✅ new leader: 127.0.0.1:8083 (was 127.0.0.1:8081)
=== [6/6] verify pooling after failover (new prompt msg_b) ===
>>> warmup msg_b on server_0 (52700)
{... "usage":{"prompt_tokens":147,"total_tokens":197,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0, ...}}}
>>> reuse msg_b on server_1 (52800), expect cache hit via new leader
{... "usage":{"prompt_tokens":147,"total_tokens":197,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":128, ...}}}

=== HA (redis) test completed ===
Check cache hit:  grep -E 'storage_cache_token_num' log_*/cache_storage.log*
Master logs:      log_master_1 / log_master_2 / log_master_3
Redis log:        log_redis
Current leader:   redis-cli -p 6399 hget 'mooncake-store/{mooncake_cluster}/master_view' leader_address

Result: cache pooling works correctly before and after failover — cached_tokens=128 on reuse in both phases, confirming the re-elected leader (8083) serves the global pool after the original leader (8081) is killed.

Checklist

  • Add at least a tag in the PR title. ([Docs])
  • Format your code, run pre-commit before commit.
  • Add unit tests. — N/A: docs/examples only, no production code changed.
  • Provide accuracy results. — N/A for model outputs; included run_ha_redis.sh failover verification instead.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch first. — Targeting develop.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@5b3dd38). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8058   +/-   ##
==========================================
  Coverage           ?   67.50%           
==========================================
  Files              ?      475           
  Lines              ?    66669           
  Branches           ?    10286           
==========================================
  Hits               ?    45006           
  Misses             ?    18792           
  Partials           ?     2871           
Flag Coverage Δ
GPU 77.48% <ø> (?)
XPU 6.98% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-17 02:29:16

📋 Review 摘要

PR 概述:新增 Mooncake HA redis 后端示例和文档,并调整 etcd HA 示例的 leader 清理逻辑。
变更范围examples/cache_storage/docs/features/docs/zh/features/
影响面 Tag[Docs]

问题

级别 文件 概述
- - 未发现新的阻塞性问题。历史未解决项状态见下方。

历史 Findings 修复情况

Finding 问题 状态
F1 run_ha_redis.sh 仍未断言 failover 后是否真的命中全局缓存 ⚠️ 仍存在
F2 redis 后端文档仍缺少 redis-server / redis-cli 安装前置说明 ⚠️ 仍存在

📝 PR 规范检查

符合规范:标题使用官方 [Docs] Tag,PR 描述包含模板要求的 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 章节。

总体评价

本轮按风险优先覆盖了 6 个变更文件中的 HA redis 脚本、HA etcd 脚本改动、Mooncake 配置透传路径以及中英文文档运行步骤;bash -n 和 JSON 解析校验均通过。除历史建议项仍未修复外,未发现需要阻塞合入的新问题。

@juncaipeng juncaipeng merged commit c46930d into PaddlePaddle:develop Jun 17, 2026
40 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants