[Docs] Add Mooncake HA redis backend deployment example by jackyYang6 · Pull Request #8058 · PaddlePaddle/FastDeploy

jackyYang6 · 2026-06-16T09:18:04Z

Motivation

The HA example added in #8051 only supports the etcd coordination backend, which forces every HA deployment to bring up and operate a 3-node etcd cluster. Some environments already run redis and prefer not to introduce etcd as an extra component.

This PR adds a redis coordination backend for the Mooncake HA example: multiple mooncake_master instances perform lease-based leader election through a single redis instance, and FastDeploy clients discover the current leader via redis. When the leader fails, a standby is automatically re-elected, transparently to clients — same guarantees as the etcd path, fewer moving parts.

Modifications

examples/cache_storage/run_ha_redis.sh (new): self-contained mirror of run_ha.sh that
1. starts a single redis instance (port 6399),
2. starts 3 mooncake_master with --enable_ha --ha_backend_type redis --ha_backend_connstring redis://127.0.0.1:6399 (rpc 8081/8082/8083), one elected leader via a redis lease,
3. launches 2 FastDeploy instances joining the same cache pool,
4. verifies cache pooling before failover (prompt A on server_0 → hit on server_1),
5. kills the current leader (read from the redis master_view HASH) to trigger re-election,
6. re-verifies pooling after failover with a brand-new prompt B, so the hit on server_1 can only come from the new leader's global pool.
examples/cache_storage/ha_redis_mooncake_config.json (new): HA client config; metadata_server and master_server_addr use the redis:// prefix for leader discovery.
examples/cache_storage/run_ha.sh: harden the leader teardown — kill_master_by_rpc_port() now also kills child PIDs (in case a child's cmdline didn't match the --rpc_port grep), and drop --root_fs_dir / --enable_offload from the master launch to align with the redis script.
docs/features/global_cache_pooling.md & docs/zh/features/global_cache_pooling.md: restructure the HA section into two backend options — Option A: etcd (run_ha.sh) and Option B: redis (run_ha_redis.sh) — each with its own client config and run steps; add the redis build flags (-DSTORE_USE_REDIS=ON -DUSE_REDIS=ON, dep libhiredis-dev) and the --ha_backend_type / --ha_backend_connstring master parameters.
examples/cache_storage/README.md: list run_ha_redis.sh and ha_redis_mooncake_config.json.

Usage or Command

cd examples/cache_storage
bash run_ha_redis.sh

Accuracy Tests

This PR is docs/examples only and does not change model outputs. Below is the full output of run_ha_redis.sh verifying cache pooling survives a leader failover on the redis backend.

The key signal is prompt_tokens_details.cached_tokens: it is 0 on the warm-up request and 128 on the reuse request — both before and after the leader is killed and a new leader is elected (127.0.0.1:8081 → 127.0.0.1:8083).

=== [1/6] start redis ===
=== redis health check ===
PONG
=== [2/6] start 3 HA mooncake_master (redis backend) ===
waiting for leader election...
✅ current leader: 127.0.0.1:8081
=== [3/6] start FastDeploy instances ===
server 0 port: 52700
server 1 port: 52800
Port 52700: [OK]   200
All services are ready!    [29s]
Port 52800: [OK]   200
All services are ready!    [0s]
=== [4/6] verify pooling before failover ===
>>> warmup msg_a on server_0 (52700)
{... "usage":{"prompt_tokens":191,"total_tokens":241,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0, ...}}}
>>> reuse msg_a on server_1 (52800), expect cache hit
{... "usage":{"prompt_tokens":191,"total_tokens":241,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":128, ...}}}
=== [5/6] kill leader and wait for failover ===
old leader: 127.0.0.1:8081 (rpc_port=8081)
kill leader master pids=104766 104796 104796 (rpc_port=8081)
waiting for a new leader to be elected...
✅ new leader: 127.0.0.1:8083 (was 127.0.0.1:8081)
=== [6/6] verify pooling after failover (new prompt msg_b) ===
>>> warmup msg_b on server_0 (52700)
{... "usage":{"prompt_tokens":147,"total_tokens":197,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0, ...}}}
>>> reuse msg_b on server_1 (52800), expect cache hit via new leader
{... "usage":{"prompt_tokens":147,"total_tokens":197,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":128, ...}}}

=== HA (redis) test completed ===
Check cache hit:  grep -E 'storage_cache_token_num' log_*/cache_storage.log*
Master logs:      log_master_1 / log_master_2 / log_master_3
Redis log:        log_redis
Current leader:   redis-cli -p 6399 hget 'mooncake-store/{mooncake_cluster}/master_view' leader_address

Result: cache pooling works correctly before and after failover — cached_tokens=128 on reuse in both phases, confirming the re-elected leader (8083) serves the global pool after the original leader (8081) is killed.

Checklist

Add at least a tag in the PR title. ([Docs])
Format your code, run pre-commit before commit.
Add unit tests. — N/A: docs/examples only, no production code changed.
Provide accuracy results. — N/A for model outputs; included run_ha_redis.sh failover verification instead.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch first. — Targeting develop.

codecov-commenter · 2026-06-16T10:03:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@5b3dd38). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8058   +/-   ##
==========================================
  Coverage           ?   67.50%           
==========================================
  Files              ?      475           
  Lines              ?    66669           
  Branches           ?    10286           
==========================================
  Hits               ?    45006           
  Misses             ?    18792           
  Partials           ?     2871

Flag	Coverage Δ
GPU	`77.48% <ø> (?)`
XPU	`6.98% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-17 02:29:16

📋 Review 摘要

PR 概述：新增 Mooncake HA redis 后端示例和文档，并调整 etcd HA 示例的 leader 清理逻辑。
变更范围：examples/cache_storage/、docs/features/、docs/zh/features/
影响面 Tag：[Docs]

问题

级别	文件	概述
-	-	未发现新的阻塞性问题。历史未解决项状态见下方。

历史 Findings 修复情况

Finding	问题	状态
F1	`run_ha_redis.sh` 仍未断言 failover 后是否真的命中全局缓存	⚠️ 仍存在
F2	redis 后端文档仍缺少 `redis-server` / `redis-cli` 安装前置说明	⚠️ 仍存在

📝 PR 规范检查

符合规范：标题使用官方 [Docs] Tag，PR 描述包含模板要求的 Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist 章节。

总体评价

本轮按风险优先覆盖了 6 个变更文件中的 HA redis 脚本、HA etcd 脚本改动、Mooncake 配置透传路径以及中英文文档运行步骤；bash -n 和 JSON 解析校验均通过。除历史建议项仍未修复外，未发现需要阻塞合入的新问题。

[Docs] Add Mooncake HA redis backend deployment example

7d40d9d

jackyYang6 temporarily deployed to Metax_ci June 16, 2026 09:18 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

juncaipeng approved these changes Jun 16, 2026

View reviewed changes

Merge branch 'develop' into docs/mooncake-ha-redis-backend

26789d1

jackyYang6 had a problem deploying to Metax_ci June 16, 2026 11:47 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Jun 16, 2026

View reviewed changes

juncaipeng merged commit c46930d into PaddlePaddle:develop Jun 17, 2026
40 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Add Mooncake HA redis backend deployment example#8058

[Docs] Add Mooncake HA redis backend deployment example#8058
juncaipeng merged 2 commits into
PaddlePaddle:developfrom
jackyYang6:docs/mooncake-ha-redis-backend

jackyYang6 commented Jun 16, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jackyYang6 commented Jun 16, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 16, 2026 •

edited

Loading