diff --git a/docs/blog/assets/openim-large-group-push-optimization.png b/docs/blog/assets/openim-large-group-push-optimization.png new file mode 100644 index 0000000000..dc862cf04d Binary files /dev/null and b/docs/blog/assets/openim-large-group-push-optimization.png differ diff --git a/docs/blog/golang/optimization/4.md b/docs/blog/golang/optimization/4.md index e40ce285a7..87091c5114 100644 --- a/docs/blog/golang/optimization/4.md +++ b/docs/blog/golang/optimization/4.md @@ -4,4 +4,143 @@ hide_title: true sidebar_position: 3 --- +# 十万大群的推送优化 +![OpenIM 十万大群推送优化架构图](./../../assets/openim-large-group-push-optimization.png) + +十万人大群里,消息推送的难点并不是“把一条消息发给十万人”这么简单。 + +真正的压力来自连续放大的链路:一条群消息进入系统后,要写入消息链路,要找到应该触达的成员,要判断哪些用户在线,要把在线用户路由到对应长连接网关,还要把未在线或在线推送失败的用户交给离线推送。任何一步如果按“全量、同步、逐个处理”的方式实现,都会在大群里被迅速放大。 + +OpenIM 的推送优化思路不是把某个函数写得更快,而是把整条链路拆开:消息转发、在线推送、离线推送、网关触达、第三方厂商推送各自解耦,再用批量、分片、过滤和兜底把压力控制在可预期范围内。 + +## 01. 推送链路先解耦,不能让一条消息拖住全链路 + +在普通群里,推送可以看起来像一个连续动作:消息来了,找到人,直接推过去。 + +但在十万人大群里,这种连续动作会变成风险。只要某个环节变慢,前面的消息写入、后面的在线触达、离线推送都会互相影响。OpenIM 的做法是把消息写入和推送执行拆成不同阶段:消息转发服务负责把消息进入存储和推送队列,推送服务再消费推送任务,在线推送和离线推送也分成不同入口处理。 + +这样做的价值很直接: + +- 消息写入不会被第三方推送服务拖慢。 +- 在线推送和离线推送可以按各自节奏扩容。 +- 推送服务出现短时抖动时,消息主链路仍然可以继续前进。 +- 后续排查问题时,可以区分是消息写入慢、在线网关慢,还是离线厂商推送慢。 + +对于大群来说,解耦不是架构上的“好看”,而是稳定性的前提。 + +## 02. 第一层优化:把单条推送收口成批量处理 + +大群消息最怕的不是单条消息,而是连续消息。 + +如果每条消息都独立完成一次解析、一次路由、一次网关调用、一次提交确认,那么高峰时系统会把大量时间花在重复动作上。OpenIM 会在消息转发阶段按消息队列的 key 做分片,把同一批到达的消息聚合到固定窗口里处理;进入推送服务后,再按会话和接收人范围继续做批量归并。 + +这让推送从“来一条处理一条”变成“按窗口处理一批”。收益主要有三点: + +- 重复的解析、调度和队列处理减少了。 +- 同一会话内连续消息可以合并进入同一次推送处理上下文。 +- worker 的并发数量受控,不会因为瞬时消息峰值无限扩张。 + +这一步看起来只是批处理,但对十万大群很关键。因为大群里的压力通常不是平均压力,而是瞬时压力。批量窗口能把尖峰削平,让后面的在线路由和离线推送有稳定输入。 + +## 03. 第二层优化:按会话归并,而不是按消息散打 + +推送服务消费到消息后,不是直接把每条消息独立发出去,而是按会话和接收人范围组织批量。对于没有自定义推送选项的连续群消息,系统会把它们合并到同一批次里处理。 + +这对大群尤其重要。十万人大群里的多条消息,本质上属于同一个群会话。如果每条消息都重新走一遍完整成员查找、在线状态判断和网关分发,很多工作会重复。按会话归并后,系统可以把同一会话内的一批消息放在同一次处理上下文里,减少重复调度。 + +同时,单聊、通知会话和大群会话会进入不同处理路径。普通用户消息更关注接收方和发送方多端同步;大群消息更关注群成员范围、在线状态、离线过滤和网关批量触达。把路径分开后,每类会话都可以使用最适合自己的推送策略。 + +## 04. 第三层优化:先缩小接收人范围 + +十万人大群不是每条消息都真的需要推给十万人。 + +OpenIM 在进入实际推送前,会先尽量缩小接收人范围。群推送前回调可以让业务系统调整接收人;消息自身也可以指定只推给部分用户,或者在默认群成员之外补充额外用户。没有特殊规则时,系统才从群成员 ID 缓存中取出群成员列表作为默认推送范围。 + +这层设计给业务留下了很大空间: + +- 普通群消息可以走默认群成员范围。 +- 定向提醒可以只推给被影响的成员。 +- 运营或系统消息可以通过回调做业务侧过滤。 +- 特殊消息可以补充额外接收人,但不破坏默认链路。 + +大群优化的核心不是每次都推得更多,而是先判断哪些人真的应该进入本轮推送。 + +## 05. 第四层优化:在线用户只推到对应网关 + +在分布式 IM 系统里,用户长连接分散在多个网关节点上。 + +最粗暴的做法是把同一条群消息广播给所有网关,让每个网关自己判断有没有目标用户。但十万人大群下,这会让网关之间产生大量无效调用。OpenIM 的在线推送会先查询用户在线状态,并按网关维度整理在线用户:哪些用户在哪个网关上,就把消息推到哪个网关。 + +这样一来,在线推送的目标从“所有网关”缩小为“真正持有目标用户连接的网关”。这会明显减少无效 RPC、降低网关压力,也让推送结果更容易回收。 + +系统也保留了兜底路径:当在线状态不可用、网关映射不可信,或者单机部署场景下无法精确路由时,可以退回到全网关推送。也就是说,精确路由是优先路径,全量广播是安全兜底。 + +## 06. 第五层优化:在线失败后才进入离线推送 + +离线推送是移动端体验的关键,但它不应该替代在线推送。 + +OpenIM 的顺序是先尝试在线触达,再根据网关返回结果计算哪些用户没有成功收到在线推送。只有这些用户才会进入离线推送候选集合。发送者、多端已经在线成功的用户、关闭离线推送的用户,都不会继续进入离线推送。 + +对于群消息,系统还会再做一次会话级过滤。比如用户对某个会话设置了免打扰,或者业务上不需要该会话的离线提醒,就可以在进入第三方厂商推送前被过滤掉。 + +这层过滤非常重要。因为第三方推送通常有配额、限速、厂商策略和成本问题。十万大群里如果不先过滤,而是把所有未确认对象直接丢给厂商通道,很容易把离线推送打成新的瓶颈。 + +## 07. 第六层优化:离线推送再次异步化 + +在线推送失败并不意味着当前推送 worker 要立刻调用第三方厂商。 + +OpenIM 会把需要离线触达的用户重新写入离线推送队列,由离线推送消费者异步处理。离线推送任务还会按用户集合切块,避免一次任务携带过大的用户列表。 + +这个设计把在线链路和厂商链路隔离开来: + +- 在线推送可以尽快完成本轮处理。 +- 离线推送可以独立消费、独立失败重试、独立扩容。 +- 厂商接口慢或抖动时,不会反向拖住在线推送。 +- 大批离线用户会被拆成可控批次,而不是一次性压到单个请求里。 + +对大群而言,这相当于把“在线实时触达”和“离线补偿触达”拆成两条节奏不同的流水线。 + +## 08. 推送内容也要轻量化 + +推送链路里还有一个容易被忽略的点:不是所有内部字段都应该带到网关和离线通道。 + +OpenIM 在进入推送和离线队列前,会清理只服务于内部逻辑的消息选项,只保留真正需要触达客户端或厂商的内容。离线推送展示内容也会优先使用业务传入的推送标题、描述和扩展字段;没有传入时,再根据消息类型生成默认展示文案。 + +这能带来两个好处: + +- 推送载荷更轻,减少网络和序列化成本。 +- 内部控制字段不会泄漏到客户端或第三方厂商通道。 + +十万大群里,任何单条消息多带一点无效字段,都会在大规模扇出时被放大。 + +## 09. 特殊消息走特殊策略 + +大群里不只有普通文本消息,还会有音视频信令、系统通知、成员变化通知等特殊消息。 + +OpenIM 对这些消息没有简单地“一视同仁”。例如音视频信令类消息,离线推送时会尽量只推给真正相关的邀请对象;对于需要覆盖或撤销的信令提醒,也会结合支持该能力的厂商通道做更精细的处理,避免用户收到过期通知。 + +这类优化的意义不是提升吞吐,而是减少噪音。大群规模越大,错误或多余的通知越容易变成体验问题。只把特殊消息推给真正需要的人,和提升系统性能同样重要。 + +## 10. 这套优化最终解决了什么 + +把这些策略合在一起,OpenIM 的十万大群推送链路可以概括为六个判断: + +| 问题 | 处理方式 | +| --- | --- | +| 消息要不要阻塞主链路 | 推送与存储、在线与离线解耦 | +| 当前要处理多少消息 | 批量窗口和 worker 并发控制 | +| 这批消息属于哪个会话 | 按会话聚合,再选择单聊或群聊路径 | +| 应该推给哪些人 | 回调、消息选项和群成员缓存共同收口 | +| 在线用户在哪里 | 通过在线状态路由到对应网关 | +| 谁还需要离线提醒 | 在线失败集合再经过会话级过滤后异步离线推送 | + +它不是单点优化,而是一套分层削峰机制。 + +## 结语 + +十万人大群的推送优化,本质上不是追求“更快地广播给所有人”,而是避免每一步都默认全量。 + +OpenIM 把推送拆成消息队列、批量处理、会话归并、在线路由、失败回收、离线过滤和厂商推送多个阶段。每一层都在减少不必要的工作:少阻塞、少重复、少广播、少离线、少无效载荷。 + +这也是为什么在大群压测场景下,系统可以面对十万在线用户和高频群消息仍保持稳定触达。真正支撑大群的,不是某个“超大并发发送”技巧,而是整条链路都在持续做减法。 diff --git a/i18n/en/docusaurus-plugin-content-docs-blog/current/assets/openim-large-group-push-optimization.png b/i18n/en/docusaurus-plugin-content-docs-blog/current/assets/openim-large-group-push-optimization.png new file mode 100644 index 0000000000..23aa10e323 Binary files /dev/null and b/i18n/en/docusaurus-plugin-content-docs-blog/current/assets/openim-large-group-push-optimization.png differ diff --git a/i18n/en/docusaurus-plugin-content-docs-blog/current/golang/optimization/4.md b/i18n/en/docusaurus-plugin-content-docs-blog/current/golang/optimization/4.md index bdbd7bf601..21bbd5597d 100644 --- a/i18n/en/docusaurus-plugin-content-docs-blog/current/golang/optimization/4.md +++ b/i18n/en/docusaurus-plugin-content-docs-blog/current/golang/optimization/4.md @@ -3,3 +3,144 @@ title: Push Optimization for 100,000-Member Groups hide_title: true sidebar_position: 3 --- + +# Push Optimization for 100,000-Member Groups + +![OpenIM large-group push optimization architecture](./../../assets/openim-large-group-push-optimization.png) + +In a 100,000-member group, push delivery is not simply about sending one message to 100,000 people. + +The real pressure comes from a chain reaction. After a group message enters the system, it must pass through the message path, find the users that should be reached, determine who is online, route online users to the correct long-connection gateways, and then hand users who are offline or missed by online push to the offline push path. If any step is implemented as a full, synchronous, one-by-one operation, the cost expands rapidly in large groups. + +OpenIM does not optimize this by making one function faster. It separates the whole path: message transfer, online push, offline push, gateway delivery, and third-party vendor push are decoupled, then controlled with batching, sharding, filtering, and fallback paths. + +## 01. Decouple the Push Path First + +In a normal group, push delivery may look like one continuous action: a message arrives, recipients are found, and the message is sent. + +In a 100,000-member group, that continuous action becomes risky. If one step slows down, message writes, online delivery, and offline push can block each other. OpenIM separates message write and push execution into different stages: the message transfer service writes messages into storage and push queues, the push service consumes push tasks, and online push and offline push are handled by different entries. + +This has direct value: + +- Message writes are not slowed down by third-party push providers. +- Online push and offline push can scale at their own pace. +- Short push-service jitter does not immediately stop the main message path. +- Troubleshooting can distinguish message-write latency, online-gateway latency, and offline-provider latency. + +For large groups, decoupling is not just architectural neatness. It is the foundation for stability. + +## 02. First Optimization: Batch Single Push Tasks + +The biggest risk in large-group messaging is not one message. It is continuous messages. + +If every message independently goes through parsing, routing, gateway calls, and queue acknowledgment, peak traffic wastes a lot of time on repeated work. OpenIM first shards messages by message-queue key in the message transfer stage and groups messages that arrive in the same window. After they enter the push service, it continues batching by conversation and recipient scope. + +This changes push delivery from “handle every message immediately” to “handle a batch within a window.” The benefits are clear: + +- Repeated parsing, scheduling, and queue handling are reduced. +- Consecutive messages in the same conversation can enter the same push-processing context. +- Worker concurrency stays bounded instead of expanding with instant traffic spikes. + +This may look like ordinary batching, but it matters greatly for 100,000-member groups. Large-group pressure is usually not average pressure; it is burst pressure. A batch window smooths the spike and gives online routing and offline push a stable input shape. + +## 03. Second Optimization: Merge by Conversation + +After the push service consumes messages, it does not treat every message as a fully independent unit. It organizes batches by conversation and recipient scope. Consecutive group messages without custom push options can be merged into the same processing batch. + +This is especially important for large groups. Multiple messages in a 100,000-member group belong to the same group conversation. If every message repeats the entire member lookup, online-state check, and gateway dispatch process, much of the work is duplicated. Grouping by conversation lets the system process a set of messages in one shared context and reduce repeated scheduling. + +At the same time, one-to-one chats, notification conversations, and large-group conversations enter different paths. User messages focus on receiver and sender multi-device synchronization. Group messages focus on group-member scope, online state, offline filtering, and gateway batch delivery. Once the paths are separated, each conversation type can use the push strategy that fits it best. + +## 04. Third Optimization: Narrow the Recipient Set First + +A 100,000-member group does not mean every message truly needs to be pushed to 100,000 users. + +Before actual delivery, OpenIM tries to narrow the recipient set. The before-group-online-push callback can let the business server adjust recipients. Message options can also specify only certain users or add extra users outside the default group-member range. Only when there are no special rules does the system use the group-member ID cache as the default push scope. + +This gives business logic useful room: + +- Normal group messages can use the default group-member scope. +- Targeted reminders can push only to affected members. +- Operational or system messages can be filtered by business callbacks. +- Special messages can add extra recipients without breaking the default path. + +The core of large-group optimization is not always pushing to more people. It is first deciding who really belongs in this push round. + +## 05. Fourth Optimization: Push Online Users Only to Their Gateways + +In a distributed IM system, user long connections are spread across multiple gateway nodes. + +The rough approach is to broadcast the same group message to every gateway and let each gateway check whether it owns any target users. In a 100,000-member group, that creates a lot of invalid calls between services. OpenIM first queries online state and groups online users by gateway: users connected to a gateway are pushed through that gateway. + +As a result, the online push target shrinks from “all gateways” to “the gateways that actually hold target user connections.” This reduces invalid RPCs, lowers gateway pressure, and makes push results easier to collect. + +OpenIM still keeps a fallback path. If online state is unavailable, gateway mapping cannot be trusted, or precise routing is not possible in standalone deployment, the system can fall back to all-gateway push. Precise routing is the preferred path; full broadcast is the safety net. + +## 06. Fifth Optimization: Enter Offline Push Only After Online Failure + +Offline push is critical for mobile experience, but it should not replace online push. + +OpenIM first attempts online delivery, then calculates which users did not receive the message successfully from gateway results. Only those users become offline-push candidates. The sender, users that were already reached online, and users whose message options disable offline push do not continue into offline push. + +For group messages, the system applies one more conversation-level filter. If a user has muted a conversation, or the business does not need offline reminders for that conversation, the user can be removed before the message reaches third-party provider channels. + +This filter is important because third-party push usually has quotas, rate limits, provider policies, and cost. If a 100,000-member group sends every unconfirmed user directly to vendor push without filtering, offline push can become the new bottleneck. + +## 07. Sixth Optimization: Make Offline Push Asynchronous Again + +Online push failure does not mean the current push worker should immediately call a third-party provider. + +OpenIM writes users that need offline delivery into a separate offline push queue, then lets offline-push consumers process them asynchronously. Offline tasks are also chunked by user set, so one task does not carry an oversized recipient list. + +This separates the online path from the vendor path: + +- Online push can finish the current round quickly. +- Offline push can consume, retry, and scale independently. +- Slow or unstable provider APIs do not drag online push backward. +- Large offline user sets are split into controlled batches instead of one oversized request. + +For large groups, this creates two pipelines with different rhythms: real-time online delivery and compensating offline delivery. + +## 08. Keep Push Payloads Lightweight + +There is another easily overlooked point in the push path: not every internal field should travel to gateways or offline providers. + +Before messages enter push delivery and offline queues, OpenIM clears message options that are only used for internal control. Offline push display content prefers the title, description, and extension fields supplied by business logic; if they are absent, the system generates default display text based on message type. + +This brings two benefits: + +- Push payloads are lighter, reducing network and serialization cost. +- Internal control fields are not leaked to clients or third-party provider channels. + +In a 100,000-member group, even a small amount of useless data per message is magnified during fan-out. + +## 09. Special Messages Need Special Strategies + +Large groups contain more than normal text messages. They also carry audio/video signaling, system notifications, member-change notifications, and other special messages. + +OpenIM does not treat all of them identically. For example, signaling messages can be pushed offline only to the actual invited users. For reminders that need overwrite or revoke behavior, OpenIM can use provider channels that support those capabilities more carefully so users do not receive stale notifications. + +This kind of optimization is not only about throughput. It reduces noise. The larger the group, the easier incorrect or unnecessary notifications become a user-experience problem. Pushing special messages only to the people who need them is as important as improving system performance. + +## 10. What This Solves + +Together, these strategies turn large-group push into six decisions: + +| Question | Handling | +| --- | --- | +| Should the message block the main path? | Decouple storage, push, online delivery, and offline delivery | +| How many messages should be processed now? | Use batch windows and bounded worker concurrency | +| Which conversation does this batch belong to? | Merge by conversation, then choose the one-to-one or group path | +| Who should receive it? | Narrow the scope with callbacks, message options, and group-member cache | +| Where are online users connected? | Route by online state to the corresponding gateways | +| Who still needs offline reminders? | Filter online-failed users again at conversation level, then push offline asynchronously | + +This is not a single optimization. It is a layered peak-shaving mechanism. + +## Conclusion + +Push optimization for 100,000-member groups is not about broadcasting faster to everyone. It is about avoiding full work at every step. + +OpenIM splits push delivery into message queues, batch processing, conversation merging, online routing, failure collection, offline filtering, and provider push. Each layer removes unnecessary work: less blocking, less repetition, less broadcasting, less offline push, and lighter payloads. + +That is why OpenIM can remain stable under large-group benchmark scenarios with 100,000 online users and high-frequency group messages. What supports large groups is not one “massive concurrent send” trick, but a whole path that keeps reducing work at every stage.