Skip to content

[orchagent]: Add ZmqRouteServer for concurrent route updates#4564

Open
venkit-nexthop wants to merge 1 commit into
sonic-net:masterfrom
venkit-nexthop:orchagent-zmq-route-server
Open

[orchagent]: Add ZmqRouteServer for concurrent route updates#4564
venkit-nexthop wants to merge 1 commit into
sonic-net:masterfrom
venkit-nexthop:orchagent-zmq-route-server

Conversation

@venkit-nexthop
Copy link
Copy Markdown
Contributor

What I did

Introduce a dedicated ZmqRouteServer (with ZmqRouteOrch and ZmqRouteConsumer) for RouteOrch to receive APPL_DB route updates from fpmsyncd. Non-fabric / non-dpu switches now create a ZmqRouteServer instead of the generic ZmqServer; fabric and DPU continue to use the existing ZmqServer.

ZmqRouteConsumer merges incoming tuples into m_toSync from the mqPollThread ingress callback under m_toSyncMutex, and only notifies the orch main loop once a batch (gMaxBulkSize) has accumulated. To support this concurrent access, ConsumerBase::addToSync and dumpPendingTasks are made virtual so the route consumer can wrap them in a lock, while the default single-threaded base remains lock-free.

Other changes:

  • create_zmq_route_server() factory added in lib/orch_zmq_config.
  • getCfgSwitchType() moved earlier in orchagent main so the server type can be chosen by switch type.
  • fake_zmqserver updated to return a ZmqMessageHandler* from handleReceivedData to match the new signature.
  • New unit tests in tests/mock_tests/zmq_route_orch_ut.cpp.

Why I did it

The generic ZmqServer serializes all message handling on a single thread. For route-heavy workloads, dispatching route updates on a dedicated server with its own poll thread improves throughput and decouples the route ingestion path from the rest of the orchagent ZMQ channel.

How I verified it

  • Built sonic-swss locally (clean make_exit=0).
  • Added zmq_route_orch_ut.cpp covering the new consumer.

Details if related

N/A

Introduce a dedicated ZmqRouteServer (and ZmqRouteOrch/ZmqRouteConsumer)
used by RouteOrch for receiving APPL_DB route updates from fpmsyncd.
Non-fabric/non-dpu switches now create a ZmqRouteServer instead of the
generic ZmqServer; fabric and DPU continue to use ZmqServer.

ZmqRouteConsumer merges incoming tuples into m_toSync from the
mqPollThread ingress callback under m_toSyncMutex, and only notifies the
orch main loop once a batch (gMaxBulkSize) has accumulated. To support
this concurrent access, ConsumerBase::addToSync and dumpPendingTasks are
made virtual so the route consumer can wrap them in a lock, while the
default single-threaded base remains lock-free.

Also:
- create_zmq_route_server() factory added in lib/orch_zmq_config.
- getCfgSwitchType() moved earlier in orchagent main so the server type
  can be chosen based on switch type.
- Update fake_zmqserver to return a ZmqMessageHandler* from
  handleReceivedData to match the new upstream signature.
- Add zmq_route_orch_ut.cpp unit tests.

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>
@venkit-nexthop venkit-nexthop requested a review from prsunny as a code owner May 12, 2026 00:49
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@venkit-nexthop
Copy link
Copy Markdown
Contributor Author

tagging @deepak-singhal0408 @prabhataravind @prsunny @qiluo-msft @vivekrnv for viz

Also refer to the following PRs:

Per Task instrumentation:
sonic-net/sonic-utilities#4536
#4563

Update m_toSync from mqPollThread:
sonic-net/sonic-swss-common#1187
#4564

Comment thread orchagent/routeorch.cpp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadlock possibility in case of resync where we call consumer.addToSync(x) from inside doTask(ConsumerBase&)


void ZmqRouteConsumer::drain()
{
std::lock_guard<std::mutex> lk(m_toSyncMutex);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With drain() holding the lock during the entire processing of doTask(), how the new behavior of directly pushing to m_toSync would work as ZmqRouteConsumerStateTable::handleReceivedData cannot acquire the lock. Can we add a comment as To-DO that we would yield in the future based on some time quanta

Comment thread orchagent/zmqrouteorch.h
#include "zmqrouteserver.h"
#include "zmqrouteconsumerstatetable.h"

extern int gZmqExecuteTimeQuantaMsecs;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to add a comment of TODO on this will be used as we are not referencing this now

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test where producers call addToSync while drain() is in progress ?

void ZmqRouteOrch::doTask(Consumer &consumer)
{
// When ZMQ disabled, forward data from Consumer
doTask((ConsumerBase &)consumer);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doTask((ConsumerBase &)consumer);
doTask(static_cast<ConsumerBase&>(consumer));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants