Skip to content

Commit b0342e7

Browse files
[show, clear]: add 'orchagent tasks' commands
Surface per-Executor runtime and scheduling-latency statistics from orchagent without rebuilding. Operators can now answer "where is orchagent spending its time?" from the CLI. New commands: - show orchagent tasks -> per-Executor run-time + sched-latency table - sonic-clear orchagent tasks -> reset the counters in place The CLI talks to orchagent over two APPL_DB notification channels (ORCH_TASK_STATS_QUERY / ORCH_TASK_STATS_REPLY) and parses the 14-field pipe-separated reply. Rows are sorted by total run-time descending; the RUN TIME and SCHED LATENCY columns render P^2 median/q1/q3/max quartets in milliseconds, the OUTLIERS column shows Tukey 1.5x IQR sums, and the TOTAL column shows cumulative run/sched wall-clock. Tests mock swsscommon so the round-trip is exercised without Redis. Coverage: header presence, sort order, quartet formatting, outlier counts, empty/zero-slot rendering, timeout/error paths, and that the correct op string is sent. doc/Command-Reference.md is updated with the new Orchagent section. This is the CLI half; the orchagent-side instrumentation that emits the reply lives in a companion sonic-swss change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>
1 parent c37b846 commit b0342e7

7 files changed

Lines changed: 578 additions & 0 deletions

File tree

clear/main.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from utilities_common import util_base
1212
from show.plugins.pbh import read_pbh_counters
1313
from config.plugins.pbh import serialize_pbh_counters
14+
from . import orchagent as orchagent_clear
1415
from . import plugins
1516
from . import stp
1617
# This is from the aliases example:
@@ -149,6 +150,11 @@ def ipv6():
149150
#
150151
cli.add_command(stp.spanning_tree)
151152

153+
#
154+
# 'orchagent'
155+
#
156+
cli.add_command(orchagent_clear.orchagent)
157+
152158
#
153159
# Inserting BGP functionality into cli's clear parse-chain.
154160
# BGP commands are determined by the routing-stack being elected.

clear/orchagent.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import click
2+
import sys
3+
4+
from swsscommon import swsscommon
5+
6+
7+
TASK_STATS_QUERY_CHANNEL = "ORCH_TASK_STATS_QUERY"
8+
TASK_STATS_REPLY_CHANNEL = "ORCH_TASK_STATS_REPLY"
9+
REPLY_TIMEOUT_MS = 10000
10+
11+
12+
def _send_clear():
13+
db = swsscommon.DBConnector("APPL_DB", 0)
14+
producer = swsscommon.NotificationProducer(db, TASK_STATS_QUERY_CHANNEL)
15+
consumer = swsscommon.NotificationConsumer(db, TASK_STATS_REPLY_CHANNEL)
16+
17+
sel = swsscommon.Select()
18+
sel.addSelectable(consumer)
19+
20+
producer.send("clear", "", swsscommon.FieldValuePairs([]))
21+
22+
state, _ = sel.select(REPLY_TIMEOUT_MS)
23+
if state == swsscommon.Select.TIMEOUT:
24+
raise RuntimeError(
25+
f"Timed out after {REPLY_TIMEOUT_MS} ms waiting for orchagent reply")
26+
if state != swsscommon.Select.OBJECT:
27+
raise RuntimeError(f"Select error waiting for orchagent reply: {state}")
28+
29+
op_ret, data_ret, _ = consumer.pop()
30+
if op_ret != "ok":
31+
raise RuntimeError(
32+
f"orchagent returned error: op={op_ret} data={data_ret}")
33+
34+
35+
@click.group()
36+
def orchagent():
37+
"""Clear orchagent runtime stats"""
38+
pass
39+
40+
41+
@orchagent.command("tasks")
42+
def tasks():
43+
"""Reset per-Executor execution timing counters in orchagent"""
44+
try:
45+
_send_clear()
46+
except RuntimeError as e:
47+
click.echo(f"Error: {e}", err=True)
48+
sys.exit(1)
49+
click.echo("OK")

doc/Command-Reference.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,9 @@
151151
* [NVGRE](#nvgre)
152152
* [NVGRE show commands](#nvgre-show-commands)
153153
* [NVGRE config commands](#nvgre-config-commands)
154+
* [Orchagent](#orchagent)
155+
* [Orchagent show commands](#orchagent-show-commands)
156+
* [Orchagent clear commands](#orchagent-clear-commands)
154157
* [PBH](#pbh)
155158
* [PBH show commands](#pbh-show-commands)
156159
* [PBH config commands](#pbh-config-commands)
@@ -9928,6 +9931,64 @@ This command is used to delete a configured NTP server IP address.
99289931

99299932
Go Back To [Beginning of the document](#) or [Beginning of this section](#NTP)
99309933

9934+
## Orchagent
9935+
9936+
This section explains the orchagent runtime introspection commands.
9937+
9938+
`orchagent` is the SONiC orchestration agent that processes APPL_DB updates and drives the hardware via SAI. Each Executor in orchagent represents a self-scheduled task (per-table consumers, periodic timers, etc.); the commands below surface how long each task is taking and how long it waits between invocations so operators can identify select-loop starvation without rebuilding orchagent.
9939+
9940+
### Orchagent show commands
9941+
9942+
**show orchagent tasks**
9943+
9944+
This command displays the per-Executor execution-time statistics that orchagent maintains internally. Each row corresponds to one orchagent task (an `Orch` subclass or named periodic). Rows are sorted by total run time descending; ties break by task name.
9945+
9946+
Columns:
9947+
9948+
- `TASK` — Executor name.
9949+
- `RUN TIME` — `median/q1/q3/max` of per-invocation wall-clock duration, in milliseconds. Median is the headline, Q1/Q3 give spread, max exposes the worst tail.
9950+
- `RUNS` — total number of completed invocations.
9951+
- `OUTLIERS` — Tukey 1.5×IQR sum (high + low) — invocations whose duration fell outside the bulk of the distribution.
9952+
- `SCHED LATENCY` — `median/q1/q3/max` of the gap between when the task finished and when it was next scheduled, in milliseconds. Exposes select-loop starvation.
9953+
- `TOTAL` — cumulative `<run>/<sched>` wall-clock spent inside the task vs waiting before it, in milliseconds.
9954+
9955+
Slots that have never run print `-` in place of the quartet/total values; `RUNS` / `OUTLIERS` remain integers.
9956+
9957+
The CLI talks to orchagent over two APPL_DB notification channels (`ORCH_TASK_STATS_QUERY` / `ORCH_TASK_STATS_REPLY`); the command exits non-zero with an error message if orchagent does not reply within 10 seconds.
9958+
9959+
- Usage:
9960+
```
9961+
show orchagent tasks
9962+
```
9963+
9964+
- Example:
9965+
```
9966+
admin@sonic:~$ show orchagent tasks
9967+
TASK RUN TIME RUNS OUTLIERS SCHED LATENCY TOTAL
9968+
median/q1/q3/max median/q1/q3/max run/sched
9969+
(in msec) (in msec) (in msec)
9970+
ROUTE_TABLE 1745.53/1391.34/2242.07/3913.36 43 5 1.06/0.41/48.40/1436.01 77619.63/5.04
9971+
```
9972+
9973+
### Orchagent clear commands
9974+
9975+
**sonic-clear orchagent tasks**
9976+
9977+
This command resets orchagent's per-Executor execution-time counters in place. Subsequent `show orchagent tasks` output starts accumulating from zero.
9978+
9979+
- Usage:
9980+
```
9981+
sonic-clear orchagent tasks
9982+
```
9983+
9984+
- Example:
9985+
```
9986+
admin@sonic:~$ sonic-clear orchagent tasks
9987+
OK
9988+
```
9989+
9990+
Go Back To [Beginning of the document](#) or [Beginning of this section](#orchagent)
9991+
99319992
# PFC Watchdog Commands
99329993
Detailed description of the PFC Watchdog can be found on [this wiki page](https://github.com/sonic-net/SONiC/wiki/PFC-Watchdog)
99339994

show/main.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@
7373
from . import switch
7474
from . import icmp
7575
from . import copp
76+
from . import orchagent
7677

7778
# Global Variables
7879
PLATFORM_JSON = 'platform.json'
@@ -334,6 +335,7 @@ def cli(ctx):
334335
cli.add_command(switch.switch)
335336
cli.add_command(icmp.icmp)
336337
cli.add_command(copp.copp)
338+
cli.add_command(orchagent.orchagent)
337339

338340
# syslog module
339341
cli.add_command(syslog.syslog)

show/orchagent.py

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
import click
2+
import sys
3+
4+
from swsscommon import swsscommon
5+
from tabulate import tabulate
6+
7+
8+
TASK_STATS_QUERY_CHANNEL = "ORCH_TASK_STATS_QUERY"
9+
TASK_STATS_REPLY_CHANNEL = "ORCH_TASK_STATS_REPLY"
10+
REPLY_TIMEOUT_MS = 10000
11+
12+
13+
def _ms(ns):
14+
"""Convert ns -> ms as a float (no unit suffix)."""
15+
return ns / 1_000_000.0
16+
17+
18+
def _fmt_quartet(median_ns, q1_ns, q3_ns, max_ns):
19+
"""Render a 'median/q1/q3/max' quartet in ms, two decimals each."""
20+
return (f"{_ms(median_ns):.2f}/"
21+
f"{_ms(q1_ns):.2f}/"
22+
f"{_ms(q3_ns):.2f}/"
23+
f"{_ms(max_ns):.2f}")
24+
25+
26+
def _query_orchagent(op):
27+
"""Send a query to orchagent over APPL_DB notification channels and
28+
return (op_ret, data_ret, fvs) on success.
29+
30+
Raises RuntimeError on timeout or transport error.
31+
"""
32+
db = swsscommon.DBConnector("APPL_DB", 0)
33+
producer = swsscommon.NotificationProducer(db, TASK_STATS_QUERY_CHANNEL)
34+
consumer = swsscommon.NotificationConsumer(db, TASK_STATS_REPLY_CHANNEL)
35+
36+
sel = swsscommon.Select()
37+
sel.addSelectable(consumer)
38+
39+
producer.send(op, "", swsscommon.FieldValuePairs([]))
40+
41+
state, _ = sel.select(REPLY_TIMEOUT_MS)
42+
if state == swsscommon.Select.TIMEOUT:
43+
raise RuntimeError(
44+
f"Timed out after {REPLY_TIMEOUT_MS} ms waiting for orchagent reply")
45+
if state != swsscommon.Select.OBJECT:
46+
raise RuntimeError(f"Select error waiting for orchagent reply: {state}")
47+
48+
op_ret, data_ret, fvs = consumer.pop()
49+
if op_ret != "ok":
50+
raise RuntimeError(
51+
f"orchagent returned error: op={op_ret} data={data_ret}")
52+
return op_ret, data_ret, fvs
53+
54+
55+
def _parse_stats(fvs):
56+
"""Parse the FieldValueTuples returned by orchagent into a list of
57+
dicts. Each value is 14 pipe-separated fields:
58+
count | total_run_ns
59+
| median_run_ns | q1_run_ns | q3_run_ns | max_run_ns
60+
| high_outliers | low_outliers
61+
| sched_count | total_sched_ns
62+
| median_sched_ns | q1_sched_ns | q3_sched_ns | max_sched_ns
63+
"""
64+
rows = []
65+
for name, blob in fvs:
66+
parts = blob.split("|")
67+
if len(parts) != 14:
68+
# Skip malformed rows rather than failing the whole table.
69+
continue
70+
try:
71+
count = int(parts[0])
72+
total_ns = int(parts[1])
73+
median_ns = int(parts[2])
74+
q1_ns = int(parts[3])
75+
q3_ns = int(parts[4])
76+
max_ns = int(parts[5])
77+
high_outliers = int(parts[6])
78+
low_outliers = int(parts[7])
79+
sched_count = int(parts[8])
80+
total_sched_ns = int(parts[9])
81+
sched_median_ns = int(parts[10])
82+
sched_q1_ns = int(parts[11])
83+
sched_q3_ns = int(parts[12])
84+
sched_max_ns = int(parts[13])
85+
except ValueError:
86+
continue
87+
rows.append({
88+
"name": name,
89+
"count": count,
90+
"total_ns": total_ns,
91+
"median_ns": median_ns,
92+
"q1_ns": q1_ns,
93+
"q3_ns": q3_ns,
94+
"max_ns": max_ns,
95+
"high_outliers": high_outliers,
96+
"low_outliers": low_outliers,
97+
"sched_count": sched_count,
98+
"total_sched_ns": total_sched_ns,
99+
"sched_median_ns": sched_median_ns,
100+
"sched_q1_ns": sched_q1_ns,
101+
"sched_q3_ns": sched_q3_ns,
102+
"sched_max_ns": sched_max_ns,
103+
})
104+
return rows
105+
106+
107+
def _render_table(rows):
108+
# Sort: total_ns descending, ties broken by name ascending.
109+
rows = sorted(rows,
110+
key=lambda r: (-r["total_ns"], r["name"]))
111+
112+
table = []
113+
for r in rows:
114+
if r["count"] == 0:
115+
run_quartet = "-"
116+
total_run = "-"
117+
else:
118+
run_quartet = _fmt_quartet(r["median_ns"],
119+
r["q1_ns"],
120+
r["q3_ns"],
121+
r["max_ns"])
122+
total_run = f"{_ms(r['total_ns']):.2f}"
123+
124+
if r["sched_count"] == 0:
125+
sched_quartet = "-"
126+
total_sched = "-"
127+
else:
128+
sched_quartet = _fmt_quartet(r["sched_median_ns"],
129+
r["sched_q1_ns"],
130+
r["sched_q3_ns"],
131+
r["sched_max_ns"])
132+
total_sched = f"{_ms(r['total_sched_ns']):.2f}"
133+
134+
# TOTAL column shows "<run>/<sched>" so a viewer can see at a
135+
# glance how much wall-clock the loop spent inside the task vs
136+
# waiting before scheduling it.
137+
total_str = f"{total_run}/{total_sched}"
138+
139+
outliers = r["high_outliers"] + r["low_outliers"]
140+
141+
table.append([
142+
r["name"],
143+
run_quartet,
144+
r["count"],
145+
outliers,
146+
sched_quartet,
147+
total_str,
148+
])
149+
150+
headers = [
151+
"TASK",
152+
"RUN TIME\nmedian/q1/q3/max\n(in msec)",
153+
"RUNS",
154+
"OUTLIERS",
155+
"SCHED LATENCY\nmedian/q1/q3/max\n(in msec)",
156+
"TOTAL\nrun/sched\n(in msec)",
157+
]
158+
return tabulate(table, headers=headers, tablefmt="plain")
159+
160+
161+
@click.group()
162+
def orchagent():
163+
"""Show orchagent runtime information"""
164+
pass
165+
166+
167+
@orchagent.command("tasks")
168+
def tasks():
169+
"""Show per-Executor execution timing in orchagent"""
170+
try:
171+
_, _, fvs = _query_orchagent("show")
172+
except RuntimeError as e:
173+
click.echo(f"Error: {e}", err=True)
174+
sys.exit(1)
175+
176+
rows = _parse_stats(fvs)
177+
click.echo(_render_table(rows))
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
"""
2+
Unit tests for `clear orchagent tasks`.
3+
4+
Mocks the swsscommon NotificationProducer / NotificationConsumer / Select
5+
to round-trip the request without needing Redis.
6+
"""
7+
8+
from unittest import mock
9+
10+
import pytest
11+
from click.testing import CliRunner
12+
13+
from clear import orchagent as clear_orchagent
14+
15+
16+
@pytest.fixture
17+
def fake_swsscommon():
18+
fake = mock.MagicMock()
19+
fake.Select.OBJECT = 0
20+
fake.Select.TIMEOUT = 1
21+
fake.Select.return_value.select.return_value = (fake.Select.OBJECT, None)
22+
23+
fake.NotificationConsumer.return_value.pop.return_value = ("ok", "", [])
24+
fake.NotificationProducer.return_value.send.return_value = None
25+
fake.FieldValuePairs.side_effect = lambda x: x
26+
fake.DBConnector.return_value = mock.MagicMock()
27+
28+
with mock.patch.object(clear_orchagent, "swsscommon", fake):
29+
yield fake
30+
31+
32+
def test_clear_sends_clear_op_and_prints_ok(fake_swsscommon):
33+
runner = CliRunner()
34+
result = runner.invoke(clear_orchagent.orchagent, ["tasks"])
35+
assert result.exit_code == 0, result.output
36+
assert "OK" in result.output
37+
38+
producer = fake_swsscommon.NotificationProducer.return_value
39+
assert producer.send.call_count == 1
40+
op_arg = producer.send.call_args[0][0]
41+
assert op_arg == "clear"
42+
43+
44+
def test_clear_timeout_reports_error(fake_swsscommon):
45+
fake_swsscommon.Select.return_value.select.return_value = (
46+
fake_swsscommon.Select.TIMEOUT, None)
47+
48+
runner = CliRunner()
49+
result = runner.invoke(clear_orchagent.orchagent, ["tasks"])
50+
assert result.exit_code != 0
51+
52+
53+
def test_clear_orchagent_error_reply_reports_error(fake_swsscommon):
54+
fake_swsscommon.NotificationConsumer.return_value.pop.return_value = (
55+
"error", "unknown op", [])
56+
57+
runner = CliRunner()
58+
result = runner.invoke(clear_orchagent.orchagent, ["tasks"])
59+
assert result.exit_code != 0

0 commit comments

Comments
 (0)