Skip to content

[show, clear]: add 'orchagent tasks' commands to surface per-Executor timing + scheduling latency#4536

Open
venkit-nexthop wants to merge 1 commit into
sonic-net:masterfrom
venkit-nexthop:venkit/show-orchagent-tasks
Open

[show, clear]: add 'orchagent tasks' commands to surface per-Executor timing + scheduling latency#4536
venkit-nexthop wants to merge 1 commit into
sonic-net:masterfrom
venkit-nexthop:venkit/show-orchagent-tasks

Conversation

@venkit-nexthop
Copy link
Copy Markdown
Contributor

What I did

Add two new CLI commands that surface orchagent's per-Executor execution-time
statistics so operators can answer "where is orchagent spending its time?"
without rebuilding it:

  • show orchagent tasks — per-Executor run-time and scheduling-latency table.
  • sonic-clear orchagent tasks — reset the counters in place.

This is the CLI half; the orchagent-side instrumentation that publishes the
reply lives in a companion sonic-swss change. The CLI degrades gracefully
(times out with a non-zero exit code) if the daemon side is not present.

How I did it

Two new click groups (show.orchagent.tasks and clear.orchagent.tasks)
wired into show/main.py and clear/main.py.

The CLI talks to orchagent over two APPL_DB notification channels
(ORCH_TASK_STATS_QUERY / ORCH_TASK_STATS_REPLY) and parses the
14-field pipe-separated reply produced by orchagent:

count | total_run_ns
  | median_run_ns | q1_run_ns | q3_run_ns | max_run_ns
  | high_outliers | low_outliers
  | sched_count   | total_sched_ns
  | median_sched_ns | q1_sched_ns | q3_sched_ns | max_sched_ns

Rendering:

  • RUN TIME — P² median/q1/q3/max of per-invocation wall-clock duration.
    Median is the headline, Q1/Q3 give spread, max exposes the worst tail.
  • RUNS — number of completed invocations.
  • OUTLIERS — Tukey 1.5×IQR sum (high + low).
  • SCHED LATENCY — P² median/q1/q3/max of the gap between when the task
    finished and when it was next scheduled. Exposes select-loop starvation.
  • TOTAL — cumulative <run>/<sched> wall-clock spent inside the task vs
    waiting before it.

Rows are sorted by total_run_ns descending; ties break by name ascending.
Empty slots (count = 0) print - for quartet/total cells but keep integer
RUNS / OUTLIERS.

sonic-clear orchagent tasks sends op="clear" on the same query channel
and prints OK on a successful round-trip.

doc/Command-Reference.md is updated with a new Orchagent section
(TOC + body) per the contributing guideline for new show / sonic-clear
subcommands.

How to verify it

tests/show_orchagent_tasks_test.py and tests/clear_orchagent_tasks_test.py
mock swsscommon.NotificationProducer / NotificationConsumer / Select
so the round-trip is exercised end-to-end without Redis. Coverage:

  • Header presence (TASK / RUN TIME / median/q1/q3/max / RUNS / OUTLIERS /
    SCHED LATENCY / TOTAL / run/sched / (in msec))
  • Sort order (total_run_ns descending, name asc tiebreak)
  • Quartet formatting (ms, two decimals, no per-cell unit suffix)
  • Outlier counts visible (high + low summed)
  • TOTAL column = <run>/<sched> with empty-row dashes
  • Empty-reply case prints headers only
  • Timeout and orchagent-error reply paths surface as exit-code-1
  • Correct op string is sent (show / clear)

12 / 12 pytests pass against the mocked swsscommon round-trip.

Manual: with the companion sonic-swss change loaded, run the commands on
a live switch and observe per-Executor stats / reset.

New command output (if the output of a command-line utility has changed)

admin@sonic:~$ show orchagent tasks
TASK         RUN TIME                          RUNS  OUTLIERS  SCHED LATENCY                    TOTAL
             median/q1/q3/max                                  median/q1/q3/max                 run/sched
             (in msec)                                         (in msec)                        (in msec)
ROUTE_TABLE  1745.53/1391.34/2242.07/3913.36     43         5  1.06/0.41/48.40/1436.01          77619.63/5.04

admin@sonic:~$ sonic-clear orchagent tasks
OK

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 11, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: venkit-nexthop / name: Venkit Kasiviswanathan (7019adc)

Surface per-Executor runtime and scheduling-latency statistics from
orchagent without rebuilding. Operators can now answer "where is
orchagent spending its time?" from the CLI.

New commands:
  - show orchagent tasks    -> per-Executor run-time + sched-latency table
  - sonic-clear orchagent tasks -> reset the counters in place

The CLI talks to orchagent over two APPL_DB notification channels
(ORCH_TASK_STATS_QUERY / ORCH_TASK_STATS_REPLY) and parses the
14-field pipe-separated reply. Rows are sorted by total run-time
descending; the RUN TIME and SCHED LATENCY columns render P^2
median/q1/q3/max quartets in milliseconds, the OUTLIERS column shows
Tukey 1.5x IQR sums, and the TOTAL column shows cumulative
run/sched wall-clock.

Tests mock swsscommon so the round-trip is exercised without Redis.
Coverage: header presence, sort order, quartet formatting, outlier
counts, empty/zero-slot rendering, timeout/error paths, and that the
correct op string is sent.

doc/Command-Reference.md is updated with the new Orchagent section.

This is the CLI half; the orchagent-side instrumentation that emits
the reply lives in a companion sonic-swss change.

Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants