[show, clear]: add 'orchagent tasks' commands to surface per-Executor timing + scheduling latency#4536
Open
venkit-nexthop wants to merge 1 commit into
Open
Conversation
|
|
Surface per-Executor runtime and scheduling-latency statistics from orchagent without rebuilding. Operators can now answer "where is orchagent spending its time?" from the CLI. New commands: - show orchagent tasks -> per-Executor run-time + sched-latency table - sonic-clear orchagent tasks -> reset the counters in place The CLI talks to orchagent over two APPL_DB notification channels (ORCH_TASK_STATS_QUERY / ORCH_TASK_STATS_REPLY) and parses the 14-field pipe-separated reply. Rows are sorted by total run-time descending; the RUN TIME and SCHED LATENCY columns render P^2 median/q1/q3/max quartets in milliseconds, the OUTLIERS column shows Tukey 1.5x IQR sums, and the TOTAL column shows cumulative run/sched wall-clock. Tests mock swsscommon so the round-trip is exercised without Redis. Coverage: header presence, sort order, quartet formatting, outlier counts, empty/zero-slot rendering, timeout/error paths, and that the correct op string is sent. doc/Command-Reference.md is updated with the new Orchagent section. This is the CLI half; the orchagent-side instrumentation that emits the reply lives in a companion sonic-swss change. Signed-off-by: Venkit Kasiviswanathan <venkit@nexthop.ai>
b0342e7 to
7019adc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Add two new CLI commands that surface orchagent's per-Executor execution-time
statistics so operators can answer "where is orchagent spending its time?"
without rebuilding it:
show orchagent tasks— per-Executor run-time and scheduling-latency table.sonic-clear orchagent tasks— reset the counters in place.This is the CLI half; the orchagent-side instrumentation that publishes the
reply lives in a companion
sonic-swsschange. The CLI degrades gracefully(times out with a non-zero exit code) if the daemon side is not present.
How I did it
Two new click groups (
show.orchagent.tasksandclear.orchagent.tasks)wired into
show/main.pyandclear/main.py.The CLI talks to orchagent over two APPL_DB notification channels
(
ORCH_TASK_STATS_QUERY/ORCH_TASK_STATS_REPLY) and parses the14-field pipe-separated reply produced by orchagent:
Rendering:
median/q1/q3/maxof per-invocation wall-clock duration.Median is the headline, Q1/Q3 give spread, max exposes the worst tail.
median/q1/q3/maxof the gap between when the taskfinished and when it was next scheduled. Exposes select-loop starvation.
<run>/<sched>wall-clock spent inside the task vswaiting before it.
Rows are sorted by
total_run_nsdescending; ties break by name ascending.Empty slots (count = 0) print
-for quartet/total cells but keep integerRUNS/OUTLIERS.sonic-clear orchagent taskssendsop="clear"on the same query channeland prints
OKon a successful round-trip.doc/Command-Reference.mdis updated with a new Orchagent section(TOC + body) per the contributing guideline for new
show/sonic-clearsubcommands.
How to verify it
tests/show_orchagent_tasks_test.pyandtests/clear_orchagent_tasks_test.pymock
swsscommon.NotificationProducer/NotificationConsumer/Selectso the round-trip is exercised end-to-end without Redis. Coverage:
SCHED LATENCY / TOTAL / run/sched / (in msec))
total_run_nsdescending, name asc tiebreak)<run>/<sched>with empty-row dashesshow/clear)12 / 12 pytests pass against the mocked swsscommon round-trip.
Manual: with the companion
sonic-swsschange loaded, run the commands ona live switch and observe per-Executor stats / reset.
New command output (if the output of a command-line utility has changed)