Skip to content

Commit 677ea8e

Browse files
committed
Add info for debugging high CPU utilization
1 parent 8f1b40a commit 677ea8e

3 files changed

Lines changed: 330 additions & 0 deletions

File tree

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Troubleshooting High CPU Utilization
2+
3+
/// note
4+
We'll cover high memory utilization in a separate document.
5+
///
6+
7+
## What's "High"???
8+
9+
It sounds silly but if you can't define what "high" means to you then
10+
how will you know when you've "fixed" it?
11+
12+
High CPU utilization isn't usually an issue by itself. The real issue would be
13+
the side effects like poor call quality or long call setup times, etc.
14+
Understanding the issue you're tring to solve and what your goals are should
15+
be the first step in any investigation.
16+
17+
## Understand the environment
18+
19+
The next step in the investigation should be understanding your environment thoroughly.
20+
These are just a few things to keep in your mind:
21+
22+
* Is Asterisk running in a virtual machine or a container?
23+
* What else is competing for resources?
24+
* Is the filesystem directly attached or remote via a SAN, etc.?
25+
* What else is sharing the same CPU?
26+
* Databases
27+
* ARI/AMI/AGI applications
28+
* Unrelated applications or servers
29+
* Are you using a realtime database?
30+
* Where is it hosted?
31+
* What else is using the same database instance?
32+
* What's the network configuration?
33+
* Network interfaces.
34+
* Firewalls.
35+
* DNS infrastructure.
36+
* Proxies and session border controllers.
37+
38+
There's more but you get the idea.
39+
40+
## Identify the actors
41+
42+
Identifying _why_ you're seeing high CPU utilization can be complicated but let's start
43+
with the _what_ first because it might not even be Asterisk itself. The tool of choice
44+
here is the venerable `top` or it's relatives `htop` and `btop`.
45+
46+
### Database
47+
48+
If the database has nothing to do with Asterisk itself then it probably shouln't be running
49+
in the same OS instance. If the database _is_ hosting your Asterisk configuration check where
50+
you're storing your incoming PJSIP registrations (non permanent contacts). The default is to
51+
store them in the "astdb" (Sqlite3 database at `/var/lib/asterisk/astdb.sqlite3`) but some people
52+
use sorcery.conf and extconfig.conf to move them to an external database believing it's "better".
53+
Unless you need to expose the active registrations to another entity like Kamailio, leaving them
54+
in the astdb is better for performance because the database is all in-memory.
55+
56+
Another thing to investigate is
57+
[Sorcery Caching](/Fundamentals/Asterisk-Configuration/Sorcery/Sorcery-Caching/).
58+
If your expiration times are too low, you may be running more transactions to the database
59+
than necessary.
60+
61+
Finally, consider moving the database to another OS instance, even if that other instance is
62+
another container or virtual machine on the same physical host. Remember, database access is
63+
usually limited to call setup but if Asterisk is processing the call media (which it usually is)
64+
it's more important for it to have a higher CPU resource allocation priority. Moving the database
65+
someplace else gives you an opportunity to control that allocation. Keep it local however and ideally
66+
use a dedicated LAN segment with a 10G or greater rate.
67+
68+
### ARI/AMI/AGI Applications
69+
70+
Take a close look at what the applications are doing and try to streamline. If you're using "classic"
71+
AGI (`AGI(/path/to/script)`) switch to using [FastAGI](/Latest_API/API_Documentation/Dialplan_Applications/AGI/)
72+
to avoid the overhead of spawning a new process for every call. As with a database, moving the applications to another OS instance can give you more control over resource allocation.
73+
74+
### Asterisk
75+
76+
#### Transcoding
77+
78+
The number one use of CPU resources by Asterisk is usually handling media.
79+
Actually, simply forwarding media between two channels that use the same codec is pretty cheap.
80+
Once you start transcoding however, the number of CPU cycles needed can increase drastically.
81+
82+
```
83+
*CLI> core show translation
84+
Translation times between formats (in microseconds) for one second of data
85+
Source Format (Rows) Destination Format (Columns)
86+
87+
ulaw alaw gsm slin8 slin16 slin32 slin48 g729 g722 opus
88+
ulaw - 9150 15000 9000 17000 17000 17000 15000 17250 23000
89+
alaw 9150 - 15000 9000 17000 17000 17000 15000 17250 23000
90+
gsm 15000 15000 - 9000 17000 17000 17000 15000 17250 23000
91+
slin8 6000 6000 6000 - 8000 8000 8000 6000 8250 14000
92+
slin16 14500 14500 14500 8500 - 8000 8000 14500 6000 14000
93+
slin32 14500 14500 14500 8500 8500 - 8000 14500 14500 14000
94+
slin48 14500 14500 14500 8500 8500 8500 - 14500 14500 6000
95+
g729 15000 15000 15000 9000 17000 17000 17000 - 17250 23000
96+
g722 15600 15600 15600 9600 9000 17000 17000 15600 - 23000
97+
opus 23500 23500 23500 17500 17500 17500 9000 23500 23500 -
98+
```
99+
100+
The absolute numbers are specific to the machine I ran the command on but look at the relative
101+
differences, especially when opus is used on one channel and the other uses something
102+
else.
103+
104+
Here's a different look at opus and what the path is between it and other codecs:
105+
106+
```
107+
*CLI> core show translation paths opus
108+
--- Translation paths SRC Codec "opus" sample rate 48000 ---
109+
opus:48000 To ulaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(ulaw@8000)
110+
opus:48000 To alaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(alaw@8000)
111+
opus:48000 To gsm:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(gsm@8000)
112+
opus:48000 To slin:8000 : (opus@48000)->(slin@48000)->(slin@8000)
113+
opus:48000 To slin:16000 : (opus@48000)->(slin@48000)->(slin@16000)
114+
opus:48000 To slin:32000 : (opus@48000)->(slin@48000)->(slin@32000)
115+
opus:48000 To slin:48000 : (opus@48000)->(slin@48000)
116+
opus:48000 To g729:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(g729@8000)
117+
opus:48000 To g722:16000 : (opus@48000)->(slin@48000)->(slin@16000)->(g722@16000)
118+
*CLI>
119+
```
120+
121+
The simple answer is "don't use opus" but of course that's not realistic especially
122+
if you have WebRTC endpoints. The more general answer is "don't use expensive codecs
123+
if you don't need to". A good example of this is when using chan_websocket on one leg
124+
of a call to interact with an AI agent. We've seen some folks using opus on the websocket
125+
leg to the agent even when the incoming callers are all using ulaw or alaw and the agent
126+
platform will happily accept 8K signed-linear.
127+
128+
#### Sounds, Music on Hold, Recording, Voicemail
129+
130+
The term "Transcoding" (and all that it implies) applies whenever you have to
131+
convert between two codecs and that includes playing sounds, playing music-on-hold,
132+
recording calls, saving and playing voicemails, etc.
133+
134+
For sounds and MOH, ensure you have files in the formats you most expect your caller
135+
channels to be using. When you have lots of different codecs in use and you're short
136+
on disk space, having at least slin8 and slin16 versions of the files gives you the
137+
least transcoding cost for everything except for opus.
138+
139+
For recording and voicemail, use the translation cost matrix to determine the most
140+
efficient formats to save files in.
141+
142+
#### Logging
143+
144+
Logging is especially costly because it takes CPU cycles to create and manage the
145+
messages plus I/O operations to write them out.
146+
147+
A performance analysis of Asterisk versions prior to 23.3.0, 22.9.0 and 20.19.0, showed
148+
that almost 40% of the CPU instructions executed by the Asterisk process were attributed
149+
to Channel Event Logging. That was more than the instructions used to actually process
150+
calls. We made significant CEL performance improvements in versions 23.3.0, 22.9.0 and 20.19.0
151+
which brought that percentage to below 10% but if you don't need Channel Event logging
152+
_turn it off_!.
153+
154+
Another culprit to watch out for is VERBOSE logging. When VERBOSE logging is enabled,
155+
a log message is generated for every line traversed in the dialplan. On a busy system
156+
this can result in hundreds of messages per second. Unless you have a good reason for
157+
seeing all those messages, you should limit both console and file logging to NOTICE,
158+
WARNING and ERROR. As for DEBUG logging, well enabling debug messages on production
159+
system is just silly unless you're actively trying to diagnose a specific issue.
160+
161+
## Still can't figure it out?
162+
163+
If you're still using more CPU than you think is necessary, you're going to have to
164+
spend time doing some serious quantitative investigation.
165+
166+
* Create a test environment that mirrors your production environment.
167+
* Use tools like [sipp](https://github.com/sipp/sipp) to simulate load.
168+
* Get a baseline by starting with basic two-party calls (no recording, transcoding,
169+
conferences, etc) and increasing volume until you get to a reasonable maximum
170+
acceptable utilization.
171+
* Now start over with a more "stressfull" of your typical call scenarios and
172+
note the difference.
173+
* Look for ways to simplify those call flows and/or the environment.
174+
175+
You can also ask for help on the [Asterisk Community Forums](https://community.asterisk.org)
176+
but be prepared to provide as much detail as you can about your symptoms,
177+
environment and expectations.
178+
179+
If you want to dive deeper into the internals of Asterisk, take a look at the
180+
[Function Tracing](/Development/Debugging/Function-Tracing)
181+
page in the [Development/Debugging](/Development/Debugging) section.

docs/Development/Debugging/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ nav:
55
- MALLOC_DEBUG-Compiler-Flag.md
66
- Memory-Leak-Debugging.md
77
- Reference-Count-Debugging.md
8+
- Function-Tracing.md
89
- Using-the-Hoard-Memory-Allocator-with-Asterisk.md
910
- Getting-a-Backtrace-Asterisk-versions-13.14.0-and-14.3.0.md
1011

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
2+
# Function Tracing
3+
4+
Function tracing can sometimes be a useful tool when CPU utilization is an issue.
5+
For that, the `perf` suite of Linux performance tools is your best bet. A full discussion
6+
of using perf is beyond the scope of this document but here are some basic instructions
7+
for using `perf record` and `perf report` which can give you a good idea of CPU instruction
8+
counts by Asterisk function. In most cases, `perf` doesn't require special compilation
9+
settings but you may get more accurate results by enabling the recently added
10+
`KEEP_FRAME_POINTERS` menuselect option under `Compiler Flags`.
11+
12+
```shell
13+
# Start perf-record on a running Asterisk process to capture instruction counts and
14+
# latency by function.
15+
# Let it run for a few minutes then use ctrl-c to stop it.
16+
$ sudo perf record -p `pidof asterisk` -q -e instructions --latency --call-graph fp -o perf.out
17+
18+
# Run perf-report to generate the results report.
19+
$ sudo perf report --force -i perf.out --call-graph=none -c asterisk --percentage relative > report.txt
20+
21+
# Filter out kernel and C library calls.
22+
$ sed -i -r -e "/kernel|libc.so.6/d" report.txt
23+
```
24+
25+
The output file will be a few thousand lines long but here's an example:
26+
27+
```
28+
#
29+
# comm: asterisk
30+
#
31+
# Total Lost Samples: 0
32+
#
33+
# Samples: 33K of event 'instructions'
34+
# Event count (approx.): 6561076235
35+
#
36+
# Children Latency Self Latency Shared Object Symbol
37+
# ........ ....... ........ ........ ...................... .................
38+
#
39+
99.46% 99.56% 0.00% 0.00% asterisk [.] dummy_start
40+
70.01% 68.68% 0.02% 0.01% asterisk [.] default_tps_processing_function
41+
62.30% 62.25% 0.19% 0.15% asterisk [.] ast_taskprocessor_execute
42+
38.04% 42.77% 0.01% 0.01% asterisk [.] execute_tasks
43+
22.81% 21.61% 0.75% 0.71% asterisk [.] internal_ao2_traverse
44+
22.73% 21.57% 0.18% 0.14% asterisk [.] __ao2_callback
45+
21.51% 16.72% 0.04% 0.03% asterisk [.] dispatch_exec_async
46+
19.33% 14.91% 0.11% 0.08% asterisk [.] subscription_invoke
47+
18.73% 14.40% 0.04% 0.03% asterisk [.] router_dispatch
48+
17.85% 21.51% 0.00% 0.00% res_pjsip.so [.] distribute
49+
17.77% 21.43% 0.03% 0.04% libasteriskpj.so.2 [.] pjsip_endpt_process_rx_data
50+
15.40% 14.69% 0.00% 0.00% asterisk [.] __ast_pbx_run
51+
15.40% 14.69% 0.00% 0.00% asterisk [.] pbx_thread
52+
14.82% 14.27% 0.00% 0.00% asterisk [.] pbx_extension_helper
53+
14.77% 14.23% 0.00% 0.00% asterisk [.] ast_spawn_extension
54+
14.71% 14.15% 0.00% 0.00% asterisk [.] pbx_exec
55+
14.61% 14.05% 0.00% 0.00% app_dial.so [.] dial_exec
56+
14.60% 14.04% 0.01% 0.01% app_dial.so [.] dial_exec_full
57+
12.51% 15.40% 0.00% 0.00% res_pjsip_session.so [.] session_on_rx_request
58+
12.51% 15.40% 0.00% 0.00% res_pjsip_session.so [.] handle_new_invite_request
59+
11.82% 9.36% 0.00% 0.00% asterisk [.] cel_report_event
60+
11.35% 13.95% 0.00% 0.00% res_pjsip_session.so [.] new_invite
61+
10.82% 9.97% 4.81% 4.70% asterisk [.] __ao2_ref
62+
10.79% 13.57% 0.00% 0.00% res_pjsip_sdp_rtp.so [.] create_rtp
63+
10.54% 12.81% 0.00% 0.01% asterisk [.] taskpool_sync_task
64+
10.42% 13.15% 0.00% 0.00% asterisk [.] ast_rtp_instance_new
65+
10.39% 13.12% 0.00% 0.00% res_rtp_asterisk.so [.] ast_rtp_new
66+
9.86% 12.54% 0.00% 0.00% res_rtp_asterisk.so [.] rtp_allocate_transport
67+
9.24% 7.31% 0.01% 0.01% res_cdrel_custom.so [.] cdrel_logger
68+
9.18% 7.26% 0.29% 0.22% res_cdrel_custom.so [.] log_advanced_record
69+
8.92% 11.37% 0.01% 0.01% res_rtp_asterisk.so [.] ice_create
70+
8.85% 7.15% 0.16% 0.14% asterisk [.] __ao2_find
71+
8.64% 6.90% 0.00% 0.00% asterisk [.] cel_backend_send_cb
72+
8.63% 6.89% 0.00% 0.00% cel_custom.so [.] custom_log
73+
8.20% 6.73% 0.01% 0.01% asterisk [.] cel_snapshot_update_cb
74+
7.98% 6.54% 0.02% 0.01% asterisk [.] cel_channel_state_change
75+
7.61% 6.34% 0.02% 0.01% asterisk [.] ast_sem_wait
76+
7.36% 6.66% 0.00% 0.00% asterisk [.] ast_bridge_call
77+
7.36% 6.66% 0.00% 0.00% asterisk [.] ast_bridge_call_with_flags
78+
7.06% 6.21% 0.37% 0.34% asterisk [.] __ao2_cleanup_debug
79+
6.88% 8.69% 0.00% 0.00% chan_pjsip.so [.] call
80+
6.77% 8.54% 0.01% 0.01% res_pjsip_session.so [.] handle_incoming_sdp
81+
6.71% 8.46% 0.00% 0.00% res_pjsip_sdp_rtp.so [.] negotiate_incoming_sdp_stream
82+
6.62% 6.06% 0.07% 0.05% asterisk [.] pbx_builtin_setvar_helper
83+
6.49% 4.81% 0.00% 0.00% asterisk [.] do_devstate_changes
84+
6.22% 4.59% 0.00% 0.00% asterisk [.] do_state_change
85+
6.20% 7.94% 0.01% 0.02% res_rtp_asterisk.so [.] rtp_add_candidates_to_ice
86+
6.13% 4.52% 0.01% 0.01% asterisk [.] _ast_device_state
87+
6.06% 4.46% 0.06% 0.04% chan_pjsip.so [.] chan_pjsip_devicestate
88+
5.94% 4.94% 0.00% 0.00% chan_pjsip.so [.] hangup
89+
90+
```
91+
92+
What you're seeing is a list of asterisk functions and what percentage of the total
93+
CPU instructions and wall clock time they're responsible for. The percentages won't
94+
add up to 100 because of the recursive nature of function calls. The top dozen or
95+
so are wrapper functions which basically run everything but things get interesting
96+
after that.
97+
98+
If you want to see the results without recussion, add the `--no-children` option to
99+
`perf report`:
100+
101+
```shell
102+
$ sudo perf report -i perf.out --call-graph=none -c asterisk --percentage relative --no-children > report-no-children.txt
103+
```
104+
105+
```
106+
# To display the perf.data header info, please use --header/--header-only options.
107+
#
108+
# comm: asterisk
109+
#
110+
# Total Lost Samples: 0
111+
#
112+
# Samples: 33K of event 'instructions'
113+
# Event count (approx.): 6561076235
114+
#
115+
# Overhead Latency Shared Object Symbol
116+
# ........ ........ ..................................... ..........................................
117+
#
118+
4.81% 4.70% asterisk [.] __ao2_ref
119+
2.57% 4.30% asterisk [.] hash_ao2_find_next
120+
1.96% 1.65% libc.so.6 [.] __printf_buffer
121+
1.77% 1.48% libc.so.6 [.] _int_free_chunk
122+
1.63% 1.62% libc.so.6 [.] _int_malloc
123+
1.54% 1.40% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
124+
1.54% 1.49% libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
125+
1.53% 1.59% libc.so.6 [.] __strcmp_avx2_rtm
126+
1.32% 1.56% [kernel.kallsyms] [k] perf_iterate_ctx
127+
1.30% 1.14% asterisk [.] __ao2_lock
128+
1.21% 1.48% libc.so.6 [.] __random_r
129+
1.13% 1.45% [kernel.kallsyms] [k] __snmp6_fill_stats64.constprop.0
130+
1.12% 1.09% asterisk [.] hash_ao2_find_first
131+
1.05% 0.98% libc.so.6 [.] __memmove_avx_unaligned_erms_rtm
132+
0.91% 0.80% libc.so.6 [.] tolower
133+
0.91% 0.89% libc.so.6 [.] __libc_calloc
134+
0.91% 0.97% [kernel.kallsyms] [k] memset_orig
135+
0.88% 0.74% libc.so.6 [.] __strlen_avx2_rtm
136+
0.85% 0.72% libc.so.6 [.] __printf_buffer_write
137+
0.83% 0.74% asterisk [.] __ao2_unlock
138+
0.79% 0.63% asterisk [.] ast_str_case_hash
139+
0.79% 0.68% libc.so.6 [.] cfree@GLIBC_2.2.5
140+
0.75% 0.71% asterisk [.] internal_ao2_traverse
141+
0.70% 0.54% asterisk [.] ast_event_iterator_next
142+
0.70% 0.76% [kernel.kallsyms] [k] psi_group_change
143+
0.68% 0.59% asterisk [.] utf8_check_first
144+
0.60% 0.72% libasteriskpj.so.2 [.] pj_pool_alloc_from_block
145+
146+
```
147+
148+
Now the report shows only the instructions directly executed by the function.

0 commit comments

Comments
 (0)