|
| 1 | +# Troubleshooting High CPU Utilization |
| 2 | + |
| 3 | +/// note |
| 4 | +We'll cover high memory utilization in a separate document. |
| 5 | +/// |
| 6 | + |
| 7 | +## What's "High"??? |
| 8 | + |
| 9 | +It sounds silly but if you can't define what "high" means to you then |
| 10 | +how will you know when you've "fixed" it? |
| 11 | + |
| 12 | +High CPU utilization isn't usually an issue by itself. The real issue would be |
| 13 | +the side effects like poor call quality or long call setup times, etc. |
| 14 | +Understanding the issue you're tring to solve and what your goals are should |
| 15 | +be the first step in any investigation. |
| 16 | + |
| 17 | +## Understand the environment |
| 18 | + |
| 19 | +The next step in the investigation should be understanding your environment thoroughly. |
| 20 | +These are just a few things to keep in your mind: |
| 21 | + |
| 22 | +* Is Asterisk running in a virtual machine or a container? |
| 23 | + * What else is competing for resources? |
| 24 | + * Is the filesystem directly attached or remote via a SAN, etc.? |
| 25 | +* What else is sharing the same CPU? |
| 26 | + * Databases |
| 27 | + * ARI/AMI/AGI applications |
| 28 | + * Unrelated applications or servers |
| 29 | +* Are you using a realtime database? |
| 30 | + * Where is it hosted? |
| 31 | + * What else is using the same database instance? |
| 32 | +* What's the network configuration? |
| 33 | + * Network interfaces. |
| 34 | + * Firewalls. |
| 35 | + * DNS infrastructure. |
| 36 | + * Proxies and session border controllers. |
| 37 | + |
| 38 | +There's more but you get the idea. |
| 39 | + |
| 40 | +## Identify the actors |
| 41 | + |
| 42 | +Identifying _why_ you're seeing high CPU utilization can be complicated but let's start |
| 43 | +with the _what_ first because it might not even be Asterisk itself. The tool of choice |
| 44 | +here is the venerable `top` or it's relatives `htop` and `btop`. |
| 45 | + |
| 46 | +### Database |
| 47 | + |
| 48 | +If the database has nothing to do with Asterisk itself then it probably shouln't be running |
| 49 | +in the same OS instance. If the database _is_ hosting your Asterisk configuration check where |
| 50 | +you're storing your incoming PJSIP registrations (non permanent contacts). The default is to |
| 51 | +store them in the "astdb" (Sqlite3 database at `/var/lib/asterisk/astdb.sqlite3`) but some people |
| 52 | +use sorcery.conf and extconfig.conf to move them to an external database believing it's "better". |
| 53 | +Unless you need to expose the active registrations to another entity like Kamailio, leaving them |
| 54 | +in the astdb is better for performance because the database is all in-memory. |
| 55 | + |
| 56 | +Another thing to investigate is |
| 57 | +[Sorcery Caching](/Fundamentals/Asterisk-Configuration/Sorcery/Sorcery-Caching/). |
| 58 | +If your expiration times are too low, you may be running more transactions to the database |
| 59 | +than necessary. |
| 60 | + |
| 61 | +Finally, consider moving the database to another OS instance, even if that other instance is |
| 62 | +another container or virtual machine on the same physical host. Remember, database access is |
| 63 | +usually limited to call setup but if Asterisk is processing the call media (which it usually is) |
| 64 | +it's more important for it to have a higher CPU resource allocation priority. Moving the database |
| 65 | +someplace else gives you an opportunity to control that allocation. Keep it local however and ideally |
| 66 | +use a dedicated LAN segment with a 10G or greater rate. |
| 67 | + |
| 68 | +### ARI/AMI/AGI Applications |
| 69 | + |
| 70 | +Take a close look at what the applications are doing and try to streamline. If you're using "classic" |
| 71 | +AGI (`AGI(/path/to/script)`) switch to using [FastAGI](/Latest_API/API_Documentation/Dialplan_Applications/AGI/) |
| 72 | +to avoid the overhead of spawning a new process for every call. As with a database, moving the applications to another OS instance can give you more control over resource allocation. |
| 73 | + |
| 74 | +### Asterisk |
| 75 | + |
| 76 | +#### Transcoding |
| 77 | + |
| 78 | +The number one use of CPU resources by Asterisk is usually handling media. |
| 79 | +Actually, simply forwarding media between two channels that use the same codec is pretty cheap. |
| 80 | +Once you start transcoding however, the number of CPU cycles needed can increase drastically. |
| 81 | + |
| 82 | +``` |
| 83 | +*CLI> core show translation |
| 84 | + Translation times between formats (in microseconds) for one second of data |
| 85 | + Source Format (Rows) Destination Format (Columns) |
| 86 | +
|
| 87 | + ulaw alaw gsm slin8 slin16 slin32 slin48 g729 g722 opus |
| 88 | + ulaw - 9150 15000 9000 17000 17000 17000 15000 17250 23000 |
| 89 | + alaw 9150 - 15000 9000 17000 17000 17000 15000 17250 23000 |
| 90 | + gsm 15000 15000 - 9000 17000 17000 17000 15000 17250 23000 |
| 91 | + slin8 6000 6000 6000 - 8000 8000 8000 6000 8250 14000 |
| 92 | + slin16 14500 14500 14500 8500 - 8000 8000 14500 6000 14000 |
| 93 | + slin32 14500 14500 14500 8500 8500 - 8000 14500 14500 14000 |
| 94 | + slin48 14500 14500 14500 8500 8500 8500 - 14500 14500 6000 |
| 95 | + g729 15000 15000 15000 9000 17000 17000 17000 - 17250 23000 |
| 96 | + g722 15600 15600 15600 9600 9000 17000 17000 15600 - 23000 |
| 97 | + opus 23500 23500 23500 17500 17500 17500 9000 23500 23500 - |
| 98 | +``` |
| 99 | + |
| 100 | +The absolute numbers are specific to the machine I ran the command on but look at the relative |
| 101 | +differences, especially when opus is used on one channel and the other uses something |
| 102 | +else. |
| 103 | + |
| 104 | +Here's a different look at opus and what the path is between it and other codecs: |
| 105 | + |
| 106 | +``` |
| 107 | +*CLI> core show translation paths opus |
| 108 | +--- Translation paths SRC Codec "opus" sample rate 48000 --- |
| 109 | + opus:48000 To ulaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(ulaw@8000) |
| 110 | + opus:48000 To alaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(alaw@8000) |
| 111 | + opus:48000 To gsm:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(gsm@8000) |
| 112 | + opus:48000 To slin:8000 : (opus@48000)->(slin@48000)->(slin@8000) |
| 113 | + opus:48000 To slin:16000 : (opus@48000)->(slin@48000)->(slin@16000) |
| 114 | + opus:48000 To slin:32000 : (opus@48000)->(slin@48000)->(slin@32000) |
| 115 | + opus:48000 To slin:48000 : (opus@48000)->(slin@48000) |
| 116 | + opus:48000 To g729:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(g729@8000) |
| 117 | + opus:48000 To g722:16000 : (opus@48000)->(slin@48000)->(slin@16000)->(g722@16000) |
| 118 | +*CLI> |
| 119 | +``` |
| 120 | + |
| 121 | +The simple answer is "don't use opus" but of course that's not realistic especially |
| 122 | +if you have WebRTC endpoints. The more general answer is "don't use expensive codecs |
| 123 | +if you don't need to". A good example of this is when using chan_websocket on one leg |
| 124 | +of a call to interact with an AI agent. We've seen some folks using opus on the websocket |
| 125 | +leg to the agent even when the incoming callers are all using ulaw or alaw and the agent |
| 126 | +platform will happily accept 8K signed-linear. |
| 127 | + |
| 128 | +#### Sounds, Music on Hold, Recording, Voicemail |
| 129 | + |
| 130 | +The term "Transcoding" (and all that it implies) applies whenever you have to |
| 131 | +convert between two codecs and that includes playing sounds, playing music-on-hold, |
| 132 | +recording calls, saving and playing voicemails, etc. |
| 133 | + |
| 134 | +For sounds and MOH, ensure you have files in the formats you most expect your caller |
| 135 | +channels to be using. When you have lots of different codecs in use and you're short |
| 136 | +on disk space, having at least slin8 and slin16 versions of the files gives you the |
| 137 | +least transcoding cost for everything except for opus. |
| 138 | + |
| 139 | +For recording and voicemail, use the translation cost matrix to determine the most |
| 140 | +efficient formats to save files in. |
| 141 | + |
| 142 | +#### Logging |
| 143 | + |
| 144 | +Logging is especially costly because it takes CPU cycles to create and manage the |
| 145 | +messages plus I/O operations to write them out. |
| 146 | + |
| 147 | +A performance analysis of Asterisk versions prior to 23.3.0, 22.9.0 and 20.19.0, showed |
| 148 | +that almost 40% of the CPU instructions executed by the Asterisk process were attributed |
| 149 | +to Channel Event Logging. That was more than the instructions used to actually process |
| 150 | +calls. We made significant CEL performance improvements in versions 23.3.0, 22.9.0 and 20.19.0 |
| 151 | +which brought that percentage to below 10% but if you don't need Channel Event logging |
| 152 | +_turn it off_!. |
| 153 | + |
| 154 | +Another culprit to watch out for is VERBOSE logging. When VERBOSE logging is enabled, |
| 155 | +a log message is generated for every line traversed in the dialplan. On a busy system |
| 156 | +this can result in hundreds of messages per second. Unless you have a good reason for |
| 157 | +seeing all those messages, you should limit both console and file logging to NOTICE, |
| 158 | +WARNING and ERROR. As for DEBUG logging, well enabling debug messages on production |
| 159 | +system is just silly unless you're actively trying to diagnose a specific issue. |
| 160 | + |
| 161 | +## Still can't figure it out? |
| 162 | + |
| 163 | +If you're still using more CPU than you think is necessary, you're going to have to |
| 164 | +spend time doing some serious quantitative investigation. |
| 165 | + |
| 166 | +* Create a test environment that mirrors your production environment. |
| 167 | +* Use tools like [sipp](https://github.com/sipp/sipp) to simulate load. |
| 168 | +* Get a baseline by starting with basic two-party calls (no recording, transcoding, |
| 169 | +conferences, etc) and increasing volume until you get to a reasonable maximum |
| 170 | +acceptable utilization. |
| 171 | +* Now start over with a more "stressfull" of your typical call scenarios and |
| 172 | +note the difference. |
| 173 | +* Look for ways to simplify those call flows and/or the environment. |
| 174 | + |
| 175 | +You can also ask for help on the [Asterisk Community Forums](https://community.asterisk.org) |
| 176 | +but be prepared to provide as much detail as you can about your symptomns, |
| 177 | +environment and expectations. |
| 178 | + |
| 179 | +If you want to dive deeper into the internals of Asterisk, take a look at the |
| 180 | +[Function Tracing](/Development/Debugging/Function-Tracing) |
| 181 | +page in the [Development/Debugging](/Development/Debugging) section. |
0 commit comments