You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: usecases/real-time/tcc_tutorial/Tutorial.md
+1-153Lines changed: 1 addition & 153 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -212,156 +212,4 @@ In this tutorial, three easy-to-use features of the Intel TCC toolbox, along wit
212
212
213
213
By leveraging these tools, system integrators can efficiently enhance the performance and reliability of their real-time applications.
214
214
215
-
<spanstyle="color:red"> Note: Please keep in mind that performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndexPerformance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure</span>
216
-
217
-
### 2. Intel® Cache Allocation Technology
218
-
In the second step, assume a system architecture where workloads with varying Quality of Service (QoS) requirements are consolidated on a single processor. This includes best-effort workloads, such as User Interface (UI) or Artificial Intelligence (AI) tasks, as well as real-time workloads, like the real-time test application. In such designs, shared resources like cache can become potential sources of contention.
219
-
220
-
Let's take a look at how Intel® Cache Allocation Technology (CAT) can help mitigate these sources of contention. CAT provides the ability to partition caches at various levels in the caching hierarchy. For example, consider the cache architecture of the Intel® Core™ i5-1350PE processor illustrated below. Initially, the default cache configuration is used, where all cache ways are shared. In the second step, the statistics are compared with the Last Level Cache (LLC) partitioned to provide an exclusive portion of the cache to the real-time test application. In both scenarios, a memory-centric workload on the best-effort cores is simulated using stress-ng.
221
-
222
-
<palign="center">
223
-
<imgsrc="images/setup_CAT.png "alt="Cache Partitioning - System Setup"style="width: 70%;">
224
-
</p>
225
-
226
-
#### Execution and Analysis
227
-
1. Start the real-time application if it is not already running, and output the statistics to the Grafana dashboard. Monitor the statistics and start in a second terminal the AI object classification demo from OpenVINO model zoo.
You can use the following command as an alternative to the AI object classification demo:
237
-
238
-
```sh
239
-
stress-ng --vm 8 --vm-bytes 128M --fork 4
240
-
Options:
241
-
-vm 8: This parameter specifies that 8 virtual memory stressor instances should be started. Each instance will allocate and stress test memory.
242
-
-vm-bytes 128M: This parameter specifies the amount of memory each virtual memory stressor instance should allocate. In this case, each of the 8 instances will allocate 128 megabytes of memory.
243
-
-fork 4: This parameter specifies that 4 child processes should be forked. Each child process will execute the stress test independently.
244
-
```
245
-
2. Partition the Last Level Cache (LLC) and assign an exclusive portion of the cache to the real-time test application, as demonstrated for the Intel® Core™ i5-1350PE above. Here is how the LLC can be partitioned using the Linux `msr-tools`:
246
-
```sh
247
-
#define LLC Core Masks
248
-
wrmsr 0xc90 0x30 # best effort mask
249
-
wrmsr 0xc91 0xFC # real-time mask
250
-
#define LLC GT Mask
251
-
wrmsr 0x18b0 0x80 # iGPU mask
252
-
253
-
#assign the masks to the cores.
254
-
#This has to match with the core selected for the rt app
255
-
wrmsr -a 0xc8f 0x0 # assign all cores to the CLOS0
#There is also the pqos Linux command-line utility which is part of the intel-cmt-cat package which can be used.
259
-
```
260
-
261
-
Alternatively, you can use the script with the `rt_optimized` option to partition the cache as demonstrated above, or with the `default` option for flat partitioning.
262
-
```sh
263
-
sudo ./setCacheAllocation.sh rt_optimized
264
-
```
265
-
<spanstyle="color:red"> Note: The script and the masks defined above are examples tailored for the cache topology of the Intel® Core™ i5-1350PE processor and the specific use-case. You may need to adapt them to match the cache topology of your processor and use-case.</span>
Examining the performance metrics of the real-time application shows that, in the initial phase, the cores running the best-effort workloads are mostly idling. During this phase, the execution time is very consistent and within a reasonable range. However, when stress-ng starts, both the execution time and the number of LLC misses increase significantly as you can see in step 1b. This occurs because the data of the real-time application is frequently evicted from the LLC and must be fetched from the main memory.
271
-
As demonstrated in the second step of the measurement, applying cache partitioning helps reduce the LLC misses for the real-time application, bringing the execution time back to a reasonable range.
272
-
273
-
As demonstrated, partitioning the cache using Intel® Cache Allocation Technology (CAT) is a straightforward way to improve temporal isolation between real-time and best-effort workloads.
274
-
275
-
This is just an example, and the configuration needs to be adjusted to your specific use case and processor. You can determine the cache topology, including the size and number of ways supported for a particular processor, by using the CPUID leaf "Deterministic Cache Parameters Leaf - 0x4." Additionally, Linux utilities like lstopo are very useful for getting an overview of the cache topology of a processor. Here are some references if you need more information about CAT ...
276
-
277
-
- Public Intel® Time Coordinated Computing (TCC) User Guide - RDC #[831067](https://cdrdv2.intel.com/v1/dl/getContent/831067)
### 3. Intel® Speed Shift technology for Edge Computing
282
-
283
-
In the third and final step of this tutorial, let's examine another aspect of power management: Performance states, or P-States. P-States enable the scaling of the processor's frequency and voltage to reduce CPU power consumption. They are part of Dynamic Voltage and Frequency Scaling (DVFS) features such as Intel® Speed Step, Speed Shift, and Turbo Boost Technology. Speed Step and Speed Shift adjust the processor's voltage and frequency within these P-States to balance power efficiency and performance, while Turbo Boost allows the processor to temporarily exceed the highest P-State to provide additional performance during demanding tasks.
284
-
285
-
Until the 11th generation of Intel Core processors, it was recommended for hard real-time use cases to disable all DVFS features in the BIOS, which would lock the frequency of all cores permanently to the base frequency. Starting with the 11th generation of Intel Core processors, P-State transitions were optimized. As a result, from the 11th generation onward, Intel® Speed Step, Speed Shift, and Turbo Boost Technology are no longer disabled if you enable <spanstyle="font-family: 'Courier New';">Intel® TCC Mode</span> in the BIOS. You still have the option to lock core frequency during runtime using the HWP MSRs or the intel_pstate driver under Linux.
286
-
287
-
With this knowledge, let's revisit the performance metrics. First, lock the core frequency of all cores to the base frequency. In the second step, boost the frequency of the real-time core to a value within the turbo frequency range to leverage higher single-threaded performance. Here let's follow the recommendations for the enveloping frequency configurations which are listed in the [TCC User Guide](https://cdrdv2.intel.com/v1/dl/getContent/831067) for the specific processor SKU.
288
-
289
-
More information about HWP and the MSR can be found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol3 section "Power and Thermal Management-Hardware Controlled Performance States - RDC#[671200](https://cdrdv2.intel.com/v1/dl/getContent/671200)
290
-
291
-
<palign="center">
292
-
<imgsrc="images/setup_CAT_isol_boost.png "alt="Speed Shift - System Setup"style="width: 70%;">
293
-
</p>
294
-
295
-
#### Execution and Analysis
296
-
297
-
1. Start the real-time application if it is not already running, and output the statistics to the Grafana dashboard. Monitor the statistics and start in a second terminal the AI object classification demo from OpenVINO model zoo.
You can use the following command as an alternative to the AI object classification demo:
306
-
307
-
```sh
308
-
stress-ng --vm 8 --vm-bytes 128M --fork 4
309
-
Options:
310
-
-vm 8: This parameter specifies that 8 virtual memory stressor instances should be started. Each instance will allocate and stress test memory.
311
-
-vm-bytes 128M: This parameter specifies the amount of memory each virtual memory stressor instance should allocate. In this case, each of the 8 instances will allocate 128 megabytes of memory.
312
-
-fork 4: This parameter specifies that 4 child processes should be forked. Each child process will execute the stress test independently.
313
-
```
314
-
Ensure that CAT partitioning is still applied or run the following script again.
315
-
316
-
```sh
317
-
sudo ./setCacheAllocation.sh rt_optimized
318
-
```
319
-
320
-
2. Lock the core frequency of all cores to base frequency of the core and tune the Energy Performance Preferences (EPP) towords `performance` this complies with disabling DVFS features in BIOS. Here is an example how you can use the sysfs entries of intel_pstate driver ...
321
-
322
-
```sh
323
-
# Loop through each CPU core and set the min and max frequencies to base frequency
324
-
forCPUin /sys/devices/system/cpu/cpu[0-11]*;do
325
-
BASE_FREQUENCY=$(cat $CPU/cpufreq/base_frequency)
326
-
echo$BASE_FREQUENCY| sudo tee $CPU/cpufreq/scaling_min_freq
327
-
echo$BASE_FREQUENCY| sudo tee $CPU/cpufreq/scaling_max_freq
328
-
echo performance | sudo tee $CPU/cpufreq/energy_performance_preference
329
-
done
330
-
```
331
-
Alternatively, you can use the script with `basefrequency`parameter ...
332
-
333
-
```sh
334
-
sudo ./setsetCoreFrequency.sh basefrequency
335
-
```
336
-
3. Boost the frequency of the core running the real-time application as described in the enveloping configuration in the TCC User Guide. In this configuration, the maximum allowed frequency of all best-effort cores is limited to the base frequency, and the Energy Performance Preferences (EPP) is set to power. This setup allows the best-effort cores to scale their frequency between the minimum and base frequency depending on core utilization. For the real-time core, the frequency is boosted to 3.1 GHz, and the EPP is set to performance to ensure Quality of Service (QoS) in case of power limit throttling.
337
-
Use the script with `rt-boost`, followed by the identfier of the real-time core and the desired core frequency ...
<spanstyle="color:red"> Note: The script and the specified frequencies are examples tailored for this tutorial and the Intel® Core™ i5-1350PE processor. You may need to adapt them to match your processor and use case.</span>
343
-
344
-
For more information on directly accessing the HWP MSR instead of using the sysfs entries of the intel_pstate driver, please refer to the [TCC User Guide](https://cdrdv2.intel.com/v1/dl/getContent/831067) and in the Intel® 64 and IA-32 Architectures Software Developer’s Manual Vol3 section "Power and Thermal Management-Hardware Controlled Performance States - RDC#[671200](https://cdrdv2.intel.com/v1/dl/getContent/671200)
Examining the performance metrics, you can observe the following:
350
-
351
-
- In the first step of the graph, where HWP was still able to scale the P-State, the execution time jitter band remains noisy.
352
-
- Step 2 shows that locking the frequency of all cores to the base frequency reduces the jitter band but also significantly increases the execution time.
353
-
- Finally, in step 3, locking the core frequency of the core running the real-time application to a turbo frequency and limiting the maximum allowed frequency of the best-effort (BE) cores to the base frequency, following the guidance of the enveloping configurations listed in the TCC User Guide, results in a narrower execution time jitter band and a significantly lower execution time.
354
-
355
-
356
-
As you can see, locking the core frequency of the core running the real-time application helps reduce execution time jitter. Additionally, boosting the core frequency of the real-time core can be highly beneficial for use cases that require higher single-threaded performance.
357
-
358
-
### Conclusion
359
-
360
-
In this tutorial, three easy-to-use features of the Intel TCC toolbox, along with some kernel command line parameters, were introduced to optimize real-time performance. These features and techniques help system integrators quickly tune the system for specific real-time use cases:
361
-
- TCC Mode: Optimizes firmware for low latency with a single configuration knob.
362
-
- Cache Allocation Technology (CAT): Enables quick partitioning of the cache to improve temporal isolation.
363
-
- Speed Shift for Edge Computing: Can be used to boost single-threaded performance.
364
-
365
-
By leveraging these tools, system integrators can efficiently enhance the performance and reliability of their real-time applications.
366
-
367
-
<spanstyle="color:red"> Note: Please keep in mind that performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndexPerformance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure</span>
215
+
<spanstyle="color:red"> Note: Please keep in mind that performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure</span>
0 commit comments