[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] [Intel] Intel-SIG: backport cluster scheduler wakeup optimization#827
Conversation
ANBZ: #8001 commit b95303e upstream. Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Intel-SIG: commit b95303e sched: Add cpus_share_resources API. Cluster based task wakeup optimization backport. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com [ Aubrey Li: amend commit log ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
ANBZ: #8001 commit 8881e16 upstream. For platforms having clusters like Kunpeng920, CPUs within the same cluster have lower latency when synchronizing and accessing shared resources like cache. Thus, this patch tries to find an idle cpu within the cluster of the target CPU before scanning the whole LLC to gain lower latency. This will be implemented in 2 steps in select_idle_sibling(): 1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them if they're sharing cluster with the target CPU. Otherwise trying to scan for an idle CPU in the target's cluster. 2. Scanning the cluster prior to the LLC of the target CPU for an idle CPU to wakeup. Testing has been done on Kunpeng920 by pinning tasks to one numa and two numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs. With this patch, We noticed enhancement on tbench and netperf within one numa or cross two numa on top of tip-sched-core commit 9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()") tbench results (node 0): baseline patched 1: 327.2833 372.4623 ( 13.80%) 4: 1320.5933 1479.8833 ( 12.06%) 8: 2638.4867 2921.5267 ( 10.73%) 16: 5282.7133 5891.5633 ( 11.53%) 32: 9810.6733 9877.3400 ( 0.68%) 64: 7408.9367 7447.9900 ( 0.53%) 128: 6203.2600 6191.6500 ( -0.19%) tbench results (node 0-1): baseline patched 1: 332.0433 372.7223 ( 12.25%) 4: 1325.4667 1477.6733 ( 11.48%) 8: 2622.9433 2897.9967 ( 10.49%) 16: 5218.6100 5878.2967 ( 12.64%) 32: 10211.7000 11494.4000 ( 12.56%) 64: 13313.7333 16740.0333 ( 25.74%) 128: 13959.1000 14533.9000 ( 4.12%) netperf results TCP_RR (node 0): baseline patched 1: 76546.5033 90649.9867 ( 18.42%) 4: 77292.4450 90932.7175 ( 17.65%) 8: 77367.7254 90882.3467 ( 17.47%) 16: 78519.9048 90938.8344 ( 15.82%) 32: 72169.5035 72851.6730 ( 0.95%) 64: 25911.2457 25882.2315 ( -0.11%) 128: 10752.6572 10768.6038 ( 0.15%) netperf results TCP_RR (node 0-1): baseline patched 1: 76857.6667 90892.2767 ( 18.26%) 4: 78236.6475 90767.3017 ( 16.02%) 8: 77929.6096 90684.1633 ( 16.37%) 16: 77438.5873 90502.5787 ( 16.87%) 32: 74205.6635 88301.5612 ( 19.00%) 64: 69827.8535 71787.6706 ( 2.81%) 128: 25281.4366 25771.3023 ( 1.94%) netperf results UDP_RR (node 0): baseline patched 1: 96869.8400 110800.8467 ( 14.38%) 4: 97744.9750 109680.5425 ( 12.21%) 8: 98783.9863 110409.9637 ( 11.77%) 16: 99575.0235 110636.2435 ( 11.11%) 32: 95044.7250 97622.8887 ( 2.71%) 64: 32925.2146 32644.4991 ( -0.85%) 128: 12859.2343 12824.0051 ( -0.27%) netperf results UDP_RR (node 0-1): baseline patched 1: 97202.4733 110190.1200 ( 13.36%) 4: 95954.0558 106245.7258 ( 10.73%) 8: 96277.1958 105206.5304 ( 9.27%) 16: 97692.7810 107927.2125 ( 10.48%) 32: 79999.6702 103550.2999 ( 29.44%) 64: 80592.7413 87284.0856 ( 8.30%) 128: 27701.5770 29914.5820 ( 7.99%) Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch in the code has not been tested but it supposed to work. Chen Yu also noticed this will improve the performance of tbench and netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one cluster sharing L2 Cache. Intel-SIG: commit 8881e16 sched/fair: Scan cluster before scanning LLC in wake-up path. Cluster based task wakeup optimization backport. [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com [ Aubrey Li: amend commit log ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
… cluster wakeup ANBZ: #8001 commit 22165f6 upstream. Chen Yu reports a hackbench regression of cluster wakeup when hackbench threads equal to the CPU number [1]. Analysis shows it's because we wake up more on the target CPU even if the prev_cpu is a good wakeup candidate and leads to the decrease of the CPU utilization. Generally if the task's prev_cpu is idle we'll wake up the task on it without scanning. On cluster machines we'll try to wake up the task in the same cluster of the target for better cache affinity, so if the prev_cpu is idle but not sharing the same cluster with the target we'll still try to find an idle CPU within the cluster. This will improve the performance at low loads on cluster machines. But in the issue above, if the prev_cpu is idle but not in the cluster with the target CPU, we'll try to scan an idle one in the cluster. But since the system is busy, we're likely to fail the scanning and use target instead, even if the prev_cpu is idle. Then leads to the regression. This patch solves this in 2 steps: o record the prev_cpu/recent_used_cpu if they're good wakeup candidates but not sharing the cluster with the target. o on scanning failure use the prev_cpu/recent_used_cpu if they're recorded as idle [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ Intel-SIG: commit 22165f6 Use candidate prev/recent_used CPU if scanning failed for cluster wakeup. Cluster based task wakeup optimization backport Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com [ Aubrey Li: amend commit log ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Reviewer's GuideThis PR backports cluster scheduler wakeup optimizations by introducing a new SD_CLUSTER flag with a static key, a cpus_share_resources API, and enhancing the fair scheduler’s wakeup paths (select_idle_cpu and select_idle_sibling) to prioritize CPUs within the same cluster before falling back to a full LLC scan, thus improving task locality and reducing cross-cluster cache contention. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
deepin pr auto review代码审查意见:
总体来说,代码的修改应该更加细致地考虑逻辑和边界情况,以确保代码的正确性和健壮性。 |
There was a problem hiding this comment.
Pull Request Overview
This PR backports wake‐up path optimizations for cluster schedulers on Intel platforms to improve inter-CPU communication by preferentially waking up CPUs within the same cluster.
- Introduces a new per-CPU variable (sd_share_id) and static key (sched_cluster_active) to manage cluster-related scheduling data.
- Adjusts idle CPU selection in the fair scheduler to prefer CPUs within the same cluster and updates related header declarations and flag definitions.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| kernel/sched/topology.c | Adds per-CPU sd_share_id and initializes it based on cluster or LLC id. |
| kernel/sched/sched.h | Declares the new per-CPU variable and extern for sched_cluster_active. |
| kernel/sched/fair.c | Updates idle CPU selection to account for cluster boundaries. |
| kernel/sched/core.c | Implements cpus_share_resources using the new sd_share_id. |
| include/linux/sched/topology.h | Provides an inline cpus_share_resources that currently returns true. |
| include/linux/sched/sd_flags.h | Introduces the SD_CLUSTER flag for indicating cluster sharing. |
Comments suppressed due to low confidence (2)
include/linux/sched/topology.h:236
- The inline implementation of cpus_share_resources always returning true may conflict with the proper implementation in kernel/sched/core.c. Consider removing or updating this inline definition to ensure consistent behavior.
static inline bool cpus_share_resources(int this_cpu, int that_cpu) {
kernel/sched/fair.c:7573
- [nitpick] The use of -1 as an initial value for prev_aff may be less clear; consider using a named constant (e.g., INVALID_CPU) to improve code readability.
int i, recent_used_cpu, prev_aff = -1;
There was a problem hiding this comment.
Hey @Avenger-285714 - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| struct sched_domain *sd; | ||
| unsigned long task_util, util_min, util_max; | ||
| int i, recent_used_cpu; | ||
| int i, recent_used_cpu, prev_aff = -1; |
There was a problem hiding this comment.
suggestion: Initialize recent_used_cpu to a safe default
Explicitly initialize recent_used_cpu to -1 at declaration to clarify its default state and prevent accidental use before assignment.
| int i, recent_used_cpu, prev_aff = -1; | |
| int i, recent_used_cpu = -1, prev_aff = -1; |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: opsiff The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
bugzilla: [https://bugzilla.openanolis.cn/show_bug.cgi?id=8001]
Backport the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.
Barry Song (2):
sched: Add cpus_share_resources API
sched/fair: Scan cluster before scanning LLC in wake-up path
Yicong Yang (1):
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
Link: https://gitee.com/anolis/cloud-kernel/pulls/2678
Summary by Sourcery
Backport cluster scheduler wakeup optimizations to prioritize CPUs within the same cluster on task wakeup paths.
New Features:
Enhancements: