Skip to content

Commit 48273ed

Browse files
joshuahahngregkh
authored andcommitted
mm/page_alloc/vmstat: simplify refresh_cpu_vm_stats change detection
[ Upstream commit 0acc67c ] Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5. Motivation & Approach ===================== While testing workloads with high sustained memory pressure on large machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number of softlockups. Further investigation showed that the zone lock in free_pcppages_bulk was being held for a long time, and was called to free 2k+ pages over 100 times just during boot. This causes starvation in other processes for the zone lock, which can lead to the system stalling as multiple threads cannot make progress without the locks. We can see these issues manifesting as warnings: [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426 [ 4512.626401] rcu: hardirqs softirqs csw/system [ 4512.638793] rcu: number: 0 145 0 [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms) [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316) While these warnings don't indicate a crash or a kernel panic, they do point to the underlying issue of lock contention. To prevent starvation in both locks, batch the freeing of pages using pcp->batch. Because free_pcppages_bulk is called with the pcp lock and acquires the zone lock, relinquishing and reacquiring the locks are only effective when both of them are broken together (unless the system was built with queued spinlocks). Thus, instead of modifying free_pcppages_bulk to break both locks, batch the freeing from its callers instead. A similar fix has been implemented in the Meta fleet, and we have seen significantly less softlockups. Testing ======= The following are a few synthetic benchmarks, made on three machines. The first is a large machine with 754GiB memory and 316 processors. The second is a relatively smaller machine with 251GiB memory and 176 processors. The third and final is the smallest of the three, which has 62GiB memory and 36 processors. On all machines, I kick off a kernel build with -j$(nproc). Negative delta is better (faster compilation). Large machine (754GiB memory, 316 processors) make -j$(nproc) +------------+---------------+-----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+-----------+ | real | 0.8070 | - 1.4865 | | user | 0.2823 | + 0.4081 | | sys | 5.0267 | -11.8737 | +------------+---------------+-----------+ Medium machine (251GiB memory, 176 processors) make -j$(nproc) +------------+---------------+----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+----------+ | real | 0.2806 | +0.0351 | | user | 0.0994 | +0.3170 | | sys | 0.6229 | -0.6277 | +------------+---------------+----------+ Small machine (62GiB memory, 36 processors) make -j$(nproc) +------------+---------------+----------+ | Metric (s) | Variation (%) | Delta(%) | +------------+---------------+----------+ | real | 0.1503 | -2.6585 | | user | 0.0431 | -2.2984 | | sys | 0.1870 | -3.2013 | +------------+---------------+----------+ Here, variation is the coefficient of variation, i.e. standard deviation / mean. Based on these results, it seems like there are varying degrees to how much lock contention this reduces. For the largest and smallest machines that I ran the tests on, it seems like there is quite some significant reduction. There is also some performance increases visible from userspace. Interestingly, the performance gains don't scale with the size of the machine, but rather there seems to be a dip in the gain there is for the medium-sized machine. One possible theory is that because the high watermark depends on both memory and the number of local CPUs, what impacts zone contention the most is not these individual values, but rather the ratio of mem:processors. This patch (of 5): Currently, refresh_cpu_vm_stats returns an int, indicating how many changes were made during its updates. Using this information, callers like vmstat_update can heuristically determine if more work will be done in the future. However, all of refresh_cpu_vm_stats's callers either (a) ignore the result, only caring about performing the updates, or (b) only care about whether changes were made, but not *how many* changes were made. Simplify the code by returning a bool instead to indicate if updates were made. In addition, simplify fold_diff and decay_pcp_high to return a bool for the same reason. Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Chris Mason <clm@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 038a102 ("mm/page_alloc: prevent pcp corruption with SMP=n") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent ce35825 commit 48273ed

3 files changed

Lines changed: 20 additions & 18 deletions

File tree

include/linux/gfp.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,7 @@ extern void page_frag_free(void *addr);
397397
#define free_page(addr) free_pages((addr), 0)
398398

399399
void page_alloc_init_cpuhp(void);
400-
int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
400+
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
401401
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
402402
void drain_all_pages(struct zone *zone);
403403
void drain_local_pages(struct zone *zone);

mm/page_alloc.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2363,10 +2363,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
23632363
* Called from the vmstat counter updater to decay the PCP high.
23642364
* Return whether there are addition works to do.
23652365
*/
2366-
int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
2366+
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
23672367
{
23682368
int high_min, to_drain, batch;
2369-
int todo = 0;
2369+
bool todo = false;
23702370

23712371
high_min = READ_ONCE(pcp->high_min);
23722372
batch = READ_ONCE(pcp->batch);
@@ -2379,15 +2379,15 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
23792379
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
23802380
pcp->high - (pcp->high >> 3), high_min);
23812381
if (pcp->high > high_min)
2382-
todo++;
2382+
todo = true;
23832383
}
23842384

23852385
to_drain = pcp->count - pcp->high;
23862386
if (to_drain > 0) {
23872387
spin_lock(&pcp->lock);
23882388
free_pcppages_bulk(zone, to_drain, pcp, 0);
23892389
spin_unlock(&pcp->lock);
2390-
todo++;
2390+
todo = true;
23912391
}
23922392

23932393
return todo;

mm/vmstat.c

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -768,25 +768,25 @@ EXPORT_SYMBOL(dec_node_page_state);
768768

769769
/*
770770
* Fold a differential into the global counters.
771-
* Returns the number of counters updated.
771+
* Returns whether counters were updated.
772772
*/
773773
static int fold_diff(int *zone_diff, int *node_diff)
774774
{
775775
int i;
776-
int changes = 0;
776+
bool changed = false;
777777

778778
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
779779
if (zone_diff[i]) {
780780
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
781-
changes++;
781+
changed = true;
782782
}
783783

784784
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
785785
if (node_diff[i]) {
786786
atomic_long_add(node_diff[i], &vm_node_stat[i]);
787-
changes++;
787+
changed = true;
788788
}
789-
return changes;
789+
return changed;
790790
}
791791

792792
/*
@@ -803,16 +803,16 @@ static int fold_diff(int *zone_diff, int *node_diff)
803803
* with the global counters. These could cause remote node cache line
804804
* bouncing and will have to be only done when necessary.
805805
*
806-
* The function returns the number of global counters updated.
806+
* The function returns whether global counters were updated.
807807
*/
808-
static int refresh_cpu_vm_stats(bool do_pagesets)
808+
static bool refresh_cpu_vm_stats(bool do_pagesets)
809809
{
810810
struct pglist_data *pgdat;
811811
struct zone *zone;
812812
int i;
813813
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
814814
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
815-
int changes = 0;
815+
bool changed = false;
816816

817817
for_each_populated_zone(zone) {
818818
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -836,7 +836,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
836836
if (do_pagesets) {
837837
cond_resched();
838838

839-
changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
839+
if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
840+
changed = true;
840841
#ifdef CONFIG_NUMA
841842
/*
842843
* Deal with draining the remote pageset of this
@@ -858,13 +859,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
858859
}
859860

860861
if (__this_cpu_dec_return(pcp->expire)) {
861-
changes++;
862+
changed = true;
862863
continue;
863864
}
864865

865866
if (__this_cpu_read(pcp->count)) {
866867
drain_zone_pages(zone, this_cpu_ptr(pcp));
867-
changes++;
868+
changed = true;
868869
}
869870
#endif
870871
}
@@ -884,8 +885,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
884885
}
885886
}
886887

887-
changes += fold_diff(global_zone_diff, global_node_diff);
888-
return changes;
888+
if (fold_diff(global_zone_diff, global_node_diff))
889+
changed = true;
890+
return changed;
889891
}
890892

891893
/*

0 commit comments

Comments
 (0)