Skip to content

Commit 8296039

Browse files
lukaszraczylosmira
authored andcommitted
fix: macb silent TX stall on BCM2712/RP1 (v2 patches from netdev)
Update the three net-macb silent-TX-stall patches from RFC v1 to PATCH net-next v2. The v2 series is on lore at: https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/ v2 changes from v1 (already merged in #1526): * 0001 (PCIe posted-write flush after TSTART doorbell) - now gated behind a new MACB_CAPS_PCIE_POSTED_WRITES capability, set only on raspberrypi_rp1_config. v1 applied the readback to every macb variant; SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx) have no fabric posted-write concern and were paying the non-posted-read latency for nothing. * 0002 (PCIe read barrier before TX completion descriptor check) - replaces the v1 form, which was a regression on read-clear ISR silicon. v1 read ISR with a TCOMP mask in macb_tx_poll(); on raspberrypi_rp1_config (where MACB_CAPS_ISR_CLEAR_ON_WRITE is not set) that read consumes every bit set in ISR, but the use-site masks down to TCOMP and discards the rest -- any RCOMP / ROVR / TXUBR bit at that instant is silently consumed. v2 replaces with (void)queue_readl(queue, IMR), the read-only mask mirror -- non-destructive, same PCIe-barrier effect. * 0003 (TX stall watchdog) - tracks tail movement via a bool flag set by macb_tx_complete() instead of a tx_tail snapshot (form suggested by Phil Elwell on raspberrypi/linux#7340). Adds a netif_carrier_ok() gate. Wraps netdev_warn in printk_ratelimit() so operators can count occurrences while bounding log noise. (An earlier draft used the macro netdev_warn_ratelimited(), which does not exist in this kernel -- caught by John Laur's build test on the v2 patches.) Production runtime so far: 24-node Pi 5 fleet on v2 patch-2 IMR-barrier form since 2026-05-14 14:00 UTC, ~190 cumulative node-hours, zero mid-runtime TX stalls. Pre-patch baseline (~0.5 stall/node-hour) would have predicted ~95 stalls; observed 0. Related: * netdev v1 RFC thread: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ * netdev v2 series: https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/ * raspberrypi/linux merge: raspberrypi/linux#7340 * raspberrypi/linux v2 PR: raspberrypi/linux#7369 Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
1 parent c5a1685 commit 8296039

6 files changed

Lines changed: 339 additions & 315 deletions
Lines changed: 72 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,105 @@
1-
From 3106d546d494f2f52ec832e7f7d04f534286e254 Mon Sep 17 00:00:00 2001
2-
Message-ID: <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
3-
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>
4-
References: <cover.1777064117.git.lukasz@raczylo.com>
1+
From 1c89250b1211cdaa0368dab15686252af6c94a3f Mon Sep 17 00:00:00 2001
52
From: Lukasz Raczylo <lukasz@raczylo.com>
6-
Date: Fri, 24 Apr 2026 21:50:55 +0100
7-
Subject: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after
8-
TSTART doorbell
3+
Date: Thu, 14 May 2026 13:24:48 +0100
4+
Subject: [PATCH net-next v2 1/3] net: macb: flush PCIe posted write after
5+
TSTART doorbell (PCIe-only)
96

107
macb_start_xmit() and macb_tx_restart() kick transmission by
11-
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
12-
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
13-
have in front of us), writes to NCR are posted PCIe writes: they
14-
are not guaranteed to reach the device before the issuing CPU
15-
returns. If the TSTART doorbell does not reach the MAC, no TX
16-
begins, no TCOMP completion arrives, and the ring remains
17-
quiescent without any kernel-visible indication.
8+
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb
9+
instances (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is
10+
the case I have in front of me), writes to NCR are posted PCIe
11+
writes: they are not guaranteed to reach the device before the
12+
issuing CPU returns. If the TSTART doorbell does not reach the
13+
MAC, no TX begins, no TCOMP completion arrives, and the ring
14+
remains quiescent without any kernel-visible indication.
1815

19-
Note that the raspberrypi/linux vendor fork carries a local patch
20-
around the TSTART site (a queue->tx_pending breadcrumb that is
21-
promoted to queue->txubr_pending by the next TCOMP interrupt,
22-
triggering macb_tx_restart()). That workaround makes the loss
23-
recoverable under traffic, but it cannot help if TCOMP itself is
24-
not raised because no TX started -- which is exactly the case we
25-
are targeting here. The handshake is not present in mainline.
16+
Add a read-back of NCR after each TSTART write. The read is an
17+
architected PCIe read barrier for earlier posted writes on the
18+
same path; it ensures the doorbell has reached the MAC before
19+
the function returns. As a side effect on macb_start_xmit() it
20+
also flushes the preceding macb_tx_lpi_wake() NCR write -- not
21+
just TSTART -- since the barrier applies to all prior posted
22+
writes by the same requester.
2623

27-
Add a read-back of NCR after each TSTART write in macb_start_xmit()
28-
and macb_tx_restart(). The read is an architected PCIe read
29-
barrier for earlier posted writes on the same path; it ensures the
30-
doorbell has reached the MAC before the functions return.
24+
The cost is one non-posted PCIe read per TSTART. To avoid
25+
imposing this on SoC-integrated macb variants (Atmel, Microchip,
26+
SiFive, Xilinx), where NCR is on-chip MMIO and no fabric
27+
posted-write concern exists, gate the readback behind a new
28+
MACB_CAPS_PCIE_POSTED_WRITES capability set only on
29+
raspberrypi_rp1_config.
3130

32-
We do not yet have direct hardware evidence that TSTART is being
33-
lost on the RP1 path (that would require a PCIe protocol analyser,
34-
or at minimum a before/after counter on queue->tx_stall_last_tail
35-
with and without this patch applied in isolation). This patch is
36-
one of a three-patch series ("candidate fixes for silent TX stall
37-
on BCM2712/RP1"); see the cover letter for context. We have
38-
verified the series compiles and applies cleanly against mainline
39-
HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f;
40-
runtime verification is pending.
31+
Note that the raspberrypi/linux vendor fork carries a local
32+
patch around the TSTART site (a queue->tx_pending breadcrumb
33+
that is promoted to queue->txubr_pending by the next TCOMP
34+
interrupt, triggering macb_tx_restart()). That workaround makes
35+
the loss recoverable under traffic, but it cannot help if TCOMP
36+
itself is not raised because no TX started -- which is exactly
37+
the case targeted here. The handshake is not present in
38+
mainline.
4139

4240
Link: https://github.com/cilium/cilium/issues/43198
4341
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
4442
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
4543
---
46-
drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++
47-
1 file changed, 12 insertions(+)
44+
drivers/net/ethernet/cadence/macb.h | 4 ++++
45+
drivers/net/ethernet/cadence/macb_main.c | 15 +++++++++++++++
46+
2 files changed, 19 insertions(+)
4847

48+
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
49+
index 2de56017e..ce9037f9e 100644
50+
--- a/drivers/net/ethernet/cadence/macb.h
51+
+++ b/drivers/net/ethernet/cadence/macb.h
52+
@@ -791,6 +791,10 @@
53+
#define MACB_CAPS_USRIO_HAS_MII BIT(26)
54+
#define MACB_CAPS_USRIO_HAS_REFCLK_SOURCE BIT(27)
55+
#define MACB_CAPS_USRIO_HAS_TSUCLK_SOURCE BIT(28)
56+
+/* Register writes are posted on the parent fabric and need a non-posted
57+
+ * read-back to guarantee delivery. Currently set only on RP1.
58+
+ */
59+
+#define MACB_CAPS_PCIE_POSTED_WRITES BIT(29)
60+
61+
/* LSO settings */
62+
#define MACB_LSO_UFO_ENABLE 0x01
4963
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
50-
index a12aa2124..b6cca55ad 100644
64+
index a12aa2124..6879f3458 100644
5165
--- a/drivers/net/ethernet/cadence/macb_main.c
5266
+++ b/drivers/net/ethernet/cadence/macb_main.c
53-
@@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue)
67+
@@ -1922,6 +1922,14 @@ static void macb_tx_restart(struct macb_queue *queue)
5468

5569
spin_lock(&bp->lock);
5670
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
5771
+ /*
58-
+ * Flush the PCIe posted-write queue so the TSTART doorbell
59-
+ * reliably reaches the MAC. Without this, the write can sit
60-
+ * in the fabric and the MAC never advances, causing a silent
61-
+ * TX stall.
72+
+ * On PCIe-attached parts, flush the posted-write queue so the
73+
+ * TSTART doorbell reliably reaches the MAC. Without this the
74+
+ * write can sit in the fabric and the MAC never advances,
75+
+ * causing a silent TX stall.
6276
+ */
63-
+ (void)macb_readl(bp, NCR);
77+
+ if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES)
78+
+ (void)macb_readl(bp, NCR);
6479
spin_unlock(&bp->lock);
6580

6681
out_tx_ptr_unlock:
67-
@@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
82+
@@ -2560,6 +2568,12 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
6883
spin_lock(&bp->lock);
6984
macb_tx_lpi_wake(bp);
7085
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
7186
+ /*
72-
+ * Flush the PCIe posted-write queue; see the comment in
73-
+ * macb_tx_restart() for the reasoning.
87+
+ * Flush PCIe posted-write queue; see comment in macb_tx_restart().
88+
+ * Also flushes the preceding macb_tx_lpi_wake() NCR write.
7489
+ */
75-
+ (void)macb_readl(bp, NCR);
90+
+ if (bp->caps & MACB_CAPS_PCIE_POSTED_WRITES)
91+
+ (void)macb_readl(bp, NCR);
7692
spin_unlock(&bp->lock);
7793

7894
if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
95+
@@ -5674,6 +5688,7 @@ static const struct macb_config raspberrypi_rp1_config = {
96+
.caps = MACB_CAPS_GIGABIT_MODE_AVAILABLE | MACB_CAPS_CLK_HW_CHG |
97+
MACB_CAPS_JUMBO |
98+
MACB_CAPS_GEM_HAS_PTP |
99+
+ MACB_CAPS_PCIE_POSTED_WRITES |
100+
MACB_CAPS_EEE |
101+
MACB_CAPS_USRIO_HAS_MII,
102+
.dma_burst_length = 16,
79103
--
80-
2.53.0
104+
2.54.0
81105

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
From 3d14f2380e6b568f0fe5a9a9a0b5351aea50af3c Mon Sep 17 00:00:00 2001
2+
From: Lukasz Raczylo <lukasz@raczylo.com>
3+
Date: Thu, 14 May 2026 13:25:25 +0100
4+
Subject: [PATCH net-next v2 2/3] net: macb: insert PCIe read barrier before TX
5+
completion descriptor check
6+
7+
macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
8+
calls napi_complete_done() and re-enables TCOMP via IER. An
9+
existing comment in the function notes that completions raised
10+
while TCOMP is masked do not re-fire on IER re-enable, and
11+
mitigates this by calling macb_tx_complete_pending(), which
12+
inspects driver-visible ring state (descriptor->ctrl, after
13+
rmb()) and reschedules NAPI if a completion is observable in
14+
memory.
15+
16+
On PCIe-attached parts (BCM2712 + RP1 PCIe south bridge on
17+
Raspberry Pi 5 is the case I have in front of me), the
18+
descriptor DMA write that sets TX_USED may not have retired to
19+
system memory at the point macb_tx_complete_pending() runs. The
20+
rmb() synchronises the CPU view of earlier CPU writes; it is
21+
not sufficient to retire an in-flight peripheral DMA write.
22+
Under that ordering the in-memory descriptor can still read
23+
TX_USED=0 when the hardware has in fact completed the frame;
24+
the check returns false; NAPI exits; the quirk above prevents
25+
the re-enabled IER from re-firing; the ring goes quiescent.
26+
27+
Add a side-effect-free MMIO read between the IER write and the
28+
macb_tx_complete_pending() check. The read functions as an
29+
architected PCIe read barrier for earlier peripheral-originated
30+
DMA writes on the same path, so any in-flight TX_USED update
31+
retires to system memory before the descriptor read.
32+
33+
The register chosen is IMR (the read-only interrupt mask
34+
mirror); reading it has no side effects on either read-clear or
35+
W1C ISR silicon (it is not the ISR), and the read still flushes
36+
prior DMA writes via the PCIe completion-ordering guarantee.
37+
38+
Link: https://github.com/cilium/cilium/issues/43198
39+
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
40+
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
41+
---
42+
drivers/net/ethernet/cadence/macb_main.c | 8 ++++++++
43+
1 file changed, 8 insertions(+)
44+
45+
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
46+
index 6879f3458..f7fa9e7ad 100644
47+
--- a/drivers/net/ethernet/cadence/macb_main.c
48+
+++ b/drivers/net/ethernet/cadence/macb_main.c
49+
@@ -1984,6 +1984,14 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
50+
* actions if an interrupt is raised just after enabling them,
51+
* but this should be harmless.
52+
*/
53+
+ /*
54+
+ * PCIe read barrier: flush any in-flight peripheral DMA
55+
+ * writes (descriptor TX_USED updates) so the subsequent
56+
+ * macb_tx_complete_pending() check observes them. IMR is
57+
+ * the read-only interrupt mask mirror; the read has no
58+
+ * side effects on either read-clear or W1C ISR silicon.
59+
+ */
60+
+ (void)queue_readl(queue, IMR);
61+
if (macb_tx_complete_pending(queue)) {
62+
queue_writel(queue, IDR, MACB_BIT(TCOMP));
63+
macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP));
64+
--
65+
2.54.0
66+

kernel/build/patches/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch

Lines changed: 0 additions & 106 deletions
This file was deleted.

0 commit comments

Comments
 (0)