Skip to content

ucentral-schema: bridge SSID tx_failed/tx_retries gap on mt76 and ath11k#1101

Open
firasshaari wants to merge 1 commit into
Telecominfraproject:staging-for-v4.2.0-LTSfrom
firasshaari:firasshaari/state-tx-failed-retries
Open

ucentral-schema: bridge SSID tx_failed/tx_retries gap on mt76 and ath11k#1101
firasshaari wants to merge 1 commit into
Telecominfraproject:staging-for-v4.2.0-LTSfrom
firasshaari:firasshaari/state-tx-failed-retries

Conversation

@firasshaari
Copy link
Copy Markdown
Contributor

Problem

mt76 (mt7621/mt7915) and ath11k (QSDK) drivers on the wlan-ap kernel do not propagate per-STA TX status to mac80211. As a result, the Kafka state payload always reports

interfaces[].ssids[].counters.tx_failed       = 0
interfaces[].ssids[].counters.tx_retries      = 0
interfaces[].ssids[].delta_counters.tx_failed = 0
interfaces[].ssids[].delta_counters.tx_retries = 0

…even under heavy traffic. Cloud consumers (Kafka subscribers, dashboards, alerting) get no TX-failure signal at all.

Approach

Each driver does count semantically-comparable RF-only retry/fail data in its own debugfs interface — we just have to read from the right place:

Driver File Counter
mt76 /sys/kernel/debug/ieee80211/<phy>/mt76/tx_stats BA miss count (unicast block-ACK miss)
ath11k /sys/kernel/debug/ieee80211/<phy>/ath11k/htt_stats (write 1 to htt_stats_type first) tx_xretry from HTT_TX_PDEV_STATS_CMN_TLV (type 1)

Both counters represent the same thing: unicast frames whose ACK never came back and triggered a retry. Values are directly comparable across driver families.

state.uc now reads whichever path matches the phy's driver, then projects the phy-aggregate failure count onto each VAP on the radio, weighted by tx_packets share so heavier-traffic VAPs absorb the larger fraction of failures. The existing generate_deltas pipeline picks up the new values into ssid.delta_counters automatically.

No schema additions, no new fields — the existing schema fields that consumers already key off just stop being zero.

What changed

One file added under feeds/ucentral/ucentral-schema/patches/:

feeds/ucentral/ucentral-schema/patches/004-state-add-mt76-phy-tx-stats.patch  (new, +125 lines)

Applied by openwrt's package build system to the wlan-ucentral-schema source (system/state.uc). Same mechanism as patches 001/002/003 already in that directory. Drivers, kernel, and the upstream schema repo are untouched.

Verification

Built and deployed on both representative APs against this branch:

Target Driver Result
yuncore_ax820 (ramips/mt7621) mt76/mt7915e ✅ live Kafka shows non-zero tx_failed/tx_retries per SSID
yuncore_fap655 (ipq50xx) ath11k QSDK ✅ live Kafka shows non-zero tx_failed/tx_retries per SSID

Live sample from state topic, both APs simultaneously serving real STA traffic:

// AX820 (mt76)  ssid.counters / ssid.delta_counters
{ ..., "tx_failed": 254,  "tx_retries": 254 }
{ ..., "tx_failed": 254,  "tx_retries": 254 }

// FAP655 (ath11k)  ssid.counters / ssid.delta_counters
{ ..., "tx_failed": 2400, "tx_retries": 2400 }
{ ..., "tx_failed": 2400, "tx_retries": 2400 }

Both counters increment monotonically across consecutive 60s samples. Raw phy counters (mt76 BA miss, ath11k tx_xretry) confirmed in the same magnitude / scale across drivers, proving the values represent the same thing.

Caveats

  • tx_retries is set equal to the per-SSID failure value because neither driver exposes a separate per-frame retry counter at this aggregation level. Better than zero; refine when the per-STA driver fix lands.
  • associations[].tx_failed / associations[].tx_retries (per-STA inside the SSID block) remain zero — those need a driver-level fix in mt76 PPDU-TXS handling and ath11k WBM→sta_info path. Out of scope here.
  • ath11k path triggers a fresh htt_stats request by writing 1 to htt_stats_type on each state.uc tick, with a 100 ms sleep before reading the response. State.uc already runs at ≥ 60s intervals so the overhead is negligible.
  • Phys whose driver exposes neither debugfs file silently skip — no impact on other hardware.
  • Long-term, the same change should be opened against Telecominfraproject/wlan-ucentral-schema upstream; this local patch can then be removed when the schema pin in wlan-ap is bumped.

Test plan

  • yuncore_ax820 builds cleanly with the patch
  • yuncore_fap655 builds cleanly with the patch
  • Deployed images on real APs running cloud-managed configs
  • Live Kafka state topic shows non-zero tx_failed / tx_retries on the SSID counters and delta_counters blocks for both AP families
  • Counters increment monotonically across consecutive 60s state samples
  • Phy-aggregate raw counters (mt76 BA miss, ath11k tx_xretry) confirmed in the same magnitude across drivers
  • Functional test on multi-SSID phy to confirm weighted attribution (single-SSID path verified here)

mt76 (mt7621/mt7915) and ath11k (QSDK) drivers on the wlan-ap kernel do
not propagate per-STA TX status to mac80211. As a result, the Kafka
state payload always reports

    interfaces[].ssids[].counters.tx_failed       = 0
    interfaces[].ssids[].counters.tx_retries      = 0
    interfaces[].ssids[].delta_counters.tx_failed = 0
    interfaces[].ssids[].delta_counters.tx_retries = 0

even under heavy traffic, leaving cloud consumers with no TX-failure
signal to work with.

Each driver does count semantically-comparable RF-only retry/fail data
in its own debugfs interface:

  mt76 reads: /sys/kernel/debug/ieee80211/<phy>/mt76/tx_stats
              BA miss count  (unicast block-ACK miss count)

  ath11k reads: /sys/kernel/debug/ieee80211/<phy>/ath11k/htt_stats
                tx_xretry from HTT_TX_PDEV_STATS_CMN_TLV (type 1)
                triggered by writing 1 to .../ath11k/htt_stats_type
                (excess-retry count -- frames whose ACK never came back)

Both counters represent the same thing -- unicast frames that needed a
retry because the receiver never ack-ed -- so the values are directly
comparable across driver families.

The phy-aggregate count is then projected onto each VAP on the radio,
weighted by the VAPs tx_packets share. A VAP with no traffic gets
nothing; a VAP carrying most of the load takes most of the failure
budget. The existing generate_deltas pipeline then populates
ssid.delta_counters from those values unchanged. No schema additions,
no new fields -- the fields that consumers already key off just stop
being zero.

Verified end-to-end on:
  - yuncore_ax820  (ramips/mt7621, mt7915e)   -> mt76 BA miss
  - yuncore_fap655 (ipq50xx, ath11k QSDK)     -> ath11k tx_xretry

Both APs show monotonically increasing nonzero tx_failed and tx_retries
on their SSID counters and delta_counters across consecutive samples in
the live Kafka state topic.

tx_retries is set equal to the per-SSID failure value (neither driver
exposes a separate per-frame retry counter at this level).
associations[].tx_failed/tx_retries remain 0 -- those need a driver-
level fix.
Signed-off-by: Firas Shaari <firas@80211networks.com>
@blogic
Copy link
Copy Markdown
Contributor

blogic commented May 26, 2026

@firasshaari patches need to land in ucentral-schema HEAD before we can backport them to the LTS branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants