Skip to content

Commit 8ffdb1a

Browse files
Robert Karpclaude
andcommitted
fix: suppress APM error events for receive-loop cancellations during shutdown (5.7.5)
Setting Outcome=Success (5.7.4) was insufficient: Elastic APM captures error events at the DiagnosticSource level before ReceiverWrapper runs, so the error document was already queued regardless of the outcome override. Registers a one-time Agent.AddFilter(IError) that drops error events whose TransactionId matches a cancelled-receive transaction, preventing them from reaching the APM server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent f0e645f commit 8ffdb1a

2 files changed

Lines changed: 51 additions & 2 deletions

File tree

docs/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## 5.7.5
8+
- Fixed
9+
- `ApmTransactionManager` now registers the APM error filter **at construction time** (application startup) instead of lazily inside `OnReceiveCancelled()`. The lazy approach lost a race: during pod graceful shutdown the APM agent flushes its internal buffer concurrently with Service Bus processor teardown, so error events could be sent to APM before `ReceiverWrapper.OnExceptionOccured` ran and had a chance to register the filter. Registering at construction time — before any message processing starts — closes this window. A fallback call in `OnReceiveCancelled()` handles the edge case where the APM agent was not yet configured at construction.
10+
711
## 5.7.4
812
- Fixed
913
- Prevented `OperationCanceledException` during pod graceful shutdown from being recorded as APM errors. Added `ICancellationAwareTransactionManager` — an optional interface that `ITransactionManager` implementations can implement to react to receive-loop cancellations. `ApmTransactionManager` implements it by setting the current Elastic APM transaction outcome to `Success`, overriding the error state set by the Azure SDK's auto-instrumentation. `ReceiverWrapper` calls `OnReceiveCancelled()` via a runtime cast before logging the shutdown warning.

src/Ev.ServiceBus.Apm/ApmTransactionManager.cs

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
using System;
2+
using System.Collections.Concurrent;
23
using System.Collections.Generic;
34
using System.Diagnostics;
5+
using System.Threading;
46
using System.Threading.Tasks;
57
using Elastic.Apm;
68
using Elastic.Apm.Api;
@@ -14,6 +16,25 @@ namespace Ev.ServiceBus.Apm;
1416
/// </summary>
1517
public class ApmTransactionManager : ITransactionManager, ICancellationAwareTransactionManager
1618
{
19+
// Tracks transaction IDs for which the ASB ProcessErrorAsync callback fired an OperationCanceledException
20+
// (the standard signal that the receive loop is being stopped, most commonly during pod graceful shutdown).
21+
// The error filter below suppresses APM error events for these transactions so that
22+
// shutdown-induced TaskCanceledException entries do not appear in APM.
23+
// Capped at 1000 entries as a safety net: normal pod shutdown produces ~50 entries; in any
24+
// edge case where the processor is stopped and restarted mid-lifecycle the cap prevents unbounded growth.
25+
private static readonly ConcurrentDictionary<string, byte> _cancelledTransactionIds = new();
26+
private const int CancelledTransactionIdCap = 1000;
27+
private static int _filterRegistered; // 0 = not registered, 1 = registered
28+
29+
public ApmTransactionManager()
30+
{
31+
// Register the shutdown-cancellation error filter at construction time (application startup),
32+
// not lazily on first OnReceiveCancelled(). During pod graceful shutdown the APM agent flushes
33+
// its buffer concurrently with Service Bus processor teardown — registering the filter after the
34+
// first OperationCanceledException fires loses that race and lets error events escape to APM.
35+
RegisterShutdownErrorFilter();
36+
}
37+
1738
public async Task RunWithInTransaction(MessageExecutionContext executionContext, Func<Task> transaction)
1839
{
1940
if (IsTraceEnabled())
@@ -73,8 +94,32 @@ private static List<SpanLink> GetSpanLinks(string? diagnosticId)
7394

7495
public void OnReceiveCancelled()
7596
{
76-
if (IsTraceEnabled())
77-
Agent.Tracer.CurrentTransaction.Outcome = Outcome.Success;
97+
if (!IsTraceEnabled())
98+
return;
99+
100+
var tx = Agent.Tracer.CurrentTransaction;
101+
if (tx is null) return;
102+
tx.Outcome = Outcome.Success;
103+
// Soft cap: Count + TryAdd are not atomic, so the dict can slightly exceed the limit under
104+
// concurrent shutdown. This is intentional — the cap is a safety net against unbounded growth
105+
// in edge cases, not a strict hard limit. Normal shutdown adds ~50 entries at most.
106+
if (_cancelledTransactionIds.Count < CancelledTransactionIdCap)
107+
_cancelledTransactionIds.TryAdd(tx.Id, 0);
108+
109+
// Fallback: if the agent was not yet configured when the constructor ran, register now.
110+
RegisterShutdownErrorFilter();
111+
}
112+
113+
private static void RegisterShutdownErrorFilter()
114+
{
115+
if (!Agent.IsConfigured || Interlocked.CompareExchange(ref _filterRegistered, 1, 0) != 0)
116+
return;
117+
118+
// Returning null from the filter drops the error event before it reaches the APM server.
119+
Agent.AddFilter((IError error) =>
120+
error.TransactionId is not null && _cancelledTransactionIds.ContainsKey(error.TransactionId)
121+
? null
122+
: error);
78123
}
79124

80125
private static bool IsTraceEnabled()

0 commit comments

Comments
 (0)