fix: High idle CPU in RDS/Terminal Server environments (SCARD_E_INVALID_HANDLE)#445
Conversation
…ard listener (#434) When an RDS/Terminal Server session is disconnected, the Windows Smart Card Service invalidates existing SCARDCONTEXT handles. DesktopSmartCardDeviceListener called SCardGetStatusChange in a tight loop with the stale handle — WinSCard internally raises a C++ exception (CxxThrowException) for each call, pegging a CPU core. Fix: - Add SCARD_E_INVALID_HANDLE, SCARD_E_SYSTEM_CANCELLED, ERROR_BROKEN_PIPE to the UpdateContextIfNonCritical recovery switch - Add Thread.Sleep(1000) backoff after recovery to prevent secondary tight loop - Guard UpdateCurrentContext against failed SCardEstablishContext - Extract ISCardInterop interface enabling cross-platform unit tests without hardware Tests: - DesktopSmartCardDeviceListenerSCardErrorTests: cross-platform mock tests (Track B) - DesktopSmartCardDeviceListenerWindowsTests: Windows CPU regression test (Track A) measuring TotalProcessorTime before/after handle invalidation via real WinSCard Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a high-idle-CPU regression in DesktopSmartCardDeviceListener seen in RDS/Terminal Server reconnect scenarios by handling invalidated smart-card contexts, adding recovery backoff, and improving testability via an injectable SCard interop abstraction.
Changes:
- Extracts the WinSCard/PCSC calls behind
ISCardInteropwith a productionSCardInteropimplementation to enable deterministic unit testing. - Extends non-critical error recovery to include
SCARD_E_INVALID_HANDLE,SCARD_E_SYSTEM_CANCELLED, andERROR_BROKEN_PIPE, and adds a 1s recovery backoff to prevent tight retry loops. - Adds cross-platform mock tests and Windows-only integration/regression tests (including CPU consumption measurement) covering the RDS invalid-handle scenario.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
Yubico.Core/src/Yubico/Core/Devices/SmartCard/DesktopSmartCardDeviceListener.cs |
Adds injectable SCard interop, expands recoverable error handling, and introduces recovery backoff + safer context refresh logic. |
Yubico.Core/src/Yubico/PlatformInterop/Desktop/SCard/ISCardInterop.cs |
Introduces an internal abstraction for the subset of SCard APIs the listener uses. |
Yubico.Core/src/Yubico/PlatformInterop/Desktop/SCard/SCardInterop.cs |
Provides the production ISCardInterop implementation delegating to NativeMethods. |
Yubico.Core/tests/Yubico/Core/Devices/SmartCard/DesktopSmartCardDeviceListenerSCardErrorTests.cs |
Adds cross-platform mock-based tests to validate recovery and throttling behavior. |
Yubico.Core/tests/Yubico/Core/Devices/SmartCard/DesktopSmartCardDeviceListenerWindowsTests.cs |
Adds Windows-only tests that invalidate a real context and validate CPU + recovery + disposal behavior. |
fix-rds-scard-invalid-handle.md |
Documents root cause analysis, fix details, and how to run/validate tests. |
…try/finally leak fix - Add ArgumentNullException guard for injected ISCardInterop in internal constructor - Check SCardReleaseContext return value in test helper and Skip on failure - Wrap disposal test listener in try/finally to prevent resource leak on Skip Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… CPM override - Replace Assert.NotEqual(..., string) with Assert.True(..., string) — xunit 2.9.3 has no string-message overload for NotEqual (that's xunit 3.x only) - Add Directory.Packages.props to .gitignore (local-only workaround that prevents a parent project's CPM config from breaking restore in this branch) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Test Results: Windows 2 files 2 suites 42s ⏱️ Results for commit c60b66d. ♻️ This comment has been updated with latest results. |
Test Results: Ubuntu 2 files 2 suites 1m 12s ⏱️ Results for commit c60b66d. ♻️ This comment has been updated with latest results. |
Test Results: MacOS 4 files 4 suites 56s ⏱️ Results for commit c60b66d. ♻️ This comment has been updated with latest results. |
- Short-circuit CheckForUpdates after context recovery at all three GetStatusChange call sites to prevent stale newStates from overwriting the freshly refreshed _readerStates (Copilot review finding). - Format SCARD_READER_STATE[] in diagnostic logging so individual reader entries are printed instead of the array type name. - Return distinct non-zero handles from FakeSCardInterop.EstablishContext on success, matching real WinSCard behavior. - Extract DRY helper for context-reestablishment test assertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…for CI Dispose the old SCARDCONTEXT before calling EstablishContext to prevent Windows from recycling the freed handle value for the new context. The previous order (establish-then-dispose) caused the old SafeHandle's Dispose to release the newly established context via SCardReleaseContext, leaving the listener with a dead handle after recovery. Replace the fragile IntPtr comparison in the recovery test with an IsClosed check — handle value recycling is legitimate OS behavior. Raise CPU test threshold from 500ms to 1500ms to tolerate concurrent test activity in CI while still catching the unfixed behavior (≥2500ms). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ive tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Status was set to Error in ListenForReaderChanges' catch (Exception) block but never reset, so callers polling Status saw stale Error after the listener had recovered. CheckForUpdates now restores Started on its successful exit path (after FireEvents), reflecting live health. Adds WhenPollSucceedsAfterManagedException_StatusResetsToStarted, which extends FakeSCardInterop with a throw-once mode to exercise the catch path followed by a recovering poll. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
An unexpected managed exception from CheckForUpdates re-enters the while (_isListening) loop with no delay. This is the only un-throttled path in the listener after PR #445. Add Thread.Sleep(RecoveryBackoffDelay) after setting Status = Error in the catch block to prevent a tight retry loop if the same exception recurs. Add test WhenCatchBlockTriggers_LoopThrottlesBeforeRetry that verifies GetStatusChange call count stays below 5 in 600ms when every poll throws. Pre-Step-2, the count would be hundreds due to immediate retry. Adjust WhenPollSucceedsAfterManagedException_StatusResetsToStarted to sleep 1500ms instead of 500ms to account for the new 1000ms sleep after the thrown exception. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Thread.Sleep(RecoveryBackoffDelay) blocks Dispose for up to 1 second per active wait site. _scard.Cancel(_context) only wakes a blocked syscall, not a sleeping thread. Add ManualResetEventSlim _stopRequested field. Signal it in StopListening before joining the listener thread. Replace all three Thread.Sleep(RecoveryBackoffDelay) calls with _stopRequested.Wait(RecoveryBackoffDelay) so StopListening can wake them immediately. Dispose _stopRequested after _context.Dispose in Dispose(bool disposing). The listener thread has already been joined at that point, so no one is in Wait. Add test WhenDisposeCalledDuringRecoveryWait_DisposeReturnsQuickly that verifies Dispose completes in under 200ms when the listener is in a recovery wait. Pre-Step-3, Dispose would block on the full 1000ms sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Today's recovery path sleeps a fixed 1s. If WinSCard / Smart Card Service stays broken for minutes, the listener still polls every second. Exponential backoff with a cap gives up CPU more aggressively without losing eventual recovery. Add MaxRecoveryBackoffDelay (30 seconds) and _consecutiveRecoveryAttempts field. Add internal static CalculateRecoveryBackoff method that doubles the base delay for each consecutive attempt, capped at 30 seconds. At each of the three recovery wait sites (catch block in ListenForReaderChanges, default-error path in HandleSCardGetStatusChangeResult, recovery path in UpdateContextIfNonCritical), call CalculateRecoveryBackoff with _consecutiveRecoveryAttempts and increment the counter after the wait. Reset _consecutiveRecoveryAttempts to 0 on successful poll in CheckForUpdates (unconditionally, regardless of Status) so a successful poll always resets the backoff window. Add CalculateRecoveryBackoff_DoublesUntilCap Theory test that verifies the exponential backoff math (0 -> 1s, 1 -> 2s, 2 -> 4s, ..., 5+ -> 30s cap, handles overflow safely). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Performance benchmark comparing pre-#445 (develop) listener behavior against the current #445 + Steps 1-4 implementation under persistent SCARD_E_INVALID_HANDLE failures. Files added: - Yubico.Core/perf/Yubico.Core.Performance.csproj (BDN benchmark exe) - Yubico.Core/perf/Program.cs (BDN entry point) - Yubico.Core/perf/Benchmarks/SmartCardListenerInvalidHandleBenchmark.cs - Yubico.Core/perf/Mocks/AlwaysInvalidHandleScardInterop.cs (test mock) - Yubico.Core/perf/Listeners/LegacyDesktopSmartCardDeviceListener.cs (simplified snapshot preserving the pre-#445 busy-loop bug) - Yubico.NET.SDK.Performance.sln (separate solution for perf projects) - Plans/perf/scard-listener-invalid-handle.md (benchmark report) Modified: - Yubico.Core/src/Yubico.Core.csproj Added InternalsVisibleTo for Yubico.Core.Performance (strong-named). Key result (observed on Apple M1, .NET 8.0.13): - Legacy (develop): 134,789,691 invocations/sec (busy spin) - Fixed (#445): 1 invocation/sec (backoff working) - Ratio: ~135,000,000× reduction This exceeds the >= 1,000× acceptance criterion by ~135,000×, proving that PR #445's recovery wiring eliminates the production busy-loop bug reported in YubiPCS 1.9.1.2 logs. Implementation notes: - Used simplified LegacyDesktopSmartCardDeviceListener (~120 lines) instead of full 513-line develop port per brief fallback guidance. - Both assemblies strong-name signed with Yubico.NET.SDK.snk. - Perf project suppresses CA1707, CA1861, IDE0058, IDE0044 analyzers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add SCardApiCall(..., knownRecoverable) overload that downgrades failures to LogDebug when the error is expected in recovery paths (e.g. EstablishContext and ListReaders failures during UpdateCurrentContext after an RDS disconnect). UpdateCurrentContext recovery-path calls now pass knownRecoverable=true to prevent flooding production logs with error-level entries during prolonged Smart Card Service downtime. Unknown-by-definition errors (default branch in HandleSCardGetStatusChangeResult) remain at Error severity. Add SmartCardLoggerExtensionsTests with 4 unit tests asserting correct severity selection for both overloads. Resolves the logger-spam concern from PR #445 review (Plans/i-need-your-expert-deep-crane.md Step 6). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
oskardanielsson-yubico
left a comment
There was a problem hiding this comment.
All new files (IDK what the practice is for updated files) should be dated as 2026, other than that it looks good
Update: Follow-up stack verified on WindowsUpon merging #460, we will get:
All 4,058 unit tests pass on Windows (483 Core + 3,575 YubiKey), including the three |
After GetStatusChange() returns SCARD_E_TIMEOUT (the normal polling result when nothing has changed), immediately return from the poll iteration. Previously, the code continued processing, causing spurious arrival/removal events to fire every ~3 seconds by triggering RelevantChangesDetected() checks on stale state data. Addresses feedback from internal testing: timeout results now short-circuit without state processing, preventing the listener from reprocessing devices and firing false events. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Commit 7864e009 short-circuited CheckForUpdates on SCARD_E_TIMEOUT to prevent spurious arrival/removal events, but the new return path bypassed the success-path resets at the end of the method: - _consecutiveRecoveryAttempts = 0 (Step 4: exponential backoff counter) - Status = Started when previously Error (Step 1: latent bug fix) A SCARD_E_TIMEOUT proves the syscall path is healthy (the timeout fired naturally, no error), so semantically it is a successful poll outcome and should reset the same health signals as SCARD_S_SUCCESS. This regressed WhenPollSucceedsAfterManagedException_StatusResetsToStarted, which scheduled TIMEOUT as the post-recovery default and asserted Status returned to Started. After 7864e009 the assertion failed because the early return skipped the reset. Restore both resets at the early-return site, mirroring the success-path block. Add two regression tests: - WhenGetStatusChangeReturnsTimeout_NoArrivalOrRemovalEventsFire pins the user-visible behaviour (no spurious events from quiet polls) - WhenPollTimesOutAfterManagedException_StatusResetsToStarted documents that TIMEOUT counts as healthy for recovery accounting 475/475 pass on macOS (was 473 + 2 new = 475). Build clean.
…es no events Internal user reported (PR #460 thread) that the spurious arrival/removal events occurred every ~3 s even after the recovery-hardening stack landed. The two existing tests added in cad24551 pin Haiku's specific SCARD_E_TIMEOUT short-circuit but do not exercise the deeper invariant: a SCARD_S_SUCCESS poll where only SCARD_STATE.CHANGED toggles on the synthetic PnP reader, with no underlying reader-list delta, must produce zero arrival/removal events. Extend FakeSCardInterop with an optional `stateApplier` callback that mutates the post-probe `states` array in place — mirrors how WinSCard / pcscd populate _eventState during a successful poll. SCARD_READER_STATE's backing fields are private (populated by P/Invoke), so the test helper uses reflection to set _eventState; this is contained to a single helper and avoids polluting the production struct with a test-only seam. Honest scope note documented in the test comment: the user's exact path (PRESENT-bit flip-flop on a real reader entry) requires multi-phase ListReaders responses and remains an integration-level scenario. This test pins the simpler invariant on the same code branch (ReaderListChangeDetected → inner while loop), which is the regression class that produced the user's symptom. 476/476 Yubico.Core unit tests pass on macOS (was 475 + 1 new = 476). Build clean.
…n gate Introduces .github/workflows/perf-regression.yml to run the existing SmartCardListenerInvalidHandleBenchmark as a regression gate on PRs targeting develop or main that modify perf code or SmartCard sources. Key features: - Triggers on changes to Yubico.Core/perf/** or SmartCard/** paths - Runs on windows-latest (matches primary perf environment) - Builds Yubico.NET.SDK.Performance.sln and executes BDN with JSON export - PowerShell script parses console output to extract invocation counts - Enforces >= 1000x ratio requirement (Legacy/Fixed syscall reduction) - Uploads BenchmarkDotNet.Artifacts/ for inspection on failure - Supports manual workflow_dispatch triggers Decision rationale: - Console output parsing chosen over in-benchmark assertions for simpler failure diagnosis and artifact inspection - Single-OS (Windows) for v1 is acceptable; can expand to matrix later - Path filters limit CI cost to relevant changes only Caveats: - BDN runs ~12 seconds (warmup + 5 iterations × 2 benchmarks) - GitHub-hosted Windows runners have variable performance; threshold has 135,000x safety margin based on local M1 results - No special runner constraints or self-hosted runners required Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause: actions/setup-dotnet v5 changed behavior - source-url parameter only configures auth but does NOT register a named source for --source "github". The publish-internal job was using --source "github" without ever running `dotnet nuget add source --name github`, causing "invalid source" error. Fix: Mirror the proven working pattern from build-nativeshims.yml line 518 - explicitly register the "github" named source before push. Regression introduced in: 2aa9c06 (setup-dotnet v4 → v5.2.0 upgrade) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make Status property and _isListening field volatile for cross-thread visibility. Log successful context re-establishment at Debug level. Add tests for the 3 remaining recoverable error codes and the unknown- error backoff path. Document benchmark CI format contract and test timing strategy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix: follow-ups on PR #445: harden SCard listener recovery
|
@claude Troubleshoot the failed run of this PR: https://github.com/Yubico/Yubico.NET.SDK/actions/runs/24985228990/job/73156722360?pr=445 |
|
Claude finished @DennisDyallo's task in 2m 22s —— View job Troubleshooting Failed CI Run
Root CauseThe "Parse benchmark results and verify thresholds" step (step 9) fails immediately with: The workflow script looks for The file that was produced (confirmed in the run log, line 356): The filter in the workflow script ( $resultsFile = Get-ChildItem -Path "BenchmarkDotNet.Artifacts/results" -Filter "*-report-full.json" -RecurseThis is a naming mismatch. The Fix OptionsThere are two equivalent single-line fixes: Option A — Change the exporter flag (line 62 of - dotnet run ... -- --filter '*SmartCardListenerInvalidHandleBenchmark*' --exporters json
+ dotnet run ... -- --filter '*SmartCardListenerInvalidHandleBenchmark*' --exporters fullJson
Option B — Change the filter (line 72 of - $resultsFile = Get-ChildItem -Path "BenchmarkDotNet.Artifacts/results" -Filter "*-report-full.json" -Recurse |
+ $resultsFile = Get-ChildItem -Path "BenchmarkDotNet.Artifacts/results" -Filter "*-report-full-compressed.json" -Recurse |Option A is slightly simpler since it's one flag swap and the rest of the script stays valid. Secondary NoteOnce the JSON lookup is fixed, the log-file parsing logic will work correctly. The benchmark The PowerShell regex
|
Bumps copyright on newly-created files, clarifies UpdateCurrentContext docs, and tightens the perf-regression workflow. Changes: - Copyright 2025 -> 2026 on 9 newly-created files (per @oskardanielsson-yubico) - DesktopSmartCardDeviceListener.UpdateCurrentContext: rewrite XML remarks to match actual behavior (old context disposed first to prevent handle recycling; on failure _context is set to invalid handle so the next poll re-enters recovery). The previous remarks described the pre-#460 logic. (Copilot review thread #3146001266) - .github/workflows/perf-regression.yml: * Drop unused 'pull-requests: write' permission (Copilot #3146001299) * Switch BenchmarkDotNet exporter from 'json' to 'fullJson' so the *-report-full.json filter on line 71 matches what BDN produces. The 'json' exporter writes *-report-full-compressed.json, which caused the workflow to fail the parse step (CI run 24985228990). * Cast invocation counters to [long] instead of [int] to avoid overflow on fast runners (Copilot #3146001329). Comments addressed in earlier commits (replied on threads): - #3023191052 null-guard, #3023191093 SCardReleaseContext check, - #3023191111 try/finally leak fix (commit be88315) - #3024468048, #3029767235 CollectionDefinition (commit e9f0373) - #3024468075 stale _readerStates after recovery (commit a9964fe) - #3024468093 reader-state log formatting (commit a9964fe) - #3024468114 EstablishContext returns non-zero handle (commit a9964fe) Comments declined / deferred: - #3028816045 (mock SCardContext native release) - conflicts with #3024468114; CI green on Windows/Ubuntu/macOS proves no harm. - #3029767257 (PnP literal consolidation) - the two literals serve distinct purposes (reader list vs PnP probe). Defer to a follow-up hardening pass once canonical names are verified against MSDN/pcsc-lite. Verified: - dotnet build Yubico.Core: 0 errors, 0 warnings - Yubico.Core unit tests: 480 passed, 0 failed, 21 skipped (Windows-only)
Summary
Closes #434. Also resolves #37 (same root cause, reported 2023-02-13 — the
SCARD_E_INVALID_HANDLEgap inUpdateContextIfNonCriticalsurvived every listener refactor since then).When an RDS/Windows 365/Terminal Server session is disconnected and reconnected, the Windows Smart Card Service invalidates existing
SCARDCONTEXThandles.DesktopSmartCardDeviceListenerwas callingSCardGetStatusChangein a tight loop with the stale handle.WinSCard.dllinternally raises and unwinds a C++ exception (CxxThrowException) for every such call — confirmed in 6 of 10 minidumps. This ran thousands of times per second, pegging one CPU core.The error code returned to managed code (
SCARD_E_INVALID_HANDLE = 0x80100003) was not handled byUpdateContextIfNonCritical. The code logged the error and immediately retried with no backoff and no context re-establishment.Changes
ISCardInteropinterface +SCardInteropconcrete class — extracts the four SCard P/Invoke calls behind an injectable interface, enabling cross-platform unit tests without hardware or WindowsUpdateContextIfNonCritical— addsSCARD_E_INVALID_HANDLE,SCARD_E_SYSTEM_CANCELLED,ERROR_BROKEN_PIPEto the recovery switch (all three surface in RDS session transitions)Thread.Sleep(1000)backoff — applied after every recovery attempt to prevent a secondary tight loop ifSCardEstablishContextalso fails while the service is transitioningUpdateCurrentContextfix — reordered to dispose the old context BEFORE callingEstablishContext, preventing a handle-value recycling bug where Windows could reuse the freed IntPtr value for the new context, and the subsequent oldSafeHandle.Dispose()would release the new context viaSCardReleaseContextIsClosedcheck, since handle value recycling is legitimate OS behaviorTest plan
Cross-platform mock tests (run on any CI platform, no hardware needed)
dotnet test Yubico.Core\tests\Yubico.Core.UnitTests.csproj \ --filter "FullyQualifiedName~DesktopSmartCardDeviceListenerSCardErrorTests" \ --logger "console;verbosity=detailed"WhenGetStatusChangeReturnsInvalidHandle_ContextIsReestablished— fails before fix, passes afterWhenGetStatusChangeAlwaysReturnsInvalidHandle_LoopDoesNotSpin— proves no tight loop (< 15 calls in 600ms)WhenGetStatusChangeReturnsSystemCancelled_ContextIsReestablishedWhenGetStatusChangeReturnsBrokenPipe_ContextIsReestablishedWhenContextReestablishmentFails_ListenerContinuesWithoutCrashingWindows CPU regression test (requires Windows with SCardSvr running — no reader needed)
dotnet test Yubico.Core\tests\Yubico.Core.UnitTests.csproj \ --filter "Category=WindowsOnly" \ --logger "console;verbosity=detailed"RealWinSCard_WhenHandleInvalidated_CpuDoesNotSpike— measuresProcess.TotalProcessorTimeover 3s after handle invalidation. Before fix: ≈2500–3000ms → FAIL. After fix: ≈30–100ms → PASS. Threshold: 1500ms (CI-safe).RealWinSCard_WhenHandleInvalidated_NewContextIsEstablished— verifies functional recovery (valid, non-closed handle; listener still in Started status)RealWinSCard_WhenHandleInvalidatedThenDisposed_DisposalCompletesCleanly— verifies Dispose completes within 5s after recoveryFull unit test suite
All 487 tests pass (468 passed, 19 skipped platform-specific, 0 failed).
🤖 Generated with Claude Code