diff --git a/docs/docs/operator/advanced-configuration.mdx b/docs/docs/operator/advanced-configuration.mdx index 60f4763c..2b19a641 100644 --- a/docs/docs/operator/advanced-configuration.mdx +++ b/docs/docs/operator/advanced-configuration.mdx @@ -129,6 +129,23 @@ builder.Services retryPeriod: TimeSpan.FromSeconds(2))); ``` +### Behavior on Leadership Loss + +With `LeaderElectionType.Single`, both the resource watcher **and** the reconciliation queue are gated on leadership. When an instance loses its lease it: + +1. **suspends the queue's intake** — any further enqueue is dropped, so a still-running reconciler's `RequeueAfter` and error retries cannot leave work behind; +2. **cancels** the queue processing loop and any in-flight reconciliation; and +3. **clears the queue** — pending and scheduled entries accumulated by the former leader are discarded. + +While the instance is not the leader the intake stays closed. On re-acquiring leadership the intake reopens and the watcher re-lists the current state, so processing resumes from a clean slate. + +**What this does and does not guarantee.** The gate protects the *queue* side: no stale entry, timed requeue, or error retry from a former leader survives a leadership transition. It does **not** terminate the process (unlike controller-runtime), so it cannot stop a **non-cooperative** reconciler that ignores its `CancellationToken` from performing external side effects while it was the leader. Therefore: + +- **Reconcilers must honor cancellation** and return promptly once their token is cancelled. +- **Reconcilers must be idempotent.** As a second line of defense, concurrent writes to the same object are serialized by the API server via `resourceVersion`; a stale write fails with HTTP 409 Conflict. + +See the [Kubernetes leases documentation](https://kubernetes.io/docs/concepts/architecture/leases/) for leader-election semantics. + ### Custom Leader Election The `Custom` leader election type allows you to implement your own coordination logic, such as namespace-based leader election. @@ -147,6 +164,7 @@ public sealed class NamespacedLeaderElectionResourceWatcher( ITimedEntityQueue entityQueue, OperatorSettings settings, IEntityLabelSelector labelSelector, + IEntityFieldSelector fieldSelector, IKubernetesClient client, INamespaceLeadershipManager namespaceLeadershipManager) : ResourceWatcher( @@ -156,6 +174,7 @@ public sealed class NamespacedLeaderElectionResourceWatcher( entityQueue, settings, labelSelector, + fieldSelector, client) where TEntity : IKubernetesObject { @@ -222,16 +241,27 @@ public class NamespaceLeadershipManager : INamespaceLeadershipManager } ``` -**Step 3: Register the custom watcher** +**Step 3: Register the controller, the custom watcher, and a queue consumer** + +`LeaderElectionType.Custom` intentionally registers **no** watcher and **no** queue consumer — with custom coordination the SDK cannot know how leadership should drive either. You therefore wire up the full pipeline yourself: `AddController` registers the reconciler and the entity queue, and you add the custom watcher plus a queue consumer as hosted services. A watcher without a consumer would enqueue events that nothing ever drains, so the consumer is required. + +Enabling `WithRegistrationValidation()` makes host startup fail fast with an `InvalidRegistrationException` if a piece is missing, instead of starting an operator that silently processes nothing. ```csharp builder.Services .AddKubernetesOperator(settings => settings - .WithLeaderElection(LeaderElectionType.Custom)) + .WithLeaderElection(LeaderElectionType.Custom) + .WithRegistrationValidation()) + .AddController(); + +builder.Services .AddSingleton() - .AddHostedService>(); + .AddHostedService>() + .AddHostedService>(); ``` +The plain `EntityQueueBackgroundService` is the right consumer here: leadership is handled inside your watcher, so the queue does not need the leadership gate that `LeaderElectionType.Single` adds. + ### Benefits of Custom Leader Election - **Horizontal Scaling**: Multiple instances can process different subsets of resources @@ -239,6 +269,60 @@ builder.Services - **Geographic Distribution**: Route requests to instances in specific regions - **Load Balancing**: Distribute work across multiple instances +## Registration Validation + +The operator can verify, on host startup, that its service registrations are complete and consistent with +the configured `LeaderElectionType` and `QueueStrategy`. This is **disabled by default**; enable it with: + +```csharp +builder.Services.AddKubernetesOperator(settings => settings + .WithRegistrationValidation()); +``` + +When enabled, every managed entity is checked for the components its configuration requires — an +`IReconciler`, a resource watcher, an `ITimedEntityQueue`, and a queue consumer. If +anything is missing, host startup aborts with an `InvalidRegistrationException` that lists the gaps — +turning a silent misconfiguration (an operator that starts but processes nothing) into an immediate, +actionable error. + +Most registrations are wired up by the SDK automatically, so validation simply passes. It is most useful +when you register components yourself — for example with `LeaderElectionType.Custom`, where the watcher +and queue consumer are your responsibility — and a forgotten registration would otherwise go unnoticed. +It also: + +- flags entities that have a finalizer but no controller, since finalizers run as part of reconciliation + and need a controller to execute; and +- with `LeaderElectionType.Single`, requires leadership-loss protection to actually be wired up: the + registered queue must implement `ISuspendableEntityQueue` (when its implementation type can be inspected) + **and** the consumer must implement `ILeaderAwareEntityQueueConsumer` (i.e. actually drive the + gate). The built-in `TimedEntityQueue` and `LeaderAwareEntityQueueBackgroundService` + already do. + +:::note +Components are recognized by the SDK types they derive from / implement: the watcher from +`ResourceWatcher`, the queue consumer from `IEntityQueueConsumer` (or, under `Single`, +`ILeaderAwareEntityQueueConsumer`), and the queue's leadership gate from `ISuspendableEntityQueue`. +`EntityQueueBackgroundService` / `LeaderAwareEntityQueueBackgroundService` implement +these. + +Recognition is by **registered implementation type**, so components must be registered with a concrete +type (`AddHostedService()`, `AddSingleton()`). Open-generic registrations such +as `AddSingleton(typeof(ITimedEntityQueue<>), typeof(MyQueue<>))` are understood (the implementation is +closed to the entity type). A component registered through a DI factory delegate exposes no type: a +factory-registered watcher or consumer is reported as missing, and under `Single` a factory-registered +queue is reported as unverifiable (its `ISuspendableEntityQueue` capability cannot be checked). Register +these concretely, or keep validation disabled. Keyed registrations are ignored — the watcher, reconciler +and consumer take their dependencies unkeyed, so a keyed registration would not satisfy them at runtime. +::: + +:::info[Scope of the check] +Validation confirms that the **registrations** required by your configuration are present and SDK-conform — +that an `ITimedEntityQueue` implements `ISuspendableEntityQueue` and the consumer implements +`ILeaderAwareEntityQueueConsumer` under `Single`. It does **not** prove those components behave +correctly — for example that a custom consumer actually calls `SuspendIntake()`/`Clear()`/`ResumeIntake()` +on leadership transitions. Treat it as a wiring check, not a behavioral guarantee. +::: + ## Queue Strategy KubeOps uses an internal queue to schedule and dispatch reconciliation work. The queue buffers events @@ -253,9 +337,16 @@ builder.Services // QueueStrategy defaults to QueueStrategy.InMemory ``` -With `InMemory`, KubeOps registers `TimedEntityQueue` as `ITimedEntityQueue` and -starts `EntityQueueBackgroundService` as a hosted service. All scheduling state lives in -the operator pod's heap. +With `InMemory`, KubeOps registers `TimedEntityQueue` as `ITimedEntityQueue` and starts a +queue-consumer hosted service. Which consumer is started depends on the configured `LeaderElectionType`: + +- `None`: `EntityQueueBackgroundService`. +- `Single`: `LeaderAwareEntityQueueBackgroundService`, which drives the queue's leadership gate + (suspends intake and clears the queue on leadership loss). This is what provides the leadership-loss + protection. +- `Custom`: no consumer is registered — you supply one yourself (see the Custom leader election example above). + +All scheduling state lives in the operator pod's heap. When the same entity is scheduled multiple times before it becomes ready, the in-memory queue keeps one entry per namespace/name key. The entry scheduled to run earliest is kept, and later enqueue calls @@ -279,18 +370,72 @@ builder.Services // Register one ITimedEntityQueue per entity type builder.Services.AddSingleton, MyCustomEntityQueue>(); -// Register one background service per entity type that drains the queue +// Register one background service per entity type that drains the queue. Implement +// IEntityQueueConsumer (or derive from EntityQueueBackgroundService) so that +// registration validation can recognize it as the queue consumer. builder.Services.AddHostedService>(); ``` :::note Register one `ITimedEntityQueue` implementation and one corresponding background service -for each entity type that should use the custom queue. +for each entity type that should use the custom queue. The consumer should implement +`IEntityQueueConsumer` (the built-in `EntityQueueBackgroundService` already does) — it +is not required at runtime, but [registration validation](#registration-validation) uses it to confirm a +consumer exists. With `LeaderElectionType.Single`, implement `ILeaderAwareEntityQueueConsumer` +instead (or derive from `LeaderAwareEntityQueueBackgroundService`) so validation can confirm the +consumer drives the leadership gate. ::: Custom queues should follow the same de-duplication contract as the built-in implementation: keep at -most one pending entry per namespace/name key, preserve the earliest scheduled execution time, and do -not expose cancellation/removal as part of `ITimedEntityQueue`. +most one pending entry per namespace/name key and preserve the earliest scheduled execution time. + +`Enqueue` returns `Task`: return `true` when the entry was scheduled and `false` when it was +dropped (for example because the `cancellationToken` was already cancelled, or — for a queue that +implements `ISuspendableEntityQueue` — because intake is currently suspended). The reconciler and the +queue consumer rely on this to avoid counting requeue metrics for work that was not actually scheduled, +so returning the correct value keeps those metrics accurate. + +```csharp +public Task Enqueue( + V1DemoEntity entity, + ReconciliationType type, + ReconciliationTriggerSource triggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) +{ + if (cancellationToken.IsCancellationRequested) + { + return Task.FromResult(false); // dropped + } + + // ... schedule the entry (de-duplicated by namespace/name) ... + return Task.FromResult(true); // scheduled +} +``` + +:::warning[Leadership-loss semantics] +The SDK's built-in queue discards a former leader's work on a leadership transition (see +[Behavior on Leadership Loss](#behavior-on-leadership-loss)). It does so through the optional +`ISuspendableEntityQueue` capability — `SuspendIntake()`, `Clear()`, `ResumeIntake()` — which the built-in +`TimedEntityQueue` implements. With `QueueStrategy.Custom` **both** the queue **and** its +background consumer are yours; the SDK does not wire up a leader-aware consumer for you. + +So if you combine `QueueStrategy.Custom` with `LeaderElectionType.Single` and want the same guarantee +(no stale entries, timed requeues, or error retries surviving a leadership transition), you must: + +- have your queue implement **`ISuspendableEntityQueue`** so that `SuspendIntake()` drops subsequent + enqueues, `Clear()` discards pending and scheduled entries, and the suspend/clear/enqueue paths are + mutually atomic; **and** +- have your custom background consumer call `SuspendIntake()` + `Clear()` when leadership is lost and + `ResumeIntake()` when it is (re)acquired — or simply derive from / reuse the SDK's public + `LeaderAwareEntityQueueBackgroundService`, which already does this. + +`ISuspendableEntityQueue` is **opt-in** and separate from `ITimedEntityQueue`: a queue that does +not implement it keeps working but provides no leadership-loss protection, and the leader-aware consumer +logs a warning to make that explicit. If you do not need the guarantee (for example with +`LeaderElectionType.None`), you can ignore the capability entirely. +::: Use this strategy when you need scheduling behaviour that differs from the built-in timer-based queue — for example, priority queues or implementations that integrate with your own observability @@ -414,8 +559,9 @@ KubeOps makes this straightforward by exposing the `EntityQueue` delega any DI-registered component to enqueue an entity for reconciliation on demand. ```csharp -// Signature (fire-and-forget; enqueue is synchronous) -public delegate void EntityQueue( +// Signature — awaitable. The result is true if the entity was scheduled, or false if the enqueue was +// dropped (for example because the queue's intake is suspended during a leadership transition). +public delegate Task EntityQueue( TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, @@ -425,6 +571,8 @@ public delegate void EntityQueue( where TEntity : IKubernetesObject; ``` +Always `await` the delegate and, where it matters, react to a `false` result (a dropped enqueue). + Always pass `ReconciliationTriggerSource.Operator` for externally triggered reconciliations. `ReconciliationTriggerSource.ApiServer` is reserved for events originating from the Kubernetes watch stream. @@ -448,7 +596,7 @@ app.MapPost("/reconcile/{namespace}/{name}", async ( return Results.NotFound(); } - entityQueue( + var scheduled = await entityQueue( entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, @@ -456,7 +604,10 @@ app.MapPost("/reconcile/{namespace}/{name}", async ( retryCount: 0, cancellationToken); - return Results.Accepted(); + // false means the enqueue was dropped (e.g. this instance is not the current leader). + return scheduled + ? Results.Accepted() + : Results.StatusCode(StatusCodes.Status503ServiceUnavailable); }); ``` @@ -918,4 +1069,4 @@ To completely replace how events are published, add your own `IEventPublisherFac - Check queue permissions and quotas when using an external message broker as a reconciliation trigger - Monitor message processing errors in external trigger consumers - Ensure entities still exist before enqueuing — the Kubernetes API is the authoritative source of truth -- If externally triggered reconciliations are not running, confirm the background service is registered and started correctly +- If externally triggered reconciliations are not running, confirm the background service is registered and started correctly \ No newline at end of file diff --git a/docs/docs/operator/building-blocks/controllers.mdx b/docs/docs/operator/building-blocks/controllers.mdx index 66b6d9f1..c29b2c51 100644 --- a/docs/docs/operator/building-blocks/controllers.mdx +++ b/docs/docs/operator/building-blocks/controllers.mdx @@ -194,8 +194,9 @@ public class V1DemoEntityController( V1DemoEntity entity, CancellationToken cancellationToken) { - // Schedule a follow-up reconciliation in 30 seconds without returning a result - queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, + // Schedule a follow-up reconciliation in 30 seconds without returning a result. + // The delegate returns Task; await it (false means the enqueue was dropped). + await queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromSeconds(30), retryCount: 0, cancellationToken); return ReconciliationResult.Success(entity); diff --git a/docs/docs/operator/caching.mdx b/docs/docs/operator/caching.mdx index 0ede24e2..e890068e 100644 --- a/docs/docs/operator/caching.mdx +++ b/docs/docs/operator/caching.mdx @@ -82,6 +82,18 @@ provided that configures the **active** cache (whichever strategy is selected). caches individually, register named FusionCache instances directly on the DI container using the FusionCache builder API. +:::warning[Tagging must stay enabled] +The resource watcher tags every deduplication entry with its entity type so that a leadership-aware +watcher can, on leadership loss, drop only that type's entries via FusionCache's `RemoveByTag` (without +wiping the dedup tokens of other entity types that share the same named cache). FusionCache tagging is +enabled by default — if you supply a custom cache configuration, do **not** disable it +(`FusionCacheOptions.DisableTagging`), otherwise the tagged writes throw at runtime. + +If you enable `OperatorSettings.ValidateRegistrations`, this is checked at startup: a cache with tagging +disabled aborts host startup with a clear `InvalidRegistrationException` instead of failing later at +runtime. +::: + **Example: L2 cache for the default `ByGeneration` strategy** ```csharp diff --git a/examples/OtelOperator/Program.cs b/examples/OtelOperator/Program.cs index 3f92b516..7937db0a 100644 --- a/examples/OtelOperator/Program.cs +++ b/examples/OtelOperator/Program.cs @@ -7,8 +7,6 @@ using KubeOps.Operator.Web.Builder; using OpenTelemetry; -using OpenTelemetry.Metrics; -using OpenTelemetry.Trace; const string operatorName = "otel-operator"; diff --git a/src/KubeOps.Abstractions/Builder/OperatorSettings.cs b/src/KubeOps.Abstractions/Builder/OperatorSettings.cs index e1db2524..1e0c518f 100644 --- a/src/KubeOps.Abstractions/Builder/OperatorSettings.cs +++ b/src/KubeOps.Abstractions/Builder/OperatorSettings.cs @@ -128,4 +128,18 @@ public sealed record OperatorSettings /// scrape the metrics, register an OpenTelemetry exporter for the meter named . /// public required bool EnableMetrics { get; init; } + + /// + /// Indicates whether the operator validates, on host startup, that its dependency injection + /// registrations are complete and consistent with the configured + /// and . Disabled by default. + /// + /// + /// When enabled, the operator verifies — for every managed entity — that the components implied by + /// the configuration are registered. If anything is missing, host startup aborts with an + /// InvalidRegistrationException listing the gaps. This catches registration mistakes + /// (for example a forgotten watcher or queue consumer in a manually wired setup) that would + /// otherwise let the operator start without processing any resources. + /// + public bool ValidateRegistrations { get; init; } } diff --git a/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilder.cs b/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilder.cs index 369bdfc4..575d66cc 100644 --- a/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilder.cs +++ b/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilder.cs @@ -113,6 +113,13 @@ public sealed partial class OperatorSettingsBuilder /// public bool EnableMetrics { get; set; } = true; + /// + /// Indicates whether the operator validates, on host startup, that its dependency injection + /// registrations are complete and consistent with the configuration. Disabled by default. See + /// for details. + /// + public bool ValidateRegistrations { get; set; } + /// /// Produces an immutable record from the current configuration. /// @@ -132,6 +139,7 @@ public sealed partial class OperatorSettingsBuilder ReconcileStrategy = ReconcileStrategy, ParallelReconciliation = ParallelReconciliation.Build(), EnableMetrics = EnableMetrics, + ValidateRegistrations = ValidateRegistrations, }; [GeneratedRegex(@"(\W|_)", RegexOptions.CultureInvariant)] diff --git a/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilderExtensions.cs b/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilderExtensions.cs index 4f085fb6..64a34e51 100644 --- a/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilderExtensions.cs +++ b/src/KubeOps.Abstractions/Builder/OperatorSettingsBuilderExtensions.cs @@ -161,6 +161,21 @@ public static OperatorSettingsBuilder WithMetrics( return builder; } + /// + /// Sets whether the operator validates, on host startup, that its dependency injection registrations + /// are complete and consistent with the configuration. Disabled by default. See + /// for details. + /// + /// The builder to configure. + /// true to enable validation (default); false to disable. + /// The same instance for chaining. + public static OperatorSettingsBuilder WithRegistrationValidation( + this OperatorSettingsBuilder builder, bool value = true) + { + builder.ValidateRegistrations = value; + return builder; + } + /// Configures parallel reconciliation settings inline via a delegate. /// The builder to configure. /// An action that configures the . diff --git a/src/KubeOps.Abstractions/Reconciliation/Queue/EntityQueue.cs b/src/KubeOps.Abstractions/Reconciliation/Queue/EntityQueue.cs index 6c296ff8..f7d35616 100644 --- a/src/KubeOps.Abstractions/Reconciliation/Queue/EntityQueue.cs +++ b/src/KubeOps.Abstractions/Reconciliation/Queue/EntityQueue.cs @@ -36,6 +36,12 @@ namespace KubeOps.Abstractions.Reconciliation.Queue; /// /// A token to monitor for cancellation requests while waiting for the queue duration to elapse. /// +/// +/// A task whose result is if the entity was scheduled, or if it +/// was dropped (for example because was already cancelled, or the queue's +/// intake is suspended on a leadership transition). Await it to observe failures of a custom queue and to react +/// to a dropped enqueue. +/// /// /// /// This delegate is injected into controllers and other components via dependency injection to enable @@ -62,6 +68,6 @@ namespace KubeOps.Abstractions.Reconciliation.Queue; /// /// /// -public delegate void EntityQueue( +public delegate Task EntityQueue( TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, TimeSpan queueIn, int retryCount, CancellationToken cancellationToken) where TEntity : IKubernetesObject; diff --git a/src/KubeOps.Operator/Builder/CacheExtensions.cs b/src/KubeOps.Operator/Builder/CacheExtensions.cs index c0868681..bfe32ea5 100644 --- a/src/KubeOps.Operator/Builder/CacheExtensions.cs +++ b/src/KubeOps.Operator/Builder/CacheExtensions.cs @@ -27,9 +27,7 @@ internal static class CacheExtensions /// The modified service collection with resource watcher caching configured. internal static IServiceCollection WithResourceWatcherEntityCaching(this IServiceCollection services, OperatorSettings settings) { - var cacheName = settings.ReconcileStrategy == ReconcileStrategy.ByResourceVersion - ? CacheConstants.CacheNames.ResourceWatcherByResourceVersion - : CacheConstants.CacheNames.ResourceWatcher; + var cacheName = CacheConstants.ResourceWatcherCacheNameFor(settings.ReconcileStrategy); var builder = services.AddFusionCache(cacheName); diff --git a/src/KubeOps.Operator/Builder/OperatorBuilder.cs b/src/KubeOps.Operator/Builder/OperatorBuilder.cs index b7db4aaa..83e06ace 100644 --- a/src/KubeOps.Operator/Builder/OperatorBuilder.cs +++ b/src/KubeOps.Operator/Builder/OperatorBuilder.cs @@ -33,6 +33,8 @@ namespace KubeOps.Operator.Builder; internal sealed class OperatorBuilder : IOperatorBuilder { + private OperatorRegistrationRegistry? _registrationRegistry; + public OperatorBuilder(IServiceCollection services, OperatorSettings settings) { Settings = settings; @@ -51,6 +53,8 @@ public IOperatorBuilder AddController() Services.TryAddScoped, TImplementation>(); Services.TryAddSingleton, Reconciler>(); + RegisterRegistrationValidation(typeof(TEntity)); + // Queue Services.TryAddTransient(); Services.TryAddTransient>(services => @@ -59,7 +63,16 @@ public IOperatorBuilder AddController() if (Settings.QueueStrategy == QueueStrategy.InMemory) { Services.TryAddSingleton, TimedEntityQueue>(); - Services.AddHostedService>(); + + switch (Settings.LeaderElectionType) + { + case LeaderElectionType.None: + Services.AddHostedService>(); + break; + case LeaderElectionType.Single: + Services.AddHostedService>(); + break; + } } // Leader Election @@ -108,6 +121,8 @@ public IOperatorBuilder AddFinalizer(string identifier services.GetRequiredService() .Create(identifier)); + RegisterRegistrationValidation(typeof(TEntity)); + return this; } @@ -165,4 +180,21 @@ private void AddOperatorBase() Services.AddLeaderElection(); } } + + private void RegisterRegistrationValidation(Type entityType) + { + if (!Settings.ValidateRegistrations) + { + return; + } + + if (_registrationRegistry is null) + { + _registrationRegistry = new(Services); + Services.AddSingleton(_registrationRegistry); + Services.AddHostedService(); + } + + _registrationRegistry.Add(entityType); + } } diff --git a/src/KubeOps.Operator/Builder/OperatorRegistrationRegistry.cs b/src/KubeOps.Operator/Builder/OperatorRegistrationRegistry.cs new file mode 100644 index 00000000..cce05ff7 --- /dev/null +++ b/src/KubeOps.Operator/Builder/OperatorRegistrationRegistry.cs @@ -0,0 +1,34 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using Microsoft.Extensions.DependencyInjection; + +namespace KubeOps.Operator.Builder; + +/// +/// Bridges build-time information to the running at host +/// startup: it tracks the entity types managed by the operator and exposes the service collection so +/// the validator can inspect the registrations. +/// +/// The service collection the operator is registered into. +internal sealed class OperatorRegistrationRegistry(IServiceCollection services) +{ + private readonly HashSet _managedEntities = []; + + /// + /// Gets the entity types that are managed by the operator (i.e. that have a controller registered). + /// + public IReadOnlyCollection ManagedEntities => _managedEntities; + + /// + /// Gets the service collection the operator is registered into. + /// + public IServiceCollection Services => services; + + /// + /// Registers an entity type as managed by the operator. + /// + /// The entity type to register. + public void Add(Type entityType) => _managedEntities.Add(entityType); +} diff --git a/src/KubeOps.Operator/Builder/OperatorRegistrationValidator.cs b/src/KubeOps.Operator/Builder/OperatorRegistrationValidator.cs new file mode 100644 index 00000000..2aa5e578 --- /dev/null +++ b/src/KubeOps.Operator/Builder/OperatorRegistrationValidator.cs @@ -0,0 +1,324 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Globalization; + +using KubeOps.Abstractions.Builder; +using KubeOps.Abstractions.Reconciliation; +using KubeOps.Abstractions.Reconciliation.Finalizer; +using KubeOps.Operator.Constants; +using KubeOps.Operator.Exceptions; +using KubeOps.Operator.Queue; +using KubeOps.Operator.Watcher; + +using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.Hosting; +using Microsoft.Extensions.Logging; + +using ZiggyCreatures.Caching.Fusion; + +namespace KubeOps.Operator.Builder; + +/// +/// Validates, on host startup, that the operator's dependency injection registrations are complete and +/// consistent with the chosen configuration. Registered only when +/// is enabled. +/// +/// +/// The validator inspects the registered services (it does not construct them) and, for every managed +/// entity, verifies that the components implied by the configuration are present. If anything is missing +/// it throws an aggregating all gaps, aborting host startup. +/// The one exception to "inspect only" is the cache-tagging check: the resource watcher tags every +/// deduplication entry, which requires FusionCache tagging to be enabled. Tagging state cannot be read from +/// the cache or its options (a builder-applied DisableTagging is not reflected in IOptionsMonitor), +/// so it is verified with a harmless no-op probe against the resource-watcher cache. +/// +/// Validation runs in , i.e. the host's Starting phase, which completes +/// before any hosted service's is invoked. A failed validation +/// therefore aborts startup before the watcher, queue consumer, leader election or CRD installer perform +/// any work — nothing has started, so there is nothing to unwind. +/// +/// +internal sealed class OperatorRegistrationValidator( + OperatorRegistrationRegistry registry, + OperatorSettings settings, + IFusionCacheProvider cacheProvider, + ILogger logger) : IHostedLifecycleService +{ + private const string TaggingProbeTag = "__kubeops_tagging_probe__"; + + public Task StartingAsync(CancellationToken cancellationToken) + { + if (registry.ManagedEntities.Count == 0) + { + return Task.CompletedTask; + } + + var problems = new List(); + foreach (var entityType in registry.ManagedEntities) + { + ValidateEntity(entityType, problems); + } + + ValidateCacheTagging(problems); + + if (problems.Count > 0) + { + throw new InvalidRegistrationException( + "Operator registration validation failed. The following required components are not " + + "registered for the current configuration " + + $"(LeaderElectionType.{settings.LeaderElectionType}, QueueStrategy.{settings.QueueStrategy}):" + + Environment.NewLine + + string.Join(Environment.NewLine, problems.Select(p => $" - {p}")) + Environment.NewLine + + "Register the missing components or disable validation via " + + "OperatorSettings.ValidateRegistrations."); + } + + logger.LogDebug( + "Operator registration validation passed for {EntityCount} managed entit(y/ies).", + registry.ManagedEntities.Count); + + return Task.CompletedTask; + } + + public Task StartAsync(CancellationToken cancellationToken) => Task.CompletedTask; + + public Task StartedAsync(CancellationToken cancellationToken) => Task.CompletedTask; + + public Task StoppingAsync(CancellationToken cancellationToken) => Task.CompletedTask; + + public Task StopAsync(CancellationToken cancellationToken) => Task.CompletedTask; + + public Task StoppedAsync(CancellationToken cancellationToken) => Task.CompletedTask; + + // True if a closed registration for the service exists, or an open-generic one the DI container would + // close to it (e.g. AddSingleton(typeof(ITimedEntityQueue<>), typeof(MyQueue<>))). Keyed registrations + // are ignored: the watcher/reconciler take these as plain (unkeyed) constructor dependencies, so a keyed + // registration would not satisfy them. + private static bool HasService(IServiceCollection services, Type serviceType) + { + if (services.Any(d => !d.IsKeyedService && d.ServiceType == serviceType)) + { + return true; + } + + if (!serviceType.IsGenericType) + { + return false; + } + + // An open-generic registration only satisfies the requested closed type if its implementation can + // actually be closed to it. An open implementation whose generic constraints exclude the entity (e.g. + // `where TEntity : ISomeMarker`) would pass a name-only match but fail to resolve at runtime; verify it + // closes. Factory/instance registrations expose no implementation type to check and are assumed to match. + var openServiceType = serviceType.GetGenericTypeDefinition(); + return services.Any(d => + !d.IsKeyedService && d.ServiceType == openServiceType && ClosesToRequestedType(d, serviceType)); + } + + private static bool ClosesToRequestedType(ServiceDescriptor descriptor, Type closedServiceType) + { + if (descriptor.ImplementationType is not { IsGenericTypeDefinition: true } openImplementation) + { + return true; + } + + try + { + _ = openImplementation.MakeGenericType(closedServiceType.GenericTypeArguments); + return true; + } + catch (ArgumentException) + { + return false; + } + } + + private static bool HasHostedServiceAssignableTo(IServiceCollection services, Type targetType) => + services.Any(d => + { + if (d.IsKeyedService || d.ServiceType != typeof(IHostedService)) + { + return false; + } + + var implementationType = d.ImplementationType ?? d.ImplementationInstance?.GetType(); + return implementationType is not null && targetType.IsAssignableFrom(implementationType); + }); + + private static bool HasKeyedFinalizer(IServiceCollection services, Type entityType) + { + var finalizerType = typeof(IEntityFinalizer<>).MakeGenericType(entityType); + return services.Any(d => d.IsKeyedService && d.ServiceType == finalizerType); + } + + // The effective implementation type for a service (the last registration wins in DI). Handles closed + // registrations as well as open-generic ones (closing the open implementation to the requested type). + // Returns null for factory-registered services whose concrete type cannot be determined without + // constructing them. + private static Type? GetImplementationType(IServiceCollection services, Type serviceType) + { + var descriptor = services.LastOrDefault(d => !d.IsKeyedService && d.ServiceType == serviceType); + if (descriptor is not null) + { + return descriptor.ImplementationType ?? descriptor.ImplementationInstance?.GetType(); + } + + if (!serviceType.IsGenericType) + { + return null; + } + + var openDescriptor = services.LastOrDefault(d => + !d.IsKeyedService && d.ServiceType == serviceType.GetGenericTypeDefinition()); + + if (openDescriptor?.ImplementationType is not { IsGenericTypeDefinition: true } openImplementation) + { + return openDescriptor?.ImplementationInstance?.GetType(); + } + + try + { + return openImplementation.MakeGenericType(serviceType.GenericTypeArguments); + } + catch (ArgumentException) + { + return null; + } + } + + // The resource watcher tags every deduplication entry (so a leadership-aware watcher can drop only its own + // entity type's entries via RemoveByTag). That requires FusionCache tagging to be enabled — otherwise every + // tagged write throws at runtime. Tagging is on by default; this catches a configuration that disabled it. + // The flag cannot be read from the cache or its options (a builder-applied DisableTagging is not reflected in + // IOptionsMonitor), so a harmless no-op tag removal for an unused sentinel tag is used as the probe: it is a + // no-op when tagging is enabled and throws InvalidOperationException when it is disabled. + private void ValidateCacheTagging(List problems) + { + var cacheName = CacheConstants.ResourceWatcherCacheNameFor(settings.ReconcileStrategy); + + try + { + cacheProvider.GetCache(cacheName).RemoveByTag(TaggingProbeTag); + } + catch (InvalidOperationException) + { + problems.Add(string.Format( + CultureInfo.InvariantCulture, + "The resource-watcher cache '{0}' has FusionCache tagging disabled " + + "(FusionCacheOptions.DisableTagging), but the watcher tags every deduplication entry (used to " + + "drop an entity type's own entries on leadership loss). Remove DisableTagging from your " + + "ConfigureResourceWatcherEntityCache configuration, or use the default cache.", + cacheName)); + } + } + + private void ValidateEntity(Type entityType, List problems) + { + var services = registry.Services; + var entityName = entityType.Name; + + // Without a controller there is no reconciliation pipeline at all, so the remaining checks would + // only add noise. If the entity is only present because a finalizer was registered for it, point + // at the actual mistake: finalizers run as part of reconciliation and require a controller. + if (!HasService(services, typeof(IReconciler<>).MakeGenericType(entityType))) + { + problems.Add(HasKeyedFinalizer(services, entityType) + ? string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': a finalizer is registered but no controller. Finalizers run as part of " + + "reconciliation; register a controller via AddController<…, {0}>().", + entityName) + : string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': no IReconciler<{0}> is registered.", + entityName)); + return; + } + + // Hosted services are recognized by their registered implementation type. A component registered + // through a DI factory delegate exposes no type and is therefore reported as missing. Register the + // watcher and consumer with a concrete type so validation can inspect them. + if (!HasHostedServiceAssignableTo(services, typeof(ResourceWatcher<>).MakeGenericType(entityType))) + { + problems.Add(string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': no resource watcher is registered (expected a hosted service deriving from ResourceWatcher<{0}>).", + entityName)); + } + + // ITimedEntityQueue is always required: both the resource watcher and the reconciler take + // it as a constructor dependency, regardless of queue strategy. With QueueStrategy.Custom the SDK + // does not register it, so a user who forgets to supply one would only fail at host startup with a + // DI error. + var queueType = typeof(ITimedEntityQueue<>).MakeGenericType(entityType); + var queueRegistered = HasService(services, queueType); + if (!queueRegistered) + { + problems.Add(string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': no ITimedEntityQueue<{0}> is registered.", + entityName)); + } + + // A queue consumer is always required. Under Single leader election it must be leadership-aware + // (drive the queue gate), so a stronger marker is required there; otherwise the base consumer marker + // is enough. The SDK registers one for the in-memory strategy; with QueueStrategy.Custom the user + // supplies it and marks it accordingly. + var single = settings.LeaderElectionType == LeaderElectionType.Single; + var consumerType = single + ? typeof(ILeaderAwareEntityQueueConsumer<>).MakeGenericType(entityType) + : typeof(IEntityQueueConsumer<>).MakeGenericType(entityType); + if (!HasHostedServiceAssignableTo(services, consumerType)) + { + problems.Add(single + ? string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': no leadership-aware queue consumer is registered. LeaderElectionType.Single " + + "requires a consumer implementing ILeaderAwareEntityQueueConsumer<{0}> (e.g. deriving from " + + "LeaderAwareEntityQueueBackgroundService<{0}>) so the queue gate is driven on leadership " + + "transitions.", + entityName) + : string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': no queue consumer is registered (expected a hosted service implementing " + + "IEntityQueueConsumer<{0}>, e.g. deriving from EntityQueueBackgroundService<{0}>).", + entityName)); + } + + if (!single || !queueRegistered) + { + return; + } + + // Under Single leader election the queue must support the leadership gate + // so a former leader leaves no work behind on a leadership transition. + var queueImpl = GetImplementationType(services, queueType); + if (queueImpl is null) + { + // The queue is registered, but its concrete type cannot be determined (e.g. a DI factory), + // so the gate capability cannot be verified. Fail rather than silently assume it is safe — + // consistent with how an unverifiable consumer is handled. + problems.Add(string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': the registered queue cannot be inspected for {1} (it is registered via a " + + "factory delegate). LeaderElectionType.Single requires a queue whose leadership-gate " + + "capability can be verified — register it with a concrete or open-generic type, or use the " + + "built-in TimedEntityQueue<{0}>.", + entityName, + nameof(ISuspendableEntityQueue))); + } + else if (!typeof(ISuspendableEntityQueue).IsAssignableFrom(queueImpl)) + { + problems.Add(string.Format( + CultureInfo.InvariantCulture, + "Entity '{0}': the registered queue ({1}) does not implement {2}, which " + + "LeaderElectionType.Single requires for leadership-loss protection (queue clear and intake " + + "suspension). Implement {2} on your queue or use the built-in TimedEntityQueue<{0}>.", + entityName, + queueImpl.Name, + nameof(ISuspendableEntityQueue))); + } + } +} diff --git a/src/KubeOps.Operator/Constants/CacheConstants.cs b/src/KubeOps.Operator/Constants/CacheConstants.cs index 7719c111..a30ad416 100644 --- a/src/KubeOps.Operator/Constants/CacheConstants.cs +++ b/src/KubeOps.Operator/Constants/CacheConstants.cs @@ -2,6 +2,8 @@ // The .NET Foundation licenses this file to you under the Apache 2.0 License. // See the LICENSE file in the project root for more information. +using KubeOps.Abstractions.Builder; + namespace KubeOps.Operator.Constants; /// @@ -9,6 +11,17 @@ namespace KubeOps.Operator.Constants; /// public static class CacheConstants { + /// + /// Returns the resource-watcher dedup cache name for the given reconcile strategy. Single source of truth + /// shared by the cache registration, the watcher, and the registration validator. + /// + /// The configured reconcile strategy. + /// The named FusionCache instance name to use for that strategy. + public static string ResourceWatcherCacheNameFor(ReconcileStrategy strategy) => + strategy == ReconcileStrategy.ByResourceVersion + ? CacheNames.ResourceWatcherByResourceVersion + : CacheNames.ResourceWatcher; + /// /// Contains constant values representing names used within the operator's caching mechanisms. /// diff --git a/src/KubeOps.Operator/Exceptions/InvalidRegistrationException.cs b/src/KubeOps.Operator/Exceptions/InvalidRegistrationException.cs new file mode 100644 index 00000000..8eae2fad --- /dev/null +++ b/src/KubeOps.Operator/Exceptions/InvalidRegistrationException.cs @@ -0,0 +1,43 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +namespace KubeOps.Operator.Exceptions; + +/// +/// Raised on host startup when the operator's dependency injection registrations are incomplete or +/// inconsistent with the configured leader election and queue strategy. Thrown by the registration +/// validation enabled via . +/// +/// +/// Derives from so existing handlers that catch the latter keep +/// working, while allowing callers to catch this more specific type. +/// +public sealed class InvalidRegistrationException : InvalidOperationException +{ + /// + /// Initializes a new instance of the class. + /// + public InvalidRegistrationException() + { + } + + /// + /// Initializes a new instance of the class. + /// + /// The error message. + public InvalidRegistrationException(string? message) + : base(message) + { + } + + /// + /// Initializes a new instance of the class. + /// + /// The error message. + /// The exception that caused this exception. + public InvalidRegistrationException(string? message, Exception? innerException) + : base(message, innerException) + { + } +} diff --git a/src/KubeOps.Operator/LeaderElection/LeaderElectionBackgroundService.cs b/src/KubeOps.Operator/LeaderElection/LeaderElectionBackgroundService.cs index 456d8ca3..8942792c 100644 --- a/src/KubeOps.Operator/LeaderElection/LeaderElectionBackgroundService.cs +++ b/src/KubeOps.Operator/LeaderElection/LeaderElectionBackgroundService.cs @@ -64,11 +64,19 @@ public async Task StopAsync(CancellationToken cancellationToken) public async ValueTask DisposeAsync() { + _disposed = true; + + // Stop and await the leadership loop before disposing the resources it observes, so a dispose without a + // prior StopAsync cannot tear down _cts/elector underneath the still-running task. + await _cts.CancelAsync(); + if (_leadershipTask is not null) + { + await _leadershipTask; + } + await CastAndDispose(_cts); await CastAndDispose(elector); - _disposed = true; - static async ValueTask CastAndDispose(IDisposable resource) { if (resource is IAsyncDisposable resourceAsyncDisposable) @@ -102,7 +110,17 @@ private async Task RunAndTryToHoldLeadershipForeverAsync() var delay = ExponentialRetryBackoff.GetDelayWithJitter(leadershipRetries); logger.LogError(exception, "Failed to hold leadership. Wait {Seconds}s before attempting to reacquire leadership.", delay.TotalSeconds); - await Task.Delay(delay); + + // Honor the stop token so a shutdown during the back-off does not have to wait out the (growing) + // delay before StopAsync/DisposeAsync — both await this task. + try + { + await Task.Delay(delay, _cts.Token); + } + catch (OperationCanceledException) when (_cts.IsCancellationRequested) + { + return; + } } } } diff --git a/src/KubeOps.Operator/LeaderElection/LeaderElectionSubscription.cs b/src/KubeOps.Operator/LeaderElection/LeaderElectionSubscription.cs new file mode 100644 index 00000000..8392b364 --- /dev/null +++ b/src/KubeOps.Operator/LeaderElection/LeaderElectionSubscription.cs @@ -0,0 +1,67 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using k8s.LeaderElection; + +namespace KubeOps.Operator.LeaderElection; + +/// +/// Shared leadership-callback wiring for the operator's leader-aware hosted services (the queue consumer and the +/// resource watcher). It owns the subscription of a service's OnStartedLeading / OnStoppedLeading +/// handlers to a and guarantees they are removed again. +/// +/// +/// +/// and are idempotent, so repeated calls cannot duplicate or +/// over-remove handlers. +/// +/// +/// It deliberately does not dispose the : the elector is a DI singleton +/// shared with the LeaderElectionBackgroundService, so its lifetime is not owned here. +/// +/// +/// The shared leader elector to subscribe to. +/// The handler to invoke when leadership is acquired. +/// The handler to invoke when leadership is lost. +internal sealed class LeaderElectionSubscription( + LeaderElector elector, Action onStartedLeading, Action onStoppedLeading) +{ + private readonly object _gate = new(); + private bool _subscribed; + + /// Gets a value indicating whether this instance currently holds leadership. + public bool IsLeader => elector.IsLeader(); + + /// Subscribes the handlers to the elector. A no-op if already subscribed. + public void Subscribe() + { + lock (_gate) + { + if (_subscribed) + { + return; + } + + elector.OnStartedLeading += onStartedLeading; + elector.OnStoppedLeading += onStoppedLeading; + _subscribed = true; + } + } + + /// Removes the handlers from the elector. A no-op if not subscribed. + public void Unsubscribe() + { + lock (_gate) + { + if (!_subscribed) + { + return; + } + + elector.OnStartedLeading -= onStartedLeading; + elector.OnStoppedLeading -= onStoppedLeading; + _subscribed = false; + } + } +} diff --git a/src/KubeOps.Operator/Queue/EntityQueueBackgroundService{TEntity}.cs b/src/KubeOps.Operator/Queue/EntityQueueBackgroundService{TEntity}.cs index 5c1bc4ce..563467d8 100644 --- a/src/KubeOps.Operator/Queue/EntityQueueBackgroundService{TEntity}.cs +++ b/src/KubeOps.Operator/Queue/EntityQueueBackgroundService{TEntity}.cs @@ -14,7 +14,6 @@ using KubeOps.Operator.Logging; using KubeOps.Operator.Metrics; -using Microsoft.Extensions.Hosting; using Microsoft.Extensions.Logging; namespace KubeOps.Operator.Queue; @@ -60,58 +59,30 @@ namespace KubeOps.Operator.Queue; /// unbounded task accumulation. /// /// -internal sealed class EntityQueueBackgroundService( +public class EntityQueueBackgroundService( ActivitySource activitySource, IKubernetesClient client, OperatorSettings operatorSettings, ITimedEntityQueue queue, IReconciler reconciler, ILogger> logger, - OperatorMetrics? metrics = null) : IHostedService, IDisposable, IAsyncDisposable + OperatorMetrics? metrics = null) : RestartableHostedService, IEntityQueueConsumer where TEntity : IKubernetesObject { - private readonly CancellationTokenSource _cts = new(); private readonly ConcurrentDictionary _uidEntries = new(); private readonly SemaphoreSlim _parallelismSemaphore = new( operatorSettings.ParallelReconciliation.MaxParallelReconciliations, operatorSettings.ParallelReconciliation.MaxParallelReconciliations); - private volatile bool _disposed; - - /// - /// - /// Schedules the queue processing loop as a background task using . - /// The passed to this method is intentionally not used for the processing loop; - /// cancellation is managed via an internal that is signaled by . - /// This avoids cancelling the scheduled work during the host startup phase. - /// - public Task StartAsync(CancellationToken cancellationToken) - { - // The current implementation of IHostedService expects that StartAsync is "really" asynchronous. - // Blocking calls are not allowed, they would stop the rest of the startup flow. - // - // This has been an open issue since 2019 and is not expected to be closed soon. (https://github.com/dotnet/runtime/issues/36063) - // For reasons unknown at the time of writing this code, "await Task.Yield()" didn't work as expected, it caused - // a deadlock in 1/10 of the cases. - // - // Therefore, we use Task.Run() and put the work to queue. The passed cancellation token of the StartAsync - // method is not used because it would only cancel the scheduling (which we definitely don't want to cancel). - // To make this intention explicit, CancellationToken.None gets passed. - _ = Task.Run(() => WatchAsync(_cts.Token), CancellationToken.None); - - return Task.CompletedTask; - } - - /// - public Task StopAsync(CancellationToken cancellationToken) - => _disposed - ? Task.CompletedTask - : _cts.CancelAsync(); + /// + /// Gets the timed entity queue this service consumes. Exposed for leadership-aware subclasses that need + /// to suspend intake or clear the queue on a leadership transition. + /// + protected ITimedEntityQueue Queue => queue; /// - public void Dispose() + protected override void DisposeManagedResources() { - _cts.Dispose(); _parallelismSemaphore.Dispose(); lock (_uidEntries) @@ -126,42 +97,31 @@ public void Dispose() client.Dispose(); queue.Dispose(); - - _disposed = true; } /// - public async ValueTask DisposeAsync() + protected override async ValueTask DisposeManagedResourcesAsync() { - await CastAndDispose(_cts); - await CastAndDispose(_parallelismSemaphore); + await CastAndDisposeAsync(_parallelismSemaphore); foreach (var entry in _uidEntries.Values) { - await CastAndDispose(entry.Semaphore); + await CastAndDisposeAsync(entry.Semaphore); } _uidEntries.Clear(); - await CastAndDispose(client); - await CastAndDispose(queue); - - _disposed = true; - return; - - static async ValueTask CastAndDispose(IDisposable resource) - { - if (resource is IAsyncDisposable resourceAsyncDisposable) - { - await resourceAsyncDisposable.DisposeAsync(); - } - else - { - resource.Dispose(); - } - } + await CastAndDisposeAsync(client); + await CastAndDisposeAsync(queue); } - private async Task> ReconcileSingleAsync(QueueEntry entry, CancellationToken cancellationToken) + /// + protected override void OnLoopFaulted(Exception exception) => + logger.LogError( + exception, + "The queue processing loop for {Entity} exited unexpectedly and stopped consuming the queue.", + typeof(TEntity).Name); + + protected virtual async Task> ReconcileSingleAsync(QueueEntry entry, CancellationToken cancellationToken) { logger .LogTrace( @@ -189,7 +149,8 @@ private async Task> ReconcileSingleAsync(QueueEntr return ReconciliationResult.Failure(entry.Entity, "Entity was not found."); } - private async Task WatchAsync(CancellationToken cancellationToken) + /// + protected override async Task ExecuteAsync(CancellationToken cancellationToken) { var tasks = new List(operatorSettings.ParallelReconciliation.MaxParallelReconciliations); @@ -208,12 +169,25 @@ private async Task WatchAsync(CancellationToken cancellationToken) tasks.RemoveAll(t => t.IsCompleted); } } - - await Task.WhenAll(tasks); } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { - // Expected during shutdown, no action needed. + // Expected during shutdown / leadership loss, no action needed. + } + finally + { + // Drain the worker tasks already spawned so the loop does not return — and its CancellationTokenSource + // is not disposed (see RunLoopAsync), nor shared resources torn down, nor a new leadership + // term started — while reconciliations are still in flight. Individual worker failures are already + // handled inside ProcessEntryAsync; cancellation surfaces here as OperationCanceledException. + try + { + await Task.WhenAll(tasks); + } + catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) + { + // Workers cancelled as part of the stop; their outcomes are already handled in ProcessEntryAsync. + } } } @@ -319,7 +293,7 @@ private async Task ProcessEntryAsync(QueueEntry entry, CancellationToke // Catches all unexpected errors, including OperationCanceledException that was NOT triggered // by the operator's own cancellation token (i.e. an internal abort from within the reconciler). // Intentional shutdown cancellations (IsCancellationRequested == true) are re-thrown and handled - // by the caller in WatchAsync. + // by the caller in ExecuteAsync. logger .LogError( e, @@ -348,15 +322,20 @@ private async Task ProcessEntryAsync(QueueEntry entry, CancellationToke // what event originally caused this reconciliation (e.g. ApiServer or Operator). // RetryCount is incremented and passed explicitly so the back-off calculation // stays correct across successive failures without losing state. - await queue.Enqueue( + var scheduled = await queue.Enqueue( entry.Entity, entry.ReconciliationType, entry.ReconciliationTriggerSource, delay, nextRetryCount, - CancellationToken.None); + cancellationToken); - metrics?.RecordRequeue(typeof(TEntity).Name, "error_retry"); + // Only count the retry when it was actually scheduled. A leadership-aware queue with + // suspended intake (leadership just lost) drops the entry and returns false. + if (scheduled) + { + metrics?.RecordRequeue(typeof(TEntity).Name, "error_retry"); + } } else { @@ -398,7 +377,7 @@ private async Task HandleLockingConflictAsync(QueueEntry entry, Cancell entry.Entity.ToIdentifierString(), requeueDelay.TotalSeconds); - await queue.Enqueue( + var scheduled = await queue.Enqueue( entry.Entity, entry.ReconciliationType, entry.ReconciliationTriggerSource, @@ -406,11 +385,16 @@ await queue.Enqueue( retryCount: 0, cancellationToken); - metrics?.RecordRequeue(typeof(TEntity).Name, "conflict"); + // Only count the requeue when it was actually scheduled (a suspended gate drops it). + if (scheduled) + { + metrics?.RecordRequeue(typeof(TEntity).Name, "conflict"); + } + break; default: - throw new NotSupportedException($"Conflict strategy {operatorSettings.ParallelReconciliation.ConflictStrategy} is not supported in HandleUidConflictAsync."); + throw new NotSupportedException($"Conflict strategy {operatorSettings.ParallelReconciliation.ConflictStrategy} is not supported in HandleLockingConflictAsync."); } } diff --git a/src/KubeOps.Operator/Queue/EntityQueueFactory.cs b/src/KubeOps.Operator/Queue/EntityQueueFactory.cs index 2292fbf2..88605f6b 100644 --- a/src/KubeOps.Operator/Queue/EntityQueueFactory.cs +++ b/src/KubeOps.Operator/Queue/EntityQueueFactory.cs @@ -34,6 +34,6 @@ public EntityQueue Create() retryCount > 0 ? $" (Retry: {retryCount})" : string.Empty, timeSpan.TotalSeconds); - queue.Enqueue(entity, type, triggerSource, timeSpan, retryCount, cancellationToken); + return queue.Enqueue(entity, type, triggerSource, timeSpan, retryCount, cancellationToken); }; } diff --git a/src/KubeOps.Operator/Queue/IEntityQueueConsumer{TEntity}.cs b/src/KubeOps.Operator/Queue/IEntityQueueConsumer{TEntity}.cs new file mode 100644 index 00000000..9f3898d4 --- /dev/null +++ b/src/KubeOps.Operator/Queue/IEntityQueueConsumer{TEntity}.cs @@ -0,0 +1,32 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics.CodeAnalysis; + +using k8s; +using k8s.Models; + +namespace KubeOps.Operator.Queue; + +/// +/// Marker for a hosted service that drains the for +/// (i.e. the queue consumer). +/// +/// The type of the Kubernetes entity being reconciled. +/// +/// The SDK's (and its leader-aware variant) implement +/// this. Its sole purpose is to let registration validation recognise a queue consumer: a custom consumer +/// supplied with should implement this interface +/// (or derive from ) so that validation can confirm a +/// consumer exists. It is not required at runtime — any hosted service can drain the queue — but without +/// it, registration validation cannot prove a consumer is present. +/// +[SuppressMessage( + "Major Code Smell", + "S2326:Unused type parameters should be removed", + Justification = "TEntity is a phantom type that ties the consumer marker to a specific entity; it is used by registration validation, not by interface members.")] +public interface IEntityQueueConsumer + where TEntity : IKubernetesObject +{ +} diff --git a/src/KubeOps.Operator/Queue/ILeaderAwareEntityQueueConsumer{TEntity}.cs b/src/KubeOps.Operator/Queue/ILeaderAwareEntityQueueConsumer{TEntity}.cs new file mode 100644 index 00000000..fc0cd20c --- /dev/null +++ b/src/KubeOps.Operator/Queue/ILeaderAwareEntityQueueConsumer{TEntity}.cs @@ -0,0 +1,34 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics.CodeAnalysis; + +using k8s; +using k8s.Models; + +namespace KubeOps.Operator.Queue; + +/// +/// Marker for a queue consumer that is leadership-aware: it gates the queue on leadership transitions by +/// driving (suspend intake and clear on leadership loss, resume on +/// re-acquisition). +/// +/// The type of the Kubernetes entity being reconciled. +/// +/// The SDK's implements this. With +/// a queue capable of gating +/// () is not enough — a consumer must actually drive that gate. +/// Registration validation therefore requires the consumer to implement this interface under +/// Single. A custom consumer supplied with +/// should implement it (or derive from ). +/// Like all such markers it declares intent; it cannot prove the consumer actually calls the gate methods. +/// +[SuppressMessage( + "Major Code Smell", + "S2326:Unused type parameters should be removed", + Justification = "TEntity is a phantom type that ties the consumer marker to a specific entity; it is used by registration validation, not by interface members.")] +public interface ILeaderAwareEntityQueueConsumer : IEntityQueueConsumer + where TEntity : IKubernetesObject +{ +} diff --git a/src/KubeOps.Operator/Queue/ISuspendableEntityQueue.cs b/src/KubeOps.Operator/Queue/ISuspendableEntityQueue.cs new file mode 100644 index 00000000..12ad8908 --- /dev/null +++ b/src/KubeOps.Operator/Queue/ISuspendableEntityQueue.cs @@ -0,0 +1,44 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +namespace KubeOps.Operator.Queue; + +/// +/// Optional capability of an that lets the operator gate and +/// drain the queue across a leadership transition. +/// +/// +/// +/// With the leader-aware queue consumer +/// () uses these methods so that a former +/// leader leaves no work behind: it calls + when +/// leadership is lost and when it is (re)acquired. +/// +/// +/// This is a separate, opt-in interface rather than part of +/// so that a custom queue only participates in leadership-loss +/// protection when it deliberately implements it. A queue that does not implement this interface keeps +/// working, but provides no such protection (the consumer logs a warning in that case). The built-in +/// implements it. +/// +/// +public interface ISuspendableEntityQueue +{ + /// + /// Removes all pending (ready) and scheduled entries without closing the queue for future use. + /// + void Clear(); + + /// + /// Suspends acceptance of new entries: subsequent enqueues are dropped until + /// is called. To be effective, the implementation must make the intake check and the scheduling mutation + /// atomic with this method and (e.g. under a shared lock). + /// + void SuspendIntake(); + + /// + /// Resumes acceptance of new entries after a previous call. + /// + void ResumeIntake(); +} diff --git a/src/KubeOps.Operator/Queue/ITimedEntityQueue{TEntity}.cs b/src/KubeOps.Operator/Queue/ITimedEntityQueue{TEntity}.cs index 987f7e76..f53c686e 100644 --- a/src/KubeOps.Operator/Queue/ITimedEntityQueue{TEntity}.cs +++ b/src/KubeOps.Operator/Queue/ITimedEntityQueue{TEntity}.cs @@ -59,10 +59,16 @@ public interface ITimedEntityQueue : IDisposable, IAsyncEnumerable /// - /// A task that represents the asynchronous enqueue operation. + /// A task whose result is if the entry was scheduled, or + /// if it was dropped (for example because was already cancelled, or + /// intake is suspended via ). Callers can use this to + /// avoid recording requeue metrics for work that was not actually scheduled. /// /// /// If an entity with the same key is already queued, the entry scheduled to run earliest is kept. + /// Implementations may drop the entry when is already cancelled, or + /// when intake is suspended via (if the queue + /// implements that optional capability), returning in those cases. /// - Task Enqueue(TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, TimeSpan queueIn, int retryCount, CancellationToken cancellationToken); + Task Enqueue(TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, TimeSpan queueIn, int retryCount, CancellationToken cancellationToken); } diff --git a/src/KubeOps.Operator/Queue/LeaderAwareEntityQueueBackgroundService{TEntity}.cs b/src/KubeOps.Operator/Queue/LeaderAwareEntityQueueBackgroundService{TEntity}.cs new file mode 100644 index 00000000..f07d831a --- /dev/null +++ b/src/KubeOps.Operator/Queue/LeaderAwareEntityQueueBackgroundService{TEntity}.cs @@ -0,0 +1,158 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics; + +using k8s; +using k8s.LeaderElection; +using k8s.Models; + +using KubeOps.Abstractions.Builder; +using KubeOps.Abstractions.Reconciliation; +using KubeOps.KubernetesClient; +using KubeOps.Operator.LeaderElection; +using KubeOps.Operator.Metrics; + +using Microsoft.Extensions.Logging; + +namespace KubeOps.Operator.Queue; + +/// +/// A leadership-aware variant of . The queue is only +/// consumed while this instance holds leadership; when leadership is lost the queue's intake is suspended, the +/// dequeue loop is stopped and cancellation is requested for any in-flight reconciliation. This stops +/// the queue side promptly, but cannot force a non-cooperative reconciler that ignores its +/// to abort — see the remarks. +/// +/// The type of the Kubernetes entity being managed. +/// +/// +/// On leadership loss this service performs a hard stop combined with a queue gate: it +/// suspends the queue's intake, cancels the dequeue loop and any in-flight reconciliation, and clears the +/// queue. While the instance is not the leader the intake stays closed, so neither a still-running +/// reconciler's RequeueAfter nor an error retry can leave work behind. On re-acquiring leadership the +/// intake is reopened and the watcher re-lists the current state. +/// +/// +/// This protects the queue side only. KubeOps does not terminate the process on +/// leadership loss (unlike controller-runtime). It therefore cannot prevent a non-cooperative +/// reconciler that ignores the from performing external side effects while a +/// former leader. Reconciler implementations must honour cancellation and be idempotent. As a second line of +/// defence, concurrent writes to the same object are serialised by the API server via +/// metadata.resourceVersion (a stale write fails with HTTP 409 Conflict). +/// +/// +/// See https://kubernetes.io/docs/concepts/architecture/leases/ for leader-election semantics. +/// +/// +public class LeaderAwareEntityQueueBackgroundService( + ActivitySource activitySource, + IKubernetesClient client, + OperatorSettings operatorSettings, + ITimedEntityQueue queue, + IReconciler reconciler, + ILogger> logger, + LeaderElector elector, + OperatorMetrics? metrics = null) + : EntityQueueBackgroundService( + activitySource, + client, + operatorSettings, + queue, + reconciler, + logger, + metrics), ILeaderAwareEntityQueueConsumer + where TEntity : IKubernetesObject +{ + private LeaderElectionSubscription? _subscription; + + private ISuspendableEntityQueue? Gate => Queue as ISuspendableEntityQueue; + + public override Task StartAsync(CancellationToken cancellationToken) + { + logger.LogDebug("Subscribe for leadership updates."); + + _subscription ??= new(elector, StartedLeading, StoppedLeading); + _subscription.Subscribe(); + + if (Gate is null) + { + logger.LogWarning( + "The configured queue ({QueueType}) does not implement {Capability}; leadership-loss " + + "protection (queue clear and intake suspension) is disabled. A former leader may leave " + + "queued work behind on a leadership transition.", + Queue.GetType().Name, + nameof(ISuspendableEntityQueue)); + } + + if (_subscription.IsLeader) + { + Gate?.ResumeIntake(); + return base.StartAsync(cancellationToken); + } + + // Not leading yet: keep the intake gate closed so nothing accumulates work until leadership is held. + logger.LogDebug("Starting as non-leader; intake gate kept closed until leadership is acquired."); + Gate?.SuspendIntake(); + return Task.CompletedTask; + } + + public override Task StopAsync(CancellationToken cancellationToken) + { + logger.LogDebug("Unsubscribe from leadership updates."); + _subscription?.Unsubscribe(); + + // Always delegate to the base stop: it drains in-flight reconciliations, bounded by the host shutdown + // token, so the processing loop is reliably awaited on host shutdown even when leadership was already lost. + return base.StopAsync(cancellationToken); + } + + /// + protected override void OnDisposing() => _subscription?.Unsubscribe(); + + private void StartedLeading() + { + logger.LogInformation("This instance started leading, starting queue processing."); + + // Open the intake gate before producers run. This service is registered before the watcher + // (see OperatorBuilder.AddController), so its OnStartedLeading runs first and the gate is open + // by the time the watcher starts enqueuing. + Gate?.ResumeIntake(); + _ = base.StartAsync(CancellationToken.None); + } + + private void StoppedLeading() + { + logger.LogInformation("This instance stopped leading, stopping queue processing."); + + // This runs inside the elector's OnStoppedLeading callback, so the safety-critical stop must not be skipped + // by a failure in the (best-effort) gate operations of a custom queue — and no exception may propagate into + // the callback (it would be misattributed as a leadership-hold failure). + // + // Close the intake gate FIRST so nothing — including a still-running reconciler's RequeueAfter or an error + // retry — can enqueue work during or after the stop. + try + { + Gate?.SuspendIntake(); + } + catch (Exception e) + { + logger.LogWarning(e, "Failed to suspend queue intake for {Entity} on leadership loss.", typeof(TEntity).Name); + } + + // Cancel the dequeue loop and any in-flight reconciliation. Non-blocking on purpose: we no longer hold + // leadership, so we abort and move on without waiting (the host-shutdown drain is a separate path). + _ = RequestStopAsync(); + + // Clear the work the former leader had already queued (best-effort). + try + { + Gate?.Clear(); + } + catch (Exception e) + { + logger.LogWarning(e, "Failed to clear the queue for {Entity} on leadership loss.", typeof(TEntity).Name); + } + } +} diff --git a/src/KubeOps.Operator/Queue/TimedEntityQueue.cs b/src/KubeOps.Operator/Queue/TimedEntityQueue.cs index b75b0a00..211ccaf2 100644 --- a/src/KubeOps.Operator/Queue/TimedEntityQueue.cs +++ b/src/KubeOps.Operator/Queue/TimedEntityQueue.cs @@ -33,7 +33,7 @@ namespace KubeOps.Operator.Queue; /// for reconciliation operations. /// /// -public sealed class TimedEntityQueue : ITimedEntityQueue +public sealed class TimedEntityQueue : ITimedEntityQueue, ISuspendableEntityQueue where TEntity : IKubernetesObject { /// @@ -56,9 +56,19 @@ public sealed class TimedEntityQueue : ITimedEntityQueue // Cancellation token source for the timer loop. private readonly CancellationTokenSource _timerCts = new(); + // Guards the intake gate together with all mutations of _management/_queue that must be atomic with the + // gate check (Enqueue scheduling, Clear, and timer promotion). Without this, a leadership transition has + // a TOCTOU race: Enqueue passes the gate check, SuspendIntake + Clear run, and Enqueue then re-adds an + // entry after the clear. + private readonly object _gateLock = new(); + // Task that runs the timer loop. private readonly Task _timerTask; + // Set while the instance does not hold leadership; guarded by _gateLock. When true, Enqueue drops new + // entries and the timer promotes nothing. + private bool _intakeSuspended; + // Read by the meter's observation thread (queue-depth gauge) and written on the dispose thread, // hence volatile to ensure the disposed state is observed promptly across threads. private volatile bool _disposed; @@ -80,61 +90,140 @@ public TimedEntityQueue(ILogger> logger, OperatorMetri internal int Count => _management.Count; + // Number of entries already promoted to the ready queue. + internal int ReadyCount => _queue.Count; + /// - public Task Enqueue(TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, TimeSpan queueIn, int retryCount, CancellationToken cancellationToken) + public Task Enqueue(TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, TimeSpan queueIn, int retryCount, CancellationToken cancellationToken) { + if (cancellationToken.IsCancellationRequested) + { + return Task.FromResult(false); + } + var key = this.GetKey(entity) ?? throw new InvalidOperationException("Cannot enqueue entities without name."); - _management - .AddOrUpdate( - key, - _ => - { - _logger - .LogTrace( - """Scheduling entity "{Identifier}" to reconcile in {Seconds}s.""", - entity.ToIdentifierString(), - queueIn.TotalSeconds); - - return new(entity, type, reconciliationTriggerSource, queueIn, retryCount); - }, - (_, oldEntry) => - { - var newQueueIn = queueIn; - var oldQueueIn = TimeSpan.FromTicks(Math.Max(0, oldEntry.EnqueueAt.Subtract(DateTimeOffset.UtcNow).Ticks)); + // The gate check and the scheduling mutation must be atomic with SuspendIntake/Clear (see _gateLock), + // otherwise a former leader could re-add an entry right after the queue was cleared on leadership loss. + lock (_gateLock) + { + if (_intakeSuspended) + { + _logger + .LogTrace( + """Intake suspended for {Entity}; dropping enqueue of "{Identifier}" (trigger {Trigger}).""", + typeof(TEntity).Name, + entity.ToIdentifierString(), + reconciliationTriggerSource.ToMetricString()); + return Task.FromResult(false); + } - // the earliest execution time should be kept, - if (oldQueueIn <= newQueueIn) + _management + .AddOrUpdate( + key: key, + addValueFactory: _ => { - newQueueIn = oldQueueIn; - _logger .LogTrace( - """Keeping scheduled entity "{Identifier}" to reconcile in {Seconds}s.""", + """Scheduling entity "{Identifier}" to reconcile in {Seconds}s.""", entity.ToIdentifierString(), - newQueueIn.TotalSeconds); - } - else + queueIn.TotalSeconds); + + return new(entity, type, reconciliationTriggerSource, queueIn, retryCount); + }, + updateValueFactory: (_, oldEntry) => { - _logger - .LogTrace( - """Updating scheduled entity "{Identifier}" to reconcile in {Seconds}s.""", - entity.ToIdentifierString(), - newQueueIn.TotalSeconds); - } + var newQueueIn = queueIn; + var oldQueueIn = TimeSpan.FromTicks( + Math.Max(0, oldEntry.EnqueueAt.Subtract(DateTimeOffset.UtcNow).Ticks)); - // schedule deleted reconciliations must not be cancelled - var newReconciliationType = oldEntry.ReconciliationType == ReconciliationType.Deleted - ? oldEntry.ReconciliationType - : type; + // the earliest execution time should be kept, + if (oldQueueIn <= newQueueIn) + { + newQueueIn = oldQueueIn; - oldEntry.Cancel(); - return new(entity, newReconciliationType, reconciliationTriggerSource, newQueueIn, retryCount); - }); + _logger + .LogTrace( + """Keeping scheduled entity "{Identifier}" to reconcile in {Seconds}s.""", + entity.ToIdentifierString(), + newQueueIn.TotalSeconds); + } + else + { + _logger + .LogTrace( + """Updating scheduled entity "{Identifier}" to reconcile in {Seconds}s.""", + entity.ToIdentifierString(), + newQueueIn.TotalSeconds); + } + + // schedule deleted reconciliations must not be cancelled + var newReconciliationType = oldEntry.ReconciliationType == ReconciliationType.Deleted + ? oldEntry.ReconciliationType + : type; + + oldEntry.Cancel(); + return new(entity, newReconciliationType, reconciliationTriggerSource, newQueueIn, retryCount); + }); + } _metrics?.RecordEnqueue(typeof(TEntity).Name, reconciliationTriggerSource.ToMetricString()); - return Task.CompletedTask; + return Task.FromResult(true); + } + + /// + public void Clear() + { + // Drain both collections atomically with the intake gate, but never CompleteAdding() the + // BlockingCollection — the queue must stay usable after ResumeIntake() once leadership is regained. + lock (_gateLock) + { + var scheduled = _management.Count; + _management.Clear(); + + var ready = 0; + while (_queue.TryTake(out _)) + { + // Discard the ready entry; draining only, never CompleteAdding(). + ready++; + } + + _logger + .LogDebug( + "Cleared entity queue for {Entity} on leadership loss: discarded {Scheduled} scheduled and {Ready} ready entries.", + typeof(TEntity).Name, + scheduled, + ready); + } + } + + /// + public void SuspendIntake() + { + lock (_gateLock) + { + _intakeSuspended = true; + } + + _logger + .LogTrace( + "Intake gate for {Entity} suspended; new and scheduled entries are dropped until resumed.", + typeof(TEntity).Name); + } + + /// + public void ResumeIntake() + { + lock (_gateLock) + { + _intakeSuspended = false; + } + + _logger + .LogTrace( + "Intake gate for {Entity} resumed; accepting new entries again.", + typeof(TEntity).Name); } /// @@ -204,12 +293,33 @@ private async Task ProcessScheduledEntriesAsync() continue; } - if (entry.EnqueueAt > now || !_management.TryRemove(key, out _)) + if (entry.EnqueueAt > now) { continue; } - _queue.TryAdd(entry.ToQueueEntry()); + // Promote atomically with the intake gate: while intake is suspended nothing is + // promoted, and a Clear() on leadership loss cannot be raced by a concurrent + // promotion re-adding an entry to the ready queue after the clear. + lock (_gateLock) + { + if (_intakeSuspended) + { + _logger + .LogTrace( + """Intake suspended for {Entity}; skipping promotion of scheduled entry "{Identifier}".""", + typeof(TEntity).Name, + entry.GetEntityIdentifierString()); + continue; + } + + if (!_management.TryRemove(key, out _)) + { + continue; + } + + _queue.TryAdd(entry.ToQueueEntry()); + } _logger .LogTrace( diff --git a/src/KubeOps.Operator/Reconciliation/Reconciler.cs b/src/KubeOps.Operator/Reconciliation/Reconciler.cs index 1e88949a..d766969d 100644 --- a/src/KubeOps.Operator/Reconciliation/Reconciler.cs +++ b/src/KubeOps.Operator/Reconciliation/Reconciler.cs @@ -63,7 +63,7 @@ await ReconcileDeletion(reconciliationContext, cancellationToken), if (result.RequeueAfter.HasValue) { - await entityQueue + var scheduled = await entityQueue .Enqueue( result.Entity, reconciliationContext.EventType, @@ -72,7 +72,12 @@ await entityQueue retryCount: 0, cancellationToken); - metrics?.RecordRequeue(typeof(TEntity).Name, "operator_requeue"); + // Only count the requeue when it was actually scheduled. A leadership-aware queue with suspended + // intake (leadership just lost) drops the entry and returns false. + if (scheduled) + { + metrics?.RecordRequeue(typeof(TEntity).Name, "operator_requeue"); + } } return result; diff --git a/src/KubeOps.Operator/RestartableHostedService.cs b/src/KubeOps.Operator/RestartableHostedService.cs new file mode 100644 index 00000000..3a9e34f6 --- /dev/null +++ b/src/KubeOps.Operator/RestartableHostedService.cs @@ -0,0 +1,393 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics; + +using KubeOps.Operator.Retry; + +using Microsoft.Extensions.Hosting; + +namespace KubeOps.Operator; + +/// +/// Base class for a hosted service that runs a single, restartable background loop and drains all in-flight work +/// before its resources are disposed. It is the shared lifecycle for the operator's leadership-aware background +/// services (the queue consumer and the resource watcher). +/// +/// +/// +/// is idempotent: while a loop is already running it does nothing, so a +/// race where two callers start the loop (for example a leadership-aware StartAsync and a concurrent +/// OnStartedLeading callback) can never start two loops. Each loop owns its own +/// and disposes it only when the loop has finished, so a restart never +/// disposes a token source a still-running former loop is still observing. +/// +/// +/// There are two ways to stop the loop, with deliberately different semantics: +/// +/// +/// +/// — the host-shutdown entry point. It requests cancellation and then awaits the +/// drain of all in-flight work, bounded by its cancellation token (the host shutdown deadline) and +/// . This honors the contract that a graceful host stop +/// waits for the service to stop. +/// +/// +/// — requests cancellation without waiting, for callers that must +/// not block (such as a leadership-loss callback). On leadership loss the in-flight reconciliation is cancelled and +/// the operator moves on — it does not wait for it to finish — which matches controller-runtime / +/// other operator SDKs. (KubeOps does not terminate the process on leadership loss, so cancellation is cooperative +/// rather than a hard os.Exit.) +/// +/// +/// +/// "Draining" never means letting in-flight work complete: the work is cancelled first, and the bounded +/// wait only lets the already-cancelled loop unwind so that shared resources are not torn down underneath +/// a still-running worker. A flap (request-stop then start) can briefly leave the previous loop draining while the +/// next loop already runs; both are tracked, and disposal drains all of them (bounded by +/// ) before subclass resources are released. A reconciler that ignores its +/// cannot block shutdown beyond the grace period. +/// +/// +public abstract class RestartableHostedService : IHostedService, IDisposable, IAsyncDisposable +{ + // Guards the start/stop lifecycle: _running (idempotency gate for the current term) and _activeRuns. + private readonly object _lifecycleLock = new(); + + // Every loop that has been started and not yet finished, with the token source it owns. Normally one, but a + // flap (stop -> start) can briefly leave the previous loop still draining while the next one runs. Disposal + // drains them ALL. + private readonly List<(Task Loop, CancellationTokenSource Cts)> _activeRuns = []; + + private bool _running; + private volatile bool _disposed; + + /// + /// Bounds how long a dispose waits for in-flight work to drain. A non-cooperative loop body that ignores its + /// cannot block shutdown beyond this. Internal so tests can shorten it. + /// + internal TimeSpan DrainGracePeriod { get; set; } = TimeSpan.FromSeconds(30); + + /// + /// Computes the back-off delay before a faulted loop is restarted, from the consecutive fault count. + /// Defaults to an exponential back-off with jitter; internal so tests can make restarts immediate. + /// + internal Func FaultBackoff { get; set; } = ExponentialRetryBackoff.GetDelayWithJitter; + + /// + /// How long a loop iteration must run before a subsequent fault is treated as a fresh failure and the + /// back-off counter is reset to zero. This keeps the back-off escalating for a tight crash loop while not + /// penalising an isolated fault that occurs after a long healthy run. Must exceed the maximum back-off. + /// Internal so tests can control it. + /// + internal TimeSpan FaultBackoffResetThreshold { get; set; } = TimeSpan.FromMinutes(1); + + /// + /// + /// Idempotent: starts the background loop only if one is not already running. The loop is scheduled with + /// ; the is not used + /// for the loop itself (it would only cancel host startup scheduling) — the loop is cancelled via + /// / disposal. + /// + public virtual Task StartAsync(CancellationToken cancellationToken) + { + lock (_lifecycleLock) + { + if (_disposed || _running) + { + return Task.CompletedTask; + } + + var cts = new CancellationTokenSource(); + _running = true; + var loop = Task.Run(() => RunLoopAsync(cts), CancellationToken.None); + _activeRuns.Add((loop, cts)); + + return Task.CompletedTask; + } + } + + /// + /// + /// Honors the contract: it requests the loop to stop and then awaits the + /// drain of all in-flight work, bounded by (the host shutdown + /// deadline) and . It drains whatever is still active even if a prior + /// already cleared the running state. For a non-blocking stop (e.g. from a + /// leadership-loss callback) use . + /// + public virtual async Task StopAsync(CancellationToken cancellationToken) + { + (Task Loop, CancellationTokenSource Cts)[] runs; + lock (_lifecycleLock) + { + if (_disposed) + { + return; + } + + _running = false; + runs = [.. _activeRuns]; + } + + await DrainRunsAsync(runs, cancellationToken); + } + + /// + public void Dispose() + { + Dispose(true); + GC.SuppressFinalize(this); + } + + /// + public async ValueTask DisposeAsync() + { + await DisposeAsyncCore(); + GC.SuppressFinalize(this); + } + + /// Disposes a resource, preferring its asynchronous disposal when available. + /// The resource to dispose. + /// A task that represents the asynchronous dispose operation. + protected static async ValueTask CastAndDisposeAsync(IDisposable resource) + { + if (resource is IAsyncDisposable resourceAsyncDisposable) + { + await resourceAsyncDisposable.DisposeAsync(); + } + else + { + resource.Dispose(); + } + } + + /// + /// Requests the loop to stop without waiting for it to drain — for callers that must not block, such as a + /// leadership-loss callback. The remaining drain is awaited by the next or by disposal. + /// + /// A task that completes once cancellation has been signalled (not when the loop has drained). + protected Task RequestStopAsync() + { + (Task Loop, CancellationTokenSource Cts)[] runs; + lock (_lifecycleLock) + { + if (_disposed || !_running) + { + return Task.CompletedTask; + } + + // Clear the desired-running state so a subsequent StartAsync (e.g. on re-acquired leadership) starts a + // fresh loop instead of being suppressed by the idempotency guard. + _running = false; + runs = [.. _activeRuns]; + } + + return CancelRunsAsync(runs); + } + + /// + /// Runs one iteration of the background loop. Implementations must honor + /// and return when it is cancelled. + /// + /// A token signalled when the loop should stop. + /// A task that completes when the loop has stopped. + protected abstract Task ExecuteAsync(CancellationToken cancellationToken); + + /// + /// Called each time exits with an unexpected error (any exception other than its own + /// cancellation), before the loop is restarted with a back-off (see ). Override to log + /// it so a faulting loop is visible. Defaults to no-op. + /// + /// The exception the loop exited with. + protected virtual void OnLoopFaulted(Exception exception) + { + } + + /// + /// Releases subclass-managed resources synchronously. Called from . + /// + protected virtual void DisposeManagedResources() + { + } + + /// + /// Releases subclass-managed resources asynchronously, after all loops have been drained. Called from + /// . Defaults to . + /// + /// A task that represents the asynchronous release operation. + protected virtual ValueTask DisposeManagedResourcesAsync() + { + DisposeManagedResources(); + return ValueTask.CompletedTask; + } + + /// + /// Cleanup that must run before the loops are drained — for example unsubscribing event handlers + /// so that no leadership callback can start a new loop while disposal is in progress. Runs on both the synchronous + /// () and asynchronous () disposal paths, so subclasses + /// do not have to override both dispose methods to be safe. + /// + protected virtual void OnDisposing() + { + } + + /// Releases the resources used by the service. + /// Whether the method is called from . + /// + /// The synchronous path cannot await the loops to drain (see for the draining + /// path); the container disposes via when available, so this is the best-effort + /// fallback. + /// + protected virtual void Dispose(bool disposing) + { + if (!disposing || _disposed) + { + return; + } + + OnDisposing(); + DisposeManagedResources(); + _disposed = true; + } + + /// Asynchronously drains all loops and releases the resources used by the service. + /// + /// Overriding subclasses must call base.DisposeAsyncCore() so the loops are drained and resources are + /// released on the asynchronous disposal path too. + /// + /// A task that represents the asynchronous dispose operation. + protected virtual async ValueTask DisposeAsyncCore() + { + if (_disposed) + { + return; + } + + (Task Loop, CancellationTokenSource Cts)[] runs; + lock (_lifecycleLock) + { + _running = false; + + // Mark disposed under the lock and before the drain so a leadership callback racing the drain is + // suppressed by StartAsync's guard and cannot add a new loop that would escape the snapshot below. + _disposed = true; + runs = [.. _activeRuns]; + } + + // Unsubscribe before draining so no leadership callback can start a fresh loop during the drain. + OnDisposing(); + await DrainRunsAsync(runs, CancellationToken.None); + await DisposeManagedResourcesAsync(); + } + + private static async Task CancelRunsAsync(IReadOnlyCollection<(Task Loop, CancellationTokenSource Cts)> runs) + { + foreach (var (_, cts) in runs) + { + try + { + await cts.CancelAsync(); + } + catch (ObjectDisposedException) + { + // The loop already finished and disposed its own token source; nothing to cancel. + } + } + } + + private async Task DrainRunsAsync( + IReadOnlyCollection<(Task Loop, CancellationTokenSource Cts)> runs, CancellationToken cancellationToken) + { + if (runs.Count == 0) + { + return; + } + + await CancelRunsAsync(runs); + + try + { + // Bound the wait so a non-cooperative loop body cannot block shutdown indefinitely; after the grace + // elapses we proceed (documented limitation). + await Task.WhenAll(runs.Select(r => r.Loop)).WaitAsync(DrainGracePeriod, cancellationToken); + } + catch (TimeoutException) + { + // Grace elapsed while work was still running; proceed with disposal. + } + catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) + { + // Host shutdown deadline reached; proceed. + } + } + + // Runs one loop and owns its CancellationTokenSource: it disposes the source only after the loop has finished, + // so StartAsync never disposes a token source that this loop is still using. A loop that faults (any exception + // other than its own cancellation) is restarted in place with a back-off, so a single transient error cannot + // silently stop a service for the rest of the process lifetime; only cancellation or a clean return ends it. + private async Task RunLoopAsync(CancellationTokenSource cts) + { + uint faultRetries = 0; + try + { + while (!cts.IsCancellationRequested) + { + var startedAt = Stopwatch.GetTimestamp(); + try + { + await ExecuteAsync(cts.Token); + + // ExecuteAsync returned without being cancelled: the loop body decided it is done. A clean + // return is not a fault, so it is not restarted. + break; + } + catch (OperationCanceledException) when (cts.IsCancellationRequested) + { + // Expected: the loop was cancelled by a stop or disposal. Nothing to report. + break; + } + catch (Exception exception) + { + // The loop exited with an unexpected error (not its own cancellation). Surface it, then wait a + // back-off and restart so the service keeps working after a transient failure. + OnLoopFaulted(exception); + + // If the loop ran healthily for a while before faulting, treat this as a fresh failure and + // restart the back-off from the beginning, so an isolated late fault is not penalised by + // earlier, already-recovered ones. A tight crash loop keeps escalating. + if (Stopwatch.GetElapsedTime(startedAt) >= FaultBackoffResetThreshold) + { + faultRetries = 0; + } + + try + { + await Task.Delay(FaultBackoff(++faultRetries), cts.Token); + } + catch (OperationCanceledException) when (cts.IsCancellationRequested) + { + break; + } + } + } + } + finally + { + lock (_lifecycleLock) + { + _activeRuns.RemoveAll(r => ReferenceEquals(r.Cts, cts)); + + // If no loop remains while we still think we are running, the loop exited without a stop request + // (ExecuteAsync returned cleanly). Reset so a subsequent StartAsync can restart it instead of + // being suppressed forever by the idempotency guard. + if (_activeRuns.Count == 0) + { + _running = false; + } + } + + cts.Dispose(); + } + } +} diff --git a/src/KubeOps.Operator/Watcher/LeaderAwareResourceWatcher{TEntity}.cs b/src/KubeOps.Operator/Watcher/LeaderAwareResourceWatcher{TEntity}.cs index 72e9ca33..fb8896a7 100644 --- a/src/KubeOps.Operator/Watcher/LeaderAwareResourceWatcher{TEntity}.cs +++ b/src/KubeOps.Operator/Watcher/LeaderAwareResourceWatcher{TEntity}.cs @@ -11,10 +11,10 @@ using KubeOps.Abstractions.Builder; using KubeOps.Abstractions.Entities; using KubeOps.KubernetesClient; +using KubeOps.Operator.LeaderElection; using KubeOps.Operator.Metrics; using KubeOps.Operator.Queue; -using Microsoft.Extensions.Hosting; using Microsoft.Extensions.Logging; using ZiggyCreatures.Caching.Fusion; @@ -30,7 +30,6 @@ public class LeaderAwareResourceWatcher( IEntityLabelSelector labelSelector, IEntityFieldSelector fieldSelector, IKubernetesClient client, - IHostApplicationLifetime hostApplicationLifetime, LeaderElector elector, OperatorMetrics? metrics = null) : ResourceWatcher( @@ -45,71 +44,68 @@ public class LeaderAwareResourceWatcher( metrics) where TEntity : IKubernetesObject { - private CancellationTokenSource _cts = new(); - private bool _disposed; + private LeaderElectionSubscription? _subscription; - public override async Task StartAsync(CancellationToken cancellationToken) + public override Task StartAsync(CancellationToken cancellationToken) { logger.LogDebug("Subscribe for leadership updates."); - elector.OnStartedLeading += StartedLeading; - elector.OnStoppedLeading += StoppedLeading; + _subscription ??= new(elector, StartedLeading, StoppedLeading); + _subscription.Subscribe(); - if (elector.IsLeader()) - { - using CancellationTokenSource linkedCancellationTokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, _cts.Token); - await base.StartAsync(linkedCancellationTokenSource.Token); - } + // Only start watching while leadership is actually held. StartedLeading/StoppedLeading restart and stop + // the base watch loop on leadership transitions. + return _subscription.IsLeader ? base.StartAsync(cancellationToken) : Task.CompletedTask; } public override Task StopAsync(CancellationToken cancellationToken) { logger.LogDebug("Unsubscribe from leadership updates."); - if (_disposed) - { - return Task.CompletedTask; - } - - elector.OnStartedLeading -= StartedLeading; - elector.OnStoppedLeading -= StoppedLeading; + _subscription?.Unsubscribe(); - return elector.IsLeader() ? base.StopAsync(cancellationToken) : Task.CompletedTask; + // Always delegate to the base stop: it drains the watch loop, bounded by the host shutdown token, so the + // watcher is reliably awaited and torn down on host shutdown even when leadership was already lost. + return base.StopAsync(cancellationToken); } - protected override void Dispose(bool disposing) - { - if (!disposing) - { - return; - } - - _cts.Dispose(); - elector.Dispose(); - _disposed = true; - - base.Dispose(disposing); - } + /// + protected override void OnDisposing() => _subscription?.Unsubscribe(); private void StartedLeading() { logger.LogInformation("This instance started leading, starting watcher."); - if (_cts.IsCancellationRequested) - { - _cts.Dispose(); - _cts = new(); - } - - base.StartAsync(_cts.Token); + // The base watcher recreates its cancellation source when it was previously cancelled, so this + // restarts the watch after a leadership loss. The token passed here is unused by the base watcher. + base.StartAsync(CancellationToken.None); } private void StoppedLeading() { - _cts.Cancel(); - logger.LogInformation("This instance stopped leading, stopping watcher."); - EntityCache.Clear(); - _ = base.StopAsync(hostApplicationLifetime.ApplicationStopped); + // Stop FIRST: this runs inside the elector's OnStoppedLeading callback, so the safety-critical action + // (abort the watch so a former leader stops enqueuing) must not be skipped by a failure in the best-effort + // cache cleanup below. RequestStopAsync is non-blocking on purpose — we no longer hold leadership, so we + // cancel and move on without waiting (the host-shutdown drain is a separate path). + _ = RequestStopAsync(); + + // EntityCache is a single named cache shared by all entity watchers (keyed by entity UID). Remove only THIS + // entity type's entries (tagged with EntityCacheTag) so a leadership loss does not wipe the dedup tokens of + // unrelated entity types that share this cache instance. Best-effort and guarded: a cache that has tagging + // disabled (custom configuration) would throw, and that exception must not propagate into the elector + // callback (it would be misattributed as a leadership-hold failure). + try + { + EntityCache.RemoveByTag(EntityCacheTag); + } + catch (Exception e) + { + logger.LogWarning( + e, + "Failed to drop deduplication cache entries for {Entity} on leadership loss. The watch has still " + + "been stopped; stale dedup tokens may briefly survive into the next leadership term.", + typeof(TEntity).Name); + } } } diff --git a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs index c9807b70..e28f1e1f 100644 --- a/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs +++ b/src/KubeOps.Operator/Watcher/ResourceWatcher{TEntity}.cs @@ -21,7 +21,6 @@ using KubeOps.Operator.Reconciliation; using KubeOps.Operator.Retry; -using Microsoft.Extensions.Hosting; using Microsoft.Extensions.Logging; using ZiggyCreatures.Caching.Fusion; @@ -38,15 +37,20 @@ public class ResourceWatcher( IEntityFieldSelector fieldSelector, IKubernetesClient client, OperatorMetrics? metrics = null) - : IHostedService, IAsyncDisposable, IDisposable + : RestartableHostedService where TEntity : IKubernetesObject { - private CancellationTokenSource _cancellationTokenSource = new(); + private readonly string[] _entityCacheTags = [typeof(TEntity).FullName ?? typeof(TEntity).Name]; + private uint _watcherReconnectRetries; - private Task? _eventWatcher; - private bool _disposed; - ~ResourceWatcher() => Dispose(false); + /// + /// Gets the tag applied to every cached deduplication entry of this entity type. The dedup cache is a single + /// named FusionCache instance shared by all entity watchers (keyed by entity UID); the tag lets a + /// leadership-aware subclass drop only this entity type's entries (see + /// ) instead of clearing every entity's entries. + /// + protected string EntityCacheTag => _entityCacheTags[0]; /// /// Gets the fusion cache used to store a strategy-dependent deduplication token for each @@ -60,8 +64,9 @@ public class ResourceWatcher( /// /// /// Subclasses may access this cache to read or invalidate cached tokens. For example, - /// calls - /// when leadership is lost to ensure stale data is not carried over to the next watch session. + /// removes this entity type's entries by + /// when leadership is lost, so stale data is not carried over to the next watch + /// session — without disturbing other entity types that share this cache instance. /// /// /// Note: when an entity has a DeletionTimestamp set (finalizer processing), the @@ -71,96 +76,36 @@ public class ResourceWatcher( /// /// protected IFusionCache EntityCache { get; } = cacheProvider.GetCache( - settings.ReconcileStrategy == ReconcileStrategy.ByResourceVersion - ? CacheConstants.CacheNames.ResourceWatcherByResourceVersion - : CacheConstants.CacheNames.ResourceWatcher); + CacheConstants.ResourceWatcherCacheNameFor(settings.ReconcileStrategy)); - public virtual Task StartAsync(CancellationToken cancellationToken) + /// + public override Task StartAsync(CancellationToken cancellationToken) { logger.LogInformation("Starting resource watcher for {ResourceType}.", typeof(TEntity).Name); - - if (_cancellationTokenSource.IsCancellationRequested) - { - _cancellationTokenSource.Dispose(); - _cancellationTokenSource = new(); - } - - _eventWatcher = WatchClientEventsAsync(_cancellationTokenSource.Token); - + var result = base.StartAsync(cancellationToken); logger.LogInformation("Started resource watcher for {ResourceType}.", typeof(TEntity).Name); - return Task.CompletedTask; + return result; } - public virtual async Task StopAsync(CancellationToken cancellationToken) + /// + public override Task StopAsync(CancellationToken cancellationToken) { logger.LogInformation("Stopping resource watcher for {ResourceType}.", typeof(TEntity).Name); - if (_disposed) - { - return; - } - - await _cancellationTokenSource.CancelAsync(); - if (_eventWatcher is not null) - { - await _eventWatcher.WaitAsync(cancellationToken); - } - - logger.LogInformation("Stopped resource watcher for {ResourceType}.", typeof(TEntity).Name); + return base.StopAsync(cancellationToken); } - public async ValueTask DisposeAsync() - { - await StopAsync(CancellationToken.None); - await DisposeAsyncCore(); - GC.SuppressFinalize(this); - } + /// + protected override void OnLoopFaulted(Exception exception) => + logger.LogError( + exception, + "The watch loop for {ResourceType} exited unexpectedly and stopped watching.", + typeof(TEntity).Name); - public void Dispose() - { - Dispose(true); - GC.SuppressFinalize(this); - } + /// + protected override void DisposeManagedResources() => client.Dispose(); - protected virtual void Dispose(bool disposing) - { - if (!disposing) - { - return; - } - - _cancellationTokenSource.Dispose(); - _eventWatcher?.Dispose(); - client.Dispose(); - - _disposed = true; - } - - protected virtual async ValueTask DisposeAsyncCore() - { - if (_eventWatcher is not null) - { - await CastAndDispose(_eventWatcher); - } - - await CastAndDispose(_cancellationTokenSource); - await CastAndDispose(client); - - _disposed = true; - - return; - - static async ValueTask CastAndDispose(IDisposable resource) - { - if (resource is IAsyncDisposable resourceAsyncDisposable) - { - await resourceAsyncDisposable.DisposeAsync(); - } - else - { - resource.Dispose(); - } - } - } + /// + protected override ValueTask DisposeManagedResourcesAsync() => CastAndDisposeAsync(client); protected virtual async Task OnEventAsync(WatchEventType eventType, TEntity entity, CancellationToken cancellationToken) { @@ -227,7 +172,7 @@ protected virtual async Task OnEventAsync(WatchEventType eventType, TEntity enti } } - await entityQueue + var enqueued = await entityQueue .Enqueue( entity, eventType.ToReconciliationType(), @@ -236,6 +181,15 @@ await entityQueue retryCount: 0, cancellationToken); + if (!enqueued) + { + logger + .LogTrace( + """Enqueue of "{Identifier}" was dropped; leaving deduplication cache unchanged.""", + entity.ToIdentifierString()); + return; + } + if (eventType == WatchEventType.Deleted) { await EntityCache.RemoveAsync(entity.Uid(), token: cancellationToken); @@ -249,6 +203,7 @@ await entityQueue await EntityCache.SetAsync( deletionTrackingEntry.CacheKey, deletionTrackingEntry.Fingerprint, + tags: _entityCacheTags, token: cancellationToken); break; @@ -256,6 +211,7 @@ await EntityCache.SetAsync( await EntityCache.SetAsync( entity.Uid(), entity.Generation() ?? 1, + tags: _entityCacheTags, token: cancellationToken); break; @@ -263,6 +219,7 @@ await EntityCache.SetAsync( await EntityCache.SetAsync( entity.Uid(), entity.ResourceVersion(), + tags: _entityCacheTags, token: cancellationToken); break; @@ -270,33 +227,22 @@ await EntityCache.SetAsync( } } - private static string GetDeletionCacheKey(TEntity entity) - => $"{entity.Uid()}:deletion"; - - private static string GetDeletionFingerprint(TEntity entity) - => string.Join( - ':', - "deleting", - entity.Metadata.DeletionTimestamp?.ToUniversalTime().ToString("O"), - entity.Metadata.DeletionGracePeriodSeconds, - entity.Generation(), - string.Join(',', entity.Finalizers() ?? [])); - - private async Task WatchClientEventsAsync(CancellationToken stoppingToken) + /// + protected override async Task ExecuteAsync(CancellationToken cancellationToken) { string? currentVersion = null; - while (!stoppingToken.IsCancellationRequested) + while (!cancellationToken.IsCancellationRequested) { try { await foreach ((WatchEventType type, TEntity entity) in client.WatchAsync( settings.Namespace, resourceVersion: currentVersion, - labelSelector: await labelSelector.GetLabelSelectorAsync(stoppingToken), - fieldSelector: await fieldSelector.GetFieldSelectorAsync(stoppingToken), + labelSelector: await labelSelector.GetLabelSelectorAsync(cancellationToken), + fieldSelector: await fieldSelector.GetFieldSelectorAsync(cancellationToken), allowWatchBookmarks: true, - cancellationToken: stoppingToken)) + cancellationToken: cancellationToken)) { using var activity = activitySource.StartActivity($"""processing "{type}" event""", ActivityKind.Consumer); using var scope = logger.BeginScope(EntityLoggingScope.CreateFor(type, entity)); @@ -318,7 +264,7 @@ private async Task WatchClientEventsAsync(CancellationToken stoppingToken) try { - await OnEventAsync(type, entity, stoppingToken); + await OnEventAsync(type, entity, cancellationToken); } catch (Exception e) { @@ -333,7 +279,7 @@ private async Task WatchClientEventsAsync(CancellationToken stoppingToken) _watcherReconnectRetries = 0; } - catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested) + catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { break; } @@ -346,10 +292,10 @@ private async Task WatchClientEventsAsync(CancellationToken stoppingToken) } catch (Exception e) { - await OnWatchErrorAsync(e); + await OnWatchErrorAsync(e, cancellationToken); } - if (stoppingToken.IsCancellationRequested) + if (cancellationToken.IsCancellationRequested) { break; } @@ -360,7 +306,19 @@ private async Task WatchClientEventsAsync(CancellationToken stoppingToken) } } - private async Task OnWatchErrorAsync(Exception e) + private static string GetDeletionCacheKey(TEntity entity) + => $"{entity.Uid()}:deletion"; + + private static string GetDeletionFingerprint(TEntity entity) + => string.Join( + ':', + "deleting", + entity.Metadata.DeletionTimestamp?.ToUniversalTime().ToString("O"), + entity.Metadata.DeletionGracePeriodSeconds, + entity.Generation(), + string.Join(',', entity.Finalizers() ?? [])); + + private async Task OnWatchErrorAsync(Exception e, CancellationToken cancellationToken) { switch (e) { @@ -390,7 +348,15 @@ e.InnerException is EndOfStreamException && "There were {Retries} errors / retries in the watcher. Wait {Seconds}s before next attempt to connect.", _watcherReconnectRetries, delay.TotalSeconds); - await Task.Delay(delay); + + try + { + await Task.Delay(delay, cancellationToken); + } + catch (OperationCanceledException) + { + // Stop or leadership loss during the backoff + } } private sealed record DeletionTrackingEntry(string CacheKey, string Fingerprint); diff --git a/test/KubeOps.Operator.Test/Builder/CacheExtensions.Test.cs b/test/KubeOps.Operator.Test/Builder/CacheExtensions.Test.cs index dcfd11e6..de79d482 100644 --- a/test/KubeOps.Operator.Test/Builder/CacheExtensions.Test.cs +++ b/test/KubeOps.Operator.Test/Builder/CacheExtensions.Test.cs @@ -46,4 +46,31 @@ public void WithResourceWatcherEntityCaching_Should_Invoke_Custom_Cache_Configur configuratorInvoked.Should().BeTrue(); } + + [Fact] + public async Task Default_ResourceWatcher_Cache_Supports_Per_Entity_Tag_Removal() + { + // The leadership-aware watcher relies on FusionCache tagging: each dedup entry is tagged with its entity + // type, and on leadership loss only that type's entries are dropped via RemoveByTag. This verifies the + // default cache registration actually supports tagging at runtime (the unit tests elsewhere mock the cache), + // and that a tag removal affects only the matching entries. + var services = new ServiceCollection(); + services.WithResourceWatcherEntityCaching( + new OperatorSettingsBuilder { ReconcileStrategy = ReconcileStrategy.ByGeneration }.Build()); + var provider = services.BuildServiceProvider(); + var cache = provider.GetRequiredService() + .GetCache(CacheConstants.CacheNames.ResourceWatcher); + + var token = TestContext.Current.CancellationToken; + await cache.SetAsync("uid-a", 1L, tags: ["EntityA"], token: token); + await cache.SetAsync("uid-b", 2L, tags: ["EntityB"], token: token); + + await cache.RemoveByTagAsync("EntityA", token: token); + + (await cache.TryGetAsync("uid-a", token: token)).HasValue + .Should().BeFalse("EntityA's tagged entry must be removed"); + var remaining = await cache.TryGetAsync("uid-b", token: token); + remaining.HasValue.Should().BeTrue("EntityB's entry shares the cache but a different tag"); + remaining.Value.Should().Be(2L); + } } diff --git a/test/KubeOps.Operator.Test/Builder/OperatorBuilderQueueStrategy.Test.cs b/test/KubeOps.Operator.Test/Builder/OperatorBuilderQueueStrategy.Test.cs index 1293a79e..453dd94f 100644 --- a/test/KubeOps.Operator.Test/Builder/OperatorBuilderQueueStrategy.Test.cs +++ b/test/KubeOps.Operator.Test/Builder/OperatorBuilderQueueStrategy.Test.cs @@ -33,6 +33,27 @@ public void Should_Register_TimedEntityQueue_And_BackgroundService_For_InMemory_ s.ImplementationType == typeof(EntityQueueBackgroundService)); } + [Fact] + [Trait("Area", "LeaderLoss")] + public void Should_Register_LeaderAware_BackgroundService_For_Single_LeaderElection() + { + var builder = new OperatorBuilder( + new ServiceCollection(), + new OperatorSettingsBuilder + { + QueueStrategy = QueueStrategy.InMemory, + LeaderElectionType = LeaderElectionType.Single, + }.Build()); + builder.AddController(); + + builder.Services.Should().Contain(s => + s.ServiceType == typeof(IHostedService) && + s.ImplementationType == typeof(LeaderAwareEntityQueueBackgroundService)); + builder.Services.Should().NotContain(s => + s.ServiceType == typeof(IHostedService) && + s.ImplementationType == typeof(EntityQueueBackgroundService)); + } + [Fact] public void Should_Not_Register_TimedEntityQueue_Or_BackgroundService_For_Custom_Strategy() { diff --git a/test/KubeOps.Operator.Test/Builder/OperatorBuilderRegistrationValidation.Test.cs b/test/KubeOps.Operator.Test/Builder/OperatorBuilderRegistrationValidation.Test.cs new file mode 100644 index 00000000..bcedfa11 --- /dev/null +++ b/test/KubeOps.Operator.Test/Builder/OperatorBuilderRegistrationValidation.Test.cs @@ -0,0 +1,637 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using FluentAssertions; + +using k8s; +using k8s.Models; + +using KubeOps.Abstractions.Builder; +using KubeOps.Abstractions.Reconciliation; +using KubeOps.Abstractions.Reconciliation.Controller; +using KubeOps.Abstractions.Reconciliation.Finalizer; +using KubeOps.Operator.Builder; +using KubeOps.Operator.Exceptions; +using KubeOps.Operator.Queue; +using KubeOps.Operator.Test.TestEntities; +using KubeOps.Operator.Watcher; + +using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.Hosting; +using Microsoft.Extensions.Logging; + +using Moq; + +using ZiggyCreatures.Caching.Fusion; + +namespace KubeOps.Operator.Test.Builder; + +[Trait("Area", "RegistrationValidation")] +public sealed class OperatorBuilderRegistrationValidationTest +{ + [Fact] + public void Should_Not_Register_Validator_When_Disabled() + { + var builder = new OperatorBuilder( + new ServiceCollection(), + new OperatorSettingsBuilder { ValidateRegistrations = false }.Build()); + builder.AddController(); + + builder.Services.Should().NotContain(s => + s.ServiceType == typeof(IHostedService) && + s.ImplementationType == typeof(OperatorRegistrationValidator)); + } + + [Fact] + public void Should_Register_Validator_When_Enabled() + { + var builder = new OperatorBuilder( + new ServiceCollection(), + new OperatorSettingsBuilder { ValidateRegistrations = true }.Build()); + builder.AddController(); + + builder.Services.Should().Contain(s => + s.ServiceType == typeof(IHostedService) && + s.ImplementationType == typeof(OperatorRegistrationValidator)); + builder.Services.Should().Contain(s => s.ServiceType == typeof(OperatorRegistrationRegistry)); + } + + // With the in-memory strategy the SDK registers the queue, watcher and (for None/Single) the consumer, + // so None/Single validate as complete. The custom queue strategy leaves the queue itself unregistered + // (the SDK only auto-registers it for in-memory), so SDK-only registrations are incomplete and must + // fail. Custom leader election additionally leaves watcher and consumer to the user. + [Theory] + [InlineData(LeaderElectionType.None, QueueStrategy.InMemory, true)] + [InlineData(LeaderElectionType.None, QueueStrategy.Custom, false)] + [InlineData(LeaderElectionType.Single, QueueStrategy.InMemory, true)] + [InlineData(LeaderElectionType.Single, QueueStrategy.Custom, false)] + [InlineData(LeaderElectionType.Custom, QueueStrategy.InMemory, false)] + [InlineData(LeaderElectionType.Custom, QueueStrategy.Custom, false)] + public async Task Should_Validate_Sdk_Registrations_For_All_Configuration_Combinations( + LeaderElectionType leaderElectionType, QueueStrategy queueStrategy, bool expectValid) + { + var validator = CreateValidatorForSdkRegistrations(leaderElectionType, queueStrategy); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + if (expectValid) + { + await act.Should().NotThrowAsync(); + } + else + { + await act.Should().ThrowAsync(); + } + } + + [Fact] + public async Task Should_Report_Missing_Watcher_And_Consumer_For_Custom_LeaderElection() + { + var validator = CreateValidatorForSdkRegistrations(LeaderElectionType.Custom, QueueStrategy.InMemory); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("ResourceWatcher") + .And.Contain("IEntityQueueConsumer") + .And.Contain("LeaderElectionType.Custom") + .And.Contain("QueueStrategy.InMemory"); + } + + [Fact] + public async Task Should_Fail_When_Custom_Queue_Strategy_Does_Not_Register_A_Queue() + { + // The watcher and reconciler take ITimedEntityQueue as a constructor dependency, so it is + // required even with QueueStrategy.Custom (where the SDK does not register it). A missing queue would + // otherwise only fail at host startup with a DI error. Use None leader election so the queue is the + // only gap. + var validator = CreateValidatorForSdkRegistrations(LeaderElectionType.None, QueueStrategy.Custom); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("ITimedEntityQueue"); + } + + [Fact] + public async Task Should_Pass_For_Custom_LeaderElection_When_Watcher_And_Consumer_Are_Registered() + { + var validator = CreateValidator( + LeaderElectionType.Custom, + QueueStrategy.InMemory, + services => + { + services.AddHostedService(); + services.AddHostedService(); + }); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + [Fact] + public async Task Should_Pass_For_Custom_LeaderElection_With_Custom_Queue_When_Watcher_Queue_And_Consumer_Are_Registered() + { + // With QueueStrategy.Custom the user supplies the queue (still required, since watcher and reconciler + // depend on it) and the consumer (recognised via IEntityQueueConsumer). With Custom leader + // election the suspendable capability is not required. + var validator = CreateValidator( + LeaderElectionType.Custom, + QueueStrategy.Custom, + services => + { + services.AddHostedService(); + services.AddSingleton, NonSuspendableQueue>(); + services.AddHostedService(); + }); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + [Fact] + public async Task Should_Fail_When_Queue_Consumer_Is_Missing() + { + // QueueStrategy.Custom + a registered queue but no consumer: nothing drains the queue. The consumer + // is recognised via IEntityQueueConsumer, so its absence is caught. Use None leader election + // (the SDK registers the watcher) and supply only the queue, so the consumer is the only gap. + var validator = CreateValidator( + LeaderElectionType.None, + QueueStrategy.Custom, + services => services.AddSingleton, NonSuspendableQueue>()); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("IEntityQueueConsumer"); + } + + [Fact] + public async Task Should_Fail_When_Finalizer_Is_Registered_Without_A_Controller() + { + // A finalizer added without a controller can never run: there is no reconciliation pipeline for + // the entity. + var settings = new OperatorSettingsBuilder { ValidateRegistrations = true }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddFinalizer("test.finalizer"); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("finalizer") + .And.Contain("AddController"); + } + + [Fact] + public async Task Should_Pass_When_Finalizer_And_Controller_Are_Registered() + { + var settings = new OperatorSettingsBuilder { ValidateRegistrations = true }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + builder.AddFinalizer("test.finalizer"); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Should_Fail_When_Single_LeaderElection_Queue_Lacks_Suspendable_Capability() + { + // LeaderElectionType.Single requires the queue to support the leadership gate + // (ISuspendableEntityQueue). A custom queue without it would silently lose leadership-loss + // protection, so validation must reject it. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.Single, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + // A leadership-aware consumer is present, so the only gap is the non-suspendable queue. + services.AddSingleton, NonSuspendableQueue>(); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain(nameof(ISuspendableEntityQueue)); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Should_Fail_When_Single_LeaderElection_Consumer_Is_Not_Leader_Aware() + { + // The queue supports the gate (ISuspendableEntityQueue), but a plain consumer never drives it, so + // leadership-loss protection would not actually take effect. Validation must reject it under Single. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.Single, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton, SuspendableQueue>(); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain(nameof(ILeaderAwareEntityQueueConsumer)); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Should_Pass_For_Single_LeaderElection_With_Suspendable_Queue_And_LeaderAware_Consumer() + { + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.Single, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton, SuspendableQueue>(); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Should_Recognize_Open_Generic_Queue_Registration() + { + // A queue registered as an open generic (AddSingleton(typeof(ITimedEntityQueue<>), typeof(...<>))) + // is resolvable by the DI container, so the validator must recognise both its presence and its + // ISuspendableEntityQueue capability under Single. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.Single, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton(typeof(ITimedEntityQueue<>), typeof(OpenSuspendableQueue<>)); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Should_Fail_When_Single_LeaderElection_Queue_Is_Factory_Registered() + { + // A factory-registered queue is resolvable, but its concrete type cannot be inspected, so the gate + // capability cannot be verified. Under Single that must fail (consistent with the consumer), instead + // of silently assuming the queue is gateable. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.Single, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton>(_ => new SuspendableQueue()); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("factory") + .And.Contain(nameof(ISuspendableEntityQueue)); + } + + [Fact] + public async Task Should_Report_Missing_Queue_When_Registered_As_Keyed_Service() + { + // A keyed queue registration does not satisfy the unkeyed constructor dependency of the watcher and + // reconciler, so the validator must not count it as a registered queue. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.None, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddKeyedSingleton, NonSuspendableQueue>("queue-key"); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("ITimedEntityQueue"); + } + + [Fact] + public async Task Should_Report_Missing_Consumer_When_Registered_Via_Factory() + { + // Documented limitation: validation recognises components by their registered implementation type. + // A consumer registered through a DI factory delegate exposes no type, so it is reported as missing. + // Users must register the consumer with a concrete type (or disable validation). + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.None, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton, NonSuspendableQueue>(); + services.AddHostedService(_ => new CustomConsumer()); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should().Contain("IEntityQueueConsumer"); + } + + [Fact] + public async Task Should_Report_Missing_Queue_When_Open_Generic_Cannot_Close_For_Entity() + { + // An open-generic queue whose generic constraints exclude the managed entity is registered. The DI + // container cannot close ITimedEntityQueue from it, so it does not actually satisfy the + // watcher/reconciler dependency. A name-only (generic type definition) match would be a false + // positive; validation must report the queue as missing. Use None leader election so the queue is the + // only gap. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.None, + QueueStrategy = QueueStrategy.Custom, + ValidateRegistrations = true, + }.Build(); + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + services.AddSingleton(typeof(ITimedEntityQueue<>), typeof(ConstrainedOpenQueue<>)); + services.AddHostedService(); + + var validator = CreateValidator(services, settings); + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should() + .Contain(nameof(V1OperatorIntegrationTestEntity)) + .And.Contain("ITimedEntityQueue"); + } + + [Fact] + public async Task Should_Fail_When_ResourceWatcher_Cache_Has_Tagging_Disabled() + { + // The watcher tags every dedup entry, so FusionCache tagging must stay enabled. A custom cache config + // that disables it would otherwise make every tagged write throw at runtime; validation catches it early. + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = LeaderElectionType.None, + QueueStrategy = QueueStrategy.InMemory, + ValidateRegistrations = true, + ConfigureResourceWatcherEntityCache = b => b.WithOptions(o => o.DisableTagging = true), + }.Build(); + + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + var validator = CreateValidator(services, settings); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + (await act.Should().ThrowAsync()) + .Which.Message.Should().Contain("tagging"); + } + + [Fact] + public async Task Should_Pass_When_ResourceWatcher_Cache_Has_Tagging_Enabled_By_Default() + { + // The default cache registration keeps tagging enabled, so the tagging probe must not add a problem. + var validator = CreateValidatorForSdkRegistrations(LeaderElectionType.None, QueueStrategy.InMemory); + + var act = async () => await validator.StartingAsync(TestContext.Current.CancellationToken); + + await act.Should().NotThrowAsync(); + } + + private static OperatorRegistrationValidator CreateValidatorForSdkRegistrations( + LeaderElectionType leaderElectionType, QueueStrategy queueStrategy) + => CreateValidator(leaderElectionType, queueStrategy); + + private static OperatorRegistrationValidator CreateValidator( + LeaderElectionType leaderElectionType, + QueueStrategy queueStrategy, + Action? registerUserComponents = null) + { + var settings = new OperatorSettingsBuilder + { + LeaderElectionType = leaderElectionType, + QueueStrategy = queueStrategy, + ValidateRegistrations = true, + }.Build(); + + var services = new ServiceCollection(); + var builder = new OperatorBuilder(services, settings); + builder.AddController(); + registerUserComponents?.Invoke(services); + + return CreateValidator(services, settings); + } + + private static OperatorRegistrationValidator CreateValidator(IServiceCollection services, OperatorSettings settings) + { + var registry = (OperatorRegistrationRegistry)services + .Single(d => d.ServiceType == typeof(OperatorRegistrationRegistry)) + .ImplementationInstance!; + + // The tagging check probes the real resource-watcher cache, so resolve the actual FusionCache provider + // the operator base registered (default config = tagging enabled). + var cacheProvider = services.BuildServiceProvider().GetRequiredService(); + + return new OperatorRegistrationValidator( + registry, settings, cacheProvider, Mock.Of>()); + } + + private sealed class TestController : IEntityController + { + public Task> ReconcileAsync( + V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + => Task.FromResult(ReconciliationResult.Success(entity)); + + public Task> DeletedAsync( + V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + => Task.FromResult(ReconciliationResult.Success(entity)); + } + + private sealed class TestFinalizer : IEntityFinalizer + { + public Task> FinalizeAsync( + V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + => Task.FromResult(ReconciliationResult.Success(entity)); + } + + // Test doubles for a user-provided Custom watcher / queue consumer. Validation inspects the service + // descriptors only (it never constructs these), so the null base arguments are never dereferenced. + private sealed class CustomWatcher() : ResourceWatcher( + null!, null!, null!, null!, null!, null!, null!, null!); + + private sealed class CustomConsumer() : EntityQueueBackgroundService( + null!, null!, null!, null!, null!, null!); + + // A leadership-aware consumer (implements ILeaderAwareEntityQueueConsumer via the base class). + private sealed class CustomLeaderAwareConsumer() + : LeaderAwareEntityQueueBackgroundService( + null!, null!, null!, null!, null!, null!, null!); + + // A custom queue that does NOT implement ISuspendableEntityQueue. Validation inspects the registration + // descriptor only and never constructs it, so the members can throw. + private sealed class NonSuspendableQueue : ITimedEntityQueue + { + public Task Enqueue( + V1OperatorIntegrationTestEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => throw new NotSupportedException(); + + public IAsyncEnumerator> GetAsyncEnumerator( + CancellationToken cancellationToken = default) => throw new NotSupportedException(); + + public void Dispose() + { + } + } + + // An open-generic custom queue (registered as AddSingleton(typeof(ITimedEntityQueue<>), typeof(...<>))). + private sealed class OpenSuspendableQueue : ITimedEntityQueue, ISuspendableEntityQueue + where TEntity : IKubernetesObject + { + public Task Enqueue( + TEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => throw new NotSupportedException(); + + public IAsyncEnumerator> GetAsyncEnumerator(CancellationToken cancellationToken = default) => + throw new NotSupportedException(); + + public void Clear() + { + } + + public void SuspendIntake() + { + } + + public void ResumeIntake() + { + } + + public void Dispose() + { + } + } + + // A constraint that V1OperatorIntegrationTestEntity does not satisfy, used to build an open-generic queue + // whose implementation cannot be closed for the managed entity. + private interface IUnsatisfiedQueueConstraint; + + // An open-generic queue whose extra constraint excludes the managed entity, so the DI container cannot + // close ITimedEntityQueue from it. Validation inspects the registration + // descriptor only and never constructs it, so the members can throw. + private sealed class ConstrainedOpenQueue : ITimedEntityQueue + where TEntity : IKubernetesObject, IUnsatisfiedQueueConstraint + { + public Task Enqueue( + TEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => throw new NotSupportedException(); + + public IAsyncEnumerator> GetAsyncEnumerator(CancellationToken cancellationToken = default) => + throw new NotSupportedException(); + + public void Dispose() + { + } + } + + // A custom queue that supports the leadership gate. + private sealed class SuspendableQueue : ITimedEntityQueue, ISuspendableEntityQueue + { + public Task Enqueue( + V1OperatorIntegrationTestEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => throw new NotSupportedException(); + + public IAsyncEnumerator> GetAsyncEnumerator( + CancellationToken cancellationToken = default) => throw new NotSupportedException(); + + public void Clear() + { + } + + public void SuspendIntake() + { + } + + public void ResumeIntake() + { + } + + public void Dispose() + { + } + } +} diff --git a/test/KubeOps.Operator.Test/Controller/DeletedEntityRequeue.Integration.Test.cs b/test/KubeOps.Operator.Test/Controller/DeletedEntityRequeue.Integration.Test.cs index d2ac46b9..ab718e8d 100644 --- a/test/KubeOps.Operator.Test/Controller/DeletedEntityRequeue.Integration.Test.cs +++ b/test/KubeOps.Operator.Test/Controller/DeletedEntityRequeue.Integration.Test.cs @@ -82,11 +82,11 @@ private sealed class TestController(MethodInvocationObserver queue) : IEntityController { - public Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + public async Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) { observer.RecordInvocation(entity); - queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromSeconds(60), retryCount: 0, TestContext.Current.CancellationToken); - return Task.FromResult(ReconciliationResult.Success(entity)); + await queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromSeconds(60), retryCount: 0, TestContext.Current.CancellationToken); + return ReconciliationResult.Success(entity); } public Task> DeletedAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) diff --git a/test/KubeOps.Operator.Test/Controller/EntityRequeue.Integration.Test.cs b/test/KubeOps.Operator.Test/Controller/EntityRequeue.Integration.Test.cs index 8b2124c3..6e1b1028 100644 --- a/test/KubeOps.Operator.Test/Controller/EntityRequeue.Integration.Test.cs +++ b/test/KubeOps.Operator.Test/Controller/EntityRequeue.Integration.Test.cs @@ -100,15 +100,15 @@ private class TestController(InvocationCounter EntityQueue queue) : IEntityController { - public Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + public async Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) { svc.Invocation(entity); if (svc.Invocations.Count <= svc.TargetInvocationCount) { - queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromMilliseconds(1), retryCount: 0, TestContext.Current.CancellationToken); + await queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromMilliseconds(1), retryCount: 0, TestContext.Current.CancellationToken); } - return Task.FromResult(ReconciliationResult.Success(entity)); + return ReconciliationResult.Success(entity); } public Task> DeletedAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) diff --git a/test/KubeOps.Operator.Test/Controller/ReplaceEntityRequeue.Integration.Test.cs b/test/KubeOps.Operator.Test/Controller/ReplaceEntityRequeue.Integration.Test.cs index 932bc130..d2b20749 100644 --- a/test/KubeOps.Operator.Test/Controller/ReplaceEntityRequeue.Integration.Test.cs +++ b/test/KubeOps.Operator.Test/Controller/ReplaceEntityRequeue.Integration.Test.cs @@ -87,16 +87,16 @@ private class TestController(InvocationCounter EntityQueue queue) : IEntityController { - public Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) + public async Task> ReconcileAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) { // schedule on first invocation if (svc.Invocations.Count == 0) { - queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromSeconds(5), retryCount: 0, TestContext.Current.CancellationToken); + await queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromSeconds(5), retryCount: 0, TestContext.Current.CancellationToken); } svc.Invocation(entity); - return Task.FromResult(ReconciliationResult.Success(entity)); + return ReconciliationResult.Success(entity); } public Task> DeletedAsync(V1OperatorIntegrationTestEntity entity, CancellationToken cancellationToken) diff --git a/test/KubeOps.Operator.Test/Events/EventPublisher.Integration.Test.cs b/test/KubeOps.Operator.Test/Events/EventPublisher.Integration.Test.cs index c6998b6e..b8b6a599 100644 --- a/test/KubeOps.Operator.Test/Events/EventPublisher.Integration.Test.cs +++ b/test/KubeOps.Operator.Test/Events/EventPublisher.Integration.Test.cs @@ -97,7 +97,7 @@ public async Task> Reconci if (svc.Invocations.Count < svc.TargetInvocationCount) { - queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromMilliseconds(10), retryCount: 0, TestContext.Current.CancellationToken); + await queue(entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.FromMilliseconds(10), retryCount: 0, TestContext.Current.CancellationToken); } return ReconciliationResult.Success(entity); diff --git a/test/KubeOps.Operator.Test/LeaderElector/LeaderElectionBackgroundService.Test.cs b/test/KubeOps.Operator.Test/LeaderElector/LeaderElectionBackgroundService.Test.cs index 9507c05c..fe2e84b0 100644 --- a/test/KubeOps.Operator.Test/LeaderElector/LeaderElectionBackgroundService.Test.cs +++ b/test/KubeOps.Operator.Test/LeaderElector/LeaderElectionBackgroundService.Test.cs @@ -2,6 +2,8 @@ // The .NET Foundation licenses this file to you under the Apache 2.0 License. // See the LICENSE file in the project root for more information. +using System.Diagnostics; + using FluentAssertions; using k8s.LeaderElection; @@ -14,6 +16,7 @@ namespace KubeOps.Operator.Test.LeaderElector; +[Trait("Area", "LeaderLoss")] public sealed class LeaderElectionBackgroundServiceTest { [Fact] @@ -59,4 +62,42 @@ public async Task Elector_Throws_Should_Retry() await leaderElectionBackgroundService.StopAsync(TestContext.Current.CancellationToken); } + + [Fact] + public async Task Stop_During_Retry_Backoff_Should_Complete_Promptly() + { + // Arrange. + var logger = Mock.Of>(); + + var electionLock = Mock.Of(); + + var firstCall = new TaskCompletionSource(); + Mock.Get(electionLock) + .Setup(el => el.GetAsync(It.IsAny())) + .Returns(_ => + { + // Fail immediately so the service enters its exponential retry back-off (>= 2s). + firstCall.TrySetResult(); + throw new InvalidOperationException("Unit test exception"); + }); + + var leaderElector = new k8s.LeaderElection.LeaderElector(new LeaderElectionConfig(electionLock)); + var leaderElectionBackgroundService = new LeaderElectionBackgroundService(logger, leaderElector); + + // Act. + await leaderElectionBackgroundService.StartAsync(TestContext.Current.CancellationToken); + + var faulted = await Task.WhenAny( + firstCall.Task, + Task.Delay(TimeSpan.FromSeconds(3), TestContext.Current.CancellationToken)); + faulted.Should().Be(firstCall.Task, "the lock attempt should have failed and entered the back-off"); + + // Stopping during the back-off must cancel the wait instead of blocking until the (>= 2s) delay elapses. + var stopwatch = Stopwatch.StartNew(); + await leaderElectionBackgroundService.StopAsync(TestContext.Current.CancellationToken); + stopwatch.Stop(); + + // Assert. + stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2)); + } } diff --git a/test/KubeOps.Operator.Test/Queue/EntityQueueBackgroundService.Test.cs b/test/KubeOps.Operator.Test/Queue/EntityQueueBackgroundService.Test.cs index 384d344f..ae54d33a 100644 --- a/test/KubeOps.Operator.Test/Queue/EntityQueueBackgroundService.Test.cs +++ b/test/KubeOps.Operator.Test/Queue/EntityQueueBackgroundService.Test.cs @@ -30,9 +30,23 @@ private sealed class ControllableQueue : ITimedEntityQueue private readonly System.Threading.Channels.Channel> _channel = System.Threading.Channels.Channel.CreateUnbounded>(); + private int _getAsyncEnumeratorCallCount; + public int EnqueueCallCount { get; private set; } - public Task Enqueue( + // Number of times a consuming loop began iterating the queue. Each processing loop calls + // GetAsyncEnumerator exactly once, so this equals the number of concurrently started loops. + public int GetAsyncEnumeratorCallCount => Volatile.Read(ref _getAsyncEnumeratorCallCount); + + // Controls the value returned by Enqueue, to simulate a leadership-aware queue dropping the entry + // (returns false) versus scheduling it (returns true). + public bool EnqueueResult { get; set; } = true; + + // Captures the cancellation token of the most recent Enqueue call, so tests can assert which token + // the producer passed (e.g. the processing token vs. CancellationToken.None for error retries). + public CancellationToken LastEnqueueToken { get; private set; } + + public Task Enqueue( TEntity entity, ReconciliationType type, ReconciliationTriggerSource reconciliationTriggerSource, @@ -41,8 +55,9 @@ public Task Enqueue( CancellationToken cancellationToken) { EnqueueCallCount++; + LastEnqueueToken = cancellationToken; _channel.Writer.TryWrite(new(entity, type, reconciliationTriggerSource, retryCount)); - return Task.CompletedTask; + return Task.FromResult(EnqueueResult); } public void Push(TEntity entity, ReconciliationType type, ReconciliationTriggerSource source, int retryCount = 0) @@ -54,6 +69,7 @@ public void Complete() public async IAsyncEnumerator> GetAsyncEnumerator( CancellationToken cancellationToken = default) { + Interlocked.Increment(ref _getAsyncEnumeratorCallCount); await foreach (var entry in _channel.Reader.ReadAllAsync(cancellationToken)) { yield return entry; @@ -64,6 +80,52 @@ public void Dispose() => _channel.Writer.TryComplete(); } + // A queue whose enumerator parks inside MoveNextAsync (holding the processing token) until released, then + // touches the token's wait handle — exactly what BlockingCollection.GetConsumingEnumerable does. If the + // token's source was disposed while the loop was parked, that touch throws ObjectDisposedException. + private sealed class BarrierQueue : ITimedEntityQueue + { + private readonly TaskCompletionSource _entered = new(TaskCreationOptions.RunContinuationsAsynchronously); + private readonly TaskCompletionSource _release = new(TaskCreationOptions.RunContinuationsAsynchronously); + private volatile bool _tokenDisposedWhileInUse; + + public Task LoopEntered => _entered.Task; + + public bool TokenDisposedWhileInUse => _tokenDisposedWhileInUse; + + public void ReleaseLoops() => _release.TrySetResult(); + + public Task Enqueue( + V1ConfigMap entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => Task.FromResult(true); + + public async IAsyncEnumerator> GetAsyncEnumerator( + CancellationToken cancellationToken = default) + { + _entered.TrySetResult(); + await _release.Task; + + try + { + _ = cancellationToken.WaitHandle; + } + catch (ObjectDisposedException) + { + _tokenDisposedWhileInUse = true; + } + + yield break; + } + + public void Dispose() + { + } + } + private static V1ConfigMap CreateEntity(string? uid = null) => new() { @@ -93,7 +155,7 @@ private static EntityQueueBackgroundService CreateService( It.IsAny())) .ReturnsAsync((string _, string? _, CancellationToken _) => entity); - return new( + var service = new EntityQueueBackgroundService( new("test"), clientMock.Object, effectiveSettings, @@ -101,6 +163,11 @@ private static EntityQueueBackgroundService CreateService( reconcilerMock.Object, Mock.Of>>(), metrics); + + // Keep the drain bound short so tests that intentionally block a worker across StopAsync/dispose don't + // wait the production default. Tests that assert the bound itself override this explicitly. + service.DrainGracePeriod = TimeSpan.FromMilliseconds(200); + return service; } [Trait("Area", "Otel")] @@ -179,6 +246,189 @@ public async Task Throwing_Reconciler_Records_Failure_Reconciliation_Metric() } } + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task StartAsync_Can_Restart_After_The_Loop_Exits_Without_A_Stop_Request() + { + // Finding 1: if ExecuteAsync exits on its own (queue enumerator completed, or an unexpected fault) + // without going through StopAsync/RequestStopAsync, _running must be reset so a later StartAsync + // (e.g. a leadership re-acquisition) can restart the loop instead of being suppressed forever. + var queue = new ControllableQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + await using var service = CreateService(queue, reconcilerMock, clientMock, CreateEntity()); + + // Completed channel -> the loop's enumerator finishes immediately -> ExecuteAsync returns on its own. + queue.Complete(); + + await service.StartAsync(TestContext.Current.CancellationToken); + await Task.Delay(200, TestContext.Current.CancellationToken); // let the loop start and fully exit + + // A second start (as on re-acquired leadership) must start a fresh loop, not be suppressed. + await service.StartAsync(TestContext.Current.CancellationToken); + await Task.Delay(200, TestContext.Current.CancellationToken); + + queue.GetAsyncEnumeratorCallCount.Should().Be(2); + } + + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task StartAsync_Is_Idempotent_And_Starts_Only_One_Processing_Loop() + { + var queue = new ControllableQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + var entity = CreateEntity(); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity); + + // Two starts without an intervening StopAsync. This mirrors the leadership-aware race where the + // StartAsync IsLeader() branch and a concurrent OnStartedLeading callback both invoke + // base.StartAsync. Only one queue-consuming loop must run, otherwise every entry is reconciled + // twice concurrently. + await service.StartAsync(TestContext.Current.CancellationToken); + await service.StartAsync(TestContext.Current.CancellationToken); + + await Task.Delay(300, TestContext.Current.CancellationToken); + await service.StopAsync(TestContext.Current.CancellationToken); + + queue.GetAsyncEnumeratorCallCount.Should().Be(1); + } + + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task Restart_Does_Not_Dispose_CancellationTokenSource_Still_Used_By_Previous_Loop() + { + // L2: on a leadership flap (StopAsync then StartAsync) the service must not dispose a + // CancellationTokenSource whose token a still-running former loop is observing. The previous code + // reused a shared _cts and eagerly disposed it in StartAsync, so the lingering loop's queue + // enumerator could touch an already-disposed token source (ObjectDisposedException). Each run must + // own its own token source and dispose it only when that run ends. + var queue = new BarrierQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + var service = new EntityQueueBackgroundService( + new("test"), + clientMock.Object, + new OperatorSettingsBuilder().Build(), + queue, + reconcilerMock.Object, + Mock.Of>>()) + { + // The BarrierQueue ignores cancellation, so bound the stop/dispose drain short. + DrainGracePeriod = TimeSpan.FromMilliseconds(200), + }; + + await service.StartAsync(TestContext.Current.CancellationToken); + await queue.LoopEntered.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + // Flap: stop (cancels the first run's token) then start a fresh run, all while the first loop is + // still parked inside the queue enumerator holding the first run's token. + await service.StopAsync(TestContext.Current.CancellationToken); + await service.StartAsync(TestContext.Current.CancellationToken); + + // Now let the parked loop(s) touch their token. + queue.ReleaseLoops(); + await Task.Delay(200, TestContext.Current.CancellationToken); + + queue.TokenDisposedWhileInUse.Should().BeFalse(); + + await service.StopAsync(TestContext.Current.CancellationToken); + await service.DisposeAsync(); + } + + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task Stop_Does_Not_Dispose_Token_While_A_Reconciliation_Is_Still_In_Flight() + { + // N1: StopAsync only requests cancellation; the processing loop must still drain its in-flight worker + // tasks before its CancellationTokenSource is disposed (and before resources are torn down). Otherwise + // a still-running reconciler that touches its token hits ObjectDisposedException. + var entity = CreateEntity(); + var queue = new ControllableQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + var entered = new TaskCompletionSource(); + var canFinish = new TaskCompletionSource(); + var tokenDisposedWhileInFlight = false; + + reconcilerMock + .Setup(r => r.Reconcile(It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _context, CancellationToken token) => + { + entered.TrySetResult(); + await canFinish.Task; + + // The worker still holds the processing token after StopAsync cancelled it. If the loop + // disposed the token source underneath the still-running worker, touching the token throws. + try + { + _ = token.WaitHandle; + } + catch (ObjectDisposedException) + { + tokenDisposedWhileInFlight = true; + } + + return ReconciliationResult.Success(entity); + }); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity); + await service.StartAsync(TestContext.Current.CancellationToken); + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + await entered.Task; + + // Stop while the reconciliation is in flight, then give the (buggy) loop time to return and dispose + // its token source before the worker resumes and touches the token. + await service.StopAsync(TestContext.Current.CancellationToken); + await Task.Delay(200, TestContext.Current.CancellationToken); + + canFinish.SetResult(); + await Task.Delay(200, TestContext.Current.CancellationToken); + + tokenDisposedWhileInFlight.Should().BeFalse(); + } + + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task StopAsync_Awaits_The_Drain_Of_In_Flight_Reconciliations() + { + // Finding 1: host StopAsync must honor the IHostedService contract and await the drain of in-flight + // reconciliations (bounded), not return immediately while a reconciliation is still running. + var entity = CreateEntity(); + var queue = new ControllableQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + var entered = new TaskCompletionSource(); + var canFinish = new TaskCompletionSource(); + reconcilerMock + .Setup(r => r.Reconcile(It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _ctx, CancellationToken _token) => + { + entered.TrySetResult(); + await canFinish.Task; + return ReconciliationResult.Success(entity); + }); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity); + service.DrainGracePeriod = TimeSpan.FromSeconds(5); // long enough to actually await a cooperative drain + await service.StartAsync(TestContext.Current.CancellationToken); + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + await entered.Task; + + var stopTask = service.StopAsync(TestContext.Current.CancellationToken); + await Task.Delay(150, TestContext.Current.CancellationToken); + stopTask.IsCompleted.Should().BeFalse(); // StopAsync is awaiting the in-flight reconciliation + + canFinish.SetResult(); + await stopTask.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + } + [Fact] public async Task Reconciler_Is_Called_For_Each_Queued_Entry() { @@ -386,6 +636,198 @@ public async Task RequeueAfterDelay_Strategy_Requeues_Concurrent_Entry_For_Same_ queue.EnqueueCallCount.Should().BeGreaterThan(0); } + [Trait("Area", "Otel")] + [Fact] + public async Task Conflict_Requeue_Metric_Is_Not_Recorded_When_Enqueue_Is_Dropped() + { + // A conflicting reconciliation is requeued (RequeueAfterDelay). If a leadership-aware queue drops + // that enqueue (intake suspended), Enqueue returns false and the conflict metric must not be counted. + const string meterName = "test-conflict-requeue-drop-metrics"; + using var meterFactory = new ServiceCollection().AddMetrics().BuildServiceProvider() + .GetRequiredService(); + var metrics = new OperatorMetrics(meterFactory, meterName); + + var requeued = 0; + using var listener = new MeterListener + { + InstrumentPublished = (instrument, l) => + { + if (instrument.Meter.Name == meterName && instrument.Name == "kubeops.operator.queue.requeued") + { + l.EnableMeasurementEvents(instrument); + } + }, + }; + listener.SetMeasurementEventCallback((_, value, _, _) => Interlocked.Add(ref requeued, (int)value)); + listener.Start(); + + var uid = Guid.NewGuid().ToString(); + var entity = CreateEntity(uid); + var firstStarted = new TaskCompletionSource(); + var firstCanFinish = new TaskCompletionSource(); + + var queue = new ControllableQueue { EnqueueResult = false }; + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + clientMock + .Setup(c => c.GetAsync(It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + reconcilerMock + .Setup(r => r.Reconcile(It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _, CancellationToken _) => + { + firstStarted.TrySetResult(); + await firstCanFinish.Task; + return ReconciliationResult.Success(entity); + }); + + var settings = new OperatorSettingsBuilder + { + ParallelReconciliation = new() + { + MaxParallelReconciliations = 4, + ConflictStrategy = ParallelReconciliationConflictStrategy.RequeueAfterDelay, + RequeueDelay = TimeSpan.FromMilliseconds(50), + }, + }.Build(); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity, settings, metrics); + await service.StartAsync(TestContext.Current.CancellationToken); + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + await firstStarted.Task; + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + await Task.Delay(200, TestContext.Current.CancellationToken); + + firstCanFinish.SetResult(); + await service.StopAsync(TestContext.Current.CancellationToken); + + // The conflicting entry's requeue was attempted but dropped, so no conflict metric was recorded. + queue.EnqueueCallCount.Should().BeGreaterThan(0); + requeued.Should().Be(0); + } + + [Trait("Area", "Otel")] + [Fact] + public async Task ErrorRetry_Requeue_Metric_Is_Not_Recorded_When_Enqueue_Is_Dropped() + { + // When a leadership-aware queue drops the retry enqueue (intake suspended after leadership loss), + // Enqueue returns false and the requeue metric must not be recorded. + const string meterName = "test-requeue-drop-metrics"; + using var meterFactory = new ServiceCollection().AddMetrics().BuildServiceProvider() + .GetRequiredService(); + var metrics = new OperatorMetrics(meterFactory, meterName); + + var requeued = 0; + using var listener = new MeterListener + { + InstrumentPublished = (instrument, l) => + { + if (instrument.Meter.Name == meterName && instrument.Name == "kubeops.operator.queue.requeued") + { + l.EnableMeasurementEvents(instrument); + } + }, + }; + listener.SetMeasurementEventCallback((_, value, _, _) => Interlocked.Add(ref requeued, (int)value)); + listener.Start(); + + var entity = CreateEntity(); + var queue = new ControllableQueue { EnqueueResult = false }; + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + clientMock + .Setup(c => c.GetAsync(It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + reconcilerMock + .Setup(r => r.Reconcile(It.IsAny>(), It.IsAny())) + .ThrowsAsync(new InvalidOperationException("transient error")); + + var settings = new OperatorSettingsBuilder + { + ParallelReconciliation = new() + { + MaxParallelReconciliations = 2, + MaxErrorRetries = 3, + ErrorBackoffBase = TimeSpan.FromMilliseconds(10), + }, + }.Build(); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity, settings, metrics); + await service.StartAsync(TestContext.Current.CancellationToken); + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + queue.Complete(); + + await Task.Delay(300, TestContext.Current.CancellationToken); + await service.StopAsync(TestContext.Current.CancellationToken); + + // The retry was attempted (Enqueue called) but, because it was dropped, no requeue was counted. + queue.EnqueueCallCount.Should().BeGreaterThan(0); + requeued.Should().Be(0); + } + + [Trait("Area", "LeaderLoss")] + [Fact] + public async Task ErrorRetry_Enqueue_Receives_Processing_Token_So_It_Is_Rejected_After_Stop() + { + // Finding 2: a former leadership term's error retry must not leak into the next term. + // The retry enqueue is passed the processing cancellationToken (not CancellationToken.None). + // A non-cooperative reconciler that ignores cancellation and throws a non-OCE *after* StopAsync + // has cancelled the processing loop must therefore enqueue its retry with an already-cancelled + // token, which a leadership-aware queue rejects. + var entity = CreateEntity(); + var queue = new ControllableQueue(); + var reconcilerMock = new Mock>(); + var clientMock = new Mock(); + + var reconcileEntered = new TaskCompletionSource(); + var canThrow = new TaskCompletionSource(); + + clientMock + .Setup(c => c.GetAsync(It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + + // Non-cooperative: ignores the token, blocks until released, then throws a non-OCE. + reconcilerMock + .Setup(r => r.Reconcile(It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _, CancellationToken _) => + { + reconcileEntered.TrySetResult(); + await canThrow.Task; + throw new InvalidOperationException("late non-cooperative failure"); + }); + + var settings = new OperatorSettingsBuilder + { + ParallelReconciliation = new() + { + MaxParallelReconciliations = 2, + MaxErrorRetries = 3, + ErrorBackoffBase = TimeSpan.FromMilliseconds(10), + }, + }.Build(); + + await using var service = CreateService(queue, reconcilerMock, clientMock, entity, settings); + await service.StartAsync(TestContext.Current.CancellationToken); + + queue.Push(entity, ReconciliationType.Modified, ReconciliationTriggerSource.ApiServer); + await reconcileEntered.Task; + + // Simulate leadership loss / shutdown: cancel the processing loop while the reconciler is in-flight. + await service.StopAsync(TestContext.Current.CancellationToken); + + // Let the in-flight reconciler throw now; its retry enqueue must carry the cancelled processing token. + canThrow.SetResult(); + await Task.Delay(200, TestContext.Current.CancellationToken); + + queue.EnqueueCallCount.Should().BeGreaterThan(0); + queue.LastEnqueueToken.IsCancellationRequested.Should().BeTrue(); + } + [Fact] public async Task Failed_Reconciliation_Is_Requeued_With_ErrorRetry_Source() { diff --git a/test/KubeOps.Operator.Test/Queue/EntityQueueFactory.Test.cs b/test/KubeOps.Operator.Test/Queue/EntityQueueFactory.Test.cs index 0416ee1d..0411d3d3 100644 --- a/test/KubeOps.Operator.Test/Queue/EntityQueueFactory.Test.cs +++ b/test/KubeOps.Operator.Test/Queue/EntityQueueFactory.Test.cs @@ -18,7 +18,7 @@ namespace KubeOps.Operator.Test.Queue; public sealed class EntityQueueFactoryTest { [Fact] - public void Create_Should_Return_Delegate_That_Calls_Enqueue_On_Queue() + public async Task Create_Should_Return_Delegate_That_Calls_Enqueue_On_Queue() { var mockQueue = new Mock>(); var services = new ServiceCollection() @@ -33,7 +33,7 @@ public void Create_Should_Return_Delegate_That_Calls_Enqueue_On_Queue() var queueIn = TimeSpan.FromSeconds(10); const int retryCount = 3; - enqueue( + await enqueue( entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, diff --git a/test/KubeOps.Operator.Test/Queue/LeaderAwareEntityQueueBackgroundService.Test.cs b/test/KubeOps.Operator.Test/Queue/LeaderAwareEntityQueueBackgroundService.Test.cs new file mode 100644 index 00000000..3326e560 --- /dev/null +++ b/test/KubeOps.Operator.Test/Queue/LeaderAwareEntityQueueBackgroundService.Test.cs @@ -0,0 +1,509 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics.Metrics; +using System.Reflection; + +using FluentAssertions; + +using k8s.Models; + +using KubeOps.Abstractions.Builder; +using KubeOps.Abstractions.Reconciliation; +using KubeOps.KubernetesClient; +using KubeOps.Operator.Metrics; +using KubeOps.Operator.Queue; +using KubeOps.Operator.Test.TestEntities; + +using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.Logging; + +using Moq; + +using ILock = k8s.LeaderElection.ILock; +using LeaderElectorType = k8s.LeaderElection.LeaderElector; + +namespace KubeOps.Operator.Test.Queue; + +[Trait("Area", "LeaderLoss")] +public sealed class LeaderAwareEntityQueueBackgroundServiceTest +{ + [Fact] + public async Task StartedLeading_Should_Begin_Consuming_Queue() + { + var queue = new CapturingQueue(); + await using var service = CreateService(queue); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + + await queue.EnumeratorStarted.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + queue.CapturedToken.IsCancellationRequested.Should().BeFalse(); + } + + [Fact] + public async Task StoppedLeading_Should_Hard_Stop_Queue_Processing() + { + var queue = new CapturingQueue(); + await using var service = CreateService(queue); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + await queue.EnumeratorStarted.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + service.SimulateStoppedLeading(); + + // Cancelling the internal token must abort the in-flight queue consumption. + await queue.EnumeratorCancelled.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + queue.CapturedToken.IsCancellationRequested.Should().BeTrue(); + } + + [Fact] + public async Task StartedLeading_Should_Resume_Intake_And_StoppedLeading_Should_Suspend_Then_Clear() + { + var queue = new CapturingQueue(); + await using var service = CreateService(queue); + + // StartAsync subscribes to the elector; not leading yet, so it closes the gate first. + await service.StartAsync(TestContext.Current.CancellationToken); + queue.GateCalls.Should().Equal("SuspendIntake"); + queue.GateCalls.Clear(); + + service.SimulateStartedLeading(); + service.SimulateStoppedLeading(); + + // ResumeIntake happens before producing; on stop the gate is closed (SuspendIntake) before the + // queue is cleared, so no in-flight requeue can slip in between the clear and the gate closing. + queue.GateCalls.Should().Equal("ResumeIntake", "SuspendIntake", "Clear"); + } + + [Fact] + public async Task StoppedLeading_Should_Clear_Queue_And_Drop_Later_Requeues() + { + using var realQueue = new TimedEntityQueue( + Mock.Of>>()); + await using var service = CreateService(realQueue); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + + // Work the former leader scheduled before losing leadership. + await realQueue.Enqueue( + CreateEntity("to-clear"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.FromSeconds(5), + retryCount: 0, + TestContext.Current.CancellationToken); + realQueue.Count.Should().Be(1); + + service.SimulateStoppedLeading(); + + // The pre-stop entry is cleared. + realQueue.Count.Should().Be(0); + realQueue.ReadyCount.Should().Be(0); + + // Simulate an in-flight reconciler finishing AFTER leadership loss and returning + // Success(entity, requeueAfter): the requeue must be dropped, leaving no stale work behind. + await realQueue.Enqueue( + CreateEntity("late-requeue"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + TestContext.Current.CancellationToken); + + realQueue.Count.Should().Be(0); + realQueue.ReadyCount.Should().Be(0); + } + + [Fact] + public async Task Queue_Without_Suspendable_Capability_Logs_Warning_And_Degrades_Gracefully() + { + // A custom queue that does not implement ISuspendableEntityQueue must not break leadership + // transitions; gating is skipped and a warning is logged so the missing protection is visible. + var loggerMock = new Mock>>(); + var queue = new NonSuspendableQueue(); + await using var service = CreateService(queue, loggerMock.Object); + + await service.StartAsync(TestContext.Current.CancellationToken); + + loggerMock.Verify( + l => l.Log( + LogLevel.Warning, + It.IsAny(), + It.Is((@object, _) => @object.ToString()!.Contains(nameof(ISuspendableEntityQueue))), + It.IsAny(), + It.IsAny>()), + Times.Once); + + service.SimulateStartedLeading(); + var act = service.SimulateStoppedLeading; + act.Should().NotThrow(); + } + + [Fact] + public async Task Dispose_Should_Unsubscribe_From_Elector_So_Later_Transitions_Are_Ignored() + { + var queue = new CapturingQueue(); + var service = CreateService(queue); + + await service.StartAsync(TestContext.Current.CancellationToken); + queue.GateCalls.Clear(); + + service.Dispose(); + + // After disposal the service must no longer react to leadership callbacks: the handler is + // unsubscribed, so invoking the elector event reaches nothing and no gating happens. + service.SimulateStartedLeading(); + queue.GateCalls.Should().BeEmpty(); + } + + [Fact] + public async Task DisposeAsync_Should_Unsubscribe_From_Elector_So_Later_Transitions_Are_Ignored() + { + var queue = new CapturingQueue(); + var service = CreateService(queue); + + await service.StartAsync(TestContext.Current.CancellationToken); + queue.GateCalls.Clear(); + + await service.DisposeAsync(); + + // Regression guard for the async disposal path: the DI container disposes via IAsyncDisposable + // when available, so DisposeAsync must perform the same unsubscribe as Dispose. The previous code + // skipped the disposal hook on this path, leaving the handler subscribed — a stale StartedLeading + // would then resume intake (this assertion would fail) and touch already-disposed state. + service.SimulateStartedLeading(); + queue.GateCalls.Should().BeEmpty(); + } + + [Fact] + public async Task StopAsync_Stops_Base_Loop_Even_When_No_Longer_Leader() + { + // N2: after leadership was lost the elector reports non-leader. Host shutdown then calls StopAsync, + // which must still stop the base processing loop instead of skipping base.StopAsync and leaving the + // loop (and its in-flight work) running. + var queue = new CapturingQueue(); + await using var service = CreateService(queue); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + await queue.EnumeratorStarted.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + await service.StopAsync(TestContext.Current.CancellationToken); + + await queue.EnumeratorCancelled.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + queue.CapturedToken.IsCancellationRequested.Should().BeTrue(); + } + + [Fact] + public async Task Dispose_After_Flap_Drains_The_Overlapped_Old_Loop() + { + // F1: a StoppedLeading -> StartedLeading flap can leave the previous loop still draining an in-flight + // reconciliation while a new loop runs. Dispose must drain the OLD loop too (not only the latest), + // otherwise that worker can touch a semaphore/client/queue after they were disposed. + using var realQueue = new TimedEntityQueue( + Mock.Of>>()); + + var entity = CreateEntity("flap"); + entity.Metadata.Uid = Guid.NewGuid().ToString(); + var entered = new TaskCompletionSource(); + var canFinish = new TaskCompletionSource(); + + var clientMock = new Mock(); + clientMock + .Setup(c => c.GetAsync( + It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + + var reconcilerMock = new Mock>(); + reconcilerMock + .Setup(r => r.Reconcile( + It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _ctx, CancellationToken _token) => + { + entered.TrySetResult(); + await canFinish.Task; + return ReconciliationResult.Success(entity); + }); + + var service = CreateService(realQueue, reconciler: reconcilerMock.Object, client: clientMock.Object); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + + await realQueue.Enqueue( + entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.Zero, 0, + TestContext.Current.CancellationToken); + await entered.Task.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + // Flap: cancel loop1 (now draining the blocked worker) and start loop2. + service.SimulateStoppedLeading(); + service.SimulateStartedLeading(); + + // Dispose must wait for loop1's in-flight worker; it cannot complete while the worker is blocked. + var disposeTask = service.DisposeAsync().AsTask(); + await Task.Delay(200, TestContext.Current.CancellationToken); + disposeTask.IsCompleted.Should().BeFalse(); + + canFinish.SetResult(); + await disposeTask.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + } + + [Fact] + public async Task Dispose_Is_Bounded_When_A_Reconciliation_Ignores_Cancellation() + { + // R1: a non-cooperative reconciler that ignores its token must not block disposal indefinitely; the + // drain is bounded by DrainGracePeriod. + using var realQueue = new TimedEntityQueue( + Mock.Of>>()); + + var entity = CreateEntity("stuck"); + entity.Metadata.Uid = Guid.NewGuid().ToString(); + var entered = new TaskCompletionSource(); + var neverFinishes = new TaskCompletionSource(); + + var clientMock = new Mock(); + clientMock + .Setup(c => c.GetAsync( + It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + + var reconcilerMock = new Mock>(); + reconcilerMock + .Setup(r => r.Reconcile( + It.IsAny>(), It.IsAny())) + .Returns(async (ReconciliationContext _ctx, CancellationToken _token) => + { + entered.TrySetResult(); + await neverFinishes.Task; + return ReconciliationResult.Success(entity); + }); + + var service = CreateService(realQueue, reconciler: reconcilerMock.Object, client: clientMock.Object); + service.DrainGracePeriod = TimeSpan.FromMilliseconds(300); + + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + + await realQueue.Enqueue( + entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.Zero, 0, + TestContext.Current.CancellationToken); + await entered.Task.WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + + // Must return within the grace, not hang on the stuck reconciler. + await service.DisposeAsync().AsTask().WaitAsync(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken); + } + + [Trait("Area", "Otel")] + [Fact] + public async Task Records_Reconciliation_Metrics_When_Leader() + { + // Finding 2: the leader-aware consumer must forward OperatorMetrics to the base, otherwise leader-elected + // operators (LeaderElectionType.Single) lose all queue-side reconciliation metrics. + const string meterName = "test-leader-aware-reconciliation-metrics"; + using var meterFactory = new ServiceCollection().AddMetrics().BuildServiceProvider() + .GetRequiredService(); + var metrics = new OperatorMetrics(meterFactory, meterName); + + var recorded = 0; + using var listener = new MeterListener + { + InstrumentPublished = (instrument, l) => + { + if (instrument.Meter.Name == meterName && instrument.Name == "kubeops.operator.reconciliation") + { + l.EnableMeasurementEvents(instrument); + } + }, + }; + listener.SetMeasurementEventCallback((_, _, _, _) => Interlocked.Increment(ref recorded)); + listener.Start(); + + using var realQueue = new TimedEntityQueue( + Mock.Of>>()); + var entity = CreateEntity("metric"); + entity.Metadata.Uid = Guid.NewGuid().ToString(); + + var clientMock = new Mock(); + clientMock + .Setup(c => c.GetAsync( + It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(entity); + var reconcilerMock = new Mock>(); + reconcilerMock + .Setup(r => r.Reconcile( + It.IsAny>(), It.IsAny())) + .ReturnsAsync(ReconciliationResult.Success(entity)); + + var service = CreateService(realQueue, reconciler: reconcilerMock.Object, client: clientMock.Object, metrics: metrics); + await service.StartAsync(TestContext.Current.CancellationToken); + service.SimulateStartedLeading(); + await realQueue.Enqueue( + entity, ReconciliationType.Modified, ReconciliationTriggerSource.Operator, TimeSpan.Zero, 0, + TestContext.Current.CancellationToken); + + var deadline = DateTime.UtcNow + TimeSpan.FromSeconds(5); + while (DateTime.UtcNow < deadline && Volatile.Read(ref recorded) == 0) + { + await Task.Delay(25, TestContext.Current.CancellationToken); + } + + Volatile.Read(ref recorded).Should().BeGreaterThan(0); + await service.DisposeAsync(); + } + + private static V1OperatorIntegrationTestEntity CreateEntity(string name) + { + var entity = new V1OperatorIntegrationTestEntity(); + entity.EnsureMetadata(); + entity.Metadata.SetNamespace("unit-test"); + entity.Metadata.Name = name; + return entity; + } + + private static TestableService CreateService( + ITimedEntityQueue queue, + ILogger>? logger = null, + IReconciler? reconciler = null, + IKubernetesClient? client = null, + OperatorMetrics? metrics = null) + { + var lockMock = new Mock(); + lockMock + .Setup(l => l.GetAsync(It.IsAny())) + .Returns(async ct => { await Task.Delay(Timeout.Infinite, ct); return null!; }); + + var elector = new LeaderElectorType(new(lockMock.Object) + { + LeaseDuration = TimeSpan.FromSeconds(1), + RenewDeadline = TimeSpan.FromMilliseconds(500), + RetryPeriod = TimeSpan.FromMilliseconds(100), + }); + + return new( + queue, + elector, + logger ?? Mock.Of>>(), + reconciler, + client, + metrics); + } + + /// + /// Exposes the private leadership callbacks for testing by invoking the delegates registered on the + /// elector, mirroring the approach used in the leader-aware resource watcher tests. + /// + private sealed class TestableService : LeaderAwareEntityQueueBackgroundService + { + private readonly LeaderElectorType _elector; + + public TestableService( + ITimedEntityQueue queue, + LeaderElectorType elector, + ILogger> logger, + IReconciler? reconciler = null, + IKubernetesClient? client = null, + OperatorMetrics? metrics = null) + : base( + new("test"), + client ?? Mock.Of(), + new OperatorSettingsBuilder { Namespace = "unit-test" }.Build(), + queue, + reconciler ?? Mock.Of>(), + logger, + elector, + metrics) + { + _elector = elector; + } + + public void SimulateStartedLeading() => InvokeElectorEvent(nameof(LeaderElectorType.OnStartedLeading)); + + public void SimulateStoppedLeading() => InvokeElectorEvent(nameof(LeaderElectorType.OnStoppedLeading)); + + private void InvokeElectorEvent(string eventName) + { + var field = typeof(LeaderElectorType) + .GetField(eventName, BindingFlags.Instance | BindingFlags.NonPublic); + var handler = (Action?)field?.GetValue(_elector); + handler?.Invoke(); + } + } + + private sealed class CapturingQueue : ITimedEntityQueue, ISuspendableEntityQueue + { + private readonly TaskCompletionSource _enumeratorStarted = new(TaskCreationOptions.RunContinuationsAsynchronously); + private readonly TaskCompletionSource _enumeratorCancelled = new(TaskCreationOptions.RunContinuationsAsynchronously); + + public CancellationToken CapturedToken { get; private set; } + + public Task EnumeratorStarted => _enumeratorStarted.Task; + + public Task EnumeratorCancelled => _enumeratorCancelled.Task; + + // Records the order of intake/clear calls so tests can assert the leadership-transition sequence. + public List GateCalls { get; } = []; + + public Task Enqueue( + V1OperatorIntegrationTestEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => Task.FromResult(true); + + public void Clear() => GateCalls.Add(nameof(Clear)); + + public void SuspendIntake() => GateCalls.Add(nameof(SuspendIntake)); + + public void ResumeIntake() => GateCalls.Add(nameof(ResumeIntake)); + + public async IAsyncEnumerator> GetAsyncEnumerator( + CancellationToken cancellationToken = default) + { + CapturedToken = cancellationToken; + var blocker = new TaskCompletionSource(); + await using var registration = cancellationToken.Register(() => + { + blocker.TrySetResult(); + _enumeratorCancelled.TrySetResult(); + }); + + _enumeratorStarted.TrySetResult(); + await blocker.Task; + yield break; + } + + public void Dispose() + { + } + } + + // A custom queue that deliberately does NOT implement ISuspendableEntityQueue, to verify the + // leader-aware consumer degrades gracefully (skips gating) instead of failing. + private sealed class NonSuspendableQueue : ITimedEntityQueue + { + public Task Enqueue( + V1OperatorIntegrationTestEntity entity, + ReconciliationType type, + ReconciliationTriggerSource reconciliationTriggerSource, + TimeSpan queueIn, + int retryCount, + CancellationToken cancellationToken) => Task.FromResult(true); + + public async IAsyncEnumerator> GetAsyncEnumerator( + CancellationToken cancellationToken = default) + { + await Task.Yield(); + yield break; + } + + public void Dispose() + { + } + } +} diff --git a/test/KubeOps.Operator.Test/Queue/TimedEntityQueue.Test.cs b/test/KubeOps.Operator.Test/Queue/TimedEntityQueue.Test.cs index 4a8009eb..72cf2f9f 100644 --- a/test/KubeOps.Operator.Test/Queue/TimedEntityQueue.Test.cs +++ b/test/KubeOps.Operator.Test/Queue/TimedEntityQueue.Test.cs @@ -182,12 +182,148 @@ await queue.Enqueue( queue.Count.Should().Be(0); } + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Clear_Should_Remove_Scheduled_And_Ready_Entries() + { + using var queue = new TimedEntityQueue(Mock.Of>>()); + + // One entry ready immediately (gets promoted to the ready queue by the timer) and one scheduled. + await queue.Enqueue( + CreateSecret("ns", "ready"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + TestContext.Current.CancellationToken); + await queue.Enqueue( + CreateSecret("ns", "scheduled"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.FromSeconds(5), + retryCount: 0, + TestContext.Current.CancellationToken); + + // Give the timer time to promote the ready entry into the ready queue. + await WaitUntil(() => queue.ReadyCount == 1, TimeSpan.FromSeconds(2)); + + queue.Clear(); + + queue.Count.Should().Be(0); + queue.ReadyCount.Should().Be(0); + (await DrainQueue(queue, TimeSpan.FromMilliseconds(200))).Should().BeEmpty(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Enqueue_Should_Be_Dropped_While_Intake_Is_Suspended() + { + using var queue = new TimedEntityQueue(Mock.Of>>()); + + queue.SuspendIntake(); + + var scheduled = await queue.Enqueue( + CreateSecret("ns", "secret"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + TestContext.Current.CancellationToken); + + // Returning false lets callers avoid counting a requeue that was not scheduled. + scheduled.Should().BeFalse(); + queue.Count.Should().Be(0); + (await DrainQueue(queue, TimeSpan.FromMilliseconds(200))).Should().BeEmpty(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Enqueue_Should_Be_Accepted_After_Intake_Is_Resumed() + { + using var queue = new TimedEntityQueue(Mock.Of>>()); + + queue.SuspendIntake(); + queue.ResumeIntake(); + + var scheduled = await queue.Enqueue( + CreateSecret("ns", "secret"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + TestContext.Current.CancellationToken); + + scheduled.Should().BeTrue(); + (await DrainQueue(queue, TimeSpan.FromMilliseconds(500))).Should().ContainSingle(); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Enqueue_Should_Be_Dropped_When_Token_Already_Cancelled() + { + using var queue = new TimedEntityQueue(Mock.Of>>()); + + var scheduled = await queue.Enqueue( + CreateSecret("ns", "secret"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + new CancellationToken(canceled: true)); + + scheduled.Should().BeFalse(); + queue.Count.Should().Be(0); + (await DrainQueue(queue, TimeSpan.FromMilliseconds(200))).Should().BeEmpty(); + } + + // Regression guard for the TOCTOU race: gate check + scheduling must be atomic with SuspendIntake/Clear. + // Stressed over many iterations so that a missing lock reliably surfaces (a single parallel round would + // be probabilistic). Without the lock in Enqueue/Clear, an entry survives the clear and this fails. + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task Concurrent_Enqueue_With_Suspend_And_Clear_Leaves_Nothing() + { + const int iterations = 200; + const int enqueuesPerIteration = 32; + var cancellationToken = TestContext.Current.CancellationToken; + + for (var i = 0; i < iterations; i++) + { + using var queue = new TimedEntityQueue(Mock.Of>>()); + + var enqueues = Enumerable.Range(0, enqueuesPerIteration).Select(n => Task.Run( + () => queue.Enqueue( + CreateSecret("ns", $"secret-{n}"), + ReconciliationType.Modified, + ReconciliationTriggerSource.Operator, + TimeSpan.Zero, + retryCount: 0, + cancellationToken), + cancellationToken)); + + var stop = Task.Run( + () => + { + queue.SuspendIntake(); + queue.Clear(); + }, + cancellationToken); + + await Task.WhenAll(enqueues.Append(stop)); + + // Intake stays suspended, so anything enqueued after the clear is dropped and anything before is + // cleared: the queue must end up completely empty. + queue.Count.Should().Be(0); + queue.ReadyCount.Should().Be(0); + } + } + [Fact] public void Dispose_Should_Complete_Without_Exception() { using var queue = new TimedEntityQueue(Mock.Of>>()); - var act = () => queue.Dispose(); + var act = queue.Dispose; act.Should().NotThrow(); } @@ -203,6 +339,15 @@ private static V1Secret CreateSecret(string secretNamespace, string secretName) return secret; } + private static async Task WaitUntil(Func condition, TimeSpan timeout) + { + var deadline = DateTimeOffset.UtcNow + timeout; + while (!condition() && DateTimeOffset.UtcNow < deadline) + { + await Task.Delay(20); + } + } + private static async Task>> DrainQueue( TimedEntityQueue queue, TimeSpan timeout) diff --git a/test/KubeOps.Operator.Test/Reconciliation/Reconciler.Test.cs b/test/KubeOps.Operator.Test/Reconciliation/Reconciler.Test.cs index b70a5561..a2b4734e 100644 --- a/test/KubeOps.Operator.Test/Reconciliation/Reconciler.Test.cs +++ b/test/KubeOps.Operator.Test/Reconciliation/Reconciler.Test.cs @@ -2,6 +2,8 @@ // The .NET Foundation licenses this file to you under the Apache 2.0 License. // See the LICENSE file in the project root for more information. +using System.Diagnostics.Metrics; + using FluentAssertions; using k8s.Models; @@ -11,6 +13,7 @@ using KubeOps.Abstractions.Reconciliation.Controller; using KubeOps.Abstractions.Reconciliation.Finalizer; using KubeOps.KubernetesClient; +using KubeOps.Operator.Metrics; using KubeOps.Operator.Queue; using KubeOps.Operator.Reconciliation; @@ -67,6 +70,52 @@ public async Task Reconcile_Should_Enqueue_Entity_When_Result_Has_RequeueAfter() Times.Once); } + [Fact] + [Trait("Area", "Otel")] + public async Task Reconcile_Should_Record_Requeue_Metric_When_Enqueue_Is_Scheduled() + { + var entity = CreateTestEntity(); + var context = ReconciliationContext.CreateFor(entity, ReconciliationType.Added, ReconciliationTriggerSource.ApiServer); + var controller = CreateMockController( + reconcileResult: ReconciliationResult.Success(entity, TimeSpan.FromMinutes(5))); + _mockQueue + .Setup(q => q.Enqueue( + It.IsAny(), It.IsAny(), It.IsAny(), + It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(true); + + using var harness = new RequeueMetricHarness(); + var reconciler = CreateReconcilerForController(controller, harness.Metrics); + + await reconciler.Reconcile(context, TestContext.Current.CancellationToken); + + harness.RequeueCount.Should().Be(1); + } + + [Fact] + [Trait("Area", "Otel")] + public async Task Reconcile_Should_Not_Record_Requeue_Metric_When_Enqueue_Is_Dropped() + { + // When the queue drops the requeue (intake suspended after leadership loss) Enqueue returns false, + // so no requeue metric must be recorded. + var entity = CreateTestEntity(); + var context = ReconciliationContext.CreateFor(entity, ReconciliationType.Added, ReconciliationTriggerSource.ApiServer); + var controller = CreateMockController( + reconcileResult: ReconciliationResult.Success(entity, TimeSpan.FromMinutes(5))); + _mockQueue + .Setup(q => q.Enqueue( + It.IsAny(), It.IsAny(), It.IsAny(), + It.IsAny(), It.IsAny(), It.IsAny())) + .ReturnsAsync(false); + + using var harness = new RequeueMetricHarness(); + var reconciler = CreateReconcilerForController(controller, harness.Metrics); + + await reconciler.Reconcile(context, TestContext.Current.CancellationToken); + + harness.RequeueCount.Should().Be(0); + } + [Fact] public async Task Reconcile_Should_Not_Enqueue_Entity_When_Result_Has_No_RequeueAfter() { @@ -380,7 +429,8 @@ public async Task Reconcile_Should_Update_Entity_With_CancellationToken_None_Aft Times.Once); } - private Reconciler CreateReconcilerForController(IEntityController controller) + private Reconciler CreateReconcilerForController( + IEntityController controller, OperatorMetrics? metrics = null) { var mockScope = new Mock(); var mockScopeFactory = new Mock(); @@ -406,7 +456,8 @@ private Reconciler CreateReconcilerForController(IEntityController< _mockServiceProvider.Object, _settings, _mockQueue.Object, - _mockClient.Object); + _mockClient.Object, + metrics); } private Reconciler CreateReconcilerForFinalizer(IEntityFinalizer? finalizer, string finalizerName) @@ -489,4 +540,43 @@ private static V1ConfigMap CreateTestEntityForFinalization(string? name = null, Kind = V1ConfigMap.KubeKind, }; } + + // Captures the "kubeops.operator.queue.requeued" counter for a dedicated meter so a test can assert + // whether the operator requeue metric was recorded. + private sealed class RequeueMetricHarness : IDisposable + { + private const string MeterName = "reconciler-requeue-test"; + private readonly ServiceProvider _provider; + private readonly MeterListener _listener; + private int _requeued; + + public RequeueMetricHarness() + { + _provider = new ServiceCollection().AddMetrics().BuildServiceProvider(); + Metrics = new OperatorMetrics(_provider.GetRequiredService(), MeterName); + + _listener = new MeterListener + { + InstrumentPublished = (instrument, l) => + { + if (instrument.Meter.Name == MeterName && instrument.Name == "kubeops.operator.queue.requeued") + { + l.EnableMeasurementEvents(instrument); + } + }, + }; + _listener.SetMeasurementEventCallback((_, value, _, _) => Interlocked.Add(ref _requeued, (int)value)); + _listener.Start(); + } + + public OperatorMetrics Metrics { get; } + + public int RequeueCount => _requeued; + + public void Dispose() + { + _listener.Dispose(); + _provider.Dispose(); + } + } } diff --git a/test/KubeOps.Operator.Test/RestartableHostedServiceTest.cs b/test/KubeOps.Operator.Test/RestartableHostedServiceTest.cs new file mode 100644 index 00000000..a034ffe5 --- /dev/null +++ b/test/KubeOps.Operator.Test/RestartableHostedServiceTest.cs @@ -0,0 +1,246 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the Apache 2.0 License. +// See the LICENSE file in the project root for more information. + +using System.Diagnostics; + +using FluentAssertions; + +namespace KubeOps.Operator.Test; + +[Trait("Area", "LeaderLoss")] +public sealed class RestartableHostedServiceTest +{ + [Fact] + public async Task Faulted_Loop_Is_Restarted_With_Backoff() + { + // The loop body throws on its first run; the service must restart it (immediate back-off in the test) + // instead of leaving it dead for the rest of the process lifetime. + var secondRunStarted = new TaskCompletionSource(); + await using var service = new TestRestartableHostedService((count, ct) => + { + if (count == 1) + { + throw new InvalidOperationException("boom"); + } + + secondRunStarted.TrySetResult(); + return Task.Delay(Timeout.Infinite, ct); + }) + { + FaultBackoff = _ => TimeSpan.Zero, + DrainGracePeriod = TimeSpan.FromSeconds(5), + }; + + await service.StartAsync(TestContext.Current.CancellationToken); + + var restarted = await Task.WhenAny( + secondRunStarted.Task, + Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken)); + restarted.Should().Be(secondRunStarted.Task, "the faulted loop should be restarted"); + service.FaultCount.Should().Be(1); + + await service.StopAsync(TestContext.Current.CancellationToken); + } + + [Fact] + public async Task FaultBackoff_Counter_Escalates_For_A_Tight_Crash_Loop() + { + // With a large reset threshold (default) and instantly-faulting runs, the back-off counter must keep + // escalating: consecutive faults pass 1, 2, 3, ... so the delay grows. + var retryCounts = new List(); + var enough = new TaskCompletionSource(); + await using var service = new TestRestartableHostedService((_, _) => throw new InvalidOperationException("boom")) + { + FaultBackoff = n => + { + retryCounts.Add(n); + if (retryCounts.Count >= 3) + { + enough.TrySetResult(); + } + + return TimeSpan.Zero; + }, + FaultBackoffResetThreshold = TimeSpan.FromMinutes(1), + DrainGracePeriod = TimeSpan.FromSeconds(5), + }; + + await service.StartAsync(TestContext.Current.CancellationToken); + (await Task.WhenAny(enough.Task, Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(enough.Task); + await service.StopAsync(TestContext.Current.CancellationToken); + + retryCounts.Take(3).Should().Equal(1u, 2u, 3u); + } + + [Fact] + public async Task FaultBackoff_Counter_Resets_After_A_Healthy_Run() + { + // With a zero reset threshold every run counts as "healthy", so each fault is treated as fresh and the + // back-off counter restarts at 1 instead of escalating. + var retryCounts = new List(); + var enough = new TaskCompletionSource(); + await using var service = new TestRestartableHostedService((_, _) => throw new InvalidOperationException("boom")) + { + FaultBackoff = n => + { + retryCounts.Add(n); + if (retryCounts.Count >= 3) + { + enough.TrySetResult(); + } + + return TimeSpan.Zero; + }, + FaultBackoffResetThreshold = TimeSpan.Zero, + DrainGracePeriod = TimeSpan.FromSeconds(5), + }; + + await service.StartAsync(TestContext.Current.CancellationToken); + (await Task.WhenAny(enough.Task, Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(enough.Task); + await service.StopAsync(TestContext.Current.CancellationToken); + + retryCounts.Take(3).Should().Equal(1u, 1u, 1u); + } + + [Fact] + public async Task Clean_Return_Is_Not_Restarted() + { + // A clean ExecuteAsync return means the loop is done; it must not be treated as a fault and restarted. + await using var service = new TestRestartableHostedService((_, _) => Task.CompletedTask); + + await service.StartAsync(TestContext.Current.CancellationToken); + await Task.Delay(TimeSpan.FromMilliseconds(250), TestContext.Current.CancellationToken); + + service.ExecuteCount.Should().Be(1); + service.FaultCount.Should().Be(0); + + await service.StopAsync(TestContext.Current.CancellationToken); + } + + [Fact] + public async Task StartAsync_During_Disposal_Does_Not_Start_A_New_Loop() + { + // Simulates a leadership callback racing disposal: a StartAsync triggered while disposing (here from + // OnDisposing) must be suppressed so no new loop escapes the drain and runs on torn-down resources. + var service = new TestRestartableHostedService((_, ct) => Task.Delay(Timeout.Infinite, ct)) + { + DrainGracePeriod = TimeSpan.FromSeconds(5), + }; + + await service.StartAsync(TestContext.Current.CancellationToken); + await WaitUntilAsync(() => service.ExecuteCount == 1); + + service.OnDisposingHook = () => service.StartAsync(CancellationToken.None); + await service.DisposeAsync(); + await Task.Delay(TimeSpan.FromMilliseconds(250), TestContext.Current.CancellationToken); + + service.ExecuteCount.Should().Be(1, "a StartAsync during disposal must not start a new loop"); + } + + [Fact] + public async Task Flap_Starts_New_Loop_While_Previous_Is_Still_Draining_And_Disposal_Drains_Both() + { + // A leadership flap: request-stop (cancel the current loop) then start a new one before the previous loop + // has finished unwinding. Both runs must be tracked — the new loop runs while the old one drains — and a + // later disposal must drain both cleanly (no hang, no fault). Each loop owns its own CancellationTokenSource, + // so the restart must not dispose the source the still-draining former loop is observing. + var run1Started = new TaskCompletionSource(); + var run1Drained = new TaskCompletionSource(); + var run2Started = new TaskCompletionSource(); + var drainGate1 = new TaskCompletionSource(); + + await using var service = new TestRestartableHostedService(async (count, ct) => + { + if (count == 1) + { + run1Started.TrySetResult(); + try + { + await Task.Delay(Timeout.Infinite, ct); + } + catch (OperationCanceledException) + { + // Linger to simulate a slow drain: stay alive until the test releases the gate, so the new + // loop provably runs concurrently with this one still unwinding. + await drainGate1.Task; + run1Drained.TrySetResult(); + throw; + } + } + else + { + run2Started.TrySetResult(); + await Task.Delay(Timeout.Infinite, ct); + } + }) + { + DrainGracePeriod = TimeSpan.FromSeconds(5), + }; + + // Loop 1 running. + await service.StartAsync(TestContext.Current.CancellationToken); + (await Task.WhenAny(run1Started.Task, Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(run1Started.Task); + + // Flap: cancel loop 1 (it lingers on the gate), then immediately start loop 2. + await service.RequestStopForTest(); + await service.StartAsync(TestContext.Current.CancellationToken); + + // Loop 2 runs while loop 1 is still draining (gate not yet released). + (await Task.WhenAny(run2Started.Task, Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(run2Started.Task, "the new loop should start while the previous one is still draining"); + run1Drained.Task.IsCompleted.Should().BeFalse("loop 1 should still be draining during the overlap"); + service.ExecuteCount.Should().Be(2, "exactly one new loop should have been started by the flap"); + + // Release loop 1's drain; it must unwind cleanly. + drainGate1.TrySetResult(); + (await Task.WhenAny(run1Drained.Task, Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(run1Drained.Task); + + // Disposal drains loop 2 (and any residue of loop 1) without hanging or faulting. + await service.DisposeAsync(); + + service.ExecuteCount.Should().Be(2); + service.FaultCount.Should().Be(0, "cancellation is not a fault"); + } + + private static async Task WaitUntilAsync(Func condition) + { + var deadline = Stopwatch.StartNew(); + while (!condition() && deadline.Elapsed < TimeSpan.FromSeconds(5)) + { + await Task.Delay(10); + } + + condition().Should().BeTrue("the awaited condition did not become true in time"); + } + + private sealed class TestRestartableHostedService(Func body) + : RestartableHostedService + { + private int _executeCount; + private int _faultCount; + + public int ExecuteCount => Volatile.Read(ref _executeCount); + + public int FaultCount => Volatile.Read(ref _faultCount); + + public Action? OnDisposingHook { get; set; } + + // Exposes the protected, non-blocking stop so a test can trigger a flap (request-stop then start). + public Task RequestStopForTest() => RequestStopAsync(); + + protected override Task ExecuteAsync(CancellationToken cancellationToken) + { + var count = Interlocked.Increment(ref _executeCount); + return body(count, cancellationToken); + } + + protected override void OnLoopFaulted(Exception exception) => Interlocked.Increment(ref _faultCount); + + protected override void OnDisposing() => OnDisposingHook?.Invoke(); + } +} diff --git a/test/KubeOps.Operator.Test/Watcher/LeaderAwareResourceWatcher.Test.cs b/test/KubeOps.Operator.Test/Watcher/LeaderAwareResourceWatcher.Test.cs index 0b212424..8823c5a0 100644 --- a/test/KubeOps.Operator.Test/Watcher/LeaderAwareResourceWatcher.Test.cs +++ b/test/KubeOps.Operator.Test/Watcher/LeaderAwareResourceWatcher.Test.cs @@ -2,6 +2,12 @@ // The .NET Foundation licenses this file to you under the Apache 2.0 License. // See the LICENSE file in the project root for more information. +using System.Runtime.CompilerServices; + +using FluentAssertions; + +using k8s; + using KubeOps.Abstractions.Builder; using KubeOps.Abstractions.Entities; using KubeOps.KubernetesClient; @@ -10,7 +16,6 @@ using KubeOps.Operator.Test.TestEntities; using KubeOps.Operator.Watcher; -using Microsoft.Extensions.Hosting; using Microsoft.Extensions.Logging; using Moq; @@ -24,7 +29,7 @@ namespace KubeOps.Operator.Test.Watcher; public sealed class LeaderAwareResourceWatcherTest { [Fact] - public async Task StoppedLeading_Should_Clear_EntityCache() + public async Task StoppedLeading_Should_Remove_Only_This_Entity_Types_Cache_Entries() { var mockCache = new Mock(); var mockCacheProvider = Mock.Of(); @@ -32,11 +37,6 @@ public async Task StoppedLeading_Should_Clear_EntityCache() .Setup(cp => cp.GetCache(It.Is(s => s == CacheConstants.CacheNames.ResourceWatcher))) .Returns(mockCache.Object); - var lifetime = Mock.Of(); - Mock.Get(lifetime) - .Setup(l => l.ApplicationStopped) - .Returns(CancellationToken.None); - var lockMock = new Mock(); lockMock .Setup(l => l.GetAsync(It.IsAny())) @@ -51,31 +51,217 @@ public async Task StoppedLeading_Should_Clear_EntityCache() var watcher = new TestableLeaderAwareResourceWatcher( mockCacheProvider, - lifetime, elector, Mock.Of>>(), Mock.Of>(), Mock.Of()); await watcher.StartAsync(TestContext.Current.CancellationToken); - // Trigger the private StoppedLeading handler via the testable wrapper. watcher.SimulateStoppedLeading(); - mockCache.Verify(c => c.Clear(It.IsAny()), Times.Once); + mockCache.Verify( + c => c.RemoveByTag( + It.Is(tag => tag == typeof(V1OperatorIntegrationTestEntity).FullName), + It.IsAny(), + It.IsAny()), + Times.Once); + mockCache.Verify(c => c.Clear(It.IsAny()), Times.Never); + } + + [Fact] + public async Task StoppedLeading_Stops_Watcher_Even_When_Cache_Cleanup_Throws() + { + var mockCache = new Mock(); + mockCache + .Setup(c => c.RemoveByTag( + It.IsAny(), It.IsAny(), It.IsAny())) + .Throws(new InvalidOperationException("tagging disabled")); + + var mockCacheProvider = Mock.Of(); + Mock.Get(mockCacheProvider).Setup(cp => cp.GetCache(It.IsAny())).Returns(mockCache.Object); + + var lockMock = new Mock(); + lockMock + .Setup(l => l.GetAsync(It.IsAny())) + .Returns(async ct => { await Task.Delay(Timeout.Infinite, ct); return null!; }); + var elector = new k8s.LeaderElection.LeaderElector(new(lockMock.Object) + { + LeaseDuration = TimeSpan.FromSeconds(1), + RenewDeadline = TimeSpan.FromMilliseconds(500), + RetryPeriod = TimeSpan.FromMilliseconds(100), + }); + + var watchStarted = new TaskCompletionSource(); + var watchCancelled = new TaskCompletionSource(); + var clientMock = new Mock(); + clientMock + .Setup(c => c.WatchAsync( + "unit-test", null, null, null, true, It.IsAny())) + .Returns((_, _, _, _, _, ct) => + SignalingWatchAsync(watchStarted, watchCancelled, ct)); + + var watcher = new TestableLeaderAwareResourceWatcher( + mockCacheProvider, + elector, + Mock.Of>>(), + Mock.Of>(), + clientMock.Object); + + await watcher.StartAsync(TestContext.Current.CancellationToken); + watcher.SimulateStartedLeading(); + (await Task.WhenAny( + watchStarted.Task, + Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(watchStarted.Task, "the watch loop should have started"); + + watcher.SimulateStoppedLeading(); + + (await Task.WhenAny( + watchCancelled.Task, + Task.Delay(TimeSpan.FromSeconds(5), TestContext.Current.CancellationToken))) + .Should().Be(watchCancelled.Task, "the watch must be stopped even though cache cleanup threw"); + mockCache.Verify( + c => c.RemoveByTag( + It.IsAny(), It.IsAny(), It.IsAny()), + Times.Once); + + await watcher.DisposeAsync(); + } + + [Fact] + public async Task StopAsync_Stops_Base_Watcher_Even_When_No_Longer_Leader() + { + var mockCache = new Mock(); + var mockCacheProvider = Mock.Of(); + Mock.Get(mockCacheProvider) + .Setup(cp => cp.GetCache(It.Is(s => s == CacheConstants.CacheNames.ResourceWatcher))) + .Returns(mockCache.Object); + + var lockMock = new Mock(); + lockMock + .Setup(l => l.GetAsync(It.IsAny())) + .Returns(async ct => { await Task.Delay(Timeout.Infinite, ct); return null!; }); + + var elector = new k8s.LeaderElection.LeaderElector(new(lockMock.Object) + { + LeaseDuration = TimeSpan.FromSeconds(1), + RenewDeadline = TimeSpan.FromMilliseconds(500), + RetryPeriod = TimeSpan.FromMilliseconds(100), + }); + + var loggerMock = new Mock>>(); + var watcher = new TestableLeaderAwareResourceWatcher( + mockCacheProvider, + elector, + loggerMock.Object, + Mock.Of>(), + Mock.Of()); + + await watcher.StartAsync(TestContext.Current.CancellationToken); + await watcher.StopAsync(TestContext.Current.CancellationToken); + + loggerMock.Verify( + l => l.Log( + LogLevel.Information, + It.IsAny(), + It.Is((@object, _) => @object.ToString()!.Contains("Stopping resource watcher")), + It.IsAny(), + It.IsAny>()), + Times.Once); + } + + [Fact] + public async Task StartAsync_Is_Idempotent_And_Starts_Only_One_Watch() + { + var mockCacheProvider = Mock.Of(); + Mock.Get(mockCacheProvider).Setup(cp => cp.GetCache(It.IsAny())).Returns(Mock.Of()); + + var lockMock = new Mock(); + lockMock + .Setup(l => l.GetAsync(It.IsAny())) + .Returns(async ct => { await Task.Delay(Timeout.Infinite, ct); return null!; }); + var elector = new k8s.LeaderElection.LeaderElector(new(lockMock.Object) + { + LeaseDuration = TimeSpan.FromSeconds(1), + RenewDeadline = TimeSpan.FromMilliseconds(500), + RetryPeriod = TimeSpan.FromMilliseconds(100), + }); + + var watchCallCount = 0; + var clientMock = new Mock(); + clientMock + .Setup(c => c.WatchAsync( + "unit-test", null, null, null, true, It.IsAny())) + .Returns((_, _, _, _, _, ct) => + { + Interlocked.Increment(ref watchCallCount); + return WaitForCancellationAsync<(k8s.WatchEventType, V1OperatorIntegrationTestEntity)>(ct); + }); + + var watcher = new TestableLeaderAwareResourceWatcher( + mockCacheProvider, + elector, + Mock.Of>>(), + Mock.Of>(), + clientMock.Object); + + await watcher.StartAsync(TestContext.Current.CancellationToken); + watcher.SimulateStartedLeading(); + watcher.SimulateStartedLeading(); + + await Task.Delay(300, TestContext.Current.CancellationToken); + Volatile.Read(ref watchCallCount).Should().Be(1); + + await watcher.DisposeAsync(); + } + + [Fact] + public async Task DisposeAsync_Unsubscribes_From_Elector() + { + var mockCacheProvider = Mock.Of(); + Mock.Get(mockCacheProvider).Setup(cp => cp.GetCache(It.IsAny())).Returns(Mock.Of()); + + var lockMock = new Mock(); + lockMock + .Setup(l => l.GetAsync(It.IsAny())) + .Returns(async ct => { await Task.Delay(Timeout.Infinite, ct); return null!; }); + var elector = new k8s.LeaderElection.LeaderElector(new(lockMock.Object) + { + LeaseDuration = TimeSpan.FromSeconds(1), + RenewDeadline = TimeSpan.FromMilliseconds(500), + RetryPeriod = TimeSpan.FromMilliseconds(100), + }); + + var watcher = new TestableLeaderAwareResourceWatcher( + mockCacheProvider, + elector, + Mock.Of>>(), + Mock.Of>(), + Mock.Of()); + + await watcher.StartAsync(TestContext.Current.CancellationToken); // subscribes the handlers + GetElectorHandler(elector, nameof(k8s.LeaderElection.LeaderElector.OnStartedLeading)).Should().NotBeNull(); + + await watcher.DisposeAsync(); + + // The async dispose path must have removed both handlers from the elector. + GetElectorHandler(elector, nameof(k8s.LeaderElection.LeaderElector.OnStartedLeading)).Should().BeNull(); + GetElectorHandler(elector, nameof(k8s.LeaderElection.LeaderElector.OnStoppedLeading)).Should().BeNull(); + } + + private static Delegate? GetElectorHandler(k8s.LeaderElection.LeaderElector elector, string eventName) + { + var field = typeof(k8s.LeaderElection.LeaderElector) + .GetField(eventName, System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic); + return (Delegate?)field?.GetValue(elector); } - /// - /// Wraps to expose the private - /// StoppedLeading handler for testing, without needing Moq to raise - /// non-virtual events. - /// private sealed class TestableLeaderAwareResourceWatcher : LeaderAwareResourceWatcher { private readonly k8s.LeaderElection.LeaderElector _elector; public TestableLeaderAwareResourceWatcher( IFusionCacheProvider cacheProvider, - IHostApplicationLifetime lifetime, k8s.LeaderElection.LeaderElector elector, ILogger> logger, ITimedEntityQueue queue, @@ -89,7 +275,6 @@ public TestableLeaderAwareResourceWatcher( new DefaultEntityLabelSelector(), new DefaultEntityFieldSelector(), client, - lifetime, elector) { _elector = elector; @@ -101,16 +286,42 @@ public TestableLeaderAwareResourceWatcher( /// Direct invocation of /// is not permitted outside the declaring class, so reflection is the only option. /// - public void SimulateStoppedLeading() + public void SimulateStoppedLeading() => InvokeElectorEvent(nameof(k8s.LeaderElection.LeaderElector.OnStoppedLeading)); + + public void SimulateStartedLeading() => InvokeElectorEvent(nameof(k8s.LeaderElection.LeaderElector.OnStartedLeading)); + + private void InvokeElectorEvent(string eventName) { var field = typeof(k8s.LeaderElection.LeaderElector) - .GetField( - nameof(k8s.LeaderElection.LeaderElector.OnStoppedLeading), - System.Reflection.BindingFlags.Instance | - System.Reflection.BindingFlags.NonPublic); + .GetField(eventName, System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic); var handler = (Action?)field?.GetValue(_elector); handler?.Invoke(); } } + + private static async IAsyncEnumerable WaitForCancellationAsync( + [EnumeratorCancellation] CancellationToken cancellationToken) + { + await Task.Delay(Timeout.Infinite, cancellationToken); + yield break; + } + + private static async IAsyncEnumerable<(WatchEventType, V1OperatorIntegrationTestEntity)> SignalingWatchAsync( + TaskCompletionSource started, + TaskCompletionSource cancelled, + [EnumeratorCancellation] CancellationToken cancellationToken) + { + started.TrySetResult(); + try + { + await Task.Delay(Timeout.Infinite, cancellationToken); + } + finally + { + cancelled.TrySetResult(); + } + + yield break; + } } diff --git a/test/KubeOps.Operator.Test/Watcher/ResourceWatcher{TEntity}.Test.cs b/test/KubeOps.Operator.Test/Watcher/ResourceWatcher{TEntity}.Test.cs index 34ff92af..6902df8e 100644 --- a/test/KubeOps.Operator.Test/Watcher/ResourceWatcher{TEntity}.Test.cs +++ b/test/KubeOps.Operator.Test/Watcher/ResourceWatcher{TEntity}.Test.cs @@ -43,6 +43,15 @@ public async Task Restarting_Watcher_Should_Trigger_New_Watch() // Restart the watcher await resourceWatcher.StartAsync(TestContext.Current.CancellationToken); + // The watch loop now runs on a background task, so StartAsync returns before WatchAsync is invoked. + // Wait until both watches have begun before asserting. + var deadline = DateTime.UtcNow + TimeSpan.FromSeconds(5); + while (DateTime.UtcNow < deadline && + Mock.Get(kubernetesClient).Invocations.Count(i => i.Method.Name == "WatchAsync") < 2) + { + await Task.Delay(25, TestContext.Current.CancellationToken); + } + // Assert Mock.Get(kubernetesClient) .Verify(client => client.WatchAsync( @@ -62,6 +71,7 @@ public async Task OnEvent_Should_Remove_From_Cache_On_Deleted_When_Strategy_Is_B var entity = CreateTestEntity(); var mockCache = new Mock(); var mockQueue = new Mock>(); + SetupEnqueueResult(mockQueue, true); var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object); // Act @@ -119,6 +129,7 @@ public async Task OnEvent_Should_Enqueue_When_Generation_Changed_And_Strategy_Is It.IsAny(), It.IsAny())) .ReturnsAsync(MaybeValue.FromValue(entity.Generation()!.Value - 1)); + SetupEnqueueResult(mockQueue, true); // Act await watcher @@ -321,6 +332,7 @@ public async Task OnEvent_Should_Enqueue_When_Finalizers_Changed_During_Deletion It.IsAny(), It.IsAny())) .ReturnsAsync(MaybeValue.FromValue("deleting:2026-05-28T10:00:00.0000000Z:30:1:operator.test/first-finalizer")); + SetupEnqueueResult(mockQueue, true); // Act await watcher.InvokeOnEventAsync( @@ -410,6 +422,7 @@ public async Task OnEvent_Should_Enqueue_When_ResourceVersion_Changed_And_Strate It.IsAny(), It.IsAny())) .ReturnsAsync(MaybeValue.FromValue("1")); + SetupEnqueueResult(mockQueue, true); // Act await watcher.InvokeOnEventAsync(WatchEventType.Modified, entity, TestContext.Current.CancellationToken); @@ -502,6 +515,7 @@ public async Task OnEvent_Should_Remove_From_Cache_On_Deleted_When_Strategy_Is_B var mockCache = new Mock(); var mockQueue = new Mock>(); var settings = new OperatorSettingsBuilder { Namespace = "unit-test", ReconcileStrategy = ReconcileStrategy.ByResourceVersion }.Build(); + SetupEnqueueResult(mockQueue, true); var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object, settings: settings); // Act @@ -519,6 +533,144 @@ public async Task OnEvent_Should_Remove_From_Cache_On_Deleted_When_Strategy_Is_B Times.Once); } + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task OnEvent_Should_Not_Update_Cache_When_Enqueue_Dropped_And_Strategy_Is_ByGeneration() + { + // Arrange – generation changed, but the queue dropped the entry (intake suspended on leadership loss) + var entity = CreateTestEntity(); + var mockCache = new Mock(); + var mockQueue = new Mock>(); + var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object); + + mockCache + .Setup(c => c.TryGetAsync( + It.Is(s => s == entity.Uid()), + It.IsAny(), + It.IsAny())) + .ReturnsAsync(MaybeValue.FromValue(entity.Generation()!.Value - 1)); + SetupEnqueueResult(mockQueue, false); + + // Act + await watcher.InvokeOnEventAsync(WatchEventType.Modified, entity, TestContext.Current.CancellationToken); + + // Assert – dedup cache must NOT be advanced for an entry that was never enqueued + mockCache.Verify( + c => c.SetAsync( + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny?>(), + It.IsAny()), + Times.Never); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task OnEvent_Should_Not_Update_Cache_When_Enqueue_Dropped_And_Strategy_Is_ByResourceVersion() + { + // Arrange – resourceVersion changed, but the queue dropped the entry + var entity = CreateTestEntity(resourceVersion: "2"); + var mockCache = new Mock(); + var mockQueue = new Mock>(); + var settings = new OperatorSettingsBuilder { Namespace = "unit-test", ReconcileStrategy = ReconcileStrategy.ByResourceVersion }.Build(); + var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object, settings: settings); + + mockCache + .Setup(c => c.TryGetAsync( + It.Is(s => s == entity.Uid()), + It.IsAny(), + It.IsAny())) + .ReturnsAsync(MaybeValue.FromValue("1")); + SetupEnqueueResult(mockQueue, false); + + // Act + await watcher.InvokeOnEventAsync(WatchEventType.Modified, entity, TestContext.Current.CancellationToken); + + // Assert – dedup cache must NOT be advanced for an entry that was never enqueued + mockCache.Verify( + c => c.SetAsync( + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny?>(), + It.IsAny()), + Times.Never); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task OnEvent_Should_Not_Update_DeletionTracking_When_Enqueue_Dropped_And_Strategy_Is_ByGeneration() + { + // Arrange – deletion fingerprint changed, but the queue dropped the entry + var entity = CreateTestEntity(); + entity.Metadata.DeletionTimestamp = new DateTime(2026, 05, 28, 10, 00, 00, DateTimeKind.Utc); + entity.Metadata.DeletionGracePeriodSeconds = 30; + entity.Metadata.Finalizers = ["operator.test/second-finalizer"]; + + var mockCache = new Mock(); + var mockQueue = new Mock>(); + var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object); + + mockCache + .Setup(c => c.TryGetAsync( + It.Is(s => s == $"{entity.Uid()}:deletion"), + It.IsAny(), + It.IsAny())) + .ReturnsAsync(MaybeValue.FromValue("deleting:2026-05-28T10:00:00.0000000Z:30:1:operator.test/first-finalizer")); + SetupEnqueueResult(mockQueue, false); + + // Act + await watcher.InvokeOnEventAsync(WatchEventType.Modified, entity, TestContext.Current.CancellationToken); + + // Assert – deletion tracking entry must NOT be advanced for an entry that was never enqueued + mockCache.Verify( + c => c.SetAsync( + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny?>(), + It.IsAny()), + Times.Never); + } + + [Fact] + [Trait("Area", "LeaderLoss")] + public async Task OnEvent_Should_Not_Remove_From_Cache_When_Enqueue_Dropped_On_Deleted() + { + // Arrange – delete event, but the queue dropped the entry + var entity = CreateTestEntity(); + var mockCache = new Mock(); + var mockQueue = new Mock>(); + var watcher = CreateTestableWatcher(cache: mockCache.Object, queue: mockQueue.Object); + + SetupEnqueueResult(mockQueue, false); + + // Act + await watcher.InvokeOnEventAsync(WatchEventType.Deleted, entity, TestContext.Current.CancellationToken); + + // Assert – cache entry must NOT be removed when the delete was never enqueued + mockCache.Verify( + c => c.RemoveAsync( + It.IsAny(), + It.IsAny(), + It.IsAny()), + Times.Never); + } + + private static void SetupEnqueueResult( + Mock> queue, + bool result) + => queue + .Setup(q => q.Enqueue( + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny(), + It.IsAny())) + .ReturnsAsync(result); + private static V1OperatorIntegrationTestEntity CreateTestEntity(string resourceVersion = "1") => new() { diff --git a/test/KubeOps.Operator.Web.Test/Metrics/MetricsExtensions.Test.cs b/test/KubeOps.Operator.Web.Test/Metrics/MetricsExtensions.Test.cs index 97ee02c8..ed6e30d0 100644 --- a/test/KubeOps.Operator.Web.Test/Metrics/MetricsExtensions.Test.cs +++ b/test/KubeOps.Operator.Web.Test/Metrics/MetricsExtensions.Test.cs @@ -11,7 +11,6 @@ using Microsoft.Extensions.DependencyInjection; -using OpenTelemetry; using OpenTelemetry.Metrics; namespace KubeOps.Operator.Web.Test.Metrics;